This document provides details for a group of bugs exhibited by the Android operating system on many devices. We provide this information for individuals who would like detailed technical information about the issues.
These bugs involve the DHCP client software included in Android. They may also involve a wireless network interface driver or firmware included in Android. The bugs cause Android devices to continue to use DHCP-leased IP addresses after the leases expire, and cause Android devices to resume using IP addresses from expired DHCP leases.
These bugs cause the Android device to disrupt service to other devices on the network.
In September 2010, Princeton University reported the bugs to Google, the vendor responsible for the Android operating system. On November 2012, Google commented that the bugs should not be present in Android version 4.2. As we no longer have an environment where we can test for these bugs, we are unable to determine if these bugs have indeed been fixed, and if so, in which version of Android.
Devices exhibiting these bugs are unsuitable for use with all network services at Princeton University which rely on the client respecting DHCP lease times. These include wireless service provided by OIT, OIT Mobile IP Service, Visitor IP (VIP) Service, and Temporary Unregistered Dormnet (TUD) IP Address Service.
A device that uses DHCPv4 runs DHCP client software on a network (e.g., Ethernet or Wireless) interface. The DHCP client software contacts DHCP servers to obtain network configuration; in particular, it usually obtains a lease (a loan) of an IP address.
For example, the DHCP client tells the DHCP server "I am network interface 0:1:2:3:4:5; please lease an IP address to me." The DHCP server might respond "You may use IP address 192.168.1.2 for the next six hours; if you would like to continue using that address, please renew it when three hours have elapsed." When three hours have elapsed, the DHCP client contacts the DHCP server which granted the lease; the client asks that server to renew the lease Typically the DHCP server responds to the client: "You may use IP address 192.168.1.2 for the next six hours; if you would like to continue using that address, please renew it again when three hours have elapsed." (If the DHCP client is unable to contact the DHCP server to the renew its unexpired lease, it will retry from time to time, and is permitted to continue using the IP address until the lease is due to expire.)
Assuming the DHCP client successfully renews the lease before it expires, this repeats periodically until the device goes offline. Once the device is offline, it no longer contacts the DHCP server to renew the lease, so eventually the last lease renewal expires. Once the last lease renewal has expired, the DHCP server is free to lease the IP address to another client.
If the device goes offline, when it later comes back online it broadcasts a DHCP request for a new lease. It may choose to request a brand-new lease, or (if it believes the old lease has not yet expired) may request a new lease on the old IP address.
We have seen these issues on a range of models from a variety of vendors; the bugs are not confined to one vendor or device model.
Based upon information provided by Princeton University customers, one or more of these issues are present in (at least) Android versions below. These are the Android versions our customers report running on those devices we have detected exhibiting one or more of the issues.
We do not have data to allow us to determine whether one or more of these bugs remains present in Android 4.2 or later. Changes made to some of our wireless network(s) beginning in late 2012 prevent us from detecting devices exhibiting this malfunction on most of our wireless network(s).
Under some circumstances, a device running one of the affected versions of Android stop renewing its DHCP lease, yet continues (or resumes) using the IP address after the lease expires. Although the owner of the device may not realize there is a problem, this interferes with service to others on the network.
We've only observed Android devices exhibit these issues when connected via their Wi-Fi network interfaces. We do not know whether these issues also affect Android device connected via Ethernet network interfaces; Android devices with Ethernet interfaces appear to be rare at this time.
We have observed a number of issues from affected Android devices. Some devices exhibit only a subset of these issues; this is likely because the issues are due to more than one bug.
After the lease has expired, the device continues using the IP address.
That is, it continues to respond to ARP Requests, claiming to own that IP address. It responds to ICMP Echo Request for that IP address. It responds to UDP and TCP traffic sent to that IP address. It initiates traffic to the Internet from that IP address.
It may do so for hours after the lease has expired.
This issue is common to all the Android devices exibibiting the bugs.
Eventually the device uses DHCP (in the INIT or INIT-REBOOT state) to request a new lease, which usually ends the incident. One situation that appears to trigger this is that the device disconnects from the wireless network (or loses it connection), for example, as a result of leaving the ESSID's coverage area; when it next connects, it starts in the DHCP INIT state. We assume there other circumstances that also will trigger a return to the DHCP INIT state.
Sometimes the malfunctioning device exhibits this issue in addition to issue #1.
While the device is continuing to use the IP address after the lease has expired, sometimes it unicasts DHCPREQUEST packets to the DHCP server for that lease, asking to renew the lease. It may do this for a few minutes, or for hours.
That makes no sense, as the lease has already expired. (A DHCP client wishing to renew a lease must renew it before the lease expires, not afterwards.)
(Our DHCP server refuses to renew the lease if the IP address was for a dynamically-assigned IP address; the lease has already expired.)
The client clearly believes that its DHCP lease has not yet expired.
A check of one affected client confirms that the problem is not due to the client's clock going backwards in time since it obtained the lease. The clock on the client seems to be counting forwards just fine.
This issue appears independently of issues #1 and #2. Some devices exibit this issue alone; others exhibit this issue along with issue #1, while others exhibit this issue along with issues #1 and #2.
As described above, eventually the device chooses to return to the DHCP INIT or INIT-REBOOT state. The device asks for a new lease, and obtains one. Sometimes this new lease is for a different IP address than the old lease. The device begins using the IP address from the new lease. This is normal.
But sometimes, the device continues (or resumes) using the IP address from the old expired lease as well. That is, the device is now using both IP addresses simultaneously. (It answers IP ARP Request packets for both IP addresses.)
Over time, the device may use DHCP to obtain a series of leases, but from time to time, it resumes using the IP address from that earlier expired lease as well. Some devices simply use the old expired lease all of the time (in addition to whatever other IP address they have leased), rather than doing so only from time to time.
Sometimes the device resumes using the IP address from an old expired lease "out of the blue" (at the same it is using another IP address for which it has an unexpired DHCP lease). We have seen devices do so hours, days, even months after the old lease has expired. (We've even seen an example of this over 16 months after the original lease expired.) The device may have slept and awoken many times (or perhaps even rebooted) since the time the original lease expired.
We sometimes see devices simultaneously use multiple IP addresses from expired DHCP leases well after the leases have expired, as if the device were accumulating the expired leases.
Sometimes, the malfunctioning device exhibits issues #1, #2, and #3 simultaneously. For example, the original lease on IP address 'a' expires, and the device continues using that IP address after lease expiration. Later, the device enters DHCP INIT and obtains a lease on IP address 'b'. The device now uses both IP addresses 'a' and 'b' simultaneously. While doing so, the device also tries to renew the expired lease on IP address 'a'.
When a device continues to use an IP address from an expired DHCP lease after that lease expires, this can interfere with service to other devices. Once the malfunctioning device has allowed its DHCP lease to expire, the DHCP server may lease the same IP address to another client.
If two devices on the same IP network try to use the same IP address at the same time, one or both can experience difficulties using IP.
The DHCP servers try to reduce the impact of these malfunctioning clients. Before offering a client a new lease for a dynamically-assigned IP address, the servers perform a quick PING test to determine whether the IP address is unexpectedly in use. (For example, is some device "stealing" the IP address?) This quick test helps, but does not entirely work around the problem caused by the malfunctioning clients. For example, sometimes the malfunctioning device may not respond to PING at the time the DHCP server checks before leasing the IP address to another client. In some DHCP server implementations, the DHCP server may have limited time to perform the test, as other clients are waiting for responses from the DHCP server. And when a device exhibiting issue #3 resumes using an IP address from an expired lease "out of the blue," that IP address may have already been leased to another client; this makes it impossible for the DHCP server to discover the malfunctioning client at the time the server leases the IP address to a soon-to-be victim.
We have observed that one situation in which the device can exibit issue #1 and issue #2 is for the device to choose to remain attached to a Wi-Fi network while the device is asleep. If the DHCP lease comes due for renewal while the device is asleep, the device doesn't renew the lease. If the device remains asleep through the time that the lease expires, the device allows the lease to expire. The device continues to behave as if it believes the lease has not yet expired; it continues to use the IP address, and in some cases, tries renew the lease after expiration time. We have observed that it makes no difference whether the device is plugged into a power source throughout this period.
Some of the information from Google suggests that the cause (or one cause) of issue #3 may be a known bug in the Broadcom firmware supporting an Android device's wireless interface. Google has indicated that a bug in that firmware's "ARP Offload" feature can cause a device to claim IP addresses from expired leases. (They indicate the ARP Offload feature is used to allow the device to respond to ARP requests while the device is asleep, without fully waking the device.) We do not know if the problematic firmware is used on all Android devices, or only those with wireless hardware made by Broadcom. We do not know if the version of Android provided by some vendors for some devices might be customized to disable that feature in the Broadcom firmware, or to replace the problematic version of the Broadcom firmware with a fixed version. Any of these could explain why not all devices exhibit issue #3.
Princeton recognized this as a pattern involving Android devices during the Summer of 2010.
We first saw an Android device attached to our network exhibit the problem in February 2010, another in April 2010, one more in June 2010, nine more in July 2010, and ten more in August 2010. As students returned to campus during Fall 2010, we saw the numbers of malfunctioning Android devices grow rapidly.
Nearly all of the devices we've detected exhibiting the bugs have malfunctioned repeatedly. Often the device will malfunction in this way several times per day.
To help us better understand which Android platforms malfunction in this way, our customer support organization collected from owners of the malfunctioning devices the following information: Android version, device make and model. While only a small fraction of our Android customers responded, the data we collected indicates that the problem is widespread, present in Android versions 2.1 through 4.1.1 running on different device models from different vendors.
We collected data showing malfunctioning Android devices' DHCP behavior and IP address use, and determined that the devices were all exhibiting a set of bugs described above.
On September 14 2010, we filed a bug report #11236 with Google, the vendor of the Android operating system.
On September 14 2010, we published the first version of the document you are presently reading. A day later, OIT added a pointer to this information to its KnowledgeBase, used by both Princeton University customers and support staff.
Over the months (and eventually years) since filing our bug report with Google, we have continued updating our bug report at Google with more information demonstrating the problems, and showing how they affected a wide variety of Android devices.
On December 24 2010, an engineer at Google acknowledged the bug report.
On April 19 2011, there was a mention of our bug report on Slashdot.
On April 20 2011, an engineer at Google updated our bug report to say that Google had identified a couple of the causes for the issue we reported. The engineer indicated that Google had identified multiple bugs causing these behaviors; this is not just a single bug. They found bugs associated with the way the device renews DHCP leases with respect to the way the device sleeps. And they reported there is a bug in the firmware for Broadcom Wi-Fi hardware, causing its ARP offload feature to claim old IP addresses after the DHCP lease has expired. The engineer indicated that they have fixes for the bugs they have identified, and would soon be releasing those fixes.
Through November 27 2012, we have not received word from Google that any Android fixes for these bugs have been released.
During late May 2011 through late July 2011, we tested a workaround proposed by a Princeton University customer. For each Android device previously identified as exhibiting these bugs, as well as those identified during the test period, we contacted the customer associated with the device (where that person was known). We invited these customers to participate in this test. Of the 730 malfunctioning devices identified through that time, it was practical to contact the customers associated with 205 of these devices. The remaining 525 devices belonged mostly to anonymous visitors; a few belonged to customers who were impractical to contact. Of the 205 customers contacted, 52 chose to participate in the test. The test ran for ten weeks. Some of the devices participated for the entire period; most joined as the test proceeded. The typical device participated for about a month. We found that the proposed workaround was effective for 75-80% of the devices exhibiting issues #1, #2, and/or #3. It was ineffective for the remaining devices; all three issues were represented among the failure cases.
Based upon the test above, on July 29 2011 we published the procedure as a Partial Workaround for "Android Allows DHCP Lease to Expire, Keeps Using IP Address" Bugs. While that procedure does not fix the bugs, it allows some of the malfunctioning Android devices to be used on Wi-Fi network without disrupting service to others. We began including a pointer to that procedure in the information we provide to affected Android customers.
If a device malfunctions in a similar manner after the customer advises us that s/he had adopted the partial workaround, we take that as final indication that the partial workaround has proven not effective for that particular customer's device. We block the device, advise the customer, and then keep the block in place permanently (or until a fix for the device is available from Google). We do not allow the device to be unblocked and have "another try" to use the partial workaround, even if the customer believes the reason for the malfunction was that the customer didn't apply the partial workaround properly (or unwittingly removed the workaround). This is beause we have experienced so many Android devices on our network exhibiting these bugs, it is impractical for us to allow each one to interfere with service repeatedly. Each device gets one opportunity to try the partial workaround.
During July 29 2011 - September 1 2011, we noticed that some of the test participants previously counted as successes malfunctioned again. We therefore updated our results on September 2 2011 to reflect that the partial workaround was effective for 70% of the tested devices exhibiting issues #1, #2, and/or #3.
On December 30 2011 we reviewed incident records for all devices which had attempted to use the partial workaround to-date. That data shows that over time, more of these devices eventually malfunction again in the same way. We therefore updated our results to reflect that the partial workaround has been effective for 61% of the devices exhibiting issues #1, #2, and/or #3.
We continue to encounter a growing number of Android devices exibiting these bugs, disrupting network service for others on a daily basis. The partial workaround proved ineffective for a significant fraction of Android devices.
We continued updating our bug report at Google for two years, showing how the bugs have remained present in newer Android releases across a variety of devices.
Through March 3 2013, we have seen over 2100 Android devices malfunction in this way while attached to our campus network.
Changes were made to portions of Princeton's wireless network architecture beginning in September 2012 to accomodate the growing demand for wireless service. These changes included migrating some wireless network(s) from using globally-routable IPv4 addresses behind normal IPv4 routers to using private IPv4 addresses behind NAT routers, to accomodate a demand for more IPv4 addresses. And these changes included migrating from DHCP servers instrumented to detect malfunctioning clients to commodity DHCP servers, to accomodate growth in DHCP transacation rates from wireless clients. Side effects of those changes prevent us from detecting wireless clients exhibiting this issue which they are attached via most of our wireless networks. As we can no longer detect those malfunctioning clients when they are attached via most wireless networks, we no longer block these clients and contact the owner. As a result, those malfunctioning clients may disrupt service to other wireless clients and degrade overall wireless network service on an ongoing basis. (On one remaining small wireless network, we can still detect these malfunction clients, and continue to block the devices and contact owners if feasible.)
On November 26 2012, Google marked the bug as closed, and marked the bug as having a fix released.
On November 26 2012, Google commented that these bugs should not be present in Android version 4.2. It is unclear to us exactly what this means. This is because Google indicated in April 2011 that they'd soon be releasing fixes, but at that time did not indicate that fixes had been included beginning in any specific version of Android. Long after that April 2011, we continued to find devices running newly-released Android updates exhibiting the malfunctions, all the way through Android 4.1.1. It is unclear to us whether Google's comment that Android 4.2 should not have the bugs is based on some new changes they made in version 4.2, or if Google was restating that they believed they had fixed the issue(s) in versions prior to 4.2. If the latter, we already know that any changes made to the earlier versions (through 4.1.1) didn't entirely fix these bugs. So at this time, we do not know whether the bugs remains present in Android 4.2. And because Princeton's wireless network architecture was changed beginning in September 2012 in such a way that we can no longer check for these bugs, we can't determine if Android 4.2 still has any of these bugs.
If Google has fixed these bugs starting in Android version 4.2 (or does so in some future version of Android), it is not clear to us that owners of most devices running older versions of Android will ever be able to obtain such bug fixes. Google's distribution model for Android updates does not result in timely Android OS updates for most owners. Often Android updates for existing devices are never made available in a way most customers can use.
We did not ban the use of Android devices at Princeton. Each Android device is welcome on our network, unless or until that device malfunctions in such a way as to disrupt or degrade service. Only those that are detected malfunctioning in this way were blocked from using the network. However, it is an unfortunate fact that most Android devices running the affected versions of the operating system do malfunction in this way, ultimately resulting in us blocking each of those devices, one at a time.
Once an individual Android device exhibits this bug, we contacted the customer to advise him or her of the problem. We advised the customer that if the device interfered with service a second time in this way, network service for the device would be blocked.
If the same device exhibited the problem more than once, we blocked that individual device from our network. (Most affected Android devices malfunction so frequently, often we detected the device malfunction several times in the same day, and so we blocked the device at the same time we first contacted the owner.)
Once blocked, if the device had a cellular network interface, the device coiuld still be used with the customer's cellular network provider, of course.
If it was not practical to contact the customer (for example, because the device was using our visitor wireless service and the owner was anonymous), we blocked that individual device from our network the first time it exhibited this bug. If at a later time it became practical to contact the customer (for example, because the customer registered the device in the University's Host Database), we contacted the customer to advise him or her of the problem.
Beginning with the availability of the partial workaround during Summer 2011, if the owner of a blocked malfunctioning Android device chooses to adopt the partial workaround, we unblocked the device, allowing it to resume using the campus network. Because the partial workaround is not fully effective, some of these devices would continue to disrupt service, or would resume disrupting service at a later date. When we detected one of these devices again disrupting service, we blocked the device from the network and contacted the customer again. In the absence of a fully-effective workaround or a fix from Google, once a device has malfunctioned after trying to use the partial workaround, that device remains blocked from our network. We did not unblock a device after the partial workaround has proved ineffective.
This is similar to how we handle other malfunctioning devices which disrupt or degrade service. We typically do not ban entire classes of devices. We have not singled out Android devices for special handling. We block individual devices after they actually disrupt or degrade service. In most cases, we unblock such devices when the owner takes acceptable action to address the issue. (Lacking a fix from Google, for those Android devices where the partial workaround is not fully effective, there was nothing those customers could do to address this particular issue.)
As described above, architectural changes made to our wireless networks beginning in September 2012 to accomodate growth now prevent us from detecting clients exhibiting this malfunction. As a result, we no longer contact the owner of these devices or block them. Instead, these malfunctioning devices continue to disrupt and degade network service for others.
Beginning in September 2012 wireless clients blocked from network service because they were exhibiting this issue will be unblocked upon request from the customer. This is because (as described above) we no longer are able to detect wireless clients exhibiting this issue. Once unblocked, if the device continues/resumes malfunctioning in this way, it may disrupt service to other wireless clients and degrade overall wireless network service on an ongoing basis. If it does so, we will no longer be able to determine the cause of the problem or address it.
Some may wonder why Princeton was the only site to report this problem at first. Some may believe that because other sites did not report the problem at first, the problem must be due to a problem with Princeton's network.
Princeton detected this issue because at the time, we took a very pro-active stance to monitor for certain kinds of common network problems which interfere with service or degrade service, including this one.
At that time, our network monitoring included comparing actual IP address usage to DHCP server lease assignments on a daily basis. Specifically, we compared our IP router ARP cache data to our DHCP server logs. We were able to do so because we were using DHCP servers instrumented to provide us with the necessary information, and non-NAT'd IP networks in which we were able to reliably gather IP ARP data from the IP router. This allowed us to detect some devices using IP addresses not assigned for their use.
This was a degree of monitoring that many sites did not perform. Many sites place client devices -- especially wireless clients -- behind NATs. And many sites operate a DHCP service that is not instrumented in a way to assist in such monitoring. Without such closely monitored ARP data and DHCP server data, detecting this kind of problem is difficult or impractical.
We also monitored our DHCP servers very closely for any problems they detected, including when they saw DHCP-leased IP addresses in-use when they should not be, or when a client tried to SELECT an offer that was not made to it, or when a client tried to renew or rebind an IP address after the client's lease on that IP address has already expired. We had instrumented our DHCP server software to make it (somewhat) easier to see such events. Our monitoring also reported DHCP clients which were the source of excessive transactions; occasionally these were victims of malfunctioning iPhone OS devices "stealing" IP addresses.
As a result of the close monitoring we performed to detect DHCP issues, Princeton tended to learn about some kinds of bugs in DHCP client implementations sooner and more often than many other sites.
A more common approach is to ignore the kinds of problems caused by devices using IP addresses not leased to them, allowing such malfunctioning devices to cause sporadic mysterious network problems for customers as their IP addresses are "stolen". Sites that use that approach may take action only when a victim of a malfunctioning device chooses to complain. Most victims probably don't complain because these kinds of problems appear random and short-lived to each victim, and often go away when they "try again."
We felt that the stance we took ultimately benefited our customers, as it resulted in more reliable network service to the customers. It reduced the frequency that our customers experience network disruptions due to others' malfunctioning devices.
As a side note, this pro-active stance also resulted in our discovering DHCP client issues a number of times over the years for a variety of common platforms. Typically we provided technical details of these issues to the DHCP client vendors, which helped the vendors to fix bugs and improve DHCP client behavior, as Apple did for this bug. Although identifying issues in vendors' DHCP client software was not our goal -- our mission was and remains to provide excellent network service to Princeton University customers -- it does speak to the technical accuracy of the bugs we discovered.
In the time since we reported this issue, a small number of others sites have indicated that they too are seeing one or more of the bugs described above.
Sites that are able to monitor for these problems closely are less likely to notice notice issue #1 if they assign DHCP leases with long expiration times., for example, on the order of days. Princeton's wireless services rely on DHCP leases in the 1-3 hour range. (Shorter leases allow us to recover unused IP addresses rapidly, in turn permitting us to assign globally-routable IP addresses to clients without requiring Princeton to impose a NAT between wireless clients and the Internet.) Expiration times so long that the Android device is likely to be woken from sleep by the customer before the lease expires might hide issue #1 in some cases, but we have found that even waking an Android device exhibiting issue #1 will not always cause the device to use DHCP to obtain a fresh lease. And issue #3 is not hidden by using long lease times, even on the order of months.
Sites which operate network infrastructure that modifies the behavior of ARP traffic may have difficulty detecting these problems. For example, some enterprise Wi-Fi infrastructures rewrite ARP request broadcast frames to becomes unicast frames destined to the device the infrastructure "believes" should be the owner of the requested IP address. Or the infrastructure may drop ARP request broadcast frames and instead reply to these requests on behalf of the device the infrastructure "believes" should be the owner of the requested IP address. Such interference with ARP traffic can make it difficult or impossible for network operators to detect these Android bugs. Depending on the way such infrastructures decide which device "should" be the legitimate owner of a requested IP address, these ARP changes may not reliably work around the damage (stolen IP addresses) caused by the Android bugs. That is, such modifications to the behavior of ARP may not hide the damage these Android bugs cause, but may prevent network operators from discovering that the damage is happening and that it's due to these Android bugs.
Since the time we detected this bug, Princeton University has re-architected most of our wireless networks to accomodate wireless growth. These network changes began in September 2012. One of the side effects of these changes is that we now are unable to detect and troubleshoot these kinds of problems on most of our wireless networks. Previously, each of our wireless networks used globally-routable IPv4 addresses and was not behind a NAT. A side effect of that design was that we could reliably retrieve IP ARP data from each network's IP router. Now most of our wireless networks are behind NAT routers so we may assign private IPv4 addresses to wireless clients; we are doing so to extend the time before we run out of globally-routable IPv4 addresses. A side-effect of the shift from normal IP routers to NAT routers is that we are no longer able to record IP ARP data for wireless clients in a manner reliable enough to let us detect wireless clients malfunctioning in various ways, including the way described in this document. Additionally, our wireless networks previously used DHCP servers we had instrumented to help us to detect "stolen" IP addresses. Now most of our wireless networks use commodity DHCP servers, because those servers are rated to handle higher DHCP transaction rates than our instrumented DHCP servers. A side-effect of the shift from instrumented DHCP servers to commodity DHCP servers is that we no longer have the DHCP server instrumentation to detect wireless clients malfunctioning in various ways, including the way described in this document. Furthermore, our wireless networks previously allowed IP ARP traffic to flow normally. Now on most of our wireless networks, wireless controllers intercept most IP ARP requests to reduce the broadcast traffic rate; instead of flooding the requests to allow all possible clients to see and respond to ARP requests, the wireless controllers may choose to construct an ARP response themselves based on the controller's notion of the "correct" answer. A side-effect of the move away from normal ARP traffic flow is that on our wireless networks, we no longer can truly determine which devices are using which IP addresses. The upshot is that as a side-effect of the the network changes we made starting in September 2012, we are no longer able to detect these kinds of malfunctioning wireless clients. Wireless clients which exhibit these kinds of issues can still disrupt or degrade network service for others, but we are not able to detect and troubleshoot these kinds of problems.