This document provides details of a DHCP bug exhibited by Mac OS X 10.5.x in rare circumstances. We have confirmed its presence in 10.5.6 and 10.5.8. We suspect it is present in other versions of Mac OS X 10.5.x.
The bug can cause the Mac to interfere with service to other devices on the network. This document provides information for individuals who would like detailed technical information about the issue.
Princeton University reported the bug to Apple. This bug was fixed in Mac OS X 10.6.
DHCPv4, the Dynamic Host Configuration Protocol for IPv4, allows a device attached to the network to automatically learn some or all of its network configuration, including its IPv4 (Internet) address. Most operating systems include DHCP client software.
A device that uses DHCPv4 runs DHCP client software on a network (e.g., Ethernet or Wireless) interface. The DHCP client software contacts DHCP servers to obtain network configuration; in particular, it usually obtains a lease (a loan) of an IP address.
For example, the DHCP client tells the DHCP server "I am network interface 0:1:2:3:4:5; please lease an IP address to me." The DHCP server might respond "You may use IP address 192.168.1.2 for the next six hours; if you would like to continue using that address, please renew it when three hours have elapsed." When three hours have elapsed, the DHCP client contacts the DHCP server which granted the lease; the client asks that server to renew the lease Typically the DHCP server responds to the client: "You may use IP address 192.168.1.2 for the next six hours; if you would like to continue using that address, please renew it again when three hours have elapsed." (If the DHCP client is unable to contact the DHCP server to the renew its unexpired lease, it will retry from time to time, and is permitted to continue using the IP address until the lease is due to expire.)
Assuming the DHCP client successfully renews the lease before it expires, this repeats periodically until the device goes offline. Once the device is offline, it no longer contacts the DHCP server to renew the lease, so eventually the last lease renewal expires. Once the last lease renewal has expired, the DHCP server is free to lease the IP address to another client.
If the device goes offline, when it later comes back online it broadcasts a DHCP request for a new lease. It may choose to request a brand-new lease, or (if it believes the old lease has not yet expired) may request a new lease on the old IP address.
A DHCP client which wishes to surrender a leased IP address before the lease expires may optionally contact the DHCP server to release the lease on that IP address. The DHCP server may then lease that IP address to another client.
Devices running Mac OS X 10.5.x sometimes get into a state in which they release a DHCP lease, but continue to use the IP address from that lease.
The pattern we see is that the device obtains a lease on IP address 'A'. Prior the expiration of the lease, the device sends a DHCPRELEASE message to the DHCP server, releasing the lease on IP address 'A'. Later, the device uses DHCP to obtain a lease on IP address 'B'. It proceeds to use IP address 'B' and also IP address 'A' at the same time. That is, it uses its new legitimate IP address 'B', but simultaneously "steals" IP address 'A'.
Later it may use DHCP to obtain a lease on IP address 'C'. It proceeds to use IP address 'C' and also IP address 'A' at the same time. That is, it uses its legitimately-leased IP address, but simultaneously "steals" IP address 'A' which it released hours or days ago.
This continues on and off until the problem with the device is addressed through manual intervention (described below).
If the problem is not addressed, eventually the problem worsens. The device exhibits the problem a second time, releasing a lease on IP address 'C'. From that time on, the device will steal both IP addresses 'A' and 'C' at the same time as it uses any other legitimately-obtained IP address 'D'. Over time, the device accumulates old IP addresses it has released, stealing all of them in addition to using its current legitimate IP address.
This is interferes with service to other devices on the network, as after the device releases an IP address, the DHCP server is free to lease the IP address to another device.
We do not know what causes the device to enter this malfunctioning state.
Apparently the circumstances leading a device to enter this malfunctioning state are not common, as we have seen only several dozen (out of thousands) exhibit the problem. We have observed that some of the devices which have exhibited this problem have gotten into this malfunctioning state a number of times; we do not know what differentiates these devices from others.
A temporary workaround to the problem is to clear the networking configuration on the device (e.g., delete the configuration using the 'Network' System Preference pane). Afterwards, recreate its networking configuration anew. This ends the current incident, however, the underlying bug is still present, so the device may still malfunction in this way at some time in the future.
Princeton began seeing this malfunction in September 2009. As the frequency of the problem was low, it took a number of months before we recognized the pattern, and several more months to collect the necessary data to describe the bug.
In April 2010, we reported the bug to Apple.
Apple responded to our bug report; the non-disclosure agreement involving Apple bug reports prevents us from disclosing Apple's response, however, we can note that this bug was fixed in Mac OS X 10.6, released August 2010.
When Princeton encounters a device malfunctioning in this way, we temporarily block the device from our network to stop it from interfering with service. We contact the owner and ask him or her to take action (delete the device's network configuration) to address the current incident. We also advise the owner that the bug remains present in Mac OS X 10.5.x, and so is likely to recur in the future, until the owner uprades to Mac OS X 10.6 or later. Once the customer indicates that the current incident has been addressed, we remove the temporary block.
Changes were made to Princeton's wireless network architecture beginning in September 2012 to accomodate the growing demand for wireless service. These changes included migrating from using globally-routable IPv4 addresses behind normal IPv4 routers to using private IPv4 addresses behind NAT routers, to accomodate a demand for more IPv4 addresses. And these changes included migrating from DHCP servers instrumented to detect malfunctioning clients to commodity DHCP servers, to accomodate growth in DHCP transacation rates from wireless clients. Side effects of those changes prevent us from detecting wireless clients exhibiting this issue. As we can no longer detect those malfunctioning clients, we no longer block these clients and contact the owner. As a result, those malfunctioning clients may disrupt service to other wireless clients and degrade overall wireless network service on an ongoing basis.
Beginning in September 2012 wireless clients blocked from network service because they were exhibiting this issue will be unblocked upon request from the customer. This is because (as described above) we no longer are able to detect wireless clients exhibiting this issue. Once unblocked, if the device continues/resumes malfunctioning in this way, it may disrupt service to other wireless clients and degrade overall wireless network service on an ongoing basis. If it does so, we will no longer be able to determine the cause of the problem or address it.
Some may wonder why only Princeton has reported this problem. Some may believe that because other sites did not report it, the problem must have been due to a problem with Princeton's network.
Princeton detected this issue because at the time, we took a very pro-active stance to monitor for certain kinds of common network problems, including this one.
At that time, our network monitoring included comparing actual IP address usage to DHCP server lease assignments on a daily basis. Specifically, we compared our IP router ARP cache data to our DHCP server logs. We were able to do so because we were using DHCP servers instrumented to provide us with the necessary information, and non-NAT'd IP networks in which we were able to reliably gather IP ARP data from the IP router. This allowed us to detect some devices using IP addresses not assigned for their use.
This was a degree of monitoring that many sites did not perform. Many sites place client devices -- especially wireless clients -- behind NATs. And many sites operate a DHCP service that is not instrumented in a way to assist in such monitoring. With such closely monitored ARP data and DHCP server data, detecting this kind of problem is difficult or impractical.
We also monitored our DHCP servers very closely for any problems they detected, including when they saw DHCP-leased IP addresses in-use when they should not be, or when a client tried to SELECT an offer that was not made to it, or when a client tried to renew or rebind an IP address after the client's lease on that IP address has already expired. We had instrumented our DHCP server software to make it (somewhat) easier to see such events. Our monitoring also reported DHCP clients which were the source of excessive transactions; occasionally these were victims of malfunctioning iPhone OS devices "stealing" IP addresses.
As a result of the close monitoring we performed to detect DHCP issues, Princeton tended to learn about some kinds of bugs in DHCP client implementations sooner and more often than many other sites.
A more common approach is to ignore the kinds of problems caused by devices using IP addresses not leased to them, allowing such malfunctioning devices to cause sporadic mysterious network problems for customers as their IP addresses are "stolen". Sites that use that approach may take action only when a victim of a malfunctioning device chooses to complain. Most victims probably don't complain because these kinds of problems appear random and short-lived to each victim, and often go away when they "try again."
We felt that the stance we took ultimately benefited our customers, as it resulted in more reliable network service to the customers. It reduced the frequency that our customers experience network disruptions due to others' malfunctioning devices.
As a side note, this pro-active stance also resulted in our discovering DHCP client issues a number of times over the years for a variety of common platforms. Typically we provided technical details of these issues to the DHCP client vendors, which helped the vendors to fix bugs and improve DHCP client behavior, as Apple did for this bug. Although identifying issues in vendors' DHCP client software was not our goal -- our mission was and remains to provide excellent network service to Princeton University customers -- it does speak to the technical accuracy of the bugs we discovered.
As a side note, this pro-active stance has also resulted in our discovering DHCP client issues a number of times over the years for a variety of common platforms. Typically we've provided technical details of these issues to the DHCP client vendors, which has helped the vendors to fix bugs and improve DHCP client behavior. Although identifying issues in vendors' DHCP client software is not our goal -- our mission is to provide excellent network service to Princeton University customers -- it does speak to the technical accuracy of the bugs we've discovered.
Sites which operate network infrastructure that modifies the behavior of ARP traffic may have difficulty detecting these problems. For example, some enterprise Wi-Fi infrastructures rewrite ARP request broadcast frames to becomes unicast frames destined to the device the infrastructure "believes" should be the owner of the requested IP address. Or the infrastructure may drop ARP request broadcast frames and instead reply to these requests on behalf of the device the infrastructure "believes" should be the owner of the requested IP address. Such interference with ARP traffic can make it difficult or impossible for network operators to detect this Mac OS bug. Depending on the way such infrastructures decide which device "should" be the legitimate owner of a requested IP address, these ARP changes may not reliably work around the damage (stolen IP addresses) caused by the Mac OS bug. That is, such modifications to the behavior of ARP may not hide the damage this Mac OS bug causes, but may prevent network operators from discovering that the damage is happening and that it's due to this Mac OS bug.
Since the time we detected this bug, Princeton University has begun re-architecting our wireless networks to accomodate wireless growth. These network changes began in September 2012, and are presently in-progress. One of the side effects of these changes is that we are often unable to detect and troubleshoot these kinds of problems on our wireless networks. Previously, each of our wireless networks used globally-routable IPv4 addresses and was not behind a NAT. A side effect of that design was that we could reliably retrieve IP ARP data from each network's IP router. Now some of our wireless networks are behind NAT routers so we may assign private IPv4 addresses to wireless clients; we are doing so to extend the time before we run out of globally-routable IPv4 addresses. A side-effect of the shift from normal IP routers to NAT routers is that we are no longer able to record IP ARP data for wireless clients in a manner reliable enough to let us detect wireless clients malfunctioning in various ways, including the way described in this document. Additionally, our wireless networks previously used DHCP servers we had instrumented to help us to detect "stolen" IP addresses. Now some of our wireless networks use commodity DHCP servers, because those servers are rated to handle higher DHCP transaction rates than our instrumented DHCP servers. A side-effect of the shift from instrumented DHCP servers to commodity DHCP servers is that we no longer have the DHCP server instrumentation to detect wireless clients malfunctioning in various ways, including the way described in this document. The upshot is that as a side-effect of the the network changes we made starting in September 2012, we are no longer able to detect these kinds of malfunctioning wireless clients. Wireless clients which exhibit these kinds of issues can still disrupt or degrade network service for others, but we are often not able to detect and troubleshoot these kinds of problems.