The modernization of PCIe hotplug in Linux

Benefits for LWN subscribers

The primary benefit from subscribing to LWN is helping to keep us publishing, but, beyond that, subscribers get immediate access to all site content and access to a number of extra site features. Please sign up today!

October 8, 2018

This article was contributed by Lukas Wunner

PCI Express hotplug has been supported in Linux for fourteen years. The code, which is aging, is currently undergoing a transformation to fit the needs of contemporary applications such as hot-swappable flash drives in data centers and power-manageable Thunderbolt controllers in laptops. Time for a roundup.

The initial PCI specification from 1992 had no provisions for the addition or removal of cards at runtime. In the late 1990s and early 2000s, various proprietary hotplug controllers, as well as the vendor-neutral standard hotplug controller, were conceived and became supported by Linux through drivers living in drivers/pci/hotplug. PCI Express (PCIe), instead, supported hotplug from the get-go in 2002, but its embodiments have changed over time. Originally intended to hot-swap PCIe cards in servers or ExpressCards in laptops, today it is commonly used in data centers (where NVMe flash drives need to be swapped at runtime) and by Thunderbolt (which tunnels PCIe through a hotpluggable chain of converged I/O switches, together with other protocols such as DisplayPort).

Threaded interrupts

Linux's PCIe hotplug driver, called pciehp, was introduced in 2004 by Dely Sy. The first major cleanup and rework was carried out by Kenji Kaneshige, who stopped doing this work in 2011. After that, contributions were largely confined to duct-taping over the driver's remaining weaknesses, in particular its event handling.

Threaded interrupts are the predominant interrupt-handling pattern in the kernel and a cornerstone of realtime Linux but, unfortunately, they were introduced after Kaneshige's rework had concluded. pciehp's hard interrupt handler therefore identified which event(s) had occurred, such as "link-up" or "link-down", and queued a work item for each of them. The problem with this approach was that, by the time the work item was executed, the link status may well have changed again. Moreover, if the link flipped more quickly than the hard interrupt handler ran, unbalanced link-up and link-down events would be detected. Finally, the possibility of multiple in-flight work items and how they interact made it difficult to reason about the event-handling code's correctness. Recently, Bjorn Helgaas (who is the PCI maintainer) referred to pciehp's event handling as "baroque". A point was reached where duct-taping was no longer an option and a fundamental rethink of the driver became unavoidable.

For Linux 4.19, I converted pciehp to threaded interrupt handling; events are now collected by the hard interrupt handler for later consumption by the interrupt thread. The detection of whether a link change is a link-up or link-down is deferred until the handling of the event by the interrupt thread to avoid handling stale events. The new approach can deal with a quick series of events (such as a link-down rapidly followed by a link-up) and tolerates link flips during bring-up of the slot, which may be caused by sketchy board layout or electromagnetic interference. The patch set also included a fair number of bug fixes and cleanups so, overall, robustness and reliability should improve noticeably. Follow-up patches queued for 4.20 shave off close to 500 lines from pciehp and other hotplug drivers, resulting in a further simplification and rationalization.

Runtime power management

Linux 4.19 will also add the ability to runtime suspend PCIe hotplug ports. This is necessary to power down Thunderbolt controllers, which show up in the operating system as a PCIe upstream port and multiple PCIe downstream ports (with hotplugged devices appearing below the latter once a tunnel has been established). Before a controller can power down, all its PCIe ports need to runtime suspend. Linux has been able to runtime suspend the upstream port since 4.8, but could not runtime suspend the downstream ports before 4.19.

Runtime suspending a Thunderbolt PCIe port does not itself result in any saved power: the port will encapsulate and decapsulate PCIe packets for transport over the converged I/O switch and consume energy as long as that switch is powered. However, Linux's power-management model requires that all child devices must suspend before their parent can. By runtime suspending all of the Thunderbolt controller's ports, its parent, a root port, is allowed to suspend which, in turn, triggers power-down of the controller through ACPI platform methods. Powering down the controller does result in a significant power saving of about 1.5W.

Put another way, runtime suspending Thunderbolt PCIe ports is done to satisfy the needs of Linux's hierarchical power-management model. A single Thunderbolt PCIe port consumes the same amount of energy regardless whether its PCI power state is D0 (full power) or D3hot (suspended), but when all ports are runtime suspended, the controller as a whole can be powered down. (Powering down Thunderbolt controllers on Macs will need further patches that may appear in the 4.21 time frame.)

An interesting detail is the handling of hotplug events that occur while a PCIe hotplug port is runtime suspended: if its parents are runtime suspended as well, the port is inaccessible. So it cannot signal the interrupt in-band, and the kernel can't react to it or even determine the type of event until the parents are runtime resumed. There are two known ways for hardware to deal with this.

The first is in accordance with the PCIe specification: the hotplug port signals a power-management event (PME), which may happen out-of-band through a means provided by the platform, such as a general-purpose I/O pin (a WAKE# signal in PCIe terminology). The PME causes wakeup of the hierarchy below the Thunderbolt host controller, whereupon the hotplug port becomes accessible. This method is known to be used on Lenovo and Dell laptops with Thunderbolt 3 and allows controllers to power down even if devices are attached. Mika Westerberg has submitted patches for 4.20 to support it.

The second method is nonstandard: the Thunderbolt hardware knows which tunnels are currently established and can therefore convert a hotplug event occurring at the converged I/O layer to a hotplug event occurring at the overlaid PCIe layer. Thus, when a device is attached or removed, an interrupt is magically received from the affected PCIe port regardless whether it and its parents are in D3hot. This method is known to be used on Apple Macs with Thunderbolt 1 and requires that the Thunderbolt host controller remains powered up as long as devices are attached. Support for it was added in 4.19.

Runtime power management is currently not enabled for non-Thunderbolt hotplug ports as they are known to cause issues such as non-maskable interrupts when put into D3hot. Vendors may pass "pcie_port_pm=force" on the command line to validate their hotplug ports for runtime suspend support and perhaps the feature can be enabled by default at a later point.

Surprise removal

The original PCIe specification defined a standard usage model that included a manually operated retention latch to hold a card in place and an attention button to request bring-down of a slot from the operating system. But contemporary implementations often omit those elements and exclusively use surprise removal of devices.

When a device is yanked out, pciehp asks its driver to unbind, then brings down the slot. But, until that happens, read requests to the device will time out after (typically) 17ms, and return a fabricated "all ones" response. The timeouts slow down the requesting task and, if the fabricated response is mistaken for real data, the task may crash or get stuck in an infinite loop. Drivers therefore need to validate data read from a device and, in particular, check for all ones in cases when that is not a valid response. A common idiom is to call pci_device_is_present(), which reads the vendor ID register and checks if it is all ones. However that is not a panacea; if a PCIe uncorrectable error occurs, the device may likewise respond with all ones, but revert to valid responses if the error can be recovered. Moreover, all ones is returned for unsupported requests or read requests that are inside a bridge's address window but outside any of the target device's base address registers (BARs). The only entity that can identify removal authoritatively and unambiguously is pciehp.

Many drivers — and even the PCI core — do not check every read for an all-ones response. Engineers working on Facebook's "Lightning" storage architecture had to learn this the hard way [YouTube]. Surprise-removing an entire array of NVMe drives took many seconds and occasionally caused machine-check exceptions. It was so slow that the driver would find itself talking with a new device plugged into that slot before the processing of the previous removal had completed. One of the outcomes was a patch set by Keith Busch in Linux 4.12 to have pciehp set a flag on surprise-removed devices and skip access to them in a few strategic places in the PCI core. This was sufficient to speed up removal to microseconds. In particular, pci_device_is_present() is short-circuited to return false if the flag is set. Before, if the device was quickly swapped with another one, the function incorrectly returned true for the removed device once the vendor ID of the new device became readable.

At Benjamin Herrenschmidt's behest, another patch by Busch is now queued for 4.20 to unify the flag with the error state of a PCI device. The error state allows distinguishing whether the device is temporarily inaccessible after an unrecoverable error but has a chance to come back, or whether it has failed permanently. Drivers can either check the error_state member in struct pci_dev directly or call pci_channel_offline() to determine the accessibility of a device.

However, Helgaas has voiced misgivings about the flag in general. For one, the flag is set asynchronously, so there is a latency between the device being removed and the flag being set. Driver authors need to be wary that, even if the device seems present per the flag, it may no longer be there. Conversely, if set, the flag does provide a definitive indication that any further device access is futile and can be skipped. The flag therefore does not relieve driver authors from validating responses from the device but, once set, it serves as a cache and avoids ambiguous vendor ID checks. In short, the problem is mitigated but not solved perfectly. A perfect solution seems nearly impossible though; we cannot acquire a mutex on the user to prevent them from yanking a device and we cannot check for a presence change after every device access for performance reasons. Austin Bolen pointed out that a new PCIe extension called "root port programmed I/O" allows for synchronous exception handling of failed device accesses and thus for a seemingly perfect solution, but "this feature won't be available in products for some time and is optional".

A second concern Helgaas has with the flag is that it may obscure bugs that occur upon surprise removal but which become less visible when the flag is set, complicating their discovery and resolution. For example, a search for the advanced error recovery (AER) capability on device removal caused numerous configuration-space accesses and, before introduction of the flag, was noticeable through a significant slowdown upon surprise removal. But the proper solution was to cache the position of the AER capability, rather than paper over the issue by skipping the configuration accesses using the flag.

Error-handling integration

The move to threaded interrupts also eases integrating pciehp with the handling of PCIe uncorrectable errors: when such an error occurs at or below a hotplug port, it may cause a link-down event as a side effect. But, sometimes the error can be recovered through software, by performing a secondary bus reset, for example. In this case, it is undesirable for pciehp to react to the link-down event by unbinding the attached devices from their drivers and bringing down the slot. Rather, it should wait to see whether the error can be recovered and ignore the link event if so. To this end, Busch and Sinan Kaya are currently working on patches to tie in pciehp with the AER and downstream port containment service drivers.

Moving BARs

A PCIe device is allocated memory ranges for memory-mapped I/O that are configured in the device's BARs. The memory ranges are usually predefined by the BIOS, but Linux may move them around on enumeration. Bridges upstream of a PCI device have their address windows configured in such a way that transactions targeting a device's BAR are routed correctly.

When devices are hot-added, their memory requirements may not fit into the windows of their upstream bridges, necessitating a reorganization of resources: adjacent BARs need to be moved and bridge windows adjusted. MacOS gained this capability in 2013 for improved Thunderbolt support and calls it "PCIe pause". Drivers are told to pause I/O to affected devices; on unpause, the BARs may have changed and drivers are required to reconfigure their devices and update internal data structures as necessary.

Sergey Miroshnichenko recently submitted initial patches to bring moving of BARs to Linux, to positive reactions. The patches use existing callbacks to pause access to a device before a PCI reset and restart access afterward. Drivers will have to opt into BAR movement. MacOS supports reallocation of PCI bus numbers and message-signaled interrupts in addition to BARs; Miroshnichenko is looking into adding that in a future revision of the patch set.

In summary

The SD Card 7.0 specification announced in June is based on PCIe and NVMe and may greatly extend the usage of PCIe hotplug in consumer electronics devices. The ongoing activities in the kernel therefore seem to come at the right time and promise to yield an up-to-par PCIe hotplug infrastructure over the coming releases.

Index entries for this article
Kernel	Hotplug
Kernel	PCIe
GuestArticles	Wunner, Lukas

(Log in to post comments)

The modernization of PCIe hotplug in Linux

Posted Oct 8, 2018 23:26 UTC (Mon) by davidstrauss (guest, #85867) [Link]

I've been struggling with this in my own adventures with using an eGPU over Thunderbolt. Unfortunately, I expect that PCIe-level hotplug improvements will only be half of the battle, as kernel modules for certain device types (certainly GPUs) respond violently to having their backing devices disappear, even if it happens cleanly at the PCIe level. Manually unloading the kernel modules before removing the device sometimes works, but it's often the case that they refuse to be unloaded because they're in-use by other kernel modules or userland. Perhaps there's a systematic way to hunt down and remove anything blocking the module unloading to prepare for disconnection, but I'm not convinced that's the right answer for user experience.

For PCIe hotplug (including removal) to "just work," kernel modules are going to need to get a lot more accustomed to an inversion of dependency removal. As suggested in "surprise removal," the traditional expectation has been an orderly retirement of consumers before their corresponding producers (to use the API terminology). We now have to accommodate the opposite: surprise electrical removal, followed by surprise PCIe device removal, followed by a kernel module surprising its consumers with an inability to service them, followed by graceful handling on downward. The ability to expose practically any PCIe device over Thunderbolt has changed the scope of hot removal support from a few types of server-style components (e.g. NVMe) to literally anything that may run over PCIe.

I would love to reach the point where I can safely and consistently unplug a Thunderbolt eGPU from my Linux machine -- without shutting it down first.

The modernization of PCIe hotplug in Linux

Posted Oct 9, 2018 14:56 UTC (Tue) by MarcB (subscriber, #101804) [Link]

I think it is not even half the battle. Even if the kernel drivers are improved, userspace will still be a big issue. "Some of the memory you mapped just physically disappeared" seems not easy to handle (perhaps in some cases it is, but certainly not always). In fact, it could very well be impossible in many cases.

I can easily see developers responding to that request with "just don't do that".

The modernization of PCIe hotplug in Linux

Posted Oct 9, 2018 17:38 UTC (Tue) by jg (guest, #17537) [Link]

In the case of screens, the hot plug case already has to be handed in user space, as you plug external monitors/projectors in routinely. The window system and UI already has support to detect and reconfigure your applications on such hotplug events. Works fine today on my laptop with USB3 dongles driving external monitors....

The modernization of PCIe hotplug in Linux

Posted Oct 9, 2018 17:58 UTC (Tue) by davidstrauss (guest, #85867) [Link]

Code accessing USB devices already anticipates this sort of disconnection; code accessing PCIe devices often does not. I'm not removing a display; I'm removing a GPU. Suddenly unplugging my Thunderbolt eGPU -- even when I'm not actually running any apps or displays with it -- always freezes my system. I've isolated the cause down to the amdgpu (and related) modules not tolerating disappearance of a GPU, as unloading those modules (which rarely works) prevents the freeze on removal.

This is not a unique deficiency of Linux; my understanding is that macOS and Windows have similar effects for many devices. I'm just saying that support for hot removal of PCIe devices requires high-level support as much as low-level. The low-level work is critical, though.

The modernization of PCIe hotplug in Linux

Posted Oct 10, 2018 18:59 UTC (Wed) by jg (guest, #17537) [Link]

Again, this situation is already (at least partially, IIRC) dealt with in the code base. There are these weird laptops that have a GPU to augment the (much lower performance one) in the laptop; you want to be able to switch back and forth (the high power GPU gets powered down).

I don't remember if all the work has been done in the X server, but certainly the applications and window managers already "do the right thing" for the most part. It gives me great pleasure that I now usually have less trouble handling adding displays/projectors than many/most Windoze users do.

Please don't declare the problem as unsolvable in advance. If it isn't completely solved, it's mostly solved for displays. We started working on these issues with the xrandr extension almost 20 years ago (which keeps getting augmented with time: thanks keithp!). The fundamental shift architecturally happened there, with applications no longer able to presume their root window was immutable.
- Jim

The modernization of PCIe hotplug in Linux

Posted Oct 10, 2018 19:03 UTC (Wed) by davidstrauss (guest, #85867) [Link]

> Please don't declare the problem as unsolvable in advance.

I'm not sure where you got the impression I was making this claim. I started the thread to highlight that the value of PCIe hotplug isn't realized for certain device types without work higher in the stack.

The modernization of PCIe hotplug in Linux

Posted Oct 9, 2018 19:43 UTC (Tue) by gioele (subscriber, #61675) [Link]

> I think it is not even half the battle. Even if the kernel drivers are improved, userspace will still be a big issue. "Some of the memory you mapped just physically disappeared" seems not easy to handle

Can userspace talk directly to PCIe memory?

And couldn't the userspace program just have a SIGBUS handler that notices the missing hardware, cleans up the state and goes back to normality?

The modernization of PCIe hotplug in Linux

Posted Oct 10, 2018 0:47 UTC (Wed) by excors (subscriber, #95769) [Link]

Vulkan lets you explicitly allocate device-local memory (i.e. on the GPU) then map it cache-coherently into userspace.

It has the concept of "lost" devices, which causes most API calls to return errors and the application is able to clean up and try again with a new graphics device. (Of course the application might choose to just crash instead). The spec says "The host address space corresponding to device memory mapped using vkMapMemory is still valid, and host memory accesses to these mapped regions are still valid, but the contents are undefined", which I guess means the kernel can't unmap the memory when it detects the GPU has gone away (because that would likely crash the application), but could map the whole range onto a single dummy page or something.

The modernization of PCIe hotplug in Linux

Posted Oct 10, 2018 16:38 UTC (Wed) by keithbusch (guest, #101531) [Link]

It is still not too hard to imagine deadlocking your machine due to the "Big PCI Lock", accessed through pci_lock_rescan_remove(). Various removal scenarios end up with circular lock dependencies. Ultimately I'd like to replace this coarse locking and with something more fine grain, or even move to lockless enumerations. This is proving very difficult as much of the pci bus driver's internal states are public, and many arch specific code access them directly, and without reference counting the devices they're using. But I think it's a worthwhile long-term goal.