215590 – PCI link is down after suspend after we try to plug device

Bug 215590 - PCI link is down after suspend after we try to plug device

Summary: PCI link is down after suspend after we try to plug device

Status:	NEW

Alias:	None

Product:	Drivers
Classification:	Unclassified
Component:	PCI (show other bugs)
Hardware:	x86-64 Linux

Importance:	P1 high
Assignee:	drivers_pci@kernel-bugs.osdl.org

URL:
Keywords:

Depends on:
Blocks:

Reported:	2022-02-11 14:29 UTC by Rahul Kumar
Modified:	2023-04-10 20:28 UTC (History)
CC List:	6 users (show)

See Also:
Kernel Version:	5.4 and 5.11
Subsystem:
Regression:	No
Bisected commit-id:

Attachments
dmesg log (180.49 KB, text/plain) 2022-02-11 14:29 UTC, Rahul Kumar	Details
kern log with all 3 patches (393.99 KB, text/plain) 2022-02-15 07:34 UTC, Rahul Kumar	Details
lspci_before_suspend (4.83 KB, text/plain) 2022-02-15 07:35 UTC, Rahul Kumar	Details
lspci_after_suspend (4.82 KB, text/plain) 2022-02-15 07:36 UTC, Rahul Kumar	Details
Force reset bus (7.17 KB, application/mbox) 2022-06-14 16:49 UTC, Andrey Grodzovsky	Details
force SBR (1.18 KB, application/mbox) 2022-06-14 16:53 UTC, Andrey Grodzovsky	Details
lspci before S3 (61.19 KB, text/plain) 2022-06-15 19:41 UTC, Andrey Grodzovsky	Details
lspci after S3 (49.13 KB, text/plain) 2022-06-15 19:42 UTC, Andrey Grodzovsky	Details
lspci after S3 with Anatoli's work around (49.13 KB, text/plain) 2022-06-15 19:42 UTC, Andrey Grodzovsky	Details
dmesg 5.11 with work around (145.62 KB, text/plain) 2022-06-16 14:13 UTC, Andrey Grodzovsky	Details
Here is the dmesg 5.11 without work around (150.86 KB, text/plain) 2022-06-16 16:52 UTC, Anatoli Antonovitch	Details
dmesg 5.11 with work around amd subsequent hot plug of same card (387.81 KB, text/plain) 2022-06-17 17:14 UTC, Andrey Grodzovsky	Details
PCI/PME: Fix race for NV23 on PME polling (822 bytes, application/mbox) 2022-06-17 17:18 UTC, Andrey Grodzovsky	Details
PCI: Add support for PME state on suspended hot-add devices (2.18 KB, application/mbox) 2022-06-17 17:18 UTC, Andrey Grodzovsky	Details
PCI: Fixes state transitions of NV23 dGPU from D3cold into D0 (3.66 KB, application/mbox) 2022-06-17 17:21 UTC, Andrey Grodzovsky	Details
The dmesg for S3 resume time with AER. (192.05 KB, text/plain) 2022-10-06 14:25 UTC, Anatoli Antonovitch	Details
The dmesg for S3 resume time with AER. (108.44 KB, text/plain) 2022-10-06 14:52 UTC, Anatoli Antonovitch	Details
The dmesg for S3 resume time with AER. (135.18 KB, text/plain) 2022-10-06 16:59 UTC, Anatoli Antonovitch	Details
The dmesg 6.1.0-rc5 for S3 resume time with AER (1.34 MB, text/plain) 2022-11-16 16:53 UTC, Anatoli Antonovitch	Details
The dmesg 6.1.0-rc5 with rework down_write_nested the ctrl->reset_lock, and AMD subsequent hot plug of same card. (1.86 MB, text/plain) 2022-11-16 16:57 UTC, Anatoli Antonovitch	Details
The patch to resolve the issue for secondary bus reset. (3.92 KB, patch) 2022-11-16 17:02 UTC, Anatoli Antonovitch	Details \| Diff
The dmesg 6.2.0-rc2 for S3 resume time with AER (1.33 MB, text/plain) 2023-01-04 19:46 UTC, Anatoli Antonovitch	Details
The dmesg 6.2.0-rc2 with rework down_write_nested the ctrl->reset_lock, and AMD subsequent hot plug of same card. (2.50 MB, text/plain) 2023-01-04 19:47 UTC, Anatoli Antonovitch	Details
The dmesg 6.2.0-rc4 for S3 resume with AER (1.46 MB, text/plain) 2023-01-20 21:25 UTC, Anatoli Antonovitch	Details
The dmesg 6.2.0-rc5 for S3 resume with AER (2.22 MB, text/plain) 2023-01-23 19:20 UTC, Anatoli Antonovitch	Details
The patch for testing (16.05 KB, patch) 2023-01-23 19:22 UTC, Anatoli Antonovitch	Details \| Diff
[PATCH v2] PCI: pciehp: Fix AB-BA deadlock between reset_lock and device_lock (7.03 KB, patch) 2023-04-09 08:41 UTC, Lukas Wunner	Details \| Diff
The dmesg for S3 resume time with AER (1.78 MB, text/plain) 2023-04-10 20:26 UTC, Anatoli Antonovitch	Details
Show Obsolete (12) Add an attachment (proposed patch, testcase, etc.)

Description Rahul Kumar 2022-02-11 14:29:33 UTC

Created attachment 300434 [details]
dmesg log

PCI link is not coming up

we have been facing a race condition between AER and hoptplug stack which is happening after suspend when device is unplugged, to avoid that  we have used below three patches.

1 - To Avoid lockdep

https://lore.kernel.org/linux-pci/908047f7699d9de9ec2efd6b79aa752d73dab4b6.1595329748.git.lukas@wunner.de/

2 - AER sleep transition patch 
https://lore.kernel.org/linux-pci/20220127025418.1989642-1-kai.heng.feng@canonical.com/

3 - DPC sleep transition patch
https://lore.kernel.org/linux-pci/20220127025418.1989642-2-kai.heng.feng@canonical.com/


Additional info when we are using unplug using sysfs we dont see the issue

Comment 1 Rahul Kumar 2022-02-15 07:34:15 UTC

Created attachment 300465 [details]
kern log with all 3 patches

Comment 2 Rahul Kumar 2022-02-15 07:35:30 UTC

Created attachment 300466 [details]
lspci_before_suspend

Comment 3 Rahul Kumar 2022-02-15 07:36:10 UTC

Created attachment 300467 [details]
lspci_after_suspend

Comment 4 Andrey Grodzovsky 2022-06-14 16:48:25 UTC

We have a workaround for this bu this is obviosly not a generic solution and it would be helpful if we could get an adivse what is the generic solution here.

Comment 5 Andrey Grodzovsky 2022-06-14 16:49:31 UTC

Created attachment 301174 [details]
Force reset bus

Here is the workaround

Comment 6 Andrey Grodzovsky 2022-06-14 16:53:20 UTC

Created attachment 301175 [details]
force SBR

The actual patch - previos was wrong.

Comment 7 Lukas Wunner 2022-06-15 09:54:20 UTC

Would it be possible for you to provide dmesg for a contemporary kernel such as v5.18? A lot has changed since v5.4 and it's difficult to determine what the behavior might be today compared to the v5.4.2 kernel you've used.

Comment 8 Andrey Grodzovsky 2022-06-15 15:13:44 UTC

UNfort(In reply to Lukas Wunner from comment #7)
> Would it be possible for you to provide dmesg for a contemporary kernel such
> as v5.18? A lot has changed since v5.4 and it's difficult to determine what
> the behavior might be today compared to the v5.4.2 kernel you've used.

UNfortunetaly the system can only run on 5.4 and 5.11 kernels to support hotplug. WOuld a log from 5.11 help ?

Comment 9 Lukas Wunner 2022-06-15 15:30:52 UTC

Hm, why is that? A v5.11 kernel is certainly much closer to what we have today, so yes, that would help. Please add the following to the kernel command line:

log_buf_len=10M pciehp.pciehp_debug=1 dyndbg="file pciehp* +p" ignore_loglevel

Comment 10 Andrey Grodzovsky 2022-06-15 15:41:08 UTC

(In reply to Lukas Wunner from comment #9)
> Hm, why is that? A v5.11 kernel is certainly much closer to what we have
> today, so yes, that would help. Please add the following to the kernel
> command line:
> 
> log_buf_len=10M pciehp.pciehp_debug=1 dyndbg="file pciehp* +p"
> ignore_loglevel

It's just some internal constraint by the costumer who 'likes' this kernel for it's stability i guess ?

Anyway - I will provide the logs with required info from 5.11 kernel soon. This kernel also includes your patch 'PCI: pciehp: Fix AB-BA deadlock between reset_lock and device_lock'

Comment 11 Lukas Wunner 2022-06-15 16:36:02 UTC

If you could provide output of "lspci -vvv" (a) before suspend, (b) after suspend (with my patch) and (c) after suspend (with my patch + Anatoli's workaround), that would be ideal. Thanks!

Comment 12 Andrey Grodzovsky 2022-06-15 19:41:57 UTC

Created attachment 301183 [details]
lspci before S3

Comment 13 Andrey Grodzovsky 2022-06-15 19:42:21 UTC

Created attachment 301184 [details]
lspci after S3

Comment 14 Andrey Grodzovsky 2022-06-15 19:42:46 UTC

Created attachment 301185 [details]
lspci after S3 with Anatoli's work around

Comment 15 Andrey Grodzovsky 2022-06-15 19:43:27 UTC

(In reply to Lukas Wunner from comment #11)
> If you could provide output of "lspci -vvv" (a) before suspend, (b) after
> suspend (with my patch) and (c) after suspend (with my patch + Anatoli's
> workaround), that would be ideal. Thanks!

Done in 3 files above - see their descriptions at the top.

Thanks

Comment 16 Lukas Wunner 2022-06-16 13:33:49 UTC

The dmesg output (https://bugzilla.kernel.org/attachment.cgi?id=300465) shows tons of AER messages right after the GPU driver starts probing. It looks like the electrical connection to the expansion card is flakey.

On resume, there's a fatal uncorrectable error at 67.797528. That's obviously the reason why the link fails to come up. AER then tries and fails to reset the port:

[   69.684921] pcieport 0000:00:01.1: AER: Root Port link has been reset
[   69.691438] pcieport 0000:00:01.1: AER: Device recovery failed

The first messages comes from aer_root_reset() and the second message from pcie_do_recovery(). You should try to find out why recovery failed. That's ultimately the reason why you need to explicitly perform a Secondary Bus Reset in the workaround patch.

The code in pcie_do_recovery() has changed significantly since v5.4. What we had in v5.11 is much closer to what we have today, so dmesg output for v5.11 would definitely help.

When I submitted the comment yesterday, I got a "mid-air collision" error from bugzilla. I'm not sure if I've accidentally deleted a comment from you when submitting mine, please check. Sorry for the inconvenience.

Comment 17 Andrey Grodzovsky 2022-06-16 14:13:28 UTC

Created attachment 301186 [details]
dmesg 5.11 with work around

Thanks for the analysis Lukas, I am attaching a dmesg from 5.11 with thee work around for you to look. Let me know if you also need dmesg without work around ?

Regarding AER - our SBIOS is a costum test SBIOS which generates AER on purpuse to test the system behaviour under this condition. SO it's not a physical flake connectivity but an intended hack.

Comment 18 Rahul Kumar 2022-06-16 14:32:48 UTC

(In reply to Andrey Grodzovsky from comment #17)
> Created attachment 301186 [details]
> dmesg 5.11 with work around
> 
> Thanks for the analysis Lukas, I am attaching a dmesg from 5.11 with thee
> work around for you to look. Let me know if you also need dmesg without work
> around ?
> 
> Regarding AER - our SBIOS is a costum test SBIOS which generates AER on
> purpuse to test the system behaviour under this condition. SO it's not a
> physical flake connectivity but an intended hack.

Just a correction here, SBIOS seems enable AER only, not purposely generated, those error we are seeing is real one's. AER treats that as transaction error we are adding card on the go on slot.

Comment 19 Lukas Wunner 2022-06-16 14:40:25 UTC

Yes dmesg output without the workaround would be useful. With usec timestamps please. (The previous log only contains timestamps with 1 second resolution.)

Comment 20 Anatoli Antonovitch 2022-06-16 16:52:44 UTC

Created attachment 301188 [details]
Here is the dmesg 5.11 without work around

Comment 21 Andrey Grodzovsky 2022-06-16 16:59:26 UTC

(In reply to Lukas Wunner from comment #19)
> Yes dmesg output without the workaround would be useful. With usec
> timestamps please. (The previous log only contains timestamps with 1 second
> resolution.)

Hey. Log attached above.

Comment 22 Lukas Wunner 2022-06-17 09:59:59 UTC

@Andrey Grodzovsky:

You wrote that with the workaround patch applied, you're able to re-plug the extension card after resume and the hotplug port is properly brought up. I'm not seeing that in dmesg_wa.txt. I can only see there that pciehp brings down the slot on resume because the extension card is gone, but there are no log entries showing that the slot is brought up again.

Upon resume, a race is visible between pciehp and AER:

In dmesg_no_wa.txt, the race is won by pciehp. It detects that the slot is empty and kicks the IRQ thread pciehp_ist() to bring the slot down:

[  159.973097] pcieport 0000:00:01.1: pciehp: pciehp_check_link_active: lnk_status = 1101
[  159.973267] pcieport 0000:00:01.1: pciehp: Slot(0): Card not present
[  159.973288] pcieport 0000:00:01.1: pciehp: pciehp_unconfigure_device: domain:bus:dev = 0000:01:00

These log messages correspond to the following call stack:

pciehp_resume()
  pciehp_check_presence()
    pciehp_card_present_or_link_active()
      pciehp_check_link_active()
    pciehp_request(ctrl, PCI_EXP_SLTSTA_PDC);
...
pciehp_ist()
  pciehp_handle_presence_or_link_change()

And pciehp_ist() is holding ctrl->reset_lock, which prevents the concurrent AER thread from performing a reset while the slot is brought down:

[  159.973365] pcieport 0000:00:01.1: AER: Multiple Uncorrected (Fatal) error received: 0000:00:01.0
[  159.973478] pcieport 0000:00:01.1: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, (Requester ID)
[  159.973481] pcieport 0000:00:01.1:   device [1022:15d3] error status/mask=00004000/04400000
[  159.973484] pcieport 0000:00:01.1:    [14] CmpltTO                (First)
[  170.387351] pcieport 0000:00:01.1: AER: Multiple Uncorrected (Fatal) error received: 0000:00:01.0
[  170.387367] pcieport 0000:00:01.1: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, (Requester ID)
[  170.387370] pcieport 0000:00:01.1:   device [1022:15d3] error status/mask=00004000/04400000
[  170.387372] pcieport 0000:00:01.1:    [14] CmpltTO                (First)
[  171.115213] pci 0000:03:00.0: AER: can't recover (no error_detected callback)

The last message above is a little concerning: At this point, the GPU driver has already been unbound. In report_error_detected(), we return PCI_ERS_RESULT_NO_AER_DRIVER if the device is not bound to a driver or if its io_state is perm_failure. Both conditions are true here. I think we need to change the logic in report_error_detected() to return PCI_ERS_RESULT_NONE in those cases and thus allow forward progress. For unbound devices, it may be necessary to call pci_restore_state() after the Secondary Bus Reset.

My expectation would be that recovery fails due to the GPU device being unbound and permanently failed. Curiously, the log messages claim success:

[  171.115953] pcieport 0000:00:01.1: pciehp: pciehp_reset_slot: SLOTCTRL 70 write cmd 0
[  172.136373] pcieport 0000:00:01.1: pciehp: pciehp_reset_slot: SLOTCTRL 70 write cmd 1008
[  172.136387] pcieport 0000:00:01.1: AER: Root Port link has been reset (0)
[  172.136407] pcieport 0000:00:01.1: AER: device recovery successful

That's very strange. Have you modified the logic in report_error_detected()? Any custom patches in your v5.11 tree? What's going on here?

Comparing all of the above to dmesg_wa.txt, it's apparent that the race is won by AER there instead of pciehp. AER resets the slot right after coming out of suspend:

[Wed Jun 15 15:29:39 2022] pcieport 0000:00:01.1: AER: Multiple Uncorrected (Fatal) error received: 0000:00:01.0
[Wed Jun 15 15:29:39 2022] pcieport 0000:00:01.1: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, (Requester ID)
[Wed Jun 15 15:29:39 2022] pcieport 0000:00:01.1:   device [1022:15d3] error status/mask=00004000/04400000
[Wed Jun 15 15:29:39 2022] pcieport 0000:00:01.1:    [14] CmpltTO                (First)
[Wed Jun 15 15:29:39 2022] [drm] PCI error: detected callback, state(2)!!
[Wed Jun 15 15:29:39 2022] pcieport 0000:00:01.1: pciehp: pciehp_reset_slot: SLOTCTRL 70 write cmd 0

Only then does pciehp get a chance to check slot status and bring it down:

[Wed Jun 15 15:29:39 2022] pcieport 0000:00:01.1: pciehp: pciehp_check_link_active: lnk_status = 1101
[Wed Jun 15 15:29:39 2022] pcieport 0000:00:01.1: pciehp: Slot(0): Card not present
[Wed Jun 15 15:29:39 2022] pcieport 0000:00:01.1: pciehp: pciehp_unconfigure_device: domain:bus:dev = 0000:01:00

However, there's something strange going on: While pciehp is still busy bringing down the slot, a slot reset is performed concurrently:

[Wed Jun 15 15:29:40 2022] amdgpu 0000:05:00.0: amdgpu: ring jpeg_dec uses VM inv eng 6 on hub 1
[Wed Jun 15 15:29:40 2022] pcieport 0000:00:01.1: pciehp: pciehp_reset_slot: SLOTCTRL 70 write cmd 1008
[Wed Jun 15 15:29:40 2022] pcieport 0000:00:01.1: AER: Root Port link has been reset (0)
[Wed Jun 15 15:29:40 2022] amdgpu 0000:03:00.0: can't change power state from D3cold to D0 (config space inaccessible)
...
[Wed Jun 15 15:29:51 2022] amdgpu 0000:03:00.0: amdgpu: amdgpu_device_ip_resume failed (1332715520).
[Wed Jun 15 15:29:51 2022] amdgpu 0000:03:00.0: amdgpu: amdgpu: finishing device.

That is supposed to be prevented by the reset_lock. I don't see how this can happen.

So the bottom line is:
(a) We may need to change the handling of unbound and permanently failed devices in report_error_detected().
(b) I can't really make sense of the logs because the messages do not correlate with the code. Also, neither dmesg_wa.txt nor dmesg_no_wa.txt show that the extension card is re-plugged and the slot is brought up again.

Both logs show a bunch of pciehp_check_link_active messages at the bottom of which I'm unsure where they're originating from, but the DLLLA bit is not set in any of those messages, so the link never comes up. Also, both lspci_after_s3_wa.txt and lspci_after_s3.txt show DLActive- for the hotplug port 00:01.1.

Comment 23 Andrey Grodzovsky 2022-06-17 17:12:54 UTC

(In reply to Lukas Wunner from comment #22)
> @Andrey Grodzovsky:
> 
> You wrote that with the workaround patch applied, you're able to re-plug the
> extension card after resume and the hotplug port is properly brought up. I'm
> not seeing that in dmesg_wa.txt. I can only see there that pciehp brings
> down the slot on resume because the extension card is gone, but there are no
> log entries showing that the slot is brought up again.
> 
> Upon resume, a race is visible between pciehp and AER:
> 
> In dmesg_no_wa.txt, the race is won by pciehp. It detects that the slot is
> empty and kicks the IRQ thread pciehp_ist() to bring the slot down:
> 
> [  159.973097] pcieport 0000:00:01.1: pciehp: pciehp_check_link_active:
> lnk_status = 1101
> [  159.973267] pcieport 0000:00:01.1: pciehp: Slot(0): Card not present
> [  159.973288] pcieport 0000:00:01.1: pciehp: pciehp_unconfigure_device:
> domain:bus:dev = 0000:01:00
> 
> These log messages correspond to the following call stack:
> 
> pciehp_resume()
>   pciehp_check_presence()
>     pciehp_card_present_or_link_active()
>       pciehp_check_link_active()
>     pciehp_request(ctrl, PCI_EXP_SLTSTA_PDC);
> ...
> pciehp_ist()
>   pciehp_handle_presence_or_link_change()
> 
> And pciehp_ist() is holding ctrl->reset_lock, which prevents the concurrent
> AER thread from performing a reset while the slot is brought down:
> 
> [  159.973365] pcieport 0000:00:01.1: AER: Multiple Uncorrected (Fatal)
> error received: 0000:00:01.0
> [  159.973478] pcieport 0000:00:01.1: PCIe Bus Error: severity=Uncorrected
> (Fatal), type=Transaction Layer, (Requester ID)
> [  159.973481] pcieport 0000:00:01.1:   device [1022:15d3] error
> status/mask=00004000/04400000
> [  159.973484] pcieport 0000:00:01.1:    [14] CmpltTO                (First)
> [  170.387351] pcieport 0000:00:01.1: AER: Multiple Uncorrected (Fatal)
> error received: 0000:00:01.0
> [  170.387367] pcieport 0000:00:01.1: PCIe Bus Error: severity=Uncorrected
> (Fatal), type=Transaction Layer, (Requester ID)
> [  170.387370] pcieport 0000:00:01.1:   device [1022:15d3] error
> status/mask=00004000/04400000
> [  170.387372] pcieport 0000:00:01.1:    [14] CmpltTO                (First)
> [  171.115213] pci 0000:03:00.0: AER: can't recover (no error_detected
> callback)
> 
> The last message above is a little concerning: At this point, the GPU driver
> has already been unbound. In report_error_detected(), we return
> PCI_ERS_RESULT_NO_AER_DRIVER if the device is not bound to a driver or if
> its io_state is perm_failure. Both conditions are true here. I think we need
> to change the logic in report_error_detected() to return PCI_ERS_RESULT_NONE
> in those cases and thus allow forward progress. For unbound devices, it may
> be necessary to call pci_restore_state() after the Secondary Bus Reset.
> 
> My expectation would be that recovery fails due to the GPU device being
> unbound and permanently failed. Curiously, the log messages claim success:
> 
> [  171.115953] pcieport 0000:00:01.1: pciehp: pciehp_reset_slot: SLOTCTRL 70
> write cmd 0
> [  172.136373] pcieport 0000:00:01.1: pciehp: pciehp_reset_slot: SLOTCTRL 70
> write cmd 1008
> [  172.136387] pcieport 0000:00:01.1: AER: Root Port link has been reset (0)
> [  172.136407] pcieport 0000:00:01.1: AER: device recovery successful
> 
> That's very strange. Have you modified the logic in report_error_detected()?
> Any custom patches in your v5.11 tree? What's going on here?


We haven't touched the logic of report_error_detected and it's exact same as of 5.11 kernel https://elixir.bootlin.com/linux/v5.11/source/drivers/pci/hotplug/pciehp_hpc.c#L837

Yes - there are a few but they are not related to pciehp code, i will attach them here just to be sure so u can look.

> 
> Comparing all of the above to dmesg_wa.txt, it's apparent that the race is
> won by AER there instead of pciehp. AER resets the slot right after coming
> out of suspend:
> 
> [Wed Jun 15 15:29:39 2022] pcieport 0000:00:01.1: AER: Multiple Uncorrected
> (Fatal) error received: 0000:00:01.0
> [Wed Jun 15 15:29:39 2022] pcieport 0000:00:01.1: PCIe Bus Error:
> severity=Uncorrected (Fatal), type=Transaction Layer, (Requester ID)
> [Wed Jun 15 15:29:39 2022] pcieport 0000:00:01.1:   device [1022:15d3] error
> status/mask=00004000/04400000
> [Wed Jun 15 15:29:39 2022] pcieport 0000:00:01.1:    [14] CmpltTO           
> (First)
> [Wed Jun 15 15:29:39 2022] [drm] PCI error: detected callback, state(2)!!
> [Wed Jun 15 15:29:39 2022] pcieport 0000:00:01.1: pciehp: pciehp_reset_slot:
> SLOTCTRL 70 write cmd 0
> 
> Only then does pciehp get a chance to check slot status and bring it down:
> 
> [Wed Jun 15 15:29:39 2022] pcieport 0000:00:01.1: pciehp:
> pciehp_check_link_active: lnk_status = 1101
> [Wed Jun 15 15:29:39 2022] pcieport 0000:00:01.1: pciehp: Slot(0): Card not
> present
> [Wed Jun 15 15:29:39 2022] pcieport 0000:00:01.1: pciehp:
> pciehp_unconfigure_device: domain:bus:dev = 0000:01:00
> 
> However, there's something strange going on: While pciehp is still busy
> bringing down the slot, a slot reset is performed concurrently:
> 
> [Wed Jun 15 15:29:40 2022] amdgpu 0000:05:00.0: amdgpu: ring jpeg_dec uses
> VM inv eng 6 on hub 1
> [Wed Jun 15 15:29:40 2022] pcieport 0000:00:01.1: pciehp: pciehp_reset_slot:
> SLOTCTRL 70 write cmd 1008
> [Wed Jun 15 15:29:40 2022] pcieport 0000:00:01.1: AER: Root Port link has
> been reset (0)
> [Wed Jun 15 15:29:40 2022] amdgpu 0000:03:00.0: can't change power state
> from D3cold to D0 (config space inaccessible)
> ...
> [Wed Jun 15 15:29:51 2022] amdgpu 0000:03:00.0: amdgpu:
> amdgpu_device_ip_resume failed (1332715520).
> [Wed Jun 15 15:29:51 2022] amdgpu 0000:03:00.0: amdgpu: amdgpu: finishing
> device.
> 
> That is supposed to be prevented by the reset_lock. I don't see how this can
> happen.


I will try to see more into this. 

> 
> So the bottom line is:
> (a) We may need to change the handling of unbound and permanently failed
> devices in report_error_detected().
> (b) I can't really make sense of the logs because the messages do not
> correlate with the code. Also, neither dmesg_wa.txt nor dmesg_no_wa.txt show
> that the extension card is re-plugged and the slot is brought up again.
> 
> Both logs show a bunch of pciehp_check_link_active messages at the bottom of
> which I'm unsure where they're originating from, but the DLLLA bit is not
> set in any of those messages, so the link never comes up. Also, both
> lspci_after_s3_wa.txt and lspci_after_s3.txt show DLActive- for the hotplug
> port 00:01.1.

I am reattchaing log from 5.11 with work around and subsequent succefull hot plug - please take a look.

Comment 24 Andrey Grodzovsky 2022-06-17 17:14:33 UTC

Created attachment 301190 [details]
dmesg 5.11 with work around amd subsequent hot plug of same card

Attaching the log with hotplug after resume as requested.

Comment 25 Andrey Grodzovsky 2022-06-17 17:18:09 UTC

Created attachment 301191 [details]
PCI/PME: Fix race for NV23 on PME polling

Comment 26 Andrey Grodzovsky 2022-06-17 17:18:45 UTC

Created attachment 301192 [details]
PCI: Add support for PME state on suspended hot-add devices

Comment 27 Andrey Grodzovsky 2022-06-17 17:21:24 UTC

Created attachment 301193 [details]
PCI: Fixes state transitions of NV23 dGPU from D3cold into D0

Comment 28 Andrey Grodzovsky 2022-06-17 17:23:04 UTC

Attached above the 3 other work arounds in PCI we still have in our brnach - i think they are not realted to our issue ? Besides this only your deadlock fix patch 'PCI: pciehp: Fix AB-BA deadlock between reset_lock and device_lock' is the only other PCI costum patch we have in that branch.

Comment 29 Andrey Grodzovsky 2022-07-13 15:02:20 UTC

Lukas - just a gentle ping to recheck the logs we attached.

Comment 30 Anatoli Antonovitch 2022-10-06 14:25:58 UTC

Created attachment 302942 [details]
The dmesg for S3 resume time with AER.

Here is the dmesg for 5.18 kernel.

Comment 31 Anatoli Antonovitch 2022-10-06 14:52:38 UTC

Created attachment 302943 [details]
The dmesg for S3 resume time with AER.

Here is the dmesg for 5.18 kernel with blacklisted amdgpu.

Comment 32 Anatoli Antonovitch 2022-10-06 16:59:22 UTC

Created attachment 302945 [details]
The dmesg for S3 resume time with AER.

The issue has been observed in the latest kernel 6.0.

Comment 33 Anatoli Antonovitch 2022-11-16 16:53:42 UTC

Created attachment 303186 [details]
The dmesg 6.1.0-rc5 for S3 resume time with AER

The issue has been observed in the latest kernel with amdgpu blacklisted.

Comment 34 Anatoli Antonovitch 2022-11-16 16:57:05 UTC

Created attachment 303187 [details]
The dmesg 6.1.0-rc5 with rework down_write_nested the ctrl->reset_lock, and AMD subsequent hot plug of same card.

Attaching the log with hotplug after resume.

Comment 35 Anatoli Antonovitch 2022-11-16 17:02:12 UTC

Created attachment 303188 [details]
The patch to resolve the issue for secondary bus reset.


@Lukas Wunner:

Is this patch make sense for this case at all?

Comment 36 Anatoli Antonovitch 2023-01-04 19:46:01 UTC

Created attachment 303524 [details]
The dmesg 6.2.0-rc2 for S3 resume time with AER

Comment 37 Anatoli Antonovitch 2023-01-04 19:47:07 UTC

Created attachment 303525 [details]
The dmesg 6.2.0-rc2 with rework down_write_nested the ctrl->reset_lock, and AMD subsequent hot plug of same card.

Comment 38 Anatoli Antonovitch 2023-01-20 21:25:59 UTC

Created attachment 303634 [details]
The dmesg 6.2.0-rc4 for S3 resume with AER

The dmesg 6.2.0-rc4 for the patch:
https://patchwork.kernel.org/project/linux-pci/patch/3dc88ea82bdc0e37d9000e413d5ebce481cbd629.1674205689.git.lukas@wunner.de/

Comment 39 Anatoli Antonovitch 2023-01-23 19:20:50 UTC

Created attachment 303639 [details]
The dmesg 6.2.0-rc5 for S3 resume with AER

Comment 40 Anatoli Antonovitch 2023-01-23 19:22:54 UTC

Created attachment 303640 [details]
The patch for testing

The merged patch "PCI: pciehp: Fix AB-BA deadlock between reset_lock and device_lock"
for the kernel 6.2.0-rc5

Comment 41 Lukas Wunner 2023-04-09 08:41:28 UTC

Created attachment 304099 [details]
[PATCH v2] PCI: pciehp: Fix AB-BA deadlock between reset_lock and device_lock

My apologies for the delay. Could you test if this patch fixes the issue for you?

It should apply cleanly on top of any recent kernel. If you test with anything older than v6.3-rc1, be sure to also cherry-pick commit 74ff8864cc84 ("PCI: hotplug: Allow marking devices as disconnected during bind/unbind").

Thanks a lot.

Comment 42 Anatoli Antonovitch 2023-04-10 20:26:13 UTC

Created attachment 304111 [details]
The dmesg for S3 resume time with AER

The dmesg for current kernel 6.3.0-rc5 on the same setup.

Comment 43 Anatoli Antonovitch 2023-04-10 20:28:15 UTC

Thanks Lukas.

The patch has been tested with current kernel 6.3.0-rc5 on the same setup.
The deadlock between reset_lock and device_lock has been fixed.

See details in the dmesg log: dmesg_6.3.0-rc5_fix.txt

Note You need to log in before you can comment on or make changes to this bug.