tg3 triggers AER kernel panic on reboot
Stefan Radman
stefan.radman at me.com
Tue Jan 21 15:59:24 CET 2025
PS: subject changed from "6.5.13-3-pve kernel panic on shutdown” to "tg3 triggers AER kernel panic on reboot” to better describe the issue.
I hit the same bug while/after upgrading an R640 cluster to PVE 8.3.2 (proxmox-kernel-6.8.12-7-pve).
The R640 are runnning the latest firmware and the BCM5720 LOM firmware is version is 23.11.4, the latest currently available from Dell.
Reboot = kernel panic triggered by tg3 driver and AER
Shutdown = OK
Please consider cherrypicking related fix for the next kernel:
tg3: Disable tg3 device on system reboot to avoid triggering AER
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2ca1c94ce0b65a2ce7512b718f3d8a0fe6224bca
I’ll gladly volunteer for testing pre release on Dell PowerEdge R540, R640 and R740 (just can’t build the kernel myself).
Thanks
Stefan
> On Jul 2, 2024, at 16:36, Stefan Radman <stefan.radman at me.com> wrote:
>
> Still happening with PVE 8.2.4 kernel 6.8.8.2 on R540/R740 after upgrading the BCM5720 firmware to to 22.91.5.
>
> While the real problem may be rooted in the the Dell/Broadcom firmware, I believe that the regression was introduced by the following commit in the tg3 driver:
> https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/commit/?id=9fc3bc764334
> - tg3_power_down(tp);
> + if (system_state == SYSTEM_POWER_OFF)
> + tg3_power_down(tp);
>
> The patch fixed a regression in the R650xs but apparently introduced another one in the R540 and R740.
>
> In the meantime a Debian user contacted me that has the same problem on several R540 with Debian Bookworm (kernel 6.1) and Trixie (6.8.x):
> kernel panic from PCI error triggered by the BCM5720 device while entering S5 state during reboot.
>
> Stefan
>
> root at pve:~# pveversion
> pve-manager/8.2.4/faa83925c9641325 (running kernel: 6.8.8-2-pve)
> root at pve:~# ethtool -i eno3 | head -5
> driver: tg3
> version: 6.8.8-2-pve
> firmware-version: FFV22.91.5 bc 5720-v1.39
> expansion-rom-version:
> bus-info: 0000:01:00.0
> root at pve:~# lspci -s 0000:01:00.0
> 01:00.0 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe
>
>
>> On Apr 25, 2024, at 00:57, Stefan Radman <stefan.radman at me.com> wrote:
>>
>> Still happening after upgrade to Proxmox VE 8.2
>>
>> Affected kernels so far:
>> 6.5.13-1-pve
>> 6.5.13-3-pve
>> 6.5.13-5-pve
>> 6.8.4-2-pve
>>
>> Stefan
>>
>>> On Apr 15, 2024, at 15:46, Stefan Radman <stefan.radman at me.com> wrote:
>>>
>>> Still happening with kernel
>>> 6.5.13-5-pve
>>>
>>> Stefan
>>>
>>>> On Apr 2, 2024, at 13:09, Stefan Radman <stefan.radman at me.com> wrote:
>>>>
>>>> Workaround: No more kernel panics on reboot when pinning kernel 6.2.16-20-pve.
>>>>
>>>> Affected kernels:
>>>> 6.5.13-1-pve
>>>> 6.5.13-3-pve
>>>>
>>>> The original issue [1] was solved long ago [2] but apparently re-introduced recently [3].
>>>>
>>>> Regression [4] being discussed on kernel.org
>>>>
>>>> Looks like a back and forth in the tg3 driver.
>>>>
>>>> Note that the kernel panic is only triggered by “reboot” and not by “shutdown”.
>>>>
>>>> Stefan
>>>>
>>>> root at per740:~# proxmox-boot-tool kernel list
>>>> Manually selected kernels:
>>>> None.
>>>>
>>>> Automatically selected kernels:
>>>> 6.2.16-20-pve
>>>> 6.5.13-1-pve
>>>> 6.5.13-3-pve
>>>>
>>>> Pinned kernel:
>>>> 6.2.16-20-pve
>>>> root at per740:~# pveversion
>>>> pve-manager/8.1.10/4b06efb5db453f29 (running kernel: 6.2.16-20-pve)
>>>>
>>>> [1] Use ACPI S5 for reboot #1904225: causes reboot crash on Dell T440
>>>> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1962730
>>>>
>>>> [2] tg3: Disable tg3 device on system reboot to avoid triggering AER
>>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2ca1c94ce0b65a2ce7512b718f3d8a0fe6224bca
>>>>
>>>> [3] tg3: power down device only on SYSTEM_POWER_OFF
>>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=9fc3bc7643341dc5be7d269f3d3dbe441d8d7ac3
>>>>
>>>> [4] * [PATCH] tg3: add new module param to force device power down on reboot
>>>> https://lore.kernel.org/lkml/d8ed4af1-5c83-4895-9fc3-9aea25724fd9@gmail.com/T/
>>>>
>>>>
>>>>> On Apr 2, 2024, at 09:37, Gilberto Ferreira <gilberto.nunes32 at gmail.com> wrote:
>>>>>
>>>>> Perhaps you should try another kernel besides 6.15 like 6.2 for instance.
>>>>>
>>>>> Em ter., 2 de abr. de 2024, 02:43, Stefan Radman via pve-user <pve-user at lists.proxmox.com <mailto:pve-user at lists.proxmox.com>> escreveu:
>>>>>>
>>>>>>
>>>>>>
>>>>>> ---------- Forwarded message ----------
>>>>>> From: Stefan Radman <stefan.radman at me.com <mailto:stefan.radman at me.com>>
>>>>>> To: Proxmox VE user list <pve-user at lists.proxmox.com <mailto:pve-user at lists.proxmox.com>>
>>>>>> Cc: PVE User List <pve-user at pve.proxmox.com <mailto:pve-user at pve.proxmox.com>>
>>>>>> Bcc:
>>>>>> Date: Tue, 2 Apr 2024 07:42:32 +0200
>>>>>> Subject: Re: [PVE-User] 6.5.13-3-pve kernel panic on shutdown
>>>>>> Yesterday I had the same thing happen when shutting down a Dell PowerEdge R740.
>>>>>>
>>>>>> Again, the kernel panic was triggered by a BCM5720 running Broadcom firmware 22.71.3 and the tg3 driver from kernel 6.5.13-3-pve.
>>>>>>
>>>>>> R740 BIOS 2.21.2 (but also happened with 2.20.1)
>>>>>>
>>>>>> Stefan
>>>>>>
>>>>>> [1325586.715465] ACPI: PM: Preparing to enter system sleep state S5
>>>>>> [1325589.991219] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 5
>>>>>> [1325589.991223] {1}[Hardware Error]: event severity: fatal
>>>>>> [1325589.991225] {1}[Hardware Error]: Error 0, type: fatal
>>>>>> [1325589.991227] {1}[Hardware Error]: section_type: PCIe error
>>>>>> [1325589.991228] {1}[Hardware Error]: port_type: 0, PCIe end point
>>>>>> [1325589.991231] {1}[Hardware Error]: version: 3.0
>>>>>> [1325589.991233] {1}[Hardware Error]: command: 0x0002, status: 0x0010
>>>>>> [1325589.991235] {1}[Hardware Error]: device_id: 0000:01:00.1
>>>>>> [1325589.991237] {1}[Hardware Error]: slot: 0
>>>>>> [1325589.991239] {1}[Hardware Error]: secondary_bus: 0x00
>>>>>> [1325589.991240] {1}[Hardware Error]: vendor_id: 0x14e4, device_id: 0x165f
>>>>>> [1325589.991242] {1}[Hardware Error]: class_code: 020000
>>>>>> [1325589.991244] {1}[Hardware Error]: aer_uncor_status: 0x00100000, aer_uncor_mask: 0x00010000
>>>>>> [1325589.991246] {1}[Hardware Error]: aer_uncor_severity: 0x000ef030
>>>>>> [1325589.991248] {1}[Hardware Error]: TLP Header: 40000001 0000010f 90028090 00000000
>>>>>> [1325589.991252] Kernel panic - not syncing: Fatal hardware error!
>>>>>> [1325589.991254] CPU: 0 PID: 0 Comm: swapper/0 Tainted: P O 6.5.13-1-pve #1
>>>>>> [1325589.991258] Hardware name: Dell Inc. PowerEdge R740/0WXD1Y, BIOS 2.20.1 09/13/2023
>>>>>> [1325589.991259] Call Trace:
>>>>>> [1325589.991261] <NMI>
>>>>>>
>>>>>> root at per740:~# pveversion
>>>>>> pve-manager/8.1.10/4b06efb5db453f29 (running kernel: 6.5.13-3-pve)
>>>>>>
>>>>>> root at per740:~# ethtool -i eno4
>>>>>> driver: tg3
>>>>>> version: 6.5.13-3-pve
>>>>>> firmware-version: FFV22.71.3 bc 5720-v1.39
>>>>>> expansion-rom-version:
>>>>>> bus-info: 0000:01:00.1
>>>>>> supports-statistics: yes
>>>>>> supports-test: yes
>>>>>> supports-eeprom-access: yes
>>>>>> supports-register-dump: yes
>>>>>> supports-priv-flags: no
>>>>>>
>>>>>>
>>>>>> > On Mar 28, 2024, at 15:50, Stefan Radman via pve-user <pve-user at lists.proxmox.com <mailto:pve-user at lists.proxmox.com>> wrote:
>>>>>> >
>>>>>> >
>>>>>> > From: Stefan Radman <stefan.radman at me.com <mailto:stefan.radman at me.com>>
>>>>>> > Subject: 6.5.13-3-pve kernel panic on shutdown
>>>>>> > Date: March 28, 2024 at 15:50:02 GMT+1
>>>>>> > To: PVE User List <pve-user at pve.proxmox.com <mailto:pve-user at pve.proxmox.com>>
>>>>>> >
>>>>>> >
>>>>>> > I recently noticed that a Dell Poweredge R540 currently running Proxmox VE 8.1.8 (kernel 6.5.13-3-pve) throws a kernel panic on shutdown.
>>>>>> >
>>>>>> > The kernel panic is triggered 3-4 seconds after the last network interface goes down (onboard BCM5720 LOM), while the system enters S5 (sleep) state.
>>>>>> >
>>>>>> > [84459.970212] bond0: (slave eno1): link status definitely down, disabling slave
>>>>>> > [84459.982170] bond0: (slave eno2): link status definitely down, disabling slave
>>>>>> > [84459.990037] tg3 0000:04:00.0 eno1: left promiscuous mode
>>>>>> > [84459.995822] tg3 0000:04:00.0 eno1: left allmulticast mode
>>>>>> > [84460.001615] bond0: now running without any active interface!
>>>>>> > [84460.018133] vmbr0: port 1(bond0) entered disabled state
>>>>>> > [84460.291379] ACPI: PM: Preparing to enter system sleep state S5
>>>>>> > [84463.685113] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 5
>>>>>> >
>>>>>> > This is reproducible on every reboot.
>>>>>> >
>>>>>> > R540 and BCM5720 are running the latest firmware available from the Dell support website.
>>>>>> >
>>>>>> > Link [2] below seem to suggest that my problem is related to a combination of ACPI S5, the tg3 driver and the BCM5720 on-board NIC.
>>>>>> >
>>>>>> > Has anyone else seen this lately (or ever) with Promox VE?
>>>>>> >
>>>>>> > Thank you
>>>>>> >
>>>>>> > Stefan
>>>>>> >
>>>>>> > [1] Use ACPI S5 for reboot #1904225: causes reboot crash on Dell T440
>>>>>> > https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1962730
>>>>>> >
>>>>>> > [2] [SRU][Regression] Revert "PM: ACPI: reboot: Use S5 for reboot" which causes Bus Fatal Error when rebooting system with BCM5720 NIC
>>>>>> > https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1917471
>>>>>> >
>>>>>> > [3] tg3: Disable tg3 device on system reboot to avoid triggering AER
>>>>>> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2ca1c94ce0b65a2ce7512b718f3d8a0fe6224bca
>>>>>> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/net/ethernet/broadcom/tg3.c?id=2ca1c94ce0b65a2ce7512b718f3d8a0fe6224bca#n18074
>>>>>> >
>>>>>> > [4] * [PATCH] tg3: Disable tg3 device on system reboot to avoid triggering AER
>>>>>> > https://lore.kernel.org/netdev/CAAd53p7PmEp+vWLz+fGdDntGQ2KqgL54fo86Bpy7oy9tKzXsAg@mail.gmail.com/T/
>>>>>> >
>>>>>> > [5] [v4,2/2] PM: ACPI: reboot: Reinstate S5 for reboot
>>>>>> > https://patches.linaro.org/project/linux-acpi/patch/20220916043319.119716-2-kai.heng.feng@canonical.com/
>>>>>> >
>>>>>> > [6] * [PATCH] tg3: add new module param to force device power down on reboot
>>>>>> > https://lore.kernel.org/lkml/d8ed4af1-5c83-4895-9fc3-9aea25724fd9@gmail.com/T/
>>>>>> >
>>>>>> >
>>>>>> > [84458.600189] systemd-shutdown[1]: Syncing filesystems and block devices.
>>>>>> > [84458.607141] systemd-shutdown[1]: Rebooting.
>>>>>> > [84458.612283] spi-nor spi0.0: Software reset failed: -524
>>>>>> > [84459.777370] megaraid_sas 0000:17:00.0: megasas_disable_intr_fusion is called outbound_intr_mask:0x40000009
>>>>>> > [84459.970212] bond0: (slave eno1): link status definitely down, disabling slave
>>>>>> > [84459.982170] bond0: (slave eno2): link status definitely down, disabling slave
>>>>>> > [84459.990037] tg3 0000:04:00.0 eno1: left promiscuous mode
>>>>>> > [84459.995822] tg3 0000:04:00.0 eno1: left allmulticast mode
>>>>>> > [84460.001615] bond0: now running without any active interface!
>>>>>> > [84460.018133] vmbr0: port 1(bond0) entered disabled state
>>>>>> > [84460.291379] ACPI: PM: Preparing to enter system sleep state S5
>>>>>> > [84463.685113] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 5
>>>>>> > [84463.685116] {1}[Hardware Error]: event severity: fatal
>>>>>> > [84463.685117] {1}[Hardware Error]: Error 0, type: fatal
>>>>>> > [84463.685119] {1}[Hardware Error]: section_type: PCIe error
>>>>>> > [84463.685120] {1}[Hardware Error]: port_type: 0, PCIe end point
>>>>>> > [84463.685121] {1}[Hardware Error]: version: 3.0
>>>>>> > [84463.685122] {1}[Hardware Error]: command: 0x0002, status: 0x0010
>>>>>> > [84463.685123] {1}[Hardware Error]: device_id: 0000:04:00.1
>>>>>> > [84463.685125] {1}[Hardware Error]: slot: 0
>>>>>> > [84463.685126] {1}[Hardware Error]: secondary_bus: 0x00
>>>>>> > [84463.685127] {1}[Hardware Error]: vendor_id: 0x14e4, device_id: 0x165f
>>>>>> > [84463.685128] {1}[Hardware Error]: class_code: 020000
>>>>>> > [84463.685129] {1}[Hardware Error]: aer_uncor_status: 0x00100000, aer_uncor_mask: 0x00010000
>>>>>> > [84463.685130] {1}[Hardware Error]: aer_uncor_severity: 0x000ef030
>>>>>> > [84463.685131] {1}[Hardware Error]: TLP Header: 40000001 0000010f 90028090 00000000
>>>>>> > [84463.685134] Kernel panic - not syncing: Fatal hardware error!
>>>>>> > [84463.685136] CPU: 0 PID: 1 Comm: systemd-shutdow Tainted: P O 6.5.13-3-pve #1
>>>>>> > [84463.685139] Hardware name: Dell Inc. PowerEdge R540/0VC7DK, BIOS 2.21.1 03/07/2024
>>>>>> > [84463.685140] Call Trace:
>>>>>> > [84463.685142] <NMI>
>>>>>> > …
>>>>>> >
>>>>>> > root at pve:~# pveversion
>>>>>> > pve-manager/8.1.8/d29041d9f87575d0 (running kernel: 6.5.13-3-pve)
>>>>>> > root at pve:~# ethtool -i eno2
>>>>>> > driver: tg3
>>>>>> > version: 6.5.13-3-pve
>>>>>> > firmware-version: FFV22.71.3 bc 5720-v1.39
>>>>>> > expansion-rom-version:
>>>>>> > bus-info: 0000:04:00.1
>>>>>> > supports-statistics: yes
>>>>>> > supports-test: yes
>>>>>> > supports-eeprom-access: yes
>>>>>> > supports-register-dump: yes
>>>>>> > supports-priv-flags: no
>>>>>> > root at pve:~# lspci | fgrep 04:00.1
>>>>>> > 04:00.1 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> > _______________________________________________
>>>>>> > pve-user mailing list
>>>>>> > pve-user at lists.proxmox.com <mailto:pve-user at lists.proxmox.com>
>>>>>> > https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> ---------- Forwarded message ----------
>>>>>> From: Stefan Radman via pve-user <pve-user at lists.proxmox.com <mailto:pve-user at lists.proxmox.com>>
>>>>>> To: Proxmox VE user list <pve-user at lists.proxmox.com <mailto:pve-user at lists.proxmox.com>>
>>>>>> Cc: Stefan Radman <stefan.radman at me.com <mailto:stefan.radman at me.com>>, PVE User List <pve-user at pve.proxmox.com <mailto:pve-user at pve.proxmox.com>>
>>>>>> Bcc:
>>>>>> Date: Tue, 2 Apr 2024 07:42:32 +0200
>>>>>> Subject: Re: [PVE-User] 6.5.13-3-pve kernel panic on shutdown
>>>>>> _______________________________________________
>>>>>> pve-user mailing list
>>>>>> pve-user at lists.proxmox.com <mailto:pve-user at lists.proxmox.com>
>>>>>> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>>>>
>>>
>>
>
More information about the pve-user
mailing list