[PVE-User] 6.5.13-3-pve kernel panic on shutdown
Stefan Radman
stefan.radman at me.com
Thu Mar 28 18:13:01 CET 2024
Hi Gilberto
Thank you for the general guidelines on dealing with APEI Generic Hardware Errors
1. Identify the Error Source:
The APEI error message clearly identifies the source as the second port of the BCM5720 LOM (onboard NIC).
[84463.685123] {1}[Hardware Error]: device_id: 0000:04:00.1
root at pve:~# ethtool -i eno2 | egrep '^(driver|version|bus)'
driver: tg3
version: 6.5.13-3-pve
bus-info: 0000:04:00.1
2. Update System Firmware:
The system firmware (provided by Dell support) is up to date.
BIOS: 2.21.1 <https://www.dell.com/support/home/en-us/drivers/driversdetails?driverid=hkpg0>
BCM 5720 firmware: 22.71.3 <https://www.dell.com/support/home/en-us/drivers/driversdetails?driverid=4jjw6>
iDRAC: 7.00.00.171
3. Kernel Parameter Tuning:
"the apei=off kernel parameter can be used to disable APEI error handling altogether, although this is not recommended as it may compromise system stability."
I’d rather not do that. The system is stable, apart from the kernel panic during reboot (which does not affect operation).
4. Hardware Diagnostics:
Hardware diagnostcs didn’t return any errors.
7. Kernel and Module Updates:
The system is running the latest stable PVE kernel (6.5.13-3-pve) and tg3 kernel module.
Please let me know if you have any further suggestions directly related to this specific issue (see [1][2][3][4][5] and [6]).
Thank you
Best regards
Stefan
[1] Use ACPI S5 for reboot #1904225: causes reboot crash on Dell T440
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1962730
[2] [SRU][Regression] Revert "PM: ACPI: reboot: Use S5 for reboot" which
causes Bus Fatal Error when rebooting system with BCM5720 NIC
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1917471
[3] tg3: Disable tg3 device on system reboot to avoid triggering AER
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2ca1c94ce0b65a2ce7512b718f3d8a0fe6224bca
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/net/ethernet/broadcom/tg3.c?id=2ca1c94ce0b65a2ce7512b718f3d8a0fe6224bca#n18074
[4] * [PATCH] tg3: Disable tg3 device on system reboot to avoid triggering AER
https://lore.kernel.org/netdev/CAAd53p7PmEp+vWLz+fGdDntGQ2KqgL54fo86Bpy7oy9tKzXsAg@mail.gmail.com/T/
[5] [v4,2/2] PM: ACPI: reboot: Reinstate S5 for reboot
https://patches.linaro.org/project/linux-acpi/patch/20220916043319.119716-2-kai.heng.feng@canonical.com/
[6] * [PATCH] tg3: add new module param to force device power down on reboot
https://lore.kernel.org/lkml/d8ed4af1-5c83-4895-9fc3-9aea25724fd9@gmail.com/T/
> On Mar 28, 2024, at 16:57, Gilberto Ferreira <gilberto.nunes32 at gmail.com> wrote:
>
> https://medium.com/@nothanjack/dealing-with-apei-generic-hardware-error-source-problems-in-linux-a8ee8a67c8c1
> ---
> Gilberto Nunes Ferreira
> (47) 99676-7530 - Whatsapp / Telegram
>
>
>
>
>
>
> Em qui., 28 de mar. de 2024 às 12:54, Stefan Radman via pve-user <
> pve-user at lists.proxmox.com> escreveu:
>
>>
>>
>>
>> ---------- Forwarded message ----------
>> From: Stefan Radman <stefan.radman at me.com>
>> To: Proxmox VE user list <pve-user at lists.proxmox.com>
>> Cc:
>> Bcc:
>> Date: Thu, 28 Mar 2024 16:47:43 +0100
>> Subject: Re: [PVE-User] 6.5.13-3-pve kernel panic on shutdown
>> Hi Gilberto
>>
>> The server firmware is up to date.
>>
>> Stefan
>>
>>> On Mar 28, 2024, at 16:18, Gilberto Ferreira <gilberto.nunes32 at gmail.com>
>> wrote:
>>>
>>> Try to update the server firmware.
>>> ---
>>> Gilberto Nunes Ferreira
>>> (47) 99676-7530 - Whatsapp / Telegram
>>>
>>>
>>>
>>>
>>>
>>>
>>> Em qui., 28 de mar. de 2024 às 11:58, Stefan Radman via pve-user <
>>> pve-user at lists.proxmox.com <mailto:pve-user at lists.proxmox.com>>
>> escreveu:
>>>
>>>>
>>>>
>>>>
>>>> ---------- Forwarded message ----------
>>>> From: Stefan Radman <stefan.radman at me.com <mailto:stefan.radman at me.com
>>>>
>>>> To: PVE User List <pve-user at pve.proxmox.com <mailto:
>> pve-user at pve.proxmox.com>>
>>>> Cc:
>>>> Bcc:
>>>> Date: Thu, 28 Mar 2024 15:50:02 +0100
>>>> Subject: 6.5.13-3-pve kernel panic on shutdown
>>>> I recently noticed that a Dell Poweredge R540 currently running Proxmox
>> VE
>>>> 8.1.8 (kernel 6.5.13-3-pve) throws a kernel panic on shutdown.
>>>>
>>>> The kernel panic is triggered 3-4 seconds after the last network
>> interface
>>>> goes down (onboard BCM5720 LOM), while the system enters S5 (sleep)
>> state.
>>>>
>>>> [84459.970212] bond0: (slave eno1): link status definitely down,
>> disabling
>>>> slave
>>>> [84459.982170] bond0: (slave eno2): link status definitely down,
>> disabling
>>>> slave
>>>> [84459.990037] tg3 0000:04:00.0 eno1: left promiscuous mode
>>>> [84459.995822] tg3 0000:04:00.0 eno1: left allmulticast mode
>>>> [84460.001615] bond0: now running without any active interface!
>>>> [84460.018133] vmbr0: port 1(bond0) entered disabled state
>>>> [84460.291379] ACPI: PM: Preparing to enter system sleep state S5
>>>> [84463.685113] {1}[Hardware Error]: Hardware error from APEI Generic
>>>> Hardware Error Source: 5
>>>>
>>>> This is reproducible on every reboot.
>>>>
>>>> R540 and BCM5720 are running the latest firmware available from the Dell
>>>> support website.
>>>>
>>>> Link [2] below seem to suggest that my problem is related to a
>> combination
>>>> of ACPI S5, the tg3 driver and the BCM5720 on-board NIC.
>>>>
>>>> Has anyone else seen this lately (or ever) with Promox VE?
>>>>
>>>> Thank you
>>>>
>>>> Stefan
>>>>
>>>> [1] Use ACPI S5 for reboot #1904225: causes reboot crash on Dell T440
>>>> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1962730
>>>>
>>>> [2] [SRU][Regression] Revert "PM: ACPI: reboot: Use S5 for reboot" which
>>>> causes Bus Fatal Error when rebooting system with BCM5720 NIC
>>>> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1917471
>>>>
>>>> [3] tg3: Disable tg3 device on system reboot to avoid triggering AER
>>>>
>>>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2ca1c94ce0b65a2ce7512b718f3d8a0fe6224bca
>>>>
>>>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/net/ethernet/broadcom/tg3.c?id=2ca1c94ce0b65a2ce7512b718f3d8a0fe6224bca#n18074
>>>>
>>>> [4] * [PATCH] tg3: Disable tg3 device on system reboot to avoid
>> triggering
>>>> AER
>>>>
>>>>
>> https://lore.kernel.org/netdev/CAAd53p7PmEp+vWLz+fGdDntGQ2KqgL54fo86Bpy7oy9tKzXsAg@mail.gmail.com/T/
>>>>
>>>> [5] [v4,2/2] PM: ACPI: reboot: Reinstate S5 for reboot
>>>>
>>>>
>> https://patches.linaro.org/project/linux-acpi/patch/20220916043319.119716-2-kai.heng.feng@canonical.com/
>>>>
>>>> [6] * [PATCH] tg3: add new module param to force device power down on
>>>> reboot
>>>>
>>>>
>> https://lore.kernel.org/lkml/d8ed4af1-5c83-4895-9fc3-9aea25724fd9@gmail.com/T/
>>>>
>>>>
>>>> [84458.600189] systemd-shutdown[1]: Syncing filesystems and block
>> devices.
>>>> [84458.607141] systemd-shutdown[1]: Rebooting.
>>>> [84458.612283] spi-nor spi0.0: Software reset failed: -524
>>>> [84459.777370] megaraid_sas 0000:17:00.0: megasas_disable_intr_fusion is
>>>> called outbound_intr_mask:0x40000009
>>>> [84459.970212] bond0: (slave eno1): link status definitely down,
>> disabling
>>>> slave
>>>> [84459.982170] bond0: (slave eno2): link status definitely down,
>> disabling
>>>> slave
>>>> [84459.990037] tg3 0000:04:00.0 eno1: left promiscuous mode
>>>> [84459.995822] tg3 0000:04:00.0 eno1: left allmulticast mode
>>>> [84460.001615] bond0: now running without any active interface!
>>>> [84460.018133] vmbr0: port 1(bond0) entered disabled state
>>>> [84460.291379] ACPI: PM: Preparing to enter system sleep state S5
>>>> [84463.685113] {1}[Hardware Error]: Hardware error from APEI Generic
>>>> Hardware Error Source: 5
>>>> [84463.685116] {1}[Hardware Error]: event severity: fatal
>>>> [84463.685117] {1}[Hardware Error]: Error 0, type: fatal
>>>> [84463.685119] {1}[Hardware Error]: section_type: PCIe error
>>>> [84463.685120] {1}[Hardware Error]: port_type: 0, PCIe end point
>>>> [84463.685121] {1}[Hardware Error]: version: 3.0
>>>> [84463.685122] {1}[Hardware Error]: command: 0x0002, status: 0x0010
>>>> [84463.685123] {1}[Hardware Error]: device_id: 0000:04:00.1
>>>> [84463.685125] {1}[Hardware Error]: slot: 0
>>>> [84463.685126] {1}[Hardware Error]: secondary_bus: 0x00
>>>> [84463.685127] {1}[Hardware Error]: vendor_id: 0x14e4, device_id:
>> 0x165f
>>>> [84463.685128] {1}[Hardware Error]: class_code: 020000
>>>> [84463.685129] {1}[Hardware Error]: aer_uncor_status: 0x00100000,
>>>> aer_uncor_mask: 0x00010000
>>>> [84463.685130] {1}[Hardware Error]: aer_uncor_severity: 0x000ef030
>>>> [84463.685131] {1}[Hardware Error]: TLP Header: 40000001 0000010f
>>>> 90028090 00000000
>>>> [84463.685134] Kernel panic - not syncing: Fatal hardware error!
>>>> [84463.685136] CPU: 0 PID: 1 Comm: systemd-shutdow Tainted: P
>> O
>>>> 6.5.13-3-pve #1
>>>> [84463.685139] Hardware name: Dell Inc. PowerEdge R540/0VC7DK, BIOS
>> 2.21.1
>>>> 03/07/2024
>>>> [84463.685140] Call Trace:
>>>> [84463.685142] <NMI>
>>>> …
>>>>
>>>> root at pve:~# pveversion
>>>> pve-manager/8.1.8/d29041d9f87575d0 (running kernel: 6.5.13-3-pve)
>>>> root at pve:~# ethtool -i eno2
>>>> driver: tg3
>>>> version: 6.5.13-3-pve
>>>> firmware-version: FFV22.71.3 bc 5720-v1.39
>>>> expansion-rom-version:
>>>> bus-info: 0000:04:00.1
>>>> supports-statistics: yes
>>>> supports-test: yes
>>>> supports-eeprom-access: yes
>>>> supports-register-dump: yes
>>>> supports-priv-flags: no
>>>> root at pve:~# lspci | fgrep 04:00.1
>>>> 04:00.1 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme
>>>> BCM5720 Gigabit Ethernet PCIe
>>>>
>>>>
>>>>
>>>>
>>>> ---------- Forwarded message ----------
>>>> From: Stefan Radman via pve-user <pve-user at lists.proxmox.com <mailto:
>> pve-user at lists.proxmox.com>>
>>>> To: PVE User List <pve-user at pve.proxmox.com <mailto:
>> pve-user at pve.proxmox.com>>
>>>> Cc: Stefan Radman <stefan.radman at me.com <mailto:stefan.radman at me.com>>
>>>> Bcc:
>>>> Date: Thu, 28 Mar 2024 15:50:02 +0100
>>>> Subject: [PVE-User] 6.5.13-3-pve kernel panic on shutdown
>>>> _______________________________________________
>>>> pve-user mailing list
>>>> pve-user at lists.proxmox.com <mailto:pve-user at lists.proxmox.com>
>>>> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>>>>
>>> _______________________________________________
>>> pve-user mailing list
>>> pve-user at lists.proxmox.com <mailto:pve-user at lists.proxmox.com>
>>> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>>
>>
>>
>>
>> ---------- Forwarded message ----------
>> From: Stefan Radman via pve-user <pve-user at lists.proxmox.com>
>> To: Proxmox VE user list <pve-user at lists.proxmox.com>
>> Cc: Stefan Radman <stefan.radman at me.com>
>> Bcc:
>> Date: Thu, 28 Mar 2024 16:47:43 +0100
>> Subject: Re: [PVE-User] 6.5.13-3-pve kernel panic on shutdown
>> _______________________________________________
>> pve-user mailing list
>> pve-user at lists.proxmox.com
>> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>>
> _______________________________________________
> pve-user mailing list
> pve-user at lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
More information about the pve-user
mailing list