[PVE-User] Proxmox node hang

Luke Thompson luke.t at tncrew.com.au
Thu Jul 15 11:31:21 CEST 2021


Hi Eneko,

Consumer RAM is always a tricky starting point. When under heavy load, 
the fault rates tend to be quite surprising (hence why ECC is preferable 
in enterprise/etc settings).

Your Gigabyte B450 Aurus M motherboard has many new BIOS iterations 
available - they're definitely worth reading into, and applying after 
your own research.
https://www.gigabyte.com/Motherboard/B450-AORUS-M-rev-1x/support#support-dl-bios 
(F60/F61c in 2021 compared to F50 in 2019)

Where there's been a hardware fault downstream, we tend to see similar 
kernel taint flags/states to what you are with P O / P D O / etc.
https://www.kernel.org/doc/html/latest/admin-guide/tainted-kernels.html 
(We've had hardware like OOB cards trigger module flags)

Have you run the system through extended memory testing (outside of PVE, 
which includes memtest at-boot, at least under the ISO)?

Your BIOS version is dated 1 month after this article, which would imply 
that a BIOS update may be beneficial to avoiding the RNG bug.
https://arstechnica.com/gadgets/2019/10/how-a-months-old-amd-microcode-bug-destroyed-my-weekend

With the kernel taint lines, do you have the procs/calls that marry up 
to the PIDs listed? What were they doing at the time?

 From the brief logs you've included, they look varied implying that the 
problem is likely hardware-based. I'd guess RAM.

As it's only faulted once, I'd say a decent course of action would be to 
test the memory extensively, and go from there.

If it can identify a faulting module, then you can remove that DIMM and 
swap it for a known-good one instead, etc.

The testing can take a while, and in our experience it can be worth 
leaving it to cycle through, esp. with non-ECC.

Even though tainted kernels never "lose their taint", if you remove the 
underlying cause it should clear the state.

It'll be good to hear about how you get on with it all. Best of luck 
with it!

Cheers,

Luke Thompson
Operations Manager

luke.t at tncrew.com.au
PO Box 111, West Wallsend

On 15/7/21 6:40 pm, Eneko Lacunza via pve-user wrote:
> Hi all,
>
> Tonight a node of our 5-node Proxmox 6.4+Ceph cluster has frezeed at
> ~6:45. A reset has brought it online later in the morning and is working
> well for 2 hours right now.
>
> HA worked like a charm and Ceph has recovered in some minutes.
>
> Fantastic success history really, thanks for your excelent work Proxmox
> developer and contributors!
>
> Now for the "post-mortem", I see node's 8 cores "general protection
> fault"ing one after another in a minute, with different processes.
>
> I suspect a memory module or main board fault (Ryzen 3700X 8-core,
> 4x32GB non-ECC RAM and gigabyte mainboard, all "consumer" parts, it has
> been working well since dec 2019). What do you think?
>
> Here a shortened syslog (I can provide all 437 lines if necessary):
>
> ---
>
> Jul 15 06:45:00 sanmarko systemd[1]: Starting Proxmox VE replication
> runner...
>
> Jul 15 06:45:00 sanmarko systemd[1]: pvesr.service: Succeeded.
>
> Jul 15 06:45:00 sanmarko systemd[1]: Started Proxmox VE replication runner.
>
> Jul 15 06:45:01 sanmarko CRON[1913457]: (root) CMD (command -v
> debian-sa1 > /dev/null && debian-sa1 1 1)
>
> Jul 15 06:45:01 sanmarko CRON[1913458]: (root) CMD (if [ -x
> /etc/munin/plugins/apt_all ]; then /etc/munin/plugins/apt_all update
> 7200 12 >/dev/null; elif [ -x /etc/munin/plugins/apt ]; then
> /etc/munin/plugins/ap
>
> t update 7200 12 >/dev/null; fi)
>
> Jul 15 06:45:10 sanmarko kernel: [145747.429110] general protection
> fault: 0000 [#1] SMP NOPTI
>
> Jul 15 06:45:10 sanmarko kernel: [145747.429175] CPU: 11 PID: 1914237
> Comm: ceph Tainted: P           O      5.4.124-1-pve #1
>
> Jul 15 06:45:10 sanmarko kernel: [145747.429245] Hardware name: Gigabyte
> Technology Co., Ltd. B450 AORUS M/B450 AORUS M, BIOS F50 11/27/2019
>
> Jul 15 06:45:10 sanmarko kernel: [145747.429322] RIP:
> 0010:kmem_cache_alloc+0x89/0x240
>
> Jul 15 06:45:10 sanmarko kernel: [145747.429382] Code: 08 65 4c 03 05 30
> e4 57 5b 49 83 78 10 00 4d 8b 20 0f 84 94 01 00 00 4d 85 e4 0f 84 8b 01
> 00 00 41 8b 47 20 49 8b 3f 4c 01 e0 <48> 8b 18 48 89 c1 49 33 9f 7
>
> 0 01 00 00 4c 89 e0 48 0f c9 48 31 cb
>
> [...]
>
> Jul 15 06:45:10 sanmarko kernel: [145747.430245] Call Trace:
>
> Jul 15 06:45:10 sanmarko kernel: [145747.430314]  ?
> security_file_alloc+0x29/0x90
>
> [...]
>
> Jul 15 06:45:10 sanmarko kernel: [145747.431244]
> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> [...]
>
> Jul 15 06:45:12 sanmarko kernel: [145749.037616] general protection
> fault: 0000 [#2] SMP NOPTI
>
> Jul 15 06:45:12 sanmarko kernel: [145749.037695] CPU: 11 PID: 2433 Comm:
> tp_fstore_op Tainted: P      D    O      5.4.124-1-pve #1
>
> Jul 15 06:45:12 sanmarko kernel: [145749.037793] Hardware name: Gigabyte
> Technology Co., Ltd. B450 AORUS M/B450 AORUS M, BIOS F50 11/27/2019
>
> Jul 15 06:45:12 sanmarko kernel: [145749.037898] RIP:
> 0010:apparmor_file_free_security+0x22/0x40
>
> Jul 15 06:45:12 sanmarko kernel: [145749.037975] Code: 2c ff ff eb a2 0f
> 1f 00 0f 1f 44 00 00 48 63 05 28 fb fc 00 48 03 87 c0 00 00 00 74 1a 48
> 8b 78 08 48 85 ff 74 11 55 48 89 e5 <f0> ff 0f 0f 88 dc 96 62 00 74 03
> 5d c3 c3 e8 db 55 00 00 5d c3 66
>
> [...]
>
> Jul 15 06:45:12 sanmarko kernel: [145749.038942] Call Trace:
>
> Jul 15 06:45:12 sanmarko kernel: [145749.039015]
> security_file_free+0x27/0x60
>
> [...]
>
> Jul 15 06:45:12 sanmarko kernel: [145749.039441]
> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> [...]
>
> Jul 15 06:45:29 sanmarko kernel: [145765.573841] general protection
> fault: 0000 [#3] SMP NOPTI
>
> Jul 15 06:45:29 sanmarko kernel: [145765.573922] CPU: 11 PID: 1733 Comm:
> pve-firewall Tainted: P      D    O      5.4.124-1-pve #1
>
> Jul 15 06:45:29 sanmarko kernel: [145765.574021] Hardware name: Gigabyte
> Technology Co., Ltd. B450 AORUS M/B450 AORUS M, BIOS F50 11/27/2019
>
> Jul 15 06:45:29 sanmarko kernel: [145765.574127] RIP:
> 0010:kmem_cache_alloc+0x89/0x240
>
> Jul 15 06:45:29 sanmarko kernel: [145765.574201] Code: 08 65 4c 03 05 30
> e4 57 5b 49 83 78 10 00 4d 8b 20 0f 84 94 01 00 00 4d 85 e4 0f 84 8b 01
> 00 00 41 8b 47 20 49 8b 3f 4c 01 e0 <48> 8b 18 48 89 c1 49 33 9f 70 01
> 00 00 4c 89 e0 48 0f c9 48 31 cb
>
> [...]
>
> Jul 15 06:45:29 sanmarko kernel: [145765.576321] Call Trace:
>
> Jul 15 06:45:29 sanmarko kernel: [145765.576391]  ?
> security_file_alloc+0x29/0x90
>
> [...]
>
> Jul 15 06:45:29 sanmarko kernel: [145765.577258]
> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> [...]
>
> Jul 15 06:45:29 sanmarko systemd[1]: pve-firewall.service: Main process
> exited, code=killed, status=11/SEGV
>
> Jul 15 06:45:29 sanmarko systemd[1]: pve-firewall.service: Failed with
> result 'signal'.
>
> Jul 15 06:45:35 sanmarko kernel: [145772.194438] general protection
> fault: 0000 [#4] SMP NOPTI
>
> Jul 15 06:45:35 sanmarko kernel: [145772.194516] CPU: 11 PID: 1776 Comm:
> ms_dispatch Tainted: P      D    O      5.4.124-1-pve #1
>
> Jul 15 06:45:35 sanmarko kernel: [145772.194614] Hardware name: Gigabyte
> Technology Co., Ltd. B450 AORUS M/B450 AORUS M, BIOS F50 11/27/2019
>
> Jul 15 06:45:35 sanmarko kernel: [145772.194718] RIP:
> 0010:kmem_cache_alloc+0x89/0x240
>
> Jul 15 06:45:35 sanmarko kernel: [145772.194792] Code: 08 65 4c 03 05 30
> e4 57 5b 49 83 78 10 00 4d 8b 20 0f 84 94 01 00 00 4d 85 e4 0f 84 8b 01
> 00 00 41 8b 47 20 49 8b 3f 4c 01 e0 <48> 8b 18 48 89 c1 49 33 9f 70 01
> 00 00 4c 89 e0 48 0f c9 48 31 cb
>
> [...]
>
> Jul 15 06:45:35 sanmarko kernel: [145772.195750] Call Trace:
>
> Jul 15 06:45:35 sanmarko kernel: [145772.195819]  ?
> security_file_alloc+0x29/0x90
>
> [...]
>
> Jul 15 06:45:35 sanmarko kernel: [145772.197176]
> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> [...]
>
> Jul 15 06:45:37 sanmarko kernel: [145774.137506] general protection
> fault: 0000 [#5] SMP NOPTI
>
> Jul 15 06:45:37 sanmarko kernel: [145774.137586] CPU: 11 PID: 2466 Comm:
> tp_fstore_op Tainted: P      D    O      5.4.124-1-pve #1
>
> Jul 15 06:45:37 sanmarko kernel: [145774.137687] Hardware name: Gigabyte
> Technology Co., Ltd. B450 AORUS M/B450 AORUS M, BIOS F50 11/27/2019
>
> Jul 15 06:45:37 sanmarko kernel: [145774.137791] RIP:
> 0010:kmem_cache_alloc+0x89/0x240
>
> Jul 15 06:45:37 sanmarko kernel: [145774.137865] Code: 08 65 4c 03 05 30
> e4 57 5b 49 83 78 10 00 4d 8b 20 0f 84 94 01 00 00 4d 85 e4 0f 84 8b 01
> 00 00 41 8b 47 20 49 8b 3f 4c 01 e0 <48> 8b 18 48 89 c1 49 33 9f 70 01
> 00 00 4c 89 e0 48 0f c9 48 31 cb
>
> [...]
>
> Jul 15 06:45:37 sanmarko kernel: [145774.139990] Call Trace:
>
> Jul 15 06:45:37 sanmarko kernel: [145774.140059]  ?
> security_file_alloc+0x29/0x90
>
> [...]
>
> Jul 15 06:45:37 sanmarko kernel: [145774.140991]
> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> [...]
>
> Jul 15 06:45:40 sanmarko kernel: [145776.830930] general protection
> fault: 0000 [#6] SMP NOPTI
>
> Jul 15 06:45:40 sanmarko kernel: [145776.831010] CPU: 11 PID: 7234 Comm:
> kvm Tainted: P      D    O      5.4.124-1-pve #1
>
> Jul 15 06:45:40 sanmarko kernel: [145776.831109] Hardware name: Gigabyte
> Technology Co., Ltd. B450 AORUS M/B450 AORUS M, BIOS F50 11/27/2019
>
> Jul 15 06:45:40 sanmarko kernel: [145776.831217] RIP:
> 0010:kmem_cache_alloc+0x89/0x240
>
> Jul 15 06:45:40 sanmarko kernel: [145776.831294] Code: 08 65 4c 03 05 30
> e4 57 5b 49 83 78 10 00 4d 8b 20 0f 84 94 01 00 00 4d 85 e4 0f 84 8b 01
> 00 00 41 8b 47 20 49 8b 3f 4c 01 e0 <48> 8b 18 48 89 c1 49 33 9f 70 01
> 00 00 4c 89 e0 48 0f c9 48 31 cb
>
> [...]
>
> Jul 15 06:45:40 sanmarko kernel: [145776.832264] Call Trace:
>
> Jul 15 06:45:40 sanmarko kernel: [145776.832334]  ?
> security_file_alloc+0x29/0x90
>
> [...]
>
> Jul 15 06:45:40 sanmarko kernel: [145776.833336]
> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> Jul 15 06:45:40 sanmarko pve-ha-lrm[1914439]: starting service vm:149
>
> Jul 15 06:45:40 sanmarko pve-ha-lrm[1914439]: <root at pam> starting task
> UPID:sanmarko:001D3649:00DE7138:60EFBD74:qmstart:149:root at pam:
>
> Jul 15 06:45:40 sanmarko pve-ha-lrm[1914441]: start VM 149:
> UPID:sanmarko:001D3649:00DE7138:60EFBD74:qmstart:149:root at pam:
>
> Jul 15 06:45:40 sanmarko systemd[1]: 149.scope: Succeeded.
>
> Jul 15 06:45:40 sanmarko systemd[1]: Stopped 149.scope.
>
> Jul 15 06:45:43 sanmarko kernel: [145779.840863] general protection
> fault: 0000 [#7] SMP NOPTI
>
> Jul 15 06:45:43 sanmarko kernel: [145779.840942] CPU: 11 PID: 1740 Comm:
> pvestatd Tainted: P      D    O      5.4.124-1-pve #1
>
> Jul 15 06:45:43 sanmarko kernel: [145779.842207] Hardware name: Gigabyte
> Technology Co., Ltd. B450 AORUS M/B450 AORUS M, BIOS F50 11/27/2019
>
> Jul 15 06:45:43 sanmarko kernel: [145779.842310] RIP:
> 0010:kmem_cache_alloc+0x89/0x240
>
> Jul 15 06:45:43 sanmarko kernel: [145779.842383] Code: 08 65 4c 03 05 30
> e4 57 5b 49 83 78 10 00 4d 8b 20 0f 84 94 01 00 00 4d 85 e4 0f 84 8b 01
> 00 00 41 8b 47 20 49 8b 3f 4c 01 e0 <48> 8b 18 48 89 c1 49 33 9f 70 01
> 00 00 4c 89 e0 48 0f c9 48 31 cb
>
> [...]
>
> Jul 15 06:45:43 sanmarko kernel: [145779.843337] Call Trace:
>
> Jul 15 06:45:43 sanmarko kernel: [145779.843406]  ?
> security_file_alloc+0x29/0x90
>
> [...]
>
> Jul 15 06:45:43 sanmarko kernel: [145779.844545]
> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> [...]
>
> Jul 15 06:45:43 sanmarko systemd[1]: pvestatd.service: Main process
> exited, code=killed, status=11/SEGV
>
> Jul 15 06:45:43 sanmarko systemd[1]: pvestatd.service: Failed with
> result 'signal'.
>
> Jul 15 06:45:45 sanmarko pve-ha-lrm[1914439]: Task
> 'UPID:sanmarko:001D3649:00DE7138:60EFBD74:qmstart:149:root at pam:' still
> active, waiting
>
> Jul 15 06:45:45 sanmarko pve-ha-lrm[1914441]: timeout waiting on systemd
>
> Jul 15 06:45:45 sanmarko pve-ha-lrm[1914439]: <root at pam> end task
> UPID:sanmarko:001D3649:00DE7138:60EFBD74:qmstart:149:root at pam: timeout
> waiting on systemd
>
> Jul 15 06:45:45 sanmarko pve-ha-lrm[1914439]: unable to start service vm:149
>
> Jul 15 06:45:50 sanmarko pve-ha-lrm[1804]: restart policy: retry number
> 1 for service 'vm:149'
>
> Jul 15 06:45:56 sanmarko kernel: [145792.695695] general protection
> fault: 0000 [#8] SMP NOPTI
>
> Jul 15 06:45:56 sanmarko kernel: [145792.695777] CPU: 11 PID: 1783 Comm:
> pve-ha-crm Tainted: P      D    O      5.4.124-1-pve #1
>
> Jul 15 06:45:56 sanmarko kernel: [145792.695876] Hardware name: Gigabyte
> Technology Co., Ltd. B450 AORUS M/B450 AORUS M, BIOS F50 11/27/2019
>
> Jul 15 06:45:56 sanmarko kernel: [145792.695980] RIP:
> 0010:kmem_cache_alloc+0x89/0x240
>
> Jul 15 06:45:56 sanmarko kernel: [145792.696054] Code: 08 65 4c 03 05 30
> e4 57 5b 49 83 78 10 00 4d 8b 20 0f 84 94 01 00 00 4d 85 e4 0f 84 8b 01
> 00 00 41 8b 47 20 49 8b 3f 4c 01 e0 <48> 8b 18 48 89 c1 49 33 9f 70 01
> 00 00 4c 89 e0 48 0f c9 48 31 cb
>
> [...]
>
> Jul 15 06:45:56 sanmarko kernel: [145792.697012] Call Trace:
>
> Jul 15 06:45:56 sanmarko kernel: [145792.697081]  ?
> security_file_alloc+0x29/0x90
>
> [...]
>
> Jul 15 06:45:56 sanmarko kernel: [145792.698225]
> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> [...]
>
> Jul 15 06:45:56 sanmarko watchdog-mux[895]: client did not stop watchdog
> - disable watchdog updates
>
> Jul 15 06:45:56 sanmarko systemd[1]: pve-ha-crm.service: Main process
> exited, code=killed, status=11/SEGV
>
> Jul 15 06:45:56 sanmarko systemd[1]: pve-ha-crm.service: Failed with
> result 'signal'.
>
> Jul 15 06:45:56 sanmarko kernel: [145792.701730] FS:
> 00007fae7b4141c0(0000) GS:ffff96b69eac0000(0000) knlGS:0000000000000000
>
> Jul 15 06:45:56 sanmarko kernel: [145792.701826] CS:  0010 DS: 0000 ES:
> 0000 CR0: 0000000080050033
>
> Jul 15 06:45:56 sanmarko kernel: [145792.701902] CR2: 00007fa6bbf73008
> CR3: 0000001f29900000 CR4: 0000000000340ee0
>
> [... no more logs until reset ...]
>
>
> # pveversion -v
>
> proxmox-ve: 6.4-1 (running kernel: 5.4.124-1-pve)
> pve-manager: 6.4-13 (running version: 6.4-13/9f411e79)
> pve-kernel-5.4: 6.4-4
> pve-kernel-helper: 6.4-4
> pve-kernel-5.3: 6.1-6
> pve-kernel-5.4.124-1-pve: 5.4.124-1
> pve-kernel-5.4.119-1-pve: 5.4.119-1
> pve-kernel-5.3.18-3-pve: 5.3.18-3
> ceph: 15.2.13-pve1~bpo10
> ceph-fuse: 15.2.13-pve1~bpo10
> corosync: 3.1.2-pve1
> criu: 3.11-3
> glusterfs-client: 5.5-3
> ifupdown: residual config
> ifupdown2: 3.0.0-1+pve4~bpo10
> libjs-extjs: 6.0.1-10
> libknet1: 1.20-pve1
> libproxmox-acme-perl: 1.1.0
> libproxmox-backup-qemu0: 1.1.0-1
> libpve-access-control: 6.4-3
> libpve-apiclient-perl: 3.1-3
> libpve-common-perl: 6.4-3
> libpve-guest-common-perl: 3.1-5
> libpve-http-server-perl: 3.2-3
> libpve-storage-perl: 6.4-1
> libqb0: 1.0.5-1
> libspice-server1: 0.14.2-4~pve6+1
> lvm2: 2.03.02-pve4
> lxc-pve: 4.0.6-2
> lxcfs: 4.0.6-pve1
> novnc-pve: 1.1.0-1
> proxmox-backup-client: 1.1.10-1
> proxmox-mini-journalreader: 1.1-1
> proxmox-widget-toolkit: 2.6-1
> pve-cluster: 6.4-1
> pve-container: 3.3-6
> pve-docs: 6.4-2
> pve-edk2-firmware: 2.20200531-1
> pve-firewall: 4.1-4
> pve-firmware: 3.2-4
> pve-ha-manager: 3.1-1
> pve-i18n: 2.3-1
> pve-qemu-kvm: 5.2.0-6
> pve-xtermjs: 4.7.0-3
> qemu-server: 6.4-2
> smartmontools: 7.2-pve2
> spiceterm: 3.1-1
> vncterm: 1.6-2
> zfsutils-linux: 2.0.4-pve1
>
>
> Thanks a lot
>
>
> Eneko Lacunza
> Zuzendari teknikoa | Director técnico
> Binovo IT Human Project
>
> Tel. +34 943 569 206 |https://www.binovo.es
> Astigarragako Bidea, 2 - 2º izda. Oficina 10-11, 20180 Oiartzun
>
> https://www.youtube.com/user/CANALBINOVO
> https://www.linkedin.com/company/37269706/
> _______________________________________________
> pve-user mailing list
> pve-user at lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user




More information about the pve-user mailing list