Proxmox node hang

Eneko Lacunza elacunza at binovo.es
Thu Jul 15 10:40:06 CEST 2021


Hi all,

Tonight a node of our 5-node Proxmox 6.4+Ceph cluster has frezeed at  
~6:45. A reset has brought it online later in the morning and is working 
well for 2 hours right now.

HA worked like a charm and Ceph has recovered in some minutes.


Fantastic success history really, thanks for your excelent work Proxmox 
developer and contributors! :-)


Now for the "post-mortem", I see node's 8 cores "general protection 
fault"ing one after another in a minute, with different processes.

I suspect a memory module or main board fault (Ryzen 3700X 8-core, 
4x32GB non-ECC RAM and gigabyte mainboard, all "consumer" parts, it has 
been working well since dec 2019). What do you think?

Here a shortened syslog (I can provide all 437 lines if necessary):

---
Jul 15 06:45:00 sanmarko systemd[1]: Starting Proxmox VE replication 
runner...
Jul 15 06:45:00 sanmarko systemd[1]: pvesr.service: Succeeded.
Jul 15 06:45:00 sanmarko systemd[1]: Started Proxmox VE replication runner.
Jul 15 06:45:01 sanmarko CRON[1913457]: (root) CMD (command -v 
debian-sa1 > /dev/null && debian-sa1 1 1)
Jul 15 06:45:01 sanmarko CRON[1913458]: (root) CMD (if [ -x 
/etc/munin/plugins/apt_all ]; then /etc/munin/plugins/apt_all update 
7200 12 >/dev/null; elif [ -x /etc/munin/plugins/apt ]; then 
/etc/munin/plugins/ap
t update 7200 12 >/dev/null; fi)
Jul 15 06:45:10 sanmarko kernel: [145747.429110] general protection 
fault: 0000 [#1] SMP NOPTI
Jul 15 06:45:10 sanmarko kernel: [145747.429175] CPU: 11 PID: 1914237 
Comm: ceph Tainted: P           O      5.4.124-1-pve #1
Jul 15 06:45:10 sanmarko kernel: [145747.429245] Hardware name: Gigabyte 
Technology Co., Ltd. B450 AORUS M/B450 AORUS M, BIOS F50 11/27/2019
Jul 15 06:45:10 sanmarko kernel: [145747.429322] RIP: 
0010:kmem_cache_alloc+0x89/0x240
Jul 15 06:45:10 sanmarko kernel: [145747.429382] Code: 08 65 4c 03 05 30 
e4 57 5b 49 83 78 10 00 4d 8b 20 0f 84 94 01 00 00 4d 85 e4 0f 84 8b 01 
00 00 41 8b 47 20 49 8b 3f 4c 01 e0 <48> 8b 18 48 89 c1 49 33 9f 7
0 01 00 00 4c 89 e0 48 0f c9 48 31 cb
[...]
Jul 15 06:45:10 sanmarko kernel: [145747.430245] Call Trace:
Jul 15 06:45:10 sanmarko kernel: [145747.430314]  ? 
security_file_alloc+0x29/0x90
[...]
Jul 15 06:45:10 sanmarko kernel: [145747.431244] 
entry_SYSCALL_64_after_hwframe+0x44/0xa9
[...]
Jul 15 06:45:12 sanmarko kernel: [145749.037616] general protection 
fault: 0000 [#2] SMP NOPTI
Jul 15 06:45:12 sanmarko kernel: [145749.037695] CPU: 11 PID: 2433 Comm: 
tp_fstore_op Tainted: P      D    O      5.4.124-1-pve #1
Jul 15 06:45:12 sanmarko kernel: [145749.037793] Hardware name: Gigabyte 
Technology Co., Ltd. B450 AORUS M/B450 AORUS M, BIOS F50 11/27/2019
Jul 15 06:45:12 sanmarko kernel: [145749.037898] RIP: 
0010:apparmor_file_free_security+0x22/0x40
Jul 15 06:45:12 sanmarko kernel: [145749.037975] Code: 2c ff ff eb a2 0f 
1f 00 0f 1f 44 00 00 48 63 05 28 fb fc 00 48 03 87 c0 00 00 00 74 1a 48 
8b 78 08 48 85 ff 74 11 55 48 89 e5 <f0> ff 0f 0f 88 dc 96 62 00 74 03 
5d c3 c3 e8 db 55 00 00 5d c3 66
[...]
Jul 15 06:45:12 sanmarko kernel: [145749.038942] Call Trace:
Jul 15 06:45:12 sanmarko kernel: [145749.039015] 
security_file_free+0x27/0x60
[...]
Jul 15 06:45:12 sanmarko kernel: [145749.039441] 
entry_SYSCALL_64_after_hwframe+0x44/0xa9
[...]
Jul 15 06:45:29 sanmarko kernel: [145765.573841] general protection 
fault: 0000 [#3] SMP NOPTI
Jul 15 06:45:29 sanmarko kernel: [145765.573922] CPU: 11 PID: 1733 Comm: 
pve-firewall Tainted: P      D    O      5.4.124-1-pve #1
Jul 15 06:45:29 sanmarko kernel: [145765.574021] Hardware name: Gigabyte 
Technology Co., Ltd. B450 AORUS M/B450 AORUS M, BIOS F50 11/27/2019
Jul 15 06:45:29 sanmarko kernel: [145765.574127] RIP: 
0010:kmem_cache_alloc+0x89/0x240
Jul 15 06:45:29 sanmarko kernel: [145765.574201] Code: 08 65 4c 03 05 30 
e4 57 5b 49 83 78 10 00 4d 8b 20 0f 84 94 01 00 00 4d 85 e4 0f 84 8b 01 
00 00 41 8b 47 20 49 8b 3f 4c 01 e0 <48> 8b 18 48 89 c1 49 33 9f 70 01 
00 00 4c 89 e0 48 0f c9 48 31 cb
[...]
Jul 15 06:45:29 sanmarko kernel: [145765.576321] Call Trace:
Jul 15 06:45:29 sanmarko kernel: [145765.576391]  ? 
security_file_alloc+0x29/0x90
[...]
Jul 15 06:45:29 sanmarko kernel: [145765.577258] 
entry_SYSCALL_64_after_hwframe+0x44/0xa9
[...]
Jul 15 06:45:29 sanmarko systemd[1]: pve-firewall.service: Main process 
exited, code=killed, status=11/SEGV
Jul 15 06:45:29 sanmarko systemd[1]: pve-firewall.service: Failed with 
result 'signal'.
Jul 15 06:45:35 sanmarko kernel: [145772.194438] general protection 
fault: 0000 [#4] SMP NOPTI
Jul 15 06:45:35 sanmarko kernel: [145772.194516] CPU: 11 PID: 1776 Comm: 
ms_dispatch Tainted: P      D    O      5.4.124-1-pve #1
Jul 15 06:45:35 sanmarko kernel: [145772.194614] Hardware name: Gigabyte 
Technology Co., Ltd. B450 AORUS M/B450 AORUS M, BIOS F50 11/27/2019
Jul 15 06:45:35 sanmarko kernel: [145772.194718] RIP: 
0010:kmem_cache_alloc+0x89/0x240
Jul 15 06:45:35 sanmarko kernel: [145772.194792] Code: 08 65 4c 03 05 30 
e4 57 5b 49 83 78 10 00 4d 8b 20 0f 84 94 01 00 00 4d 85 e4 0f 84 8b 01 
00 00 41 8b 47 20 49 8b 3f 4c 01 e0 <48> 8b 18 48 89 c1 49 33 9f 70 01 
00 00 4c 89 e0 48 0f c9 48 31 cb
[...]
Jul 15 06:45:35 sanmarko kernel: [145772.195750] Call Trace:
Jul 15 06:45:35 sanmarko kernel: [145772.195819]  ? 
security_file_alloc+0x29/0x90
[...]
Jul 15 06:45:35 sanmarko kernel: [145772.197176] 
entry_SYSCALL_64_after_hwframe+0x44/0xa9
[...]
Jul 15 06:45:37 sanmarko kernel: [145774.137506] general protection 
fault: 0000 [#5] SMP NOPTI
Jul 15 06:45:37 sanmarko kernel: [145774.137586] CPU: 11 PID: 2466 Comm: 
tp_fstore_op Tainted: P      D    O      5.4.124-1-pve #1
Jul 15 06:45:37 sanmarko kernel: [145774.137687] Hardware name: Gigabyte 
Technology Co., Ltd. B450 AORUS M/B450 AORUS M, BIOS F50 11/27/2019
Jul 15 06:45:37 sanmarko kernel: [145774.137791] RIP: 
0010:kmem_cache_alloc+0x89/0x240
Jul 15 06:45:37 sanmarko kernel: [145774.137865] Code: 08 65 4c 03 05 30 
e4 57 5b 49 83 78 10 00 4d 8b 20 0f 84 94 01 00 00 4d 85 e4 0f 84 8b 01 
00 00 41 8b 47 20 49 8b 3f 4c 01 e0 <48> 8b 18 48 89 c1 49 33 9f 70 01 
00 00 4c 89 e0 48 0f c9 48 31 cb
[...]
Jul 15 06:45:37 sanmarko kernel: [145774.139990] Call Trace:
Jul 15 06:45:37 sanmarko kernel: [145774.140059]  ? 
security_file_alloc+0x29/0x90
[...]
Jul 15 06:45:37 sanmarko kernel: [145774.140991] 
entry_SYSCALL_64_after_hwframe+0x44/0xa9
[...]
Jul 15 06:45:40 sanmarko kernel: [145776.830930] general protection 
fault: 0000 [#6] SMP NOPTI
Jul 15 06:45:40 sanmarko kernel: [145776.831010] CPU: 11 PID: 7234 Comm: 
kvm Tainted: P      D    O      5.4.124-1-pve #1
Jul 15 06:45:40 sanmarko kernel: [145776.831109] Hardware name: Gigabyte 
Technology Co., Ltd. B450 AORUS M/B450 AORUS M, BIOS F50 11/27/2019
Jul 15 06:45:40 sanmarko kernel: [145776.831217] RIP: 
0010:kmem_cache_alloc+0x89/0x240
Jul 15 06:45:40 sanmarko kernel: [145776.831294] Code: 08 65 4c 03 05 30 
e4 57 5b 49 83 78 10 00 4d 8b 20 0f 84 94 01 00 00 4d 85 e4 0f 84 8b 01 
00 00 41 8b 47 20 49 8b 3f 4c 01 e0 <48> 8b 18 48 89 c1 49 33 9f 70 01 
00 00 4c 89 e0 48 0f c9 48 31 cb
[...]
Jul 15 06:45:40 sanmarko kernel: [145776.832264] Call Trace:
Jul 15 06:45:40 sanmarko kernel: [145776.832334]  ? 
security_file_alloc+0x29/0x90
[...]
Jul 15 06:45:40 sanmarko kernel: [145776.833336] 
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Jul 15 06:45:40 sanmarko pve-ha-lrm[1914439]: starting service vm:149
Jul 15 06:45:40 sanmarko pve-ha-lrm[1914439]: <root at pam> starting task 
UPID:sanmarko:001D3649:00DE7138:60EFBD74:qmstart:149:root at pam:
Jul 15 06:45:40 sanmarko pve-ha-lrm[1914441]: start VM 149: 
UPID:sanmarko:001D3649:00DE7138:60EFBD74:qmstart:149:root at pam:
Jul 15 06:45:40 sanmarko systemd[1]: 149.scope: Succeeded.
Jul 15 06:45:40 sanmarko systemd[1]: Stopped 149.scope.
Jul 15 06:45:43 sanmarko kernel: [145779.840863] general protection 
fault: 0000 [#7] SMP NOPTI
Jul 15 06:45:43 sanmarko kernel: [145779.840942] CPU: 11 PID: 1740 Comm: 
pvestatd Tainted: P      D    O      5.4.124-1-pve #1
Jul 15 06:45:43 sanmarko kernel: [145779.842207] Hardware name: Gigabyte 
Technology Co., Ltd. B450 AORUS M/B450 AORUS M, BIOS F50 11/27/2019
Jul 15 06:45:43 sanmarko kernel: [145779.842310] RIP: 
0010:kmem_cache_alloc+0x89/0x240
Jul 15 06:45:43 sanmarko kernel: [145779.842383] Code: 08 65 4c 03 05 30 
e4 57 5b 49 83 78 10 00 4d 8b 20 0f 84 94 01 00 00 4d 85 e4 0f 84 8b 01 
00 00 41 8b 47 20 49 8b 3f 4c 01 e0 <48> 8b 18 48 89 c1 49 33 9f 70 01 
00 00 4c 89 e0 48 0f c9 48 31 cb
[...]
Jul 15 06:45:43 sanmarko kernel: [145779.843337] Call Trace:
Jul 15 06:45:43 sanmarko kernel: [145779.843406]  ? 
security_file_alloc+0x29/0x90
[...]
Jul 15 06:45:43 sanmarko kernel: [145779.844545] 
entry_SYSCALL_64_after_hwframe+0x44/0xa9
[...]
Jul 15 06:45:43 sanmarko systemd[1]: pvestatd.service: Main process 
exited, code=killed, status=11/SEGV
Jul 15 06:45:43 sanmarko systemd[1]: pvestatd.service: Failed with 
result 'signal'.
Jul 15 06:45:45 sanmarko pve-ha-lrm[1914439]: Task 
'UPID:sanmarko:001D3649:00DE7138:60EFBD74:qmstart:149:root at pam:' still 
active, waiting
Jul 15 06:45:45 sanmarko pve-ha-lrm[1914441]: timeout waiting on systemd
Jul 15 06:45:45 sanmarko pve-ha-lrm[1914439]: <root at pam> end task 
UPID:sanmarko:001D3649:00DE7138:60EFBD74:qmstart:149:root at pam: timeout 
waiting on systemd
Jul 15 06:45:45 sanmarko pve-ha-lrm[1914439]: unable to start service vm:149
Jul 15 06:45:50 sanmarko pve-ha-lrm[1804]: restart policy: retry number 
1 for service 'vm:149'
Jul 15 06:45:56 sanmarko kernel: [145792.695695] general protection 
fault: 0000 [#8] SMP NOPTI
Jul 15 06:45:56 sanmarko kernel: [145792.695777] CPU: 11 PID: 1783 Comm: 
pve-ha-crm Tainted: P      D    O      5.4.124-1-pve #1
Jul 15 06:45:56 sanmarko kernel: [145792.695876] Hardware name: Gigabyte 
Technology Co., Ltd. B450 AORUS M/B450 AORUS M, BIOS F50 11/27/2019
Jul 15 06:45:56 sanmarko kernel: [145792.695980] RIP: 
0010:kmem_cache_alloc+0x89/0x240
Jul 15 06:45:56 sanmarko kernel: [145792.696054] Code: 08 65 4c 03 05 30 
e4 57 5b 49 83 78 10 00 4d 8b 20 0f 84 94 01 00 00 4d 85 e4 0f 84 8b 01 
00 00 41 8b 47 20 49 8b 3f 4c 01 e0 <48> 8b 18 48 89 c1 49 33 9f 70 01 
00 00 4c 89 e0 48 0f c9 48 31 cb
[...]
Jul 15 06:45:56 sanmarko kernel: [145792.697012] Call Trace:
Jul 15 06:45:56 sanmarko kernel: [145792.697081]  ? 
security_file_alloc+0x29/0x90
[...]
Jul 15 06:45:56 sanmarko kernel: [145792.698225] 
entry_SYSCALL_64_after_hwframe+0x44/0xa9
[...]
Jul 15 06:45:56 sanmarko watchdog-mux[895]: client did not stop watchdog 
- disable watchdog updates
Jul 15 06:45:56 sanmarko systemd[1]: pve-ha-crm.service: Main process 
exited, code=killed, status=11/SEGV
Jul 15 06:45:56 sanmarko systemd[1]: pve-ha-crm.service: Failed with 
result 'signal'.
Jul 15 06:45:56 sanmarko kernel: [145792.701730] FS: 
00007fae7b4141c0(0000) GS:ffff96b69eac0000(0000) knlGS:0000000000000000
Jul 15 06:45:56 sanmarko kernel: [145792.701826] CS:  0010 DS: 0000 ES: 
0000 CR0: 0000000080050033
Jul 15 06:45:56 sanmarko kernel: [145792.701902] CR2: 00007fa6bbf73008 
CR3: 0000001f29900000 CR4: 0000000000340ee0
[... no more logs until reset ...]


# pveversion -v
proxmox-ve: 6.4-1 (running kernel: 5.4.124-1-pve)
pve-manager: 6.4-13 (running version: 6.4-13/9f411e79)
pve-kernel-5.4: 6.4-4
pve-kernel-helper: 6.4-4
pve-kernel-5.3: 6.1-6
pve-kernel-5.4.124-1-pve: 5.4.124-1
pve-kernel-5.4.119-1-pve: 5.4.119-1
pve-kernel-5.3.18-3-pve: 5.3.18-3
ceph: 15.2.13-pve1~bpo10
ceph-fuse: 15.2.13-pve1~bpo10
corosync: 3.1.2-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: residual config
ifupdown2: 3.0.0-1+pve4~bpo10
libjs-extjs: 6.0.1-10
libknet1: 1.20-pve1
libproxmox-acme-perl: 1.1.0
libproxmox-backup-qemu0: 1.1.0-1
libpve-access-control: 6.4-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.4-3
libpve-guest-common-perl: 3.1-5
libpve-http-server-perl: 3.2-3
libpve-storage-perl: 6.4-1
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.6-2
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.1.10-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.6-1
pve-cluster: 6.4-1
pve-container: 3.3-6
pve-docs: 6.4-2
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-4
pve-firmware: 3.2-4
pve-ha-manager: 3.1-1
pve-i18n: 2.3-1
pve-qemu-kvm: 5.2.0-6
pve-xtermjs: 4.7.0-3
qemu-server: 6.4-2
smartmontools: 7.2-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 2.0.4-pve1


Thanks a lot

Eneko Lacunza
Zuzendari teknikoa | Director técnico
Binovo IT Human Project

Tel. +34 943 569 206 | https://www.binovo.es
Astigarragako Bidea, 2 - 2º izda. Oficina 10-11, 20180 Oiartzun

https://www.youtube.com/user/CANALBINOVO
https://www.linkedin.com/company/37269706/





More information about the pve-user mailing list