PVE 7 Ceph 16.2.6 osd crash
Eneko Lacunza
elacunza at binovo.es
Tue Oct 26 16:03:49 CEST 2021
Hi all,
This morning an OSD in our office cluster crashed:
Oct 26 12:52:17 sanmarko ceph-osd[2161]: ceph-osd:
../nptl/pthread_mutex_lock.c:81: __pthread_mutex_lock: Assertion
`mutex->__data.__owner == 0' failed.
Oct 26 12:52:17 sanmarko ceph-osd[2161]: *** Caught signal (Aborted) **
Oct 26 12:52:17 sanmarko ceph-osd[2161]: in thread 7fb2a6722700
thread_name:filestore_sync
Oct 26 12:52:17 sanmarko ceph-osd[2161]: ceph version 16.2.6
(1a6b9a05546f335eeeddb460fdc89caadf80ac7a) pacific (stable)
Oct 26 12:52:17 sanmarko ceph-osd[2161]: 1:
/lib/x86_64-linux-gnu/libpthread.so.0(+0x14140) [0x7fb2b1805140]
Oct 26 12:52:17 sanmarko ceph-osd[2161]: 2: gsignal()
Oct 26 12:52:17 sanmarko ceph-osd[2161]: 3: abort()
Oct 26 12:52:17 sanmarko ceph-osd[2161]: 4:
/lib/x86_64-linux-gnu/libc.so.6(+0x2540f) [0x7fb2b12ba40f]
Oct 26 12:52:17 sanmarko ceph-osd[2161]: 5:
/lib/x86_64-linux-gnu/libc.so.6(+0x34662) [0x7fb2b12c9662]
Oct 26 12:52:17 sanmarko ceph-osd[2161]: 6: pthread_mutex_lock()
Oct 26 12:52:17 sanmarko ceph-osd[2161]: 7:
(JournalingObjectStore::ApplyManager::commit_start()+0xbc) [0x555bd4e5ab2c]
Oct 26 12:52:17 sanmarko ceph-osd[2161]: 8:
(FileStore::sync_entry()+0x32e) [0x555bd4e2448e]
Oct 26 12:52:17 sanmarko ceph-osd[2161]: 9:
(FileStore::SyncThread::entry()+0xd) [0x555bd4e4ed2d]
Oct 26 12:52:17 sanmarko ceph-osd[2161]: 10:
/lib/x86_64-linux-gnu/libpthread.so.0(+0x8ea7) [0x7fb2b17f9ea7]
Oct 26 12:52:17 sanmarko ceph-osd[2161]: 11: clone()
Oct 26 12:52:17 sanmarko ceph-osd[2161]: 2021-10-26T12:52:17.458+0200
7fb2a6722700 -1 *** Caught signal (Aborted) **
Oct 26 12:52:17 sanmarko ceph-osd[2161]: in thread 7fb2a6722700
thread_name:filestore_sync
Oct 26 12:52:17 sanmarko ceph-osd[2161]: ceph version 16.2.6
(1a6b9a05546f335eeeddb460fdc89caadf80ac7a) pacific (stable)
Oct 26 12:52:17 sanmarko ceph-osd[2161]: 1:
/lib/x86_64-linux-gnu/libpthread.so.0(+0x14140) [0x7fb2b1805140]
Oct 26 12:52:17 sanmarko ceph-osd[2161]: 2: gsignal()
Oct 26 12:52:17 sanmarko ceph-osd[2161]: 3: abort()
Oct 26 12:52:17 sanmarko ceph-osd[2161]: 4:
/lib/x86_64-linux-gnu/libc.so.6(+0x2540f) [0x7fb2b12ba40f]
Oct 26 12:52:17 sanmarko ceph-osd[2161]: 5:
/lib/x86_64-linux-gnu/libc.so.6(+0x34662) [0x7fb2b12c9662]
Oct 26 12:52:17 sanmarko ceph-osd[2161]: 6: pthread_mutex_lock()
Oct 26 12:52:17 sanmarko ceph-osd[2161]: 7:
(JournalingObjectStore::ApplyManager::commit_start()+0xbc) [0x555bd4e5ab2c]
Oct 26 12:52:17 sanmarko ceph-osd[2161]: 8:
(FileStore::sync_entry()+0x32e) [0x555bd4e2448e]
Oct 26 12:52:17 sanmarko ceph-osd[2161]: 9:
(FileStore::SyncThread::entry()+0xd) [0x555bd4e4ed2d]
Oct 26 12:52:17 sanmarko ceph-osd[2161]: 10:
/lib/x86_64-linux-gnu/libpthread.so.0(+0x8ea7) [0x7fb2b17f9ea7]
Oct 26 12:52:17 sanmarko ceph-osd[2161]: 11: clone()
Oct 26 12:52:17 sanmarko ceph-osd[2161]: NOTE: a copy of the
executable, or `objdump -rdS <executable>` is needed to interpret this.
Oct 26 12:52:17 sanmarko ceph-osd[2161]: 0>
2021-10-26T12:52:17.458+0200 7fb2a6722700 -1 *** Caught signal (Aborted) **
Oct 26 12:52:17 sanmarko ceph-osd[2161]: in thread 7fb2a6722700
thread_name:filestore_sync
Oct 26 12:52:17 sanmarko ceph-osd[2161]: ceph version 16.2.6
(1a6b9a05546f335eeeddb460fdc89caadf80ac7a) pacific (stable)
Oct 26 12:52:17 sanmarko ceph-osd[2161]: 1:
/lib/x86_64-linux-gnu/libpthread.so.0(+0x14140) [0x7fb2b1805140]
Oct 26 12:52:17 sanmarko ceph-osd[2161]: 2: gsignal()
Oct 26 12:52:17 sanmarko ceph-osd[2161]: 3: abort()
Oct 26 12:52:17 sanmarko ceph-osd[2161]: 4:
/lib/x86_64-linux-gnu/libc.so.6(+0x2540f) [0x7fb2b12ba40f]
Oct 26 12:52:17 sanmarko ceph-osd[2161]: 5:
/lib/x86_64-linux-gnu/libc.so.6(+0x34662) [0x7fb2b12c9662]
Oct 26 12:52:17 sanmarko ceph-osd[2161]: 6: pthread_mutex_lock()
Oct 26 12:52:17 sanmarko ceph-osd[2161]: 7:
(JournalingObjectStore::ApplyManager::commit_start()+0xbc) [0x555bd4e5ab2c]
Oct 26 12:52:17 sanmarko ceph-osd[2161]: 8:
(FileStore::sync_entry()+0x32e) [0x555bd4e2448e]
Oct 26 12:52:17 sanmarko ceph-osd[2161]: 9:
(FileStore::SyncThread::entry()+0xd) [0x555bd4e4ed2d]
Oct 26 12:52:17 sanmarko ceph-osd[2161]: 10:
/lib/x86_64-linux-gnu/libpthread.so.0(+0x8ea7) [0x7fb2b17f9ea7]
Oct 26 12:52:17 sanmarko ceph-osd[2161]: 11: clone()
Oct 26 12:52:17 sanmarko ceph-osd[2161]: NOTE: a copy of the
executable, or `objdump -rdS <executable>` is needed to interpret this.
Oct 26 12:52:17 sanmarko ceph-osd[2161]: 0>
2021-10-26T12:52:17.458+0200 7fb2a6722700 -1 *** Caught signal (Aborted) **
Oct 26 12:52:17 sanmarko ceph-osd[2161]: in thread 7fb2a6722700
thread_name:filestore_sync
Oct 26 12:52:17 sanmarko ceph-osd[2161]: ceph version 16.2.6
(1a6b9a05546f335eeeddb460fdc89caadf80ac7a) pacific (stable)
Oct 26 12:52:17 sanmarko ceph-osd[2161]: 1:
/lib/x86_64-linux-gnu/libpthread.so.0(+0x14140) [0x7fb2b1805140]
Oct 26 12:52:17 sanmarko ceph-osd[2161]: 2: gsignal()
Oct 26 12:52:17 sanmarko ceph-osd[2161]: 3: abort()
Oct 26 12:52:17 sanmarko ceph-osd[2161]: 4:
/lib/x86_64-linux-gnu/libc.so.6(+0x2540f) [0x7fb2b12ba40f]
Oct 26 12:52:17 sanmarko ceph-osd[2161]: 5:
/lib/x86_64-linux-gnu/libc.so.6(+0x34662) [0x7fb2b12c9662]
Oct 26 12:52:17 sanmarko ceph-osd[2161]: 6: pthread_mutex_lock()
Oct 26 12:52:17 sanmarko ceph-osd[2161]: 7:
(JournalingObjectStore::ApplyManager::commit_start()+0xbc) [0x555bd4e5ab2c]
Oct 26 12:52:17 sanmarko ceph-osd[2161]: 8:
(FileStore::sync_entry()+0x32e) [0x555bd4e2448e]
Oct 26 12:52:17 sanmarko ceph-osd[2161]: 9:
(FileStore::SyncThread::entry()+0xd) [0x555bd4e4ed2d]
Oct 26 12:52:17 sanmarko ceph-osd[2161]: 10:
/lib/x86_64-linux-gnu/libpthread.so.0(+0x8ea7) [0x7fb2b17f9ea7]
Oct 26 12:52:17 sanmarko ceph-osd[2161]: 11: clone()
Oct 26 12:52:17 sanmarko ceph-osd[2161]: NOTE: a copy of the
executable, or `objdump -rdS <executable>` is needed to interpret this.
Oct 26 12:52:17 sanmarko systemd[1]: ceph-osd at 6.service: Main process
exited, code=killed, status=6/ABRT
Oct 26 12:52:17 sanmarko systemd[1]: ceph-osd at 6.service: Failed with
result 'signal'.
Oct 26 12:52:17 sanmarko systemd[1]: ceph-osd at 6.service: Consumed 1h
7min 59.828s CPU time.
This is a filestore OSD. This node has 4 OSDs: 1 SSD OSD and 3 HDD OSDs
with journal in system SSD.
I don't see any useful detail in ceph-osd.6.log... but it's full of logs
like these before crash (they're there even at -9999):
-15> 2021-10-26T12:52:11.422+0200 7fb29ff23700 10 monclient: tick
-14> 2021-10-26T12:52:11.422+0200 7fb29ff23700 10 monclient:
_check_auth_rotating have uptodate secrets (they expire after
2021-10-26T12:51:41.424126+0200)
-13> 2021-10-26T12:52:12.422+0200 7fb29ff23700 10 monclient: tick
-12> 2021-10-26T12:52:12.422+0200 7fb29ff23700 10 monclient:
_check_auth_rotating have uptodate secrets (they expire after
2021-10-26T12:51:42.424200+0200)
-11> 2021-10-26T12:52:13.422+0200 7fb29ff23700 10 monclient: tick
-10> 2021-10-26T12:52:13.422+0200 7fb29ff23700 10 monclient:
_check_auth_rotating have uptodate secrets (they expire after
2021-10-26T12:51:43.424270+0200)
-9> 2021-10-26T12:52:14.422+0200 7fb29ff23700 10 monclient: tick
-8> 2021-10-26T12:52:14.422+0200 7fb29ff23700 10 monclient:
_check_auth_rotating have uptodate secrets (they expire after
2021-10-26T12:51:44.424342+0200)
-7> 2021-10-26T12:52:15.422+0200 7fb29ff23700 10 monclient: tick
-6> 2021-10-26T12:52:15.422+0200 7fb29ff23700 10 monclient:
_check_auth_rotating have uptodate secrets (they expire after
2021-10-26T12:51:45.424414+0200)
-5> 2021-10-26T12:52:15.726+0200 7fb29370a700 5 osd.6 39883
heartbeat osd_stat(store_statfs(0x999ad46000/0xd0000/0xe8c3af0000, data
0x999ae16000/0x999ae16000, compress 0x0/0x0/0x0, omap 0x2d2d2dc, meta 0x0)
, peers [0,2,3,4,7,8,9,10,11,15] op hist [])
-4> 2021-10-26T12:52:16.422+0200 7fb29ff23700 10 monclient: tick
-3> 2021-10-26T12:52:16.422+0200 7fb29ff23700 10 monclient:
_check_auth_rotating have uptodate secrets (they expire after
2021-10-26T12:51:46.424489+0200)
-2> 2021-10-26T12:52:17.422+0200 7fb29ff23700 10 monclient: tick
-1> 2021-10-26T12:52:17.422+0200 7fb29ff23700 10 monclient:
_check_auth_rotating have uptodate secrets (they expire after
2021-10-26T12:51:47.424562+0200)
I don't see those logs after daemon has been automatically restarted.
Any idea what could be the issue?
System has plenty of RAM:
# free -h
total used free shared buff/cache
available
Mem: 125Gi 28Gi 752Mi 65Mi 96Gi 96Gi
Swap: 0B 0B 0B
And CPU is ~90% iddle (of 16 cores).
# pveversion -v
proxmox-ve: 7.0-2 (running kernel: 5.11.22-5-pve)
pve-manager: 7.0-13 (running version: 7.0-13/7aa7e488)
pve-kernel-helper: 7.1-2
pve-kernel-5.11: 7.0-8
pve-kernel-5.4: 6.4-6
pve-kernel-5.11.22-5-pve: 5.11.22-10
pve-kernel-5.4.140-1-pve: 5.4.140-1
pve-kernel-5.4.124-1-pve: 5.4.124-2
ceph: 16.2.6-pve2
ceph-fuse: 16.2.6-pve2
corosync: 3.1.5-pve1
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: residual config
ifupdown2: 3.1.0-1+pmx3
libjs-extjs: 7.0.0-1
libknet1: 1.22-pve1
libproxmox-acme-perl: 1.3.0
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.0-5
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.0-10
libpve-guest-common-perl: 4.0-2
libpve-http-server-perl: 4.0-3
libpve-storage-perl: 7.0-12
libqb0: 1.0.5-1
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.9-4
lxcfs: 4.0.8-pve2
novnc-pve: 1.2.0-3
proxmox-backup-client: 2.0.11-1
proxmox-backup-file-restore: 2.0.11-1
proxmox-mini-journalreader: 1.2-1
proxmox-widget-toolkit: 3.3-6
pve-cluster: 7.0-3
pve-container: 4.0-10
pve-docs: 7.0-5
pve-edk2-firmware: 3.20210831-1
pve-firewall: 4.2-4
pve-firmware: 3.3-2
pve-ha-manager: 3.3-1
pve-i18n: 2.5-1
pve-qemu-kvm: 6.0.0-4
pve-xtermjs: 4.12.0-1
qemu-server: 7.0-16
smartmontools: 7.2-pve2
spiceterm: 3.2-2
vncterm: 1.7-1
zfsutils-linux: 2.0.5-pve1
Thanks
Eneko Lacunza
Zuzendari teknikoa | Director técnico
Binovo IT Human Project
Tel. +34 943 569 206 | https://www.binovo.es
Astigarragako Bidea, 2 - 2º izda. Oficina 10-11, 20180 Oiartzun
https://www.youtube.com/user/CANALBINOVO
https://www.linkedin.com/company/37269706/
More information about the pve-user
mailing list