PVE 7 Ceph 16.2.6 osd crash

Eneko Lacunza elacunza at binovo.es
Tue Oct 26 16:03:49 CEST 2021


Hi all,

This morning an OSD in our office cluster crashed:

Oct 26 12:52:17 sanmarko ceph-osd[2161]: ceph-osd: 
../nptl/pthread_mutex_lock.c:81: __pthread_mutex_lock: Assertion 
`mutex->__data.__owner == 0' failed.
Oct 26 12:52:17 sanmarko ceph-osd[2161]: *** Caught signal (Aborted) **
Oct 26 12:52:17 sanmarko ceph-osd[2161]:  in thread 7fb2a6722700 
thread_name:filestore_sync
Oct 26 12:52:17 sanmarko ceph-osd[2161]:  ceph version 16.2.6 
(1a6b9a05546f335eeeddb460fdc89caadf80ac7a) pacific (stable)
Oct 26 12:52:17 sanmarko ceph-osd[2161]:  1: 
/lib/x86_64-linux-gnu/libpthread.so.0(+0x14140) [0x7fb2b1805140]
Oct 26 12:52:17 sanmarko ceph-osd[2161]:  2: gsignal()
Oct 26 12:52:17 sanmarko ceph-osd[2161]:  3: abort()
Oct 26 12:52:17 sanmarko ceph-osd[2161]:  4: 
/lib/x86_64-linux-gnu/libc.so.6(+0x2540f) [0x7fb2b12ba40f]
Oct 26 12:52:17 sanmarko ceph-osd[2161]:  5: 
/lib/x86_64-linux-gnu/libc.so.6(+0x34662) [0x7fb2b12c9662]
Oct 26 12:52:17 sanmarko ceph-osd[2161]:  6: pthread_mutex_lock()
Oct 26 12:52:17 sanmarko ceph-osd[2161]:  7: 
(JournalingObjectStore::ApplyManager::commit_start()+0xbc) [0x555bd4e5ab2c]
Oct 26 12:52:17 sanmarko ceph-osd[2161]:  8: 
(FileStore::sync_entry()+0x32e) [0x555bd4e2448e]
Oct 26 12:52:17 sanmarko ceph-osd[2161]:  9: 
(FileStore::SyncThread::entry()+0xd) [0x555bd4e4ed2d]
Oct 26 12:52:17 sanmarko ceph-osd[2161]:  10: 
/lib/x86_64-linux-gnu/libpthread.so.0(+0x8ea7) [0x7fb2b17f9ea7]
Oct 26 12:52:17 sanmarko ceph-osd[2161]:  11: clone()
Oct 26 12:52:17 sanmarko ceph-osd[2161]: 2021-10-26T12:52:17.458+0200 
7fb2a6722700 -1 *** Caught signal (Aborted) **
Oct 26 12:52:17 sanmarko ceph-osd[2161]:  in thread 7fb2a6722700 
thread_name:filestore_sync
Oct 26 12:52:17 sanmarko ceph-osd[2161]:  ceph version 16.2.6 
(1a6b9a05546f335eeeddb460fdc89caadf80ac7a) pacific (stable)
Oct 26 12:52:17 sanmarko ceph-osd[2161]:  1: 
/lib/x86_64-linux-gnu/libpthread.so.0(+0x14140) [0x7fb2b1805140]
Oct 26 12:52:17 sanmarko ceph-osd[2161]:  2: gsignal()
Oct 26 12:52:17 sanmarko ceph-osd[2161]:  3: abort()
Oct 26 12:52:17 sanmarko ceph-osd[2161]:  4: 
/lib/x86_64-linux-gnu/libc.so.6(+0x2540f) [0x7fb2b12ba40f]
Oct 26 12:52:17 sanmarko ceph-osd[2161]:  5: 
/lib/x86_64-linux-gnu/libc.so.6(+0x34662) [0x7fb2b12c9662]
Oct 26 12:52:17 sanmarko ceph-osd[2161]:  6: pthread_mutex_lock()
Oct 26 12:52:17 sanmarko ceph-osd[2161]:  7: 
(JournalingObjectStore::ApplyManager::commit_start()+0xbc) [0x555bd4e5ab2c]
Oct 26 12:52:17 sanmarko ceph-osd[2161]:  8: 
(FileStore::sync_entry()+0x32e) [0x555bd4e2448e]
Oct 26 12:52:17 sanmarko ceph-osd[2161]:  9: 
(FileStore::SyncThread::entry()+0xd) [0x555bd4e4ed2d]
Oct 26 12:52:17 sanmarko ceph-osd[2161]:  10: 
/lib/x86_64-linux-gnu/libpthread.so.0(+0x8ea7) [0x7fb2b17f9ea7]
Oct 26 12:52:17 sanmarko ceph-osd[2161]:  11: clone()
Oct 26 12:52:17 sanmarko ceph-osd[2161]:  NOTE: a copy of the 
executable, or `objdump -rdS <executable>` is needed to interpret this.
Oct 26 12:52:17 sanmarko ceph-osd[2161]:      0> 
2021-10-26T12:52:17.458+0200 7fb2a6722700 -1 *** Caught signal (Aborted) **
Oct 26 12:52:17 sanmarko ceph-osd[2161]:  in thread 7fb2a6722700 
thread_name:filestore_sync
Oct 26 12:52:17 sanmarko ceph-osd[2161]:  ceph version 16.2.6 
(1a6b9a05546f335eeeddb460fdc89caadf80ac7a) pacific (stable)
Oct 26 12:52:17 sanmarko ceph-osd[2161]:  1: 
/lib/x86_64-linux-gnu/libpthread.so.0(+0x14140) [0x7fb2b1805140]
Oct 26 12:52:17 sanmarko ceph-osd[2161]:  2: gsignal()
Oct 26 12:52:17 sanmarko ceph-osd[2161]:  3: abort()
Oct 26 12:52:17 sanmarko ceph-osd[2161]:  4: 
/lib/x86_64-linux-gnu/libc.so.6(+0x2540f) [0x7fb2b12ba40f]
Oct 26 12:52:17 sanmarko ceph-osd[2161]:  5: 
/lib/x86_64-linux-gnu/libc.so.6(+0x34662) [0x7fb2b12c9662]
Oct 26 12:52:17 sanmarko ceph-osd[2161]:  6: pthread_mutex_lock()
Oct 26 12:52:17 sanmarko ceph-osd[2161]:  7: 
(JournalingObjectStore::ApplyManager::commit_start()+0xbc) [0x555bd4e5ab2c]
Oct 26 12:52:17 sanmarko ceph-osd[2161]:  8: 
(FileStore::sync_entry()+0x32e) [0x555bd4e2448e]
Oct 26 12:52:17 sanmarko ceph-osd[2161]:  9: 
(FileStore::SyncThread::entry()+0xd) [0x555bd4e4ed2d]
Oct 26 12:52:17 sanmarko ceph-osd[2161]:  10: 
/lib/x86_64-linux-gnu/libpthread.so.0(+0x8ea7) [0x7fb2b17f9ea7]
Oct 26 12:52:17 sanmarko ceph-osd[2161]:  11: clone()
Oct 26 12:52:17 sanmarko ceph-osd[2161]:  NOTE: a copy of the 
executable, or `objdump -rdS <executable>` is needed to interpret this.
Oct 26 12:52:17 sanmarko ceph-osd[2161]:      0> 
2021-10-26T12:52:17.458+0200 7fb2a6722700 -1 *** Caught signal (Aborted) **
Oct 26 12:52:17 sanmarko ceph-osd[2161]:  in thread 7fb2a6722700 
thread_name:filestore_sync
Oct 26 12:52:17 sanmarko ceph-osd[2161]:  ceph version 16.2.6 
(1a6b9a05546f335eeeddb460fdc89caadf80ac7a) pacific (stable)
Oct 26 12:52:17 sanmarko ceph-osd[2161]:  1: 
/lib/x86_64-linux-gnu/libpthread.so.0(+0x14140) [0x7fb2b1805140]
Oct 26 12:52:17 sanmarko ceph-osd[2161]:  2: gsignal()
Oct 26 12:52:17 sanmarko ceph-osd[2161]:  3: abort()
Oct 26 12:52:17 sanmarko ceph-osd[2161]:  4: 
/lib/x86_64-linux-gnu/libc.so.6(+0x2540f) [0x7fb2b12ba40f]
Oct 26 12:52:17 sanmarko ceph-osd[2161]:  5: 
/lib/x86_64-linux-gnu/libc.so.6(+0x34662) [0x7fb2b12c9662]
Oct 26 12:52:17 sanmarko ceph-osd[2161]:  6: pthread_mutex_lock()
Oct 26 12:52:17 sanmarko ceph-osd[2161]:  7: 
(JournalingObjectStore::ApplyManager::commit_start()+0xbc) [0x555bd4e5ab2c]
Oct 26 12:52:17 sanmarko ceph-osd[2161]:  8: 
(FileStore::sync_entry()+0x32e) [0x555bd4e2448e]
Oct 26 12:52:17 sanmarko ceph-osd[2161]:  9: 
(FileStore::SyncThread::entry()+0xd) [0x555bd4e4ed2d]
Oct 26 12:52:17 sanmarko ceph-osd[2161]:  10: 
/lib/x86_64-linux-gnu/libpthread.so.0(+0x8ea7) [0x7fb2b17f9ea7]
Oct 26 12:52:17 sanmarko ceph-osd[2161]:  11: clone()
Oct 26 12:52:17 sanmarko ceph-osd[2161]:  NOTE: a copy of the 
executable, or `objdump -rdS <executable>` is needed to interpret this.
Oct 26 12:52:17 sanmarko systemd[1]: ceph-osd at 6.service: Main process 
exited, code=killed, status=6/ABRT
Oct 26 12:52:17 sanmarko systemd[1]: ceph-osd at 6.service: Failed with 
result 'signal'.
Oct 26 12:52:17 sanmarko systemd[1]: ceph-osd at 6.service: Consumed 1h 
7min 59.828s CPU time.

This is a filestore OSD. This node has 4 OSDs: 1 SSD OSD and 3 HDD OSDs 
with journal in system SSD.

I don't see any useful detail in ceph-osd.6.log... but it's full of logs 
like these before crash (they're there even at -9999):
    -15> 2021-10-26T12:52:11.422+0200 7fb29ff23700 10 monclient: tick
    -14> 2021-10-26T12:52:11.422+0200 7fb29ff23700 10 monclient: 
_check_auth_rotating have uptodate secrets (they expire after 
2021-10-26T12:51:41.424126+0200)
    -13> 2021-10-26T12:52:12.422+0200 7fb29ff23700 10 monclient: tick
    -12> 2021-10-26T12:52:12.422+0200 7fb29ff23700 10 monclient: 
_check_auth_rotating have uptodate secrets (they expire after 
2021-10-26T12:51:42.424200+0200)
    -11> 2021-10-26T12:52:13.422+0200 7fb29ff23700 10 monclient: tick
    -10> 2021-10-26T12:52:13.422+0200 7fb29ff23700 10 monclient: 
_check_auth_rotating have uptodate secrets (they expire after 
2021-10-26T12:51:43.424270+0200)
     -9> 2021-10-26T12:52:14.422+0200 7fb29ff23700 10 monclient: tick
     -8> 2021-10-26T12:52:14.422+0200 7fb29ff23700 10 monclient: 
_check_auth_rotating have uptodate secrets (they expire after 
2021-10-26T12:51:44.424342+0200)
     -7> 2021-10-26T12:52:15.422+0200 7fb29ff23700 10 monclient: tick
     -6> 2021-10-26T12:52:15.422+0200 7fb29ff23700 10 monclient: 
_check_auth_rotating have uptodate secrets (they expire after 
2021-10-26T12:51:45.424414+0200)
     -5> 2021-10-26T12:52:15.726+0200 7fb29370a700  5 osd.6 39883 
heartbeat osd_stat(store_statfs(0x999ad46000/0xd0000/0xe8c3af0000, data 
0x999ae16000/0x999ae16000, compress 0x0/0x0/0x0, omap 0x2d2d2dc, meta 0x0)
, peers [0,2,3,4,7,8,9,10,11,15] op hist [])
     -4> 2021-10-26T12:52:16.422+0200 7fb29ff23700 10 monclient: tick
     -3> 2021-10-26T12:52:16.422+0200 7fb29ff23700 10 monclient: 
_check_auth_rotating have uptodate secrets (they expire after 
2021-10-26T12:51:46.424489+0200)
     -2> 2021-10-26T12:52:17.422+0200 7fb29ff23700 10 monclient: tick
     -1> 2021-10-26T12:52:17.422+0200 7fb29ff23700 10 monclient: 
_check_auth_rotating have uptodate secrets (they expire after 
2021-10-26T12:51:47.424562+0200)


I don't see those logs after daemon has been automatically restarted.

Any idea what could be the issue?

System has plenty of RAM:

# free -h
                total        used        free      shared buff/cache   
available
Mem:           125Gi        28Gi       752Mi        65Mi 96Gi        96Gi
Swap:             0B          0B          0B

And CPU is ~90% iddle (of 16 cores).

# pveversion -v
proxmox-ve: 7.0-2 (running kernel: 5.11.22-5-pve)
pve-manager: 7.0-13 (running version: 7.0-13/7aa7e488)
pve-kernel-helper: 7.1-2
pve-kernel-5.11: 7.0-8
pve-kernel-5.4: 6.4-6
pve-kernel-5.11.22-5-pve: 5.11.22-10
pve-kernel-5.4.140-1-pve: 5.4.140-1
pve-kernel-5.4.124-1-pve: 5.4.124-2
ceph: 16.2.6-pve2
ceph-fuse: 16.2.6-pve2
corosync: 3.1.5-pve1
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: residual config
ifupdown2: 3.1.0-1+pmx3
libjs-extjs: 7.0.0-1
libknet1: 1.22-pve1
libproxmox-acme-perl: 1.3.0
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.0-5
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.0-10
libpve-guest-common-perl: 4.0-2
libpve-http-server-perl: 4.0-3
libpve-storage-perl: 7.0-12
libqb0: 1.0.5-1
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.9-4
lxcfs: 4.0.8-pve2
novnc-pve: 1.2.0-3
proxmox-backup-client: 2.0.11-1
proxmox-backup-file-restore: 2.0.11-1
proxmox-mini-journalreader: 1.2-1
proxmox-widget-toolkit: 3.3-6
pve-cluster: 7.0-3
pve-container: 4.0-10
pve-docs: 7.0-5
pve-edk2-firmware: 3.20210831-1
pve-firewall: 4.2-4
pve-firmware: 3.3-2
pve-ha-manager: 3.3-1
pve-i18n: 2.5-1
pve-qemu-kvm: 6.0.0-4
pve-xtermjs: 4.12.0-1
qemu-server: 7.0-16
smartmontools: 7.2-pve2
spiceterm: 3.2-2
vncterm: 1.7-1
zfsutils-linux: 2.0.5-pve1

Thanks


Eneko Lacunza
Zuzendari teknikoa | Director técnico
Binovo IT Human Project

Tel. +34 943 569 206 | https://www.binovo.es
Astigarragako Bidea, 2 - 2º izda. Oficina 10-11, 20180 Oiartzun

https://www.youtube.com/user/CANALBINOVO
https://www.linkedin.com/company/37269706/





More information about the pve-user mailing list