[pve-devel] corosync bug: cluster break after 1 node clean shutdown

Tue Sep 15 16:32:52 CEST 2020

On 9/15/20 4:09 PM, Alexandre DERUMIER wrote:
>>> Can you try to give pmxcfs real time scheduling, e.g., by doing: 
>>>
>>> # systemctl edit pve-cluster 
>>>
>>> And then add snippet: 
>>>
>>>
>>> [Service] 
>>> CPUSchedulingPolicy=rr 
>>> CPUSchedulingPriority=99 
> yes, sure, I'll do it now
> 
> 
>> I'm currently digging the logs 
>>> Is your most simplest/stable reproducer still a periodic restart of corosync in one node? 
> yes, a simple "systemctl restart corosync" on 1 node each minute
> 
> 
> 
> After 1hour, it's still locked.
> 
> on other nodes, I still have pmxfs logs like:
> 

I mean this is bad, but also great!
Cam you do a coredump of the whole thing and upload it somewhere with the version info
used (for dbgsym package)? That could help a lot.

> manual "pmxcfs -d"
> https://gist.github.com/aderumier/4cd91d17e1f8847b93ea5f621f257c2e
> 

Hmm, the fuse connection of the previous one got into a weird state (or something is still
running) but I'd rather say this is a side-effect not directly connected to the real bug.

> 
> some interesting dmesg about "pvesr"
> 
> [Tue Sep 15 14:45:34 2020] INFO: task pvesr:19038 blocked for more than 120 seconds.
> [Tue Sep 15 14:45:34 2020]       Tainted: P           O      5.4.60-1-pve #1
> [Tue Sep 15 14:45:34 2020] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [Tue Sep 15 14:45:34 2020] pvesr           D    0 19038      1 0x00000080
> [Tue Sep 15 14:45:34 2020] Call Trace:
> [Tue Sep 15 14:45:34 2020]  __schedule+0x2e6/0x6f0
> [Tue Sep 15 14:45:34 2020]  ? filename_parentat.isra.57.part.58+0xf7/0x180
> [Tue Sep 15 14:45:34 2020]  schedule+0x33/0xa0
> [Tue Sep 15 14:45:34 2020]  rwsem_down_write_slowpath+0x2ed/0x4a0
> [Tue Sep 15 14:45:34 2020]  down_write+0x3d/0x40
> [Tue Sep 15 14:45:34 2020]  filename_create+0x8e/0x180
> [Tue Sep 15 14:45:34 2020]  do_mkdirat+0x59/0x110
> [Tue Sep 15 14:45:34 2020]  __x64_sys_mkdir+0x1b/0x20
> [Tue Sep 15 14:45:34 2020]  do_syscall_64+0x57/0x190
> [Tue Sep 15 14:45:34 2020]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 

hmm, hangs in mkdir (cluster wide locking)