[PVE-User] node not rebooted after corosync crash

Wed Aug 15 10:41:43 CEST 2018

Week ago on one of my PVE nodes suddenly crashed corosync.

-------------->8=========
corosync[4701]: error   [TOTEM ] FAILED TO RECEIVE
corosync[4701]:  [TOTEM ] FAILED TO RECEIVE
corosync[4701]: notice  [TOTEM ] A new membership (10.19.92.53:1992) was 
formed. Members left: 1 2 4
corosync[4701]: notice  [TOTEM ] Failed to receive the leave message. 
failed: 1 2 4
corosync[4701]:  [TOTEM ] A new membership (10.19.92.53:1992) was 
formed. Members left: 1 2 4
corosync[4701]:  [TOTEM ] Failed to receive the leave message. failed: 1 2 4
corosync[4701]: notice  [QUORUM] This node is within the non-primary 
component and will NOT provide any services.
corosync[4701]: notice  [QUORUM] Members[1]: 3
corosync[4701]: notice  [MAIN  ] Completed service synchronization, 
ready to provide service.
corosync[4701]:  [QUORUM] This node is within the non-primary component 
and will NOT provide any services.
corosync[4701]:  [QUORUM] Members[1]: 3
corosync[4701]:  [MAIN  ] Completed service synchronization, ready to 
provide service.
kernel: [29187555.500409] dlm: closing connection to node 2
corosync[4701]: notice  [TOTEM ] A new membership (10.19.92.51:2000) was 
formed. Members joined: 1 2 4
corosync[4701]:  [TOTEM ] A new membership (10.19.92.51:2000) was 
formed. Members joined: 1 2 4
corosync[4701]: notice  [QUORUM] This node is within the primary 
component and will provide service.
corosync[4701]: notice  [QUORUM] Members[4]: 1 2 3 4
corosync[4701]: notice  [MAIN  ] Completed service synchronization, 
ready to provide service.
corosync[4701]:  [QUORUM] This node is within the primary component and 
will provide service.
corosync[4701]: notice  [CFG   ] Killed by node 1: dlm_controld
corosync[4701]: error   [MAIN  ] Corosync Cluster Engine exiting with 
status -1 at cfg.c:530.
corosync[4701]:  [QUORUM] Members[4]: 1 2 3 4
corosync[4701]:  [MAIN  ] Completed service synchronization, ready to 
provide service.
dlm_controld[688]: 29187298 daemon node 4 stateful merge
dlm_controld[688]: 29187298 receive_start 4:6 add node with started_count 2
dlm_controld[688]: 29187298 daemon node 1 stateful merge
dlm_controld[688]: 29187298 receive_start 1:5 add node with started_count 4
dlm_controld[688]: 29187298 daemon node 2 stateful merge
dlm_controld[688]: 29187298 receive_start 2:17 add node with 
started_count 13
corosync[4701]:  [CFG   ] Killed by node 1: dlm_controld
corosync[4701]:  [MAIN  ] Corosync Cluster Engine exiting with status -1 
at cfg.c:530.
dlm_controld[688]: 29187298 cpg_dispatch error 2
dlm_controld[688]: 29187298 process_cluster_cfg cfg_dispatch 2
dlm_controld[688]: 29187298 cluster is down, exiting
dlm_controld[688]: 29187298 process_cluster quorum_dispatch 2
dlm_controld[688]: 29187298 daemon cpg_dispatch error 2
systemd[1]: corosync.service: Main process exited, code=exited, 
status=255/n/a
systemd[1]: corosync.service: Unit entered failed state.
systemd[1]: corosync.service: Failed with result 'exit-code'.
kernel: [29187556.903177] dlm: closing connection to node 4
kernel: [29187556.906730] dlm: closing connection to node 3
dlm_controld[688]: 29187298 abandoned lockspace hp-big-gfs
kernel: [29187556.924279] dlm: dlm user daemon left 1 lockspaces
-------------->8=========

But node did not rebooted.

I use WATCHDOG_MODULE=ipmi_watchdog. Watchdog still running:

-------------->8=========

# ipmitool mc watchdog get
Watchdog Timer Use:     SMS/OS (0x44)
Watchdog Timer Is:      Started/Running
Watchdog Timer Actions: Hard Reset (0x01)
Pre-timeout interval:   0 seconds
Timer Expiration Flags: 0x10
Initial Countdown:      10 sec
Present Countdown:      9 sec

-------------->8=========

The only down service is corosync.

-------------->8=========

# pveversion --verbose
proxmox-ve: 5.0-21 (running kernel: 4.10.17-2-pve)
pve-manager: 5.0-31 (running version: 5.0-31/27769b1f)
pve-kernel-4.10.17-2-pve: 4.10.17-20
pve-kernel-4.10.17-3-pve: 4.10.17-21
libpve-http-server-perl: 2.0-6
lvm2: 2.02.168-pve3
corosync: 2.4.2-pve3
libqb0: 1.0.1-1
pve-cluster: 5.0-12
qemu-server: 5.0-15
pve-firmware: 2.0-2
libpve-common-perl: 5.0-16
libpve-guest-common-perl: 2.0-11
libpve-access-control: 5.0-6
libpve-storage-perl: 5.0-14
pve-libspice-server1: 0.12.8-3
vncterm: 1.5-2
pve-docs: 5.0-9
pve-qemu-kvm: 2.9.0-5
pve-container: 2.0-15
pve-firewall: 3.0-2
pve-ha-manager: 2.0-2
ksm-control-daemon: 1.2-2
glusterfs-client: 3.8.8-1
lxc-pve: 2.0.8-3
lxcfs: 2.0.7-pve4
criu: 2.11.1-1~bpo90
novnc-pve: 0.6-4
smartmontools: 6.5+svn4324-1
zfsutils-linux: 0.6.5.11-pve17~bpo90
gfs2-utils: 3.1.9-2
openvswitch-switch: 2.7.0-2
ceph: 12.2.0-pve1

-------------->8=========

I also have GFS2 in this cluster, which did not stop work after corosync 
crash (which scares me most).

Shouldn't node reboot on corosync fail, and why it can still run? Or 
shall node have HA VMs to reboot, and just stay as it is if there's only 
regular autostarted VMs and no HA machines present?