[PVE-User] node not rebooted after corosync crash
Dmitry Petuhov
mityapetuhov at gmail.com
Wed Aug 15 10:41:43 CEST 2018
Week ago on one of my PVE nodes suddenly crashed corosync.
-------------->8=========
corosync[4701]: error [TOTEM ] FAILED TO RECEIVE
corosync[4701]: [TOTEM ] FAILED TO RECEIVE
corosync[4701]: notice [TOTEM ] A new membership (10.19.92.53:1992) was
formed. Members left: 1 2 4
corosync[4701]: notice [TOTEM ] Failed to receive the leave message.
failed: 1 2 4
corosync[4701]: [TOTEM ] A new membership (10.19.92.53:1992) was
formed. Members left: 1 2 4
corosync[4701]: [TOTEM ] Failed to receive the leave message. failed: 1 2 4
corosync[4701]: notice [QUORUM] This node is within the non-primary
component and will NOT provide any services.
corosync[4701]: notice [QUORUM] Members[1]: 3
corosync[4701]: notice [MAIN ] Completed service synchronization,
ready to provide service.
corosync[4701]: [QUORUM] This node is within the non-primary component
and will NOT provide any services.
corosync[4701]: [QUORUM] Members[1]: 3
corosync[4701]: [MAIN ] Completed service synchronization, ready to
provide service.
kernel: [29187555.500409] dlm: closing connection to node 2
corosync[4701]: notice [TOTEM ] A new membership (10.19.92.51:2000) was
formed. Members joined: 1 2 4
corosync[4701]: [TOTEM ] A new membership (10.19.92.51:2000) was
formed. Members joined: 1 2 4
corosync[4701]: notice [QUORUM] This node is within the primary
component and will provide service.
corosync[4701]: notice [QUORUM] Members[4]: 1 2 3 4
corosync[4701]: notice [MAIN ] Completed service synchronization,
ready to provide service.
corosync[4701]: [QUORUM] This node is within the primary component and
will provide service.
corosync[4701]: notice [CFG ] Killed by node 1: dlm_controld
corosync[4701]: error [MAIN ] Corosync Cluster Engine exiting with
status -1 at cfg.c:530.
corosync[4701]: [QUORUM] Members[4]: 1 2 3 4
corosync[4701]: [MAIN ] Completed service synchronization, ready to
provide service.
dlm_controld[688]: 29187298 daemon node 4 stateful merge
dlm_controld[688]: 29187298 receive_start 4:6 add node with started_count 2
dlm_controld[688]: 29187298 daemon node 1 stateful merge
dlm_controld[688]: 29187298 receive_start 1:5 add node with started_count 4
dlm_controld[688]: 29187298 daemon node 2 stateful merge
dlm_controld[688]: 29187298 receive_start 2:17 add node with
started_count 13
corosync[4701]: [CFG ] Killed by node 1: dlm_controld
corosync[4701]: [MAIN ] Corosync Cluster Engine exiting with status -1
at cfg.c:530.
dlm_controld[688]: 29187298 cpg_dispatch error 2
dlm_controld[688]: 29187298 process_cluster_cfg cfg_dispatch 2
dlm_controld[688]: 29187298 cluster is down, exiting
dlm_controld[688]: 29187298 process_cluster quorum_dispatch 2
dlm_controld[688]: 29187298 daemon cpg_dispatch error 2
systemd[1]: corosync.service: Main process exited, code=exited,
status=255/n/a
systemd[1]: corosync.service: Unit entered failed state.
systemd[1]: corosync.service: Failed with result 'exit-code'.
kernel: [29187556.903177] dlm: closing connection to node 4
kernel: [29187556.906730] dlm: closing connection to node 3
dlm_controld[688]: 29187298 abandoned lockspace hp-big-gfs
kernel: [29187556.924279] dlm: dlm user daemon left 1 lockspaces
-------------->8=========
But node did not rebooted.
I use WATCHDOG_MODULE=ipmi_watchdog. Watchdog still running:
-------------->8=========
# ipmitool mc watchdog get
Watchdog Timer Use: SMS/OS (0x44)
Watchdog Timer Is: Started/Running
Watchdog Timer Actions: Hard Reset (0x01)
Pre-timeout interval: 0 seconds
Timer Expiration Flags: 0x10
Initial Countdown: 10 sec
Present Countdown: 9 sec
-------------->8=========
The only down service is corosync.
-------------->8=========
# pveversion --verbose
proxmox-ve: 5.0-21 (running kernel: 4.10.17-2-pve)
pve-manager: 5.0-31 (running version: 5.0-31/27769b1f)
pve-kernel-4.10.17-2-pve: 4.10.17-20
pve-kernel-4.10.17-3-pve: 4.10.17-21
libpve-http-server-perl: 2.0-6
lvm2: 2.02.168-pve3
corosync: 2.4.2-pve3
libqb0: 1.0.1-1
pve-cluster: 5.0-12
qemu-server: 5.0-15
pve-firmware: 2.0-2
libpve-common-perl: 5.0-16
libpve-guest-common-perl: 2.0-11
libpve-access-control: 5.0-6
libpve-storage-perl: 5.0-14
pve-libspice-server1: 0.12.8-3
vncterm: 1.5-2
pve-docs: 5.0-9
pve-qemu-kvm: 2.9.0-5
pve-container: 2.0-15
pve-firewall: 3.0-2
pve-ha-manager: 2.0-2
ksm-control-daemon: 1.2-2
glusterfs-client: 3.8.8-1
lxc-pve: 2.0.8-3
lxcfs: 2.0.7-pve4
criu: 2.11.1-1~bpo90
novnc-pve: 0.6-4
smartmontools: 6.5+svn4324-1
zfsutils-linux: 0.6.5.11-pve17~bpo90
gfs2-utils: 3.1.9-2
openvswitch-switch: 2.7.0-2
ceph: 12.2.0-pve1
-------------->8=========
I also have GFS2 in this cluster, which did not stop work after corosync
crash (which scares me most).
Shouldn't node reboot on corosync fail, and why it can still run? Or
shall node have HA VMs to reboot, and just stay as it is if there's only
regular autostarted VMs and no HA machines present?
More information about the pve-user
mailing list