[PVE-User] WARNING: Upgrade and Watchdog kills Server in HA-Mode

Andreas Herrmann andreas at mx20.org
Thu Dec 7 14:07:41 CET 2017


Hi again,

On 07.12.2017 08:57, Thomas Lamprecht wrote:
> Do you got some log entries around that time?
> Or a persistent journal?

some more filtered logs about the watchdog are attached. nethcn-b(1|2|5)
"crashed" and nethcn-b(3|4) kept online. Ceph monitors are running on
nethcn-b(1|3|5).

Andreas
-------------- next part --------------
root at nethcn-b1:~# cat /var/log/syslog.1|egrep watchdog\|ipcc
Dec  6 18:33:53 nethcn-b1 pvestatd[10770]: ipcc_send_rec[4] failed: Transport endpoint is not connected
Dec  6 18:33:53 nethcn-b1 pvestatd[10770]: ipcc_send_rec[4] failed: Connection refused
Dec  6 18:33:53 nethcn-b1 pvestatd[10770]: ipcc_send_rec[4] failed: Connection refused
Dec  6 18:33:53 nethcn-b1 pvestatd[10770]: ipcc_send_rec[4] failed: Connection refused
Dec  6 18:33:53 nethcn-b1 pvestatd[10770]: ipcc_send_rec[4] failed: Connection refused
Dec  6 18:33:56 nethcn-b1 pve-ha-lrm[13875]: ipcc_send_rec[1] failed: Transport endpoint is not connected
Dec  6 18:33:56 nethcn-b1 pve-ha-lrm[13875]: ipcc_send_rec[2] failed: Connection refused
Dec  6 18:33:56 nethcn-b1 pve-ha-lrm[13875]: ipcc_send_rec[3] failed: Connection refused
Dec  6 18:33:56 nethcn-b1 watchdog-mux[3565]: client did not stop watchdog - disable watchdog updates
Dec  6 18:33:58 nethcn-b1 pve-ha-crm[10964]: ipcc_send_rec[1] failed: Transport endpoint is not connected

root at nethcn-b2:~# cat /var/log/syslog.1|egrep watchdog\|ipcc
Dec  6 17:46:40 nethcn-b2 pve-ha-crm[10842]: watchdog active
Dec  6 17:51:20 nethcn-b2 pve-ha-crm[10842]: ipcc_send_rec[1] failed: Transport endpoint is not connected
Dec  6 17:51:20 nethcn-b2 pve-ha-crm[10842]: ipcc_send_rec[2] failed: Connection refused
Dec  6 17:51:20 nethcn-b2 pve-ha-crm[10842]: ipcc_send_rec[3] failed: Connection refused
Dec  6 17:51:20 nethcn-b2 watchdog-mux[3397]: client did not stop watchdog - disable watchdog updates
Dec  6 17:51:21 nethcn-b2 pve-ha-lrm[13145]: ipcc_send_rec[1] failed: Transport endpoint is not connected
Dec  6 17:51:21 nethcn-b2 watchdog-mux[3397]: exit watchdog-mux with active connections
Dec  6 17:51:21 nethcn-b2 kernel: [88876.361477] watchdog: watchdog0: watchdog did not stop!
Dec  6 17:51:23 nethcn-b2 pvestatd[10618]: ipcc_send_rec[1] failed: Transport endpoint is not connected

root at nethcn-b3:~# cat /var/log/syslog.1|egrep watchdog\|ipcc
Dec  6 17:46:15 nethcn-b3 pveproxy[15923]: ipcc_send_rec[1] failed: Transport endpoint is not connected
Dec  6 17:46:15 nethcn-b3 pveproxy[15923]: ipcc_send_rec[2] failed: Connection refused
Dec  6 17:46:15 nethcn-b3 pveproxy[15923]: ipcc_send_rec[3] failed: Connection refused
Dec  6 17:46:19 nethcn-b3 pvestatd[10805]: ipcc_send_rec[1] failed: Transport endpoint is not connected
Dec  6 17:46:20 nethcn-b3 pve-ha-crm[10996]: ipcc_send_rec[1] failed: Transport endpoint is not connected
Dec  6 17:46:20 nethcn-b3 pve-ha-lrm[13497]: ipcc_send_rec[1] failed: Transport endpoint is not connected
Dec  6 17:46:30 nethcn-b3 pve-ha-lrm[13497]: watchdog closed (disabled)
Dec  6 17:46:40 nethcn-b3 pve-ha-crm[10996]: watchdog closed (disabled)
Dec  6 17:47:03 nethcn-b3 systemd[1]: Stopping Proxmox VE watchdog multiplexer...
Dec  6 17:47:03 nethcn-b3 watchdog-mux[3580]: got terminate request
Dec  6 17:47:03 nethcn-b3 watchdog-mux[3580]: clean exit
Dec  6 17:47:03 nethcn-b3 systemd[1]: Stopped Proxmox VE watchdog multiplexer.
Dec  6 17:47:03 nethcn-b3 systemd[1]: Started Proxmox VE watchdog multiplexer.
Dec  6 17:47:03 nethcn-b3 watchdog-mux[834]: Watchdog driver 'Software Watchdog', version 0
Dec  6 17:49:21 nethcn-b3 pve-ha-lrm[30589]: watchdog active

root at nethcn-b4:~# cat /var/log/syslog.1|egrep watchdog\|ipcc
Dec  6 17:37:08 nethcn-b4 pveproxy[12998]: ipcc_send_rec[1] failed: Transport endpoint is not connected
Dec  6 17:37:08 nethcn-b4 pveproxy[12998]: ipcc_send_rec[2] failed: Connection refused
Dec  6 17:37:08 nethcn-b4 pveproxy[12998]: ipcc_send_rec[3] failed: Connection refused
Dec  6 17:37:10 nethcn-b4 pve-ha-lrm[12950]: ipcc_send_rec[1] failed: Transport endpoint is not connected
Dec  6 17:37:11 nethcn-b4 pve-ha-crm[10654]: ipcc_send_rec[1] failed: Transport endpoint is not connected
Dec  6 17:37:13 nethcn-b4 pvestatd[10424]: ipcc_send_rec[1] failed: Transport endpoint is not connected
Dec  6 17:37:20 nethcn-b4 pve-ha-lrm[12950]: watchdog closed (disabled)
Dec  6 17:39:01 nethcn-b4 systemd[1]: Stopping Proxmox VE watchdog multiplexer...
Dec  6 17:39:01 nethcn-b4 watchdog-mux[3564]: got terminate request
Dec  6 17:39:01 nethcn-b4 watchdog-mux[3564]: clean exit
Dec  6 17:39:01 nethcn-b4 systemd[1]: Stopped Proxmox VE watchdog multiplexer.
Dec  6 17:39:01 nethcn-b4 systemd[1]: Started Proxmox VE watchdog multiplexer.
Dec  6 17:39:01 nethcn-b4 watchdog-mux[5395]: Watchdog driver 'Software Watchdog', version 0
Dec  6 17:44:51 nethcn-b4 pve-ha-lrm[31595]: watchdog active
Dec  6 17:53:26 nethcn-b4 pve-ha-crm[31896]: watchdog active

root at nethcn-b5:/var/log# cat /var/log/syslog.1|egrep watchdog\|ipcc
Dec  6 17:27:33 nethcn-b5 pve-ha-crm[11175]: ipcc_send_rec[1] failed: Transport endpoint is not connected
Dec  6 17:27:33 nethcn-b5 pve-ha-crm[11175]: ipcc_send_rec[2] failed: Connection refused
Dec  6 17:27:33 nethcn-b5 pve-ha-crm[11175]: ipcc_send_rec[3] failed: Connection refused
Dec  6 17:27:33 nethcn-b5 pve-ha-crm[14737]: ipcc_send_rec[1] failed: Connection refused
Dec  6 17:27:33 nethcn-b5 pve-ha-crm[14737]: ipcc_send_rec[1] failed: Connection refused
Dec  6 17:27:33 nethcn-b5 pve-ha-crm[14737]: ipcc_send_rec[2] failed: Connection refused
Dec  6 17:27:33 nethcn-b5 pve-ha-crm[14737]: ipcc_send_rec[2] failed: Connection refused
Dec  6 17:27:33 nethcn-b5 pve-ha-crm[14737]: ipcc_send_rec[3] failed: Connection refused
Dec  6 17:27:33 nethcn-b5 pve-ha-crm[14737]: ipcc_send_rec[3] failed: Connection refused
Dec  6 17:27:38 nethcn-b5 pve-ha-lrm[14351]: ipcc_send_rec[1] failed: Transport endpoint is not connected
Dec  6 17:27:39 nethcn-b5 pvestatd[10636]: ipcc_send_rec[1] failed: Transport endpoint is not connected
Dec  6 17:27:48 nethcn-b5 pve-ha-lrm[14351]: watchdog closed (disabled)
Dec  6 17:28:00 nethcn-b5 pve-ha-lrm[16775]: watchdog active
Dec  6 17:28:06 nethcn-b5 systemd[1]: Stopping Proxmox VE watchdog multiplexer...
Dec  6 17:28:06 nethcn-b5 watchdog-mux[3747]: got terminate request
Dec  6 17:28:06 nethcn-b5 watchdog-mux[3747]: exit watchdog-mux with active connections
Dec  6 17:28:06 nethcn-b5 systemd[1]: Stopped Proxmox VE watchdog multiplexer.
Dec  6 17:28:06 nethcn-b5 systemd[1]: Started Proxmox VE watchdog multiplexer.
Dec  6 17:28:06 nethcn-b5 kernel: [88725.955509] watchdog: watchdog0: watchdog did not stop!
Dec  6 17:28:06 nethcn-b5 watchdog-mux[18946]: watchdog active - unable to restart watchdog-mux
Dec  6 17:28:06 nethcn-b5 systemd[1]: watchdog-mux.service: Main process exited, code=exited, status=1/FAILURE
Dec  6 17:28:06 nethcn-b5 systemd[1]: watchdog-mux.service: Unit entered failed state.
Dec  6 17:28:06 nethcn-b5 systemd[1]: watchdog-mux.service: Failed with result 'exit-code'.
Dec  6 17:28:10 nethcn-b5 pve-ha-lrm[16775]: watchdog update failed - Broken pipe


More information about the pve-user mailing list