[PVE-User] Cluster Host Node Rebooted and VMs Did not restart?
JR Richardson
jmr.richardson at gmail.com
Tue Jul 2 04:19:55 CEST 2019
Hey Folks,
Ran into an interesting problem today. Topology is a 4-Node Cluster on
v.5.2, been up and running without issues for 10 months or so. Several VMs
running on node 1, 2 & 3, maybe 30% loaded. I keep node 4 empty for HA and
new build VM testing. All VMs are set for HA and auto start on node reboot.
Node 1 rebooted, no indication why, nothing in the logs identifying cause of
the node rebooting. When node 1 came back up it was 1 hour ahead of time and
none of the VMs would start and HA did not start up the VMs on empty node 4.
Syslog:
Jul 1 14:42:00 vmpve01 systemd[1]: Starting Proxmox VE replication
runner...
Jul 1 14:42:01 vmpve01 systemd[1]: Started Proxmox VE replication runner.
Jul 1 14:42:20 vmpve01 pvedaemon[18577]: <root at pam> successful auth for
user 'NOC at pve'
Jul 1 14:43:00 vmpve01 systemd[1]: Starting Proxmox VE replication
runner...
Jul 1 14:43:00 vmpve01 pvedaemon[18577]: <root at pam> successful auth for
user 'NOC at pve'
Jul 1 14:43:01 vmpve01 systemd[1]: Started Proxmox VE replication runner.
Jul 1 14:43:40 vmpve01 pvedaemon[32611]: <root at pam> successful auth for
user 'NOC at pve'
Jul 1 14:43:50 vmpve01 pveproxy[9369]: worker exit
Jul 1 14:43:50 vmpve01 pveproxy[2395]: worker 9369 finished
Jul 1 14:43:50 vmpve01 pveproxy[2395]: starting 1 worker(s)
Jul 1 14:43:50 vmpve01 pveproxy[2395]: worker 21373 started
Jul 1 14:44:00 vmpve01 systemd[1]: Starting Proxmox VE replication
runner...
Jul 1 14:44:01 vmpve01 systemd[1]: Started Proxmox VE replication runner.
Jul 1 14:44:03 vmpve01 systemd[1]: Time has been changed
Jul 1 14:44:03 vmpve01 systemd[1]: apt-daily-upgrade.timer: Adding 51min
27.103008s random time.
Jul 1 14:44:03 vmpve01 systemd[1]: pve-daily-update.timer: Adding 49min
46.847880s random time.
Jul 1 14:44:03 vmpve01 systemd[1]: apt-daily.timer: Adding 8h 36min
11.274821s random time.
Jul 1 14:44:13 vmpve01 pvedaemon[14464]: <root at pam> successful auth for
user 'NOC at pve'
Jul 1 14:44:54 vmpve01 pvedaemon[32611]: <root at pam> successful auth for
user 'NOC at pve'
Jul 1 14:45:00 vmpve01 systemd[1]: Starting Proxmox VE replication
runner...
Jul 1 14:45:01 vmpve01 systemd[1]: Started Proxmox VE replication runner.
****NODE REBOOTED****
Jul 1 15:45:38 vmpve01 systemd[1]: Started Load/Save Random Seed.
Jul 1 15:45:38 vmpve01 systemd[1]: Started Flush Journal to Persistent
Storage.
Jul 1 15:45:38 vmpve01 systemd-modules-load[408]: Inserted module
'iscsi_tcp'
Jul 1 15:45:38 vmpve01 systemd[1]: Mounted RPC Pipe File System.
Jul 1 15:45:38 vmpve01 systemd[1]: Started udev Coldplug all Devices.
......
*****All normal logging cluster starting up then I start getting these
re-occuring******
Jul 1 15:03:19 vmpve01 pmxcfs[2185]: [status] notice: RRDC update error
/var/lib/rrdcached/db/pve2-storage/vm
pve01/local-vmstor01: -1
Jul 1 15:03:19 vmpve01 pmxcfs[2185]: [status] notice: RRDC update error
/var/lib/rrdcached/db/pve2-storage/vm
pve01/local: -1
Jul 1 15:03:19 vmpve01 pmxcfs[2185]: [status] notice: RRDC update error
/var/lib/rrdcached/db/pve2-storage/vm
pve01/local-lvm: -1
Jul 1 15:03:23 vmpve01 rrdcached[2114]: queue_thread_main: rrd_update_r
(/var/lib/rrdcached/db/pve2-node/vmpv
e01) failed with status -1. (/var/lib/rrdcached/db/pve2-node/vmpve01:
illegal attempt to update using time 156
2010251 when last update time is 1562011399 (minimum one second step))
Jul 1 15:03:23 vmpve01 systemd[1]: Starting Cleanup of Temporary
Directories...
Jul 1 15:03:23 vmpve01 systemd[1]: Started Cleanup of Temporary
Directories.
Jul 1 15:03:29 vmpve01 pmxcfs[2185]: [status] notice: RRDC update error
/var/lib/rrdcached/db/pve2-node/vmpve
01: -1
Jul 1 15:03:29 vmpve01 pmxcfs[2185]: [status] notice: RRDC update error
/var/lib/rrdcached/db/pve2-vm/105: -1
Jul 1 15:03:29 vmpve01 pmxcfs[2185]: [status] notice: RRDC update error
/var/lib/rrdcached/db/pve2-vm/104: -1
Jul 1 15:03:29 vmpve01 pmxcfs[2185]: [status] notice: RRDC update error
/var/lib/rrdcached/db/pve2-vm/106: -1
Jul 1 15:03:29 vmpve01 pmxcfs[2185]: [status] notice: RRDC update error
/var/lib/rrdcached/db/pve2-vm/102: -1
Jul 1 15:03:29 vmpve01 pmxcfs[2185]: [status] notice: RRDC update error
/var/lib/rrdcached/db/pve2-vm/103: -1
Jul 1 15:03:29 vmpve01 pmxcfs[2185]: [status] notice: RRDC update error
/var/lib/rrdcached/db/pve2-vm/107: -1
Jul 1 15:03:29 vmpve01 pmxcfs[2185]: [status] notice: RRDC update error
/var/lib/rrdcached/db/pve2-vm/101: -1
Jul 1 15:03:29 vmpve01 pmxcfs[2185]: [status] notice: RRDC update error
/var/lib/rrdcached/db/pve2-storage/vmpve01/local-lvm: -1
Jul 1 15:03:29 vmpve01 pmxcfs[2185]: [status] notice: RRDC update error
/var/lib/rrdcached/db/pve2-storage/vmpve01/local: -1
Jul 1 15:03:29 vmpve01 pmxcfs[2185]: [status] notice: RRDC update error
/var/lib/rrdcached/db/pve2-storage/vmpve01/local-vmstor01: -1
Jul 1 15:03:37 vmpve01 pvedaemon[2366]: <root at pam> starting task
UPID:vmpve01:00001600:00016749:5D1A6719:hastart:107:root at pam:
Jul 1 15:03:37 vmpve01 pmxcfs[2185]: [main] notice: ignore duplicate
Jul 1 15:03:38 vmpve01 pvedaemon[2366]: <root at pam> end task
UPID:vmpve01:00001600:00016749:5D1A6719:hastart:107:root at pam: OK
Jul 1 15:03:38 vmpve01 pmxcfs[2185]: [main] notice: ignore duplicate
Jul 1 15:03:39 vmpve01 pmxcfs[2185]: [status] notice: RRDC update error
/var/lib/rrdcached/db/pve2-node/vmpve01: -1
After system time caught up with actual current time, the RRDC errors when
away. Unless I missed it, there was no log indication of attempting to
restart the VMs. I manually tried to start the VMs but none on node 1 would
start up. After system time synchronized with hardware clock, I rebooted the
node again and could start all the VMs on node 1. After manual reboot, I
waited about 10 minutes, time was in sync OK, but Cluster still did not auto
start or HA start any of node 1 VMs on node 4.
Cluster is normal at this time:
root at vmpve01:/var/log# pvec
pveceph pvecm
root at vmpve01:/var/log# pvecm status
Quorum information
------------------
Date: Mon Jul 1 21:12:29 2019
Quorum provider: corosync_votequorum
Nodes: 4
Node ID: 0x00000001
Ring ID: 1/80
Quorate: Yes
Votequorum information
----------------------
Expected votes: 4
Highest expected: 4
Total votes: 4
Quorum: 3
Flags: Quorate
Membership information
----------------------
Nodeid Votes Name
0x00000001 1 10.99.96.191 (local)
0x00000002 1 10.99.96.192
0x00000003 1 10.99.96.193
0x00000004 1 10.99.96.194
I was just about to roll into proxmox updates this week too. Any suggestions
on diagnosing root cause, not of node reboot, but of why VMs would not HA
start when the node came up an hour in the future?
Thanks.
JR
JR Richardson
Engineering for the Masses
Chasing the Azeotrope
More information about the pve-user
mailing list