[PVE-User] Cluster Host Node Rebooted and VMs Did not restart?

JR Richardson jmr.richardson at gmail.com
Thu Jul 4 21:37:36 CEST 2019


Sorry for posting twice, not sure I used the correct mail list.
> 
> Hey Folks,
>  
> Ran into an interesting problem today. Topology is a 4-Node Cluster on v.5.2, been up and running without issues for 10 months or so. Several VMs running on node 1, 2 & 3, maybe 30% loaded. I keep node 4 empty for HA and new build VM testing. All VMs are set for HA and auto start on node reboot.
>  
> Node 1 rebooted, no indication why, nothing in the logs identifying cause of the node rebooting. When node 1 came back up it was 1 hour ahead of time and none of the VMs would start and HA did not start up the VMs on empty node 4.
>  
> Syslog:
> Jul  1 14:42:00 vmpve01 systemd[1]: Starting Proxmox VE replication runner...
> Jul  1 14:42:01 vmpve01 systemd[1]: Started Proxmox VE replication runner.
> Jul  1 14:42:20 vmpve01 pvedaemon[18577]: <root at pam> successful auth for user 'NOC at pve'
> Jul  1 14:43:00 vmpve01 systemd[1]: Starting Proxmox VE replication runner...
> Jul  1 14:43:00 vmpve01 pvedaemon[18577]: <root at pam> successful auth for user 'NOC at pve'
> Jul  1 14:43:01 vmpve01 systemd[1]: Started Proxmox VE replication runner.
> Jul  1 14:43:40 vmpve01 pvedaemon[32611]: <root at pam> successful auth for user 'NOC at pve'
> Jul  1 14:43:50 vmpve01 pveproxy[9369]: worker exit
> Jul  1 14:43:50 vmpve01 pveproxy[2395]: worker 9369 finished
> Jul  1 14:43:50 vmpve01 pveproxy[2395]: starting 1 worker(s)
> Jul  1 14:43:50 vmpve01 pveproxy[2395]: worker 21373 started
> Jul  1 14:44:00 vmpve01 systemd[1]: Starting Proxmox VE replication runner...
> Jul  1 14:44:01 vmpve01 systemd[1]: Started Proxmox VE replication runner.
> Jul  1 14:44:03 vmpve01 systemd[1]: Time has been changed
> Jul  1 14:44:03 vmpve01 systemd[1]: apt-daily-upgrade.timer: Adding 51min 27.103008s random time.
> Jul  1 14:44:03 vmpve01 systemd[1]: pve-daily-update.timer: Adding 49min 46.847880s random time.
> Jul  1 14:44:03 vmpve01 systemd[1]: apt-daily.timer: Adding 8h 36min 11.274821s random time.
> Jul  1 14:44:13 vmpve01 pvedaemon[14464]: <root at pam> successful auth for user 'NOC at pve'
> Jul  1 14:44:54 vmpve01 pvedaemon[32611]: <root at pam> successful auth for user 'NOC at pve'
> Jul  1 14:45:00 vmpve01 systemd[1]: Starting Proxmox VE replication runner...
> Jul  1 14:45:01 vmpve01 systemd[1]: Started Proxmox VE replication runner.
> ****NODE REBOOTED****
> Jul  1 15:45:38 vmpve01 systemd[1]: Started Load/Save Random Seed.
> Jul  1 15:45:38 vmpve01 systemd[1]: Started Flush Journal to Persistent Storage.
> Jul  1 15:45:38 vmpve01 systemd-modules-load[408]: Inserted module 'iscsi_tcp'
> Jul  1 15:45:38 vmpve01 systemd[1]: Mounted RPC Pipe File System.
> Jul  1 15:45:38 vmpve01 systemd[1]: Started udev Coldplug all Devices.
> …………..
> *****All normal logging cluster starting up then I start getting these re-occuring******
> Jul  1 15:03:19 vmpve01 pmxcfs[2185]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/vm
> pve01/local-vmstor01: -1
> Jul  1 15:03:19 vmpve01 pmxcfs[2185]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/vm
> pve01/local: -1
> Jul  1 15:03:19 vmpve01 pmxcfs[2185]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/vm
> pve01/local-lvm: -1
> Jul  1 15:03:23 vmpve01 rrdcached[2114]: queue_thread_main: rrd_update_r (/var/lib/rrdcached/db/pve2-node/vmpv
> e01) failed with status -1. (/var/lib/rrdcached/db/pve2-node/vmpve01: illegal attempt to update using time 156
> 2010251 when last update time is 1562011399 (minimum one second step))
> Jul  1 15:03:23 vmpve01 systemd[1]: Starting Cleanup of Temporary Directories...
> Jul  1 15:03:23 vmpve01 systemd[1]: Started Cleanup of Temporary Directories.
> Jul  1 15:03:29 vmpve01 pmxcfs[2185]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-node/vmpve
> 01: -1
> Jul  1 15:03:29 vmpve01 pmxcfs[2185]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/105: -1
> Jul  1 15:03:29 vmpve01 pmxcfs[2185]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/104: -1
> Jul  1 15:03:29 vmpve01 pmxcfs[2185]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/106: -1
> Jul  1 15:03:29 vmpve01 pmxcfs[2185]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/102: -1
> Jul  1 15:03:29 vmpve01 pmxcfs[2185]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/103: -1
> Jul  1 15:03:29 vmpve01 pmxcfs[2185]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/107: -1
> Jul  1 15:03:29 vmpve01 pmxcfs[2185]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/101: -1
> Jul  1 15:03:29 vmpve01 pmxcfs[2185]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/vmpve01/local-lvm: -1
> Jul  1 15:03:29 vmpve01 pmxcfs[2185]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/vmpve01/local: -1
> Jul  1 15:03:29 vmpve01 pmxcfs[2185]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/vmpve01/local-vmstor01: -1
> Jul  1 15:03:37 vmpve01 pvedaemon[2366]: <root at pam> starting task UPID:vmpve01:00001600:00016749:5D1A6719:hastart:107:root at pam:
> Jul  1 15:03:37 vmpve01 pmxcfs[2185]: [main] notice: ignore duplicate
> Jul  1 15:03:38 vmpve01 pvedaemon[2366]: <root at pam> end task UPID:vmpve01:00001600:00016749:5D1A6719:hastart:107:root at pam: OK
> Jul  1 15:03:38 vmpve01 pmxcfs[2185]: [main] notice: ignore duplicate
> Jul  1 15:03:39 vmpve01 pmxcfs[2185]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-node/vmpve01: -1
>  
> After system time caught up with actual current time, the RRDC errors when away. Unless I missed it, there was no log indication of attempting to restart the VMs. I manually tried to start the VMs but none on node 1 would start up. After system time synchronized with hardware clock, I rebooted the node again and could start all the VMs on node 1. After manual reboot, I waited about 10 minutes, time was in sync OK, but Cluster still did not auto start or HA start any of node 1 VMs on node 4.
>  
> Cluster is normal at this time:
>  
> root at vmpve01:/var/log# pvec
> pveceph  pvecm   
> root at vmpve01:/var/log# pvecm status
> Quorum information
> ------------------
> Date:             Mon Jul  1 21:12:29 2019
> Quorum provider:  corosync_votequorum
> Nodes:            4
> Node ID:          0x00000001
> Ring ID:          1/80
> Quorate:          Yes
>  
> Votequorum information
> ----------------------
> Expected votes:   4
> Highest expected: 4
> Total votes:      4
> Quorum:           3 
> Flags:            Quorate
>  
> Membership information
> ----------------------
>     Nodeid      Votes Name
> 0x00000001          1 10.99.96.191 (local)
> 0x00000002          1 10.99.96.192
> 0x00000003          1 10.99.96.193
> 0x00000004          1 10.99.96.194
>  
> I was just about to roll into proxmox updates this week too. Any suggestions on diagnosing root cause, not of node reboot, but of why VMs would not HA start when the node came up an hour in the future?
>  
> Thanks.
> 
> JR
>  
> JR Richardson
> Engineering for the Masses
> Chasing the Azeotrope
>  



More information about the pve-user mailing list