[PVE-User] periodic Node Crash/freeze

Ml Ml mliebherr99 at googlemail.com
Thu Aug 23 08:57:48 CEST 2018


i could need some hint/help since one cluster is letting me down since
29.07.2018 .
Thats when one of my three nodes started to freeze and stop.

In syslog the last entries are:

Aug 21 02:33:00 node10 systemd[1]: Starting Proxmox VE replication runner...
Aug 21 02:33:01 node10 systemd[1]: Started Proxmox VE replication runner.
Aug 21 02:33:01 node10 CRON[1870491]: (root) CMD (/usr/bin/puppet
agent -vt --color false --logdest /var/log/puppet/agent.log


Aug 22 16:11:12 node08 pmxcfs[5227]: [dcdb] notice: cpg_send_message
retried 1 times
Aug 22 16:11:12 node08 pmxcfs[5227]: [status] notice: members: 1/5227, 2/5058
Aug 22 16:11:12 node08 pmxcfs[5227]: [status] notice: starting data

I already posted it here:

It happened at:
29.07.2018 node09 / pve 4.4
07.08.2018 node08 / pve 4.4 ( then i decided to upgrade)
21.08.2018 node10 / pve 5.2
22.08.2018 node08 / pve 5.2

...and i am getting nervous now since there are 60 important VMs on it.
As you can see it happened across multiple nodes with diffrent PVE Versions.

Memtest is okay.

As far as i googled the "^@^@^@^@^@^" appear is syslog because i can
not fully write the file to disk?

Maybe something triggers some totem/watchdog stuff which then ends in
a disaster?

My Ideas from here:
- disable corosync/totem and see if the problems stop

Have you any ideas which could narrow my problem down?

My Setup is a 3 Node Cluster (node08, node09, node10) with ceph.

I have 4 other 3-NodeCluster running just fine.

Thanks a lot.


More information about the pve-user mailing list