[PVE-User] periodic Node Crash/freeze

Thu Aug 23 08:57:48 CEST 2018

Hello,

i could need some hint/help since one cluster is letting me down since
29.07.2018 .
Thats when one of my three nodes started to freeze and stop.

In syslog the last entries are:

Aug 21 02:33:00 node10 systemd[1]: Starting Proxmox VE replication runner...
Aug 21 02:33:01 node10 systemd[1]: Started Proxmox VE replication runner.
Aug 21 02:33:01 node10 CRON[1870491]: (root) CMD (/usr/bin/puppet
agent -vt --color false --logdest /var/log/puppet/agent.log
1>/dev/null)
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^

  or:

Aug 22 16:11:12 node08 pmxcfs[5227]: [dcdb] notice: cpg_send_message
retried 1 times
Aug 22 16:11:12 node08 pmxcfs[5227]: [status] notice: members: 1/5227, 2/5058
Aug 22 16:11:12 node08 pmxcfs[5227]: [status] notice: starting data
syncronisation
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@

I already posted it here:
  https://forum.proxmox.com/threads/periodic-node-crash-freeze.46407/

It happened at:
29.07.2018 node09 / pve 4.4
07.08.2018 node08 / pve 4.4 ( then i decided to upgrade)
21.08.2018 node10 / pve 5.2
22.08.2018 node08 / pve 5.2

...and i am getting nervous now since there are 60 important VMs on it.
As you can see it happened across multiple nodes with diffrent PVE Versions.

Memtest is okay.

As far as i googled the "^@^@^@^@^@^" appear is syslog because i can
not fully write the file to disk?

Maybe something triggers some totem/watchdog stuff which then ends in
a disaster?

My Ideas from here:
- disable corosync/totem and see if the problems stop

Have you any ideas which could narrow my problem down?

My Setup is a 3 Node Cluster (node08, node09, node10) with ceph.

I have 4 other 3-NodeCluster running just fine.

Thanks a lot.

Mario