[PVE-User] 5.4 Seeing Corosync TOTEM Retransmits

JR Richardson jmr.richardson at gmail.com
Tue Aug 18 15:48:19 CEST 2020


Hi Folks,

I have a v5.4.13, 13 node cluster, has been in production running
fine, separate networks for storage, data, heartbeat, management. All
linux bonded interfaces to Cisco 3750G LACP port channels, all
Gigabit. Last evening I started maintenance to update to the latest
5.4 release version in preparation for upgrading to 6.2.

First 3 nodes went OK, as expected, no issues. When I started to
migrate some VMs back to a node I just upgraded, the whole cluster
crashed and all 13 nodes rebooted. After recovering, two nodes network
bonds were blocking and several VMs were locked (migration). I was
able to recover the cluster and all VMs ok, overall the system was
pretty resilient didn't take too long to get everything restored.

This morning I started diagnosing what had happened and found at the
time of the all node reboot, common log message in all nodes at
relatively the same time:
Aug 17 20:36:07 pbxpve01 corosync[2184]: notice  [TOTEM ] Retransmit
List: 2d5 2d6 2d7 2d8 2e3 2e4 2e5 2e7 2e8 2ea 2eb 2ec 2ed 2ee 2f1 2f2
Aug 17 20:36:07 pbxpve01 corosync[2184]:  [TOTEM ] Retransmit List:
2d5 2d6 2d7 2d8 2e3 2e4 2e5 2e7 2e8 2ea 2eb 2ec 2ed 2ee 2f1 2f2
Aug 17 20:36:07 pbxpve01 corosync[2184]: notice  [TOTEM ] Retransmit
List: 307 308 309 30a 30d 30e 30f 313 314 315 316 317 31b 31c 31d 31e
Aug 17 20:36:07 pbxpve01 corosync[2184]:  [TOTEM ] Retransmit List:
307 308 309 30a 30d 30e 30f 313 314 315 316 317 31b 31c 31d 31e
Aug 17 20:36:07 pbxpve01 corosync[2184]: notice  [TOTEM ] Retransmit
List: 34a 353 35b 35c 35d 35e 35f 360 364 365 366 368 369 36a 36b 36c
.........

I did some reading and I understand there was some sort of heartbeat
network latency introduced during the live migration event. But since
my networks are separate, does the VM memory transfer between nodes
performed on the heartbeat network? Can I specify what network to use
for migration, like storage (jumbo frame enabled) or management to
relieve any congestion on the heartbeat network segment?

Another question is tuning, should I try to tune corosync '<totem
netmtu="1480"/>' or '<totem window_size="170"/>' settings or just push
through the upgrade to 6.2?

Any suggestions are welcome.

Thanks.

JR
-- 
JR Richardson
Engineering for the Masses
Chasing the Azeotrope




More information about the pve-user mailing list