BIG cluster questions

Eneko Lacunza elacunza at
Fri Jun 25 10:06:20 CEST 2021


We have tested without bonding, same issues.

El 24/6/21 a las 16:30, Eneko Lacunza escribió:
> Hi all,
> We're currently helping a customer to configure a virtualization 
> cluster with 88 servers for VDI.
> Right know we're testing the feasibility of building just one Proxmox 
> cluster of 88 nodes. A 4-node cluster has been configured too for 
> comparing both (same server and networking/racks).
> Nodes have 2 NICs 2x25Gbps each. Currently there are two LACP bonds 
> configured (one for each NIC); one for storage (NFS v4.2) and the 
> other for the rest (VMs, cluster).
> Cluster has two rings, one on each bond.
> - With clusters at rest (no significant number of VMs running), we see 
> quite a different corosync/knet latency average on our 88 node cluster 
> (~300-400) and our 4-node cluster (<100).
> For 88-node cluster:
> - Creating some VMs (let's say 16), one each 30s, works well.
> - Destroying some VMs (let's say 16), one each 30s, outputs error 
> messages (storage cfs lock related) and fails removing some of the VMs.
> - Rebooting 32 nodes, one each 30 seconds (boot for a node is about 
> 120s) so that no quorum is lost, creates a cluster traffic "flood". 
> Some of the rebooted nodes don't rejoin the cluster, and WUI shows all 
> nodes in cluster quorum with a grey ?, instead of green OK. In this 
> situation corosying latency in some nodes can skyrocket to 10s or 100s 
> times the values before the reboots. Access to pmxcfs is very slow and 
> we have been able to fix the issue only rebooting all nodes.
> - We have tried changing the transport of knet in a ring from UDP to 
> SCTP as reported here:
> that gives better latencies for corosync, but the reboot issue continues.
> We don't know whether both issues are related or not.
> Could LACP bonds be the issue?
> "
> If your switch support the LACP (IEEE 802.3ad) protocol then we 
> recommend using the corresponding bonding mode (802.3ad). Otherwise 
> you should generally use the active-backup mode.
> If you intend to run your cluster network on the bonding interfaces, 
> then you have to use active-passive mode on the bonding interfaces, 
> other modes are unsupported.
> "
> As per second line, we understand that running cluster networking over 
> a LACP bond is not supported (just to confirm our interpretation)? 
> We're in the process of reconfiguring nodes/switches to test without a 
> bond, to see if that gives us a stable cluster (will report on this). 
> Do you think this could be the issue?
> Now for more general questions; do you think a 88-node Proxmox VE 
> cluster is feasible?
> Those 88 nodes will host about 14.000 VMs. Will HA manager be able to 
> manage them, or are they too many? (HA for those VMs doesn't seem to 
> be a requirement right know).
> Thanks a lot
> Eneko


CTO | Zuzendari teknikoa

Binovo IT Human Project

	943 569 206 <tel:943 569 206>

	elacunza at <mailto:elacunza at> <//>

	Astigarragako Bidea, 2 - 2 izda. Oficina 10-11, 20180 Oiartzun

youtube <>	
	linkedin <>	

More information about the pve-user mailing list