[pve-devel] [PATCH docs 4/7] pvecm: extend cluster Requirements

Thu May 8 13:54:37 CEST 2025

On 07.05.25 17:22, Kevin Schneider wrote:
> IMO this isn't strict enough and we should empathize on the importance 
> of the problem. I would go for
> 
> To ensure reliable Corosync redundancy, it's essential to use at least 
> two separate physical and logical networks. Single bonded interfaces do 
> not provide Corosync redundancy. When a bonded interface fails without 
> redundancy, it can lead to asymmetric communication, causing all nodes 
> to lose quorum—even if more than half of them can still communicate with 
> each other.

Although a bond on the interface together with MLAG'd switches CAN 
provide further resiliency in case of switch or single NIC PHY failure. 
It does not protect against total failure of the NIC of course.

I think adding a "typical topologies" or "example topologies" to the 
docs might be a good idea?

Below my personal, opinionated recommendation after deploying quite a 
good amount of Proxmox clusters. Of course I don't expect everyone to 
agree with this... But hopefully it can serve as a starting point?

Typical topologies:

In most cases, a server for a Proxmox cluster will have at least two 
physical NICs. One is usually a low or medium speed dual-port onboard 
NIC (1GBase-T or 10GBase-T). The other one is typically a medium or high 
speed add-in PCIe NIC (e.g. 10G SFP+, 40G QSFP+, 25G SFP28, 100G 
QSFP28). There may be more NICs depending on the specific use case, e.g. 
a separate NIC for Ceph Cluster (private, replication, back-side) traffic.

In such a setup, it is recommended to reserve the low or medium speed 
onboard NICs for cluster traffic (and potentially management purposes). 
These NICs should be connected using a switch.
Although for very small clusters (3 nodes) and a dual-port NIC a ring 
topology could be used to connect the nodes together, this is not 
recommended as it makes later expansion more troublesome.

It is recommended to use a physically separate switch just for the 
cluster network. If your main switch is the only way for nodes to 
communicate, failure of this switch will take out your entire cluster 
with potentially catastrophic consequences.

For single-port onboard NICs there are no further design decisions to 
make. However, onboard NICs are almost always dual port, which allows 
some more freedom in the design of the cluster network.

Design of the dedicated cluster network:

a) Two separate cluster switches, switches support MLAG or Stacking / 
Virtual Chassis
This is an ideal scenario, in which you deploy two managed switches in 
an MLAG or Stacking / Virtual Chassis configuration. MLAG or Stacking / 
Virtual Chassis requires the switches to have a link between them, 
called IPL ("Inter Peer Link"). MLAG or Stacking / Virtual Chassis makes 
two switches behave as if they were one, but if one switch fails, the 
remaining one will still work and take over seamlessly!

Each cluster node is connected to both switches. Both NIC ports on each 
node are bonded together (LACP recommended).

This topology provides a very good degree of resiliency.

The bond is configured as Ring0 for corosync.

b) Two separate cluster switches, switches DO NOT support MLAG or 
Stacking / Virtual Chassis

In this scenario you deploy two separate switches (potentially 
unmanaged). There should not be a link between the switches, as this can 
easily lead to loops and makes the entire configuration more complex.

Each cluster node is connected to both switches, but the NIC ports are 
not bonded together. Typically, both NIC ports will be in separate IP 
subnets.

This topology provides a slightly smaller degree of resiliency compared 
to MLAG.

One switch / broadcast domain is configured as Ring0 for corosync, the 
other one is configured as Ring1.

c) Single separate cluster switch

If you only want to deploy a single switch that is reserved for cluster 
traffic, you can either use a single NIC port on each node, or both 
bonded together. It will not make much of a difference, as bonding will 
only protect against single PHY / port failure.

The interface is configured as Ring0 for corosync.

Usage of the other NICs for redundancy purposes:
It is recommended to add the other NICs / networks in the system as 
backup links / additional rings to corosync. Bad connectivity over a 
potentially congested storage network is better than no connectivity at 
all, because the dedicated cluster network has failed and there is no 
backup.