[pve-devel] [PATCH docs 4/7] pvecm: extend cluster Requirements
Robin Christ
robin at rchrist.io
Thu May 8 13:54:37 CEST 2025
On 07.05.25 17:22, Kevin Schneider wrote:
> IMO this isn't strict enough and we should empathize on the importance
> of the problem. I would go for
>
> To ensure reliable Corosync redundancy, it's essential to use at least
> two separate physical and logical networks. Single bonded interfaces do
> not provide Corosync redundancy. When a bonded interface fails without
> redundancy, it can lead to asymmetric communication, causing all nodes
> to lose quorum—even if more than half of them can still communicate with
> each other.
Although a bond on the interface together with MLAG'd switches CAN
provide further resiliency in case of switch or single NIC PHY failure.
It does not protect against total failure of the NIC of course.
I think adding a "typical topologies" or "example topologies" to the
docs might be a good idea?
Below my personal, opinionated recommendation after deploying quite a
good amount of Proxmox clusters. Of course I don't expect everyone to
agree with this... But hopefully it can serve as a starting point?
Typical topologies:
In most cases, a server for a Proxmox cluster will have at least two
physical NICs. One is usually a low or medium speed dual-port onboard
NIC (1GBase-T or 10GBase-T). The other one is typically a medium or high
speed add-in PCIe NIC (e.g. 10G SFP+, 40G QSFP+, 25G SFP28, 100G
QSFP28). There may be more NICs depending on the specific use case, e.g.
a separate NIC for Ceph Cluster (private, replication, back-side) traffic.
In such a setup, it is recommended to reserve the low or medium speed
onboard NICs for cluster traffic (and potentially management purposes).
These NICs should be connected using a switch.
Although for very small clusters (3 nodes) and a dual-port NIC a ring
topology could be used to connect the nodes together, this is not
recommended as it makes later expansion more troublesome.
It is recommended to use a physically separate switch just for the
cluster network. If your main switch is the only way for nodes to
communicate, failure of this switch will take out your entire cluster
with potentially catastrophic consequences.
For single-port onboard NICs there are no further design decisions to
make. However, onboard NICs are almost always dual port, which allows
some more freedom in the design of the cluster network.
Design of the dedicated cluster network:
a) Two separate cluster switches, switches support MLAG or Stacking /
Virtual Chassis
This is an ideal scenario, in which you deploy two managed switches in
an MLAG or Stacking / Virtual Chassis configuration. MLAG or Stacking /
Virtual Chassis requires the switches to have a link between them,
called IPL ("Inter Peer Link"). MLAG or Stacking / Virtual Chassis makes
two switches behave as if they were one, but if one switch fails, the
remaining one will still work and take over seamlessly!
Each cluster node is connected to both switches. Both NIC ports on each
node are bonded together (LACP recommended).
This topology provides a very good degree of resiliency.
The bond is configured as Ring0 for corosync.
b) Two separate cluster switches, switches DO NOT support MLAG or
Stacking / Virtual Chassis
In this scenario you deploy two separate switches (potentially
unmanaged). There should not be a link between the switches, as this can
easily lead to loops and makes the entire configuration more complex.
Each cluster node is connected to both switches, but the NIC ports are
not bonded together. Typically, both NIC ports will be in separate IP
subnets.
This topology provides a slightly smaller degree of resiliency compared
to MLAG.
One switch / broadcast domain is configured as Ring0 for corosync, the
other one is configured as Ring1.
c) Single separate cluster switch
If you only want to deploy a single switch that is reserved for cluster
traffic, you can either use a single NIC port on each node, or both
bonded together. It will not make much of a difference, as bonding will
only protect against single PHY / port failure.
The interface is configured as Ring0 for corosync.
Usage of the other NICs for redundancy purposes:
It is recommended to add the other NICs / networks in the system as
backup links / additional rings to corosync. Bad connectivity over a
potentially congested storage network is better than no connectivity at
all, because the dedicated cluster network has failed and there is no
backup.
More information about the pve-devel
mailing list