[pve-devel] applied: [PATCH kronosnet] cherry-pick pmtud fix

Wed Nov 16 09:37:18 CET 2022

Am 10/11/2022 um 16:28 schrieb Fabian Grünbichler:
> as reported in https://forum.proxmox.com/threads/sudden-reboot-of-multiple-nodes-while-adding-a-new-node.116714/
> 
> this patch just fixes a particular issue where a node joins (as in
> quorum membership change, not limited to PVE cluster join) an existing
> cluster, but has a lower MTU than the existing links to the already
> joined part of the cluster.
> 
> i.e.:
> 
> Node A: MTU 9000
> Node B: MTU 9000
> Node C: MTU 1500
> 
> A & B are already up and running and have established that they can talk
> to eachother with MTU 9000 (-overhead). Now C joins as well - without
> the reset and re-schedule of MTU discovery in this patch, A and B will
> use MTU 9000 when talking to C, but those packets might never arrive
> (depending on network hardware and configuration). Since the heartbeat
> packets used to detect the link status are always small, they are able
> to arrive at C without any problems. If the network along the way
> doesn't reject the packets, but just drops them, the MTU discovery is
> also severely delayed (up to tens of minutes until the actual, low MTU
> is correctly detected!).
> 
> In the regular case, the reset will be immediately followed by detecting
> the correct MTU for the new link (and depending on whether its lower
> than the other links, the global MTU used for fragmenting by knet), and
> the window with additional overhead (smaller MTU => more fragmentation
> => more packets) should be fairly small. In case of a network blackhole
> negatively affecting MTU discovery, the window might be big, but without
> this patch, the result is a complete outage of the whole cluster, which
> is even less desirable than a cluster running with performance impacted.
> 
> Upstream is working on further improving similar failure scenarios, such as:
> - improved handling of MTU being lowered at runtime (either at the link
>   level, or somewhere along the network path)
> - improving MTU discovery timeouts and intervals to speedup recovery
>   even with blackholing networks
> 
> These other changes are still work in progress and will follow at a
> later date.
> 
> This patch is cherry-picked from upstream branch stable1-proposed
> (slated for inclusion in the next stable 1.x release of libknet).
> 
> Signed-off-by: Fabian Grünbichler <f.gruenbichler at proxmox.com>
> ---
> We might evaluate setting netmtu to 1500-overhead in our cluster
> creation code to avoid MTU related issues - the net benefit for setting
> up high MTU for corosync traffic is likely neglible, and almost always
> a side-effect of re-using network links also used as uplinks or storage
> links.
> 
> netmtu is used by corosync to fragment its messages *before* passing
> them to knet, avoiding the need to fragment at the knet layer. There is
> also a (new, git-only at the moment) corosync.conf option for setting
> the MTU used by knet, skipping the pMTU-discovered one entirely. we
> could cherry-pick and set this option as well in case we want to default
> to "non-jumbo MTU".
> 
>  ...eset-restart-pmtud-when-a-node-joins.patch | 156 ++++++++++++++++++
>  debian/patches/series                         |   1 +
>  2 files changed, 157 insertions(+)
>  create mode 100644 debian/patches/0001-pmtud-Reset-restart-pmtud-when-a-node-joins.patch
> 
>

applied, thanks!