[pve-devel] [PATCH kronosnet] cherry-pick pmtud fix

Fabian Grünbichler f.gruenbichler at proxmox.com
Thu Nov 10 16:28:33 CET 2022


as reported in https://forum.proxmox.com/threads/sudden-reboot-of-multiple-nodes-while-adding-a-new-node.116714/

this patch just fixes a particular issue where a node joins (as in
quorum membership change, not limited to PVE cluster join) an existing
cluster, but has a lower MTU than the existing links to the already
joined part of the cluster.

i.e.:

Node A: MTU 9000
Node B: MTU 9000
Node C: MTU 1500

A & B are already up and running and have established that they can talk
to eachother with MTU 9000 (-overhead). Now C joins as well - without
the reset and re-schedule of MTU discovery in this patch, A and B will
use MTU 9000 when talking to C, but those packets might never arrive
(depending on network hardware and configuration). Since the heartbeat
packets used to detect the link status are always small, they are able
to arrive at C without any problems. If the network along the way
doesn't reject the packets, but just drops them, the MTU discovery is
also severely delayed (up to tens of minutes until the actual, low MTU
is correctly detected!).

In the regular case, the reset will be immediately followed by detecting
the correct MTU for the new link (and depending on whether its lower
than the other links, the global MTU used for fragmenting by knet), and
the window with additional overhead (smaller MTU => more fragmentation
=> more packets) should be fairly small. In case of a network blackhole
negatively affecting MTU discovery, the window might be big, but without
this patch, the result is a complete outage of the whole cluster, which
is even less desirable than a cluster running with performance impacted.

Upstream is working on further improving similar failure scenarios, such as:
- improved handling of MTU being lowered at runtime (either at the link
  level, or somewhere along the network path)
- improving MTU discovery timeouts and intervals to speedup recovery
  even with blackholing networks

These other changes are still work in progress and will follow at a
later date.

This patch is cherry-picked from upstream branch stable1-proposed
(slated for inclusion in the next stable 1.x release of libknet).

Signed-off-by: Fabian Grünbichler <f.gruenbichler at proxmox.com>
---
We might evaluate setting netmtu to 1500-overhead in our cluster
creation code to avoid MTU related issues - the net benefit for setting
up high MTU for corosync traffic is likely neglible, and almost always
a side-effect of re-using network links also used as uplinks or storage
links.

netmtu is used by corosync to fragment its messages *before* passing
them to knet, avoiding the need to fragment at the knet layer. There is
also a (new, git-only at the moment) corosync.conf option for setting
the MTU used by knet, skipping the pMTU-discovered one entirely. we
could cherry-pick and set this option as well in case we want to default
to "non-jumbo MTU".

 ...eset-restart-pmtud-when-a-node-joins.patch | 156 ++++++++++++++++++
 debian/patches/series                         |   1 +
 2 files changed, 157 insertions(+)
 create mode 100644 debian/patches/0001-pmtud-Reset-restart-pmtud-when-a-node-joins.patch

diff --git a/debian/patches/0001-pmtud-Reset-restart-pmtud-when-a-node-joins.patch b/debian/patches/0001-pmtud-Reset-restart-pmtud-when-a-node-joins.patch
new file mode 100644
index 00000000..2a73ad6d
--- /dev/null
+++ b/debian/patches/0001-pmtud-Reset-restart-pmtud-when-a-node-joins.patch
@@ -0,0 +1,156 @@
+From 3b4c22ea18b96e21a055a9e44652d873d7a567a7 Mon Sep 17 00:00:00 2001
+From: Christine Caulfield <ccaulfie at redhat.com>
+Date: Mon, 31 Oct 2022 12:58:48 +0000
+Subject: [PATCH kronosnet] pmtud: Reset/restart pmtud when a node joins
+MIME-Version: 1.0
+Content-Type: text/plain; charset=UTF-8
+Content-Transfer-Encoding: 8bit
+
+If a new node joins with a 'black hole' that sets the effective
+MTU to be less than the hardware MTU then it will fail to join.
+
+So this patch restarts the pMTU processes each time a new node
+joins so it has a chance to declare it's real MTU in time.
+
+Signed-off-by: Fabian Grünbichler <f.gruenbichler at proxmox.com>
+---
+ libknet/crypto.c         |  2 +-
+ libknet/links.c          |  4 ++++
+ libknet/threads_common.c | 28 ++++++++++++++++++++--------
+ libknet/threads_common.h |  2 +-
+ libknet/threads_pmtud.c  |  2 +-
+ libknet/transport_udp.c  |  2 +-
+ 6 files changed, 28 insertions(+), 12 deletions(-)
+
+diff --git a/libknet/crypto.c b/libknet/crypto.c
+index 032a13e3..dc41aac8 100644
+--- a/libknet/crypto.c
++++ b/libknet/crypto.c
+@@ -144,7 +144,7 @@ static int crypto_use_config(
+ 		knet_h->sec_salt_size = 0;
+ 	}
+ 
+-	force_pmtud_run(knet_h, KNET_SUB_CRYPTO, 1);
++	force_pmtud_run(knet_h, KNET_SUB_CRYPTO, 1, 0);
+ 
+ 	return 0;
+ }
+diff --git a/libknet/links.c b/libknet/links.c
+index 5ee86896..185f62e5 100644
+--- a/libknet/links.c
++++ b/libknet/links.c
+@@ -46,6 +46,10 @@ int _link_updown(knet_handle_t knet_h, knet_node_id_t host_id, uint8_t link_id,
+ 
+ 	if (!connected) {
+ 		transport_link_is_down(knet_h, link);
++	} else {
++		/* Reset MTU in case new link can't use full line MTU */
++		log_info(knet_h, KNET_SUB_LINK, "Resetting MTU for link %u because host %u joined", link_id, host_id);
++		force_pmtud_run(knet_h, KNET_SUB_LINK, 1, 1);
+ 	}
+ 
+ 	if (lock_stats) {
+diff --git a/libknet/threads_common.c b/libknet/threads_common.c
+index bd847460..c7419861 100644
+--- a/libknet/threads_common.c
++++ b/libknet/threads_common.c
+@@ -37,13 +37,8 @@ int shutdown_in_progress(knet_handle_t knet_h)
+ 	return ret;
+ }
+ 
+-static int pmtud_reschedule(knet_handle_t knet_h)
++static int _pmtud_reschedule(knet_handle_t knet_h)
+ {
+-	if (pthread_mutex_lock(&knet_h->pmtud_mutex) != 0) {
+-		log_debug(knet_h, KNET_SUB_PMTUD, "Unable to get mutex lock");
+-		return -1;
+-	}
+-
+ 	if (knet_h->pmtud_running) {
+ 		knet_h->pmtud_abort = 1;
+ 
+@@ -51,9 +46,20 @@ static int pmtud_reschedule(knet_handle_t knet_h)
+ 			pthread_cond_signal(&knet_h->pmtud_cond);
+ 		}
+ 	}
++	return 0;
++}
+ 
++static int pmtud_reschedule(knet_handle_t knet_h)
++{
++	int res;
++
++	if (pthread_mutex_lock(&knet_h->pmtud_mutex) != 0) {
++		log_debug(knet_h, KNET_SUB_PMTUD, "Unable to get mutex lock");
++		return -1;
++	}
++	res = _pmtud_reschedule(knet_h);
+ 	pthread_mutex_unlock(&knet_h->pmtud_mutex);
+-	return 0;
++	return res;
+ }
+ 
+ int get_global_wrlock(knet_handle_t knet_h)
+@@ -219,7 +225,7 @@ int wait_all_threads_status(knet_handle_t knet_h, uint8_t status)
+ 	return 0;
+ }
+ 
+-void force_pmtud_run(knet_handle_t knet_h, uint8_t subsystem, uint8_t reset_mtu)
++void force_pmtud_run(knet_handle_t knet_h, uint8_t subsystem, uint8_t reset_mtu, uint8_t force_restart)
+ {
+ 	if (reset_mtu) {
+ 		log_debug(knet_h, subsystem, "PMTUd has been reset to default");
+@@ -243,6 +249,12 @@ void force_pmtud_run(knet_handle_t knet_h, uint8_t subsystem, uint8_t reset_mtu)
+ 				log_debug(knet_h, subsystem, "Notifying PMTUd to rerun");
+ 				knet_h->pmtud_forcerun = 1;
+ 			}
++		} else {
++			if (force_restart) {
++				if (_pmtud_reschedule(knet_h) < 0) {
++					log_info(knet_h, KNET_SUB_PMTUD, "Unable to notify PMTUd to reschedule. A joining node may struggle to connect properly");
++				}
++			}
+ 		}
+ 		pthread_mutex_unlock(&knet_h->pmtud_mutex);
+ 	}
+diff --git a/libknet/threads_common.h b/libknet/threads_common.h
+index 057723a8..0d2e9b7e 100644
+--- a/libknet/threads_common.h
++++ b/libknet/threads_common.h
+@@ -51,6 +51,6 @@ int set_thread_flush_queue(knet_handle_t knet_h, uint8_t thread_id, uint8_t stat
+ int wait_all_threads_flush_queue(knet_handle_t knet_h);
+ int set_thread_status(knet_handle_t knet_h, uint8_t thread_id, uint8_t status);
+ int wait_all_threads_status(knet_handle_t knet_h, uint8_t status);
+-void force_pmtud_run(knet_handle_t knet_h, uint8_t subsystem, uint8_t reset_mtu);
++void force_pmtud_run(knet_handle_t knet_h, uint8_t subsystem, uint8_t reset_mtu, uint8_t force_restart);
+ 
+ #endif
+diff --git a/libknet/threads_pmtud.c b/libknet/threads_pmtud.c
+index ab1e435e..3bf99b86 100644
+--- a/libknet/threads_pmtud.c
++++ b/libknet/threads_pmtud.c
+@@ -761,7 +761,7 @@ int knet_handle_pmtud_set(knet_handle_t knet_h,
+ 
+ 	knet_h->manual_mtu = iface_mtu;
+ 
+-	force_pmtud_run(knet_h, KNET_SUB_PMTUD, 0);
++	force_pmtud_run(knet_h, KNET_SUB_PMTUD, 0, 0);
+ 
+ 	pthread_rwlock_unlock(&knet_h->global_rwlock);
+ 
+diff --git a/libknet/transport_udp.c b/libknet/transport_udp.c
+index f0257997..2f158658 100644
+--- a/libknet/transport_udp.c
++++ b/libknet/transport_udp.c
+@@ -343,7 +343,7 @@ static int read_errs_from_sock(knet_handle_t knet_h, int sockfd)
+ 									pthread_mutex_unlock(&knet_h->kmtu_mutex);
+ 								}
+ 
+-								force_pmtud_run(knet_h, KNET_SUB_TRANSP_UDP, 0);
++								force_pmtud_run(knet_h, KNET_SUB_TRANSP_UDP, 0, 0);
+ 							}
+ 							/*
+ 							 * those errors are way too noisy
+-- 
+2.30.2
+
diff --git a/debian/patches/series b/debian/patches/series
index e69de29b..f14ee886 100644
--- a/debian/patches/series
+++ b/debian/patches/series
@@ -0,0 +1 @@
+0001-pmtud-Reset-restart-pmtud-when-a-node-joins.patch
-- 
2.30.2






More information about the pve-devel mailing list