[pve-devel] [RFC kernel] cherry-pick scheduler fix to avoid temporary VM freezes on NUMA hosts

Wed Jan 17 15:45:21 CET 2024

Users have been reporting [1] that VMs occasionally become
unresponsive with high CPU usage for some time (varying between ~1 and
more than 60 seconds). After that time, the guests come back and
continue running. Windows VMs seem most affected (not responding to
pings during the hang, RDP sessions time out), but we also got reports
about Linux VMs (reporting soft lockups). The issue was not present on
host kernel 5.15 and was first reported with kernel 6.2. Users
reported that the issue becomes easier to trigger the more memory is
assigned to the guests. Setting mitigations=off was reported to
alleviate (but not eliminate) the issue. For most users the issue
seems to disappear after (also) disabling KSM [2], but some users
reported freezes even with KSM disabled [3].

It turned out the reports concerned NUMA hosts only, and that the
freezes correlated with runs of the NUMA balancer [4]. Users reported
that disabling the NUMA balancer resolves the issue (even with KSM
enabled).

We put together a Linux VM reproducer, ran a git-bisect on the kernel
to find the commit introducing the issue and asked upstream for help
[5]. As it turned out, an upstream bugreport was recently opened [6]
and a preliminary fix to the KVM TDP MMU was proposed [7]. With that
patch [7] on top of kernel 6.7, the reproducer does not trigger
freezes anymore. As of now, the patch (or its v2 [8]) is not yet
merged in the mainline kernel, and backporting it may be difficult due
to dependencies on other KVM changes [9].

However, the bugreport [6] also prompted an upstream developer to
propose a patch to the kernel scheduler logic that decides whether a
contended spinlock/rwlock should be dropped [10]. Without the patch,
PREEMPT_DYNAMIC kernels (such as ours) would always drop contended
locks. With the patch, the kernel only drops contended locks if the
kernel is currently set to preempt=full. As noted in the commit
message [10], this can (counter-intuitively) improve KVM performance.
Our kernel defaults to preempt=voluntary (according to
/sys/kernel/debug/sched/preempt), so with the patch it does not drop
contended locks anymore, and the reproducer does not trigger freezes
anymore. Hence, backport [10] to our kernel.

[1] https://forum.proxmox.com/threads/130727/
[2] https://forum.proxmox.com/threads/130727/page-4#post-575886
[3] https://forum.proxmox.com/threads/130727/page-8#post-617587
[4] https://www.kernel.org/doc/html/latest/admin-guide/sysctl/kernel.html#numa-balancing
[5] https://lore.kernel.org/kvm/832697b9-3652-422d-a019-8c0574a188ac@proxmox.com/
[6] https://bugzilla.kernel.org/show_bug.cgi?id=218259
[7] https://lore.kernel.org/all/20230825020733.2849862-1-seanjc@google.com/
[8] https://lore.kernel.org/all/20240110012045.505046-1-seanjc@google.com/
[9] https://lore.kernel.org/kvm/Zaa654hwFKba_7pf@google.com/
[10] https://lore.kernel.org/all/20240110214723.695930-1-seanjc@google.com/

Signed-off-by: Friedrich Weber <f.weber at proxmox.com>
---

Notes:
    This RFC is not meant to be applied immediately, but is intended to
    sum up the current state of the issue and point out potential fixes.
    
    The patch [10] backported in this RFC hasn't been reviewed upstream
    yet. And while it fixes the reproducer, it is not certain that it will
    fix freezes seen by users on real-world workloads. Hence, it would be
    desirable to also apply some variant of [7] [8] once it is applied
    upstream, however there may be difficulties backporting it, as noted
    above.
    
    So, in any case, for now it might sense to monitor how upstream
    handles the situation, and then react accordingly. I'll continue to
    participate upstream and send a v2 in due time.

 ...spinlocks-on-contention-iff-kernel-i.patch | 78 +++++++++++++++++++
 1 file changed, 78 insertions(+)
 create mode 100644 patches/kernel/0018-sched-core-Drop-spinlocks-on-contention-iff-kernel-i.patch

diff --git a/patches/kernel/0018-sched-core-Drop-spinlocks-on-contention-iff-kernel-i.patch b/patches/kernel/0018-sched-core-Drop-spinlocks-on-contention-iff-kernel-i.patch
new file mode 100644
index 0000000..932e2f2
--- /dev/null
+++ b/patches/kernel/0018-sched-core-Drop-spinlocks-on-contention-iff-kernel-i.patch
@@ -0,0 +1,78 @@
+From 39f2bfe0177d3f56c9feac4e70424e4952949e2a Mon Sep 17 00:00:00 2001
+From: Sean Christopherson <seanjc at google.com>
+Date: Wed, 10 Jan 2024 13:47:23 -0800
+Subject: [PATCH] sched/core: Drop spinlocks on contention iff kernel is
+ preemptible
+
+Use preempt_model_preemptible() to detect a preemptible kernel when
+deciding whether or not to reschedule in order to drop a contended
+spinlock or rwlock.  Because PREEMPT_DYNAMIC selects PREEMPTION, kernels
+built with PREEMPT_DYNAMIC=y will yield contended locks even if the live
+preemption model is "none" or "voluntary".  In short, make kernels with
+dynamically selected models behave the same as kernels with statically
+selected models.
+
+Somewhat counter-intuitively, NOT yielding a lock can provide better
+latency for the relevant tasks/processes.  E.g. KVM x86's mmu_lock, a
+rwlock, is often contended between an invalidation event (takes mmu_lock
+for write) and a vCPU servicing a guest page fault (takes mmu_lock for
+read).  For _some_ setups, letting the invalidation task complete even
+if there is mmu_lock contention provides lower latency for *all* tasks,
+i.e. the invalidation completes sooner *and* the vCPU services the guest
+page fault sooner.
+
+But even KVM's mmu_lock behavior isn't uniform, e.g. the "best" behavior
+can vary depending on the host VMM, the guest workload, the number of
+vCPUs, the number of pCPUs in the host, why there is lock contention, etc.
+
+In other words, simply deleting the CONFIG_PREEMPTION guard (or doing the
+opposite and removing contention yielding entirely) needs to come with a
+big pile of data proving that changing the status quo is a net positive.
+
+Cc: Valentin Schneider <valentin.schneider at arm.com>
+Cc: Peter Zijlstra (Intel) <peterz at infradead.org>
+Cc: Marco Elver <elver at google.com>
+Cc: Frederic Weisbecker <frederic at kernel.org>
+Cc: David Matlack <dmatlack at google.com>
+Signed-off-by: Sean Christopherson <seanjc at google.com>
+---
+ include/linux/sched.h | 14 ++++++--------
+ 1 file changed, 6 insertions(+), 8 deletions(-)
+
+diff --git a/include/linux/sched.h b/include/linux/sched.h
+index 292c31697248..a274bc85f222 100644
+--- a/include/linux/sched.h
++++ b/include/linux/sched.h
+@@ -2234,11 +2234,10 @@ static inline bool preempt_model_preemptible(void)
+  */
+ static inline int spin_needbreak(spinlock_t *lock)
+ {
+-#ifdef CONFIG_PREEMPTION
++	if (!preempt_model_preemptible())
++		return 0;
++
+ 	return spin_is_contended(lock);
+-#else
+-	return 0;
+-#endif
+ }
+ 
+ /*
+@@ -2251,11 +2250,10 @@ static inline int spin_needbreak(spinlock_t *lock)
+  */
+ static inline int rwlock_needbreak(rwlock_t *lock)
+ {
+-#ifdef CONFIG_PREEMPTION
++	if (!preempt_model_preemptible())
++		return 0;
++
+ 	return rwlock_is_contended(lock);
+-#else
+-	return 0;
+-#endif
+ }
+ 
+ static __always_inline bool need_resched(void)
+-- 
+2.39.2
+
-- 
2.39.2