[pve-devel] [PATCH pve-kernel 2/2] cherry-pick fix for uefi guests hanging upon guest-initialized reboot

Stoiko Ivanov s.ivanov at proxmox.com
Thu Aug 24 16:30:21 CEST 2023


This was identified as a potential fix for an issue we analyzed in our
Enterprise support, where guests would hang before the boot-loader
after being rebooted from within the guest (after applying updates for
RHEL 8).

https://lore.kernel.org/lkml/20230608090348.414990-1-gshan@redhat.com/

Suggested-by: Stefan Hanreich <s.hanreich at proxmox.com>
Signed-off-by: Stoiko Ivanov <s.ivanov at proxmox.com>
---
 ...l-stage2-mapping-on-invalid-memory-s.patch | 122 ++++++++++++++++++
 1 file changed, 122 insertions(+)
 create mode 100644 patches/kernel/0025-KVM-Avoid-illegal-stage2-mapping-on-invalid-memory-s.patch

diff --git a/patches/kernel/0025-KVM-Avoid-illegal-stage2-mapping-on-invalid-memory-s.patch b/patches/kernel/0025-KVM-Avoid-illegal-stage2-mapping-on-invalid-memory-s.patch
new file mode 100644
index 000000000000..d50aab8e4d7c
--- /dev/null
+++ b/patches/kernel/0025-KVM-Avoid-illegal-stage2-mapping-on-invalid-memory-s.patch
@@ -0,0 +1,122 @@
+From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001
+From: Gavin Shan <gshan at redhat.com>
+Date: Thu, 15 Jun 2023 15:42:59 +1000
+Subject: [PATCH] KVM: Avoid illegal stage2 mapping on invalid memory slot
+
+commit 2230f9e1171a2e9731422a14d1bbc313c0b719d1 upstream.
+
+We run into guest hang in edk2 firmware when KSM is kept as running on
+the host. The edk2 firmware is waiting for status 0x80 from QEMU's pflash
+device (TYPE_PFLASH_CFI01) during the operation of sector erasing or
+buffered write. The status is returned by reading the memory region of
+the pflash device and the read request should have been forwarded to QEMU
+and emulated by it. Unfortunately, the read request is covered by an
+illegal stage2 mapping when the guest hang issue occurs. The read request
+is completed with QEMU bypassed and wrong status is fetched. The edk2
+firmware runs into an infinite loop with the wrong status.
+
+The illegal stage2 mapping is populated due to same page sharing by KSM
+at (C) even the associated memory slot has been marked as invalid at (B)
+when the memory slot is requested to be deleted. It's notable that the
+active and inactive memory slots can't be swapped when we're in the middle
+of kvm_mmu_notifier_change_pte() because kvm->mn_active_invalidate_count
+is elevated, and kvm_swap_active_memslots() will busy loop until it reaches
+to zero again. Besides, the swapping from the active to the inactive memory
+slots is also avoided by holding &kvm->srcu in __kvm_handle_hva_range(),
+corresponding to synchronize_srcu_expedited() in kvm_swap_active_memslots().
+
+  CPU-A                    CPU-B
+  -----                    -----
+                           ioctl(kvm_fd, KVM_SET_USER_MEMORY_REGION)
+                           kvm_vm_ioctl_set_memory_region
+                           kvm_set_memory_region
+                           __kvm_set_memory_region
+                           kvm_set_memslot(kvm, old, NULL, KVM_MR_DELETE)
+                             kvm_invalidate_memslot
+                               kvm_copy_memslot
+                               kvm_replace_memslot
+                               kvm_swap_active_memslots        (A)
+                               kvm_arch_flush_shadow_memslot   (B)
+  same page sharing by KSM
+  kvm_mmu_notifier_invalidate_range_start
+        :
+  kvm_mmu_notifier_change_pte
+    kvm_handle_hva_range
+    __kvm_handle_hva_range
+    kvm_set_spte_gfn            (C)
+        :
+  kvm_mmu_notifier_invalidate_range_end
+
+Fix the issue by skipping the invalid memory slot at (C) to avoid the
+illegal stage2 mapping so that the read request for the pflash's status
+is forwarded to QEMU and emulated by it. In this way, the correct pflash's
+status can be returned from QEMU to break the infinite loop in the edk2
+firmware.
+
+We tried a git-bisect and the first problematic commit is cd4c71835228 ("
+KVM: arm64: Convert to the gfn-based MMU notifier callbacks"). With this,
+clean_dcache_guest_page() is called after the memory slots are iterated
+in kvm_mmu_notifier_change_pte(). clean_dcache_guest_page() is called
+before the iteration on the memory slots before this commit. This change
+literally enlarges the racy window between kvm_mmu_notifier_change_pte()
+and memory slot removal so that we're able to reproduce the issue in a
+practical test case. However, the issue exists since commit d5d8184d35c9
+("KVM: ARM: Memory virtualization setup").
+
+Cc: stable at vger.kernel.org # v3.9+
+Fixes: d5d8184d35c9 ("KVM: ARM: Memory virtualization setup")
+Reported-by: Shuai Hu <hshuai at redhat.com>
+Reported-by: Zhenyu Zhang <zhenyzha at redhat.com>
+Signed-off-by: Gavin Shan <gshan at redhat.com>
+Reviewed-by: David Hildenbrand <david at redhat.com>
+Reviewed-by: Oliver Upton <oliver.upton at linux.dev>
+Reviewed-by: Peter Xu <peterx at redhat.com>
+Reviewed-by: Sean Christopherson <seanjc at google.com>
+Reviewed-by: Shaoqin Huang <shahuang at redhat.com>
+Message-Id: <20230615054259.14911-1-gshan at redhat.com>
+Signed-off-by: Paolo Bonzini <pbonzini at redhat.com>
+Signed-off-by: Greg Kroah-Hartman <gregkh at linuxfoundation.org>
+(cherry picked from commit 953dd7e2df8181d5ce4117fca347992d616f0621)
+Signed-off-by: Stoiko Ivanov <s.ivanov at proxmox.com>
+---
+ virt/kvm/kvm_main.c | 20 +++++++++++++++++++-
+ 1 file changed, 19 insertions(+), 1 deletion(-)
+
+diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
+index db159be9d5b8..6deb43c2d091 100644
+--- a/virt/kvm/kvm_main.c
++++ b/virt/kvm/kvm_main.c
+@@ -636,6 +636,24 @@ static __always_inline int kvm_handle_hva_range_no_flush(struct mmu_notifier *mn
+ 
+ 	return __kvm_handle_hva_range(kvm, &range);
+ }
++
++static bool kvm_change_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
++{
++	/*
++	 * Skipping invalid memslots is correct if and only change_pte() is
++	 * surrounded by invalidate_range_{start,end}(), which is currently
++	 * guaranteed by the primary MMU.  If that ever changes, KVM needs to
++	 * unmap the memslot instead of skipping the memslot to ensure that KVM
++	 * doesn't hold references to the old PFN.
++	 */
++	WARN_ON_ONCE(!READ_ONCE(kvm->mn_active_invalidate_count));
++
++	if (range->slot->flags & KVM_MEMSLOT_INVALID)
++		return false;
++
++	return kvm_set_spte_gfn(kvm, range);
++}
++
+ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
+ 					struct mm_struct *mm,
+ 					unsigned long address,
+@@ -656,7 +674,7 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
+ 	if (!READ_ONCE(kvm->mmu_notifier_count))
+ 		return;
+ 
+-	kvm_handle_hva_range(mn, address, address + 1, pte, kvm_set_spte_gfn);
++	kvm_handle_hva_range(mn, address, address + 1, pte, kvm_change_spte_gfn);
+ }
+ 
+ void kvm_inc_notifier_count(struct kvm *kvm, unsigned long start,
-- 
2.39.2






More information about the pve-devel mailing list