[PVE-User] QEMU crash with dpdk 22.11 app on Proxmox 8
Fiona Ebner
f.ebner at proxmox.com
Wed Sep 4 11:58:57 CEST 2024
Hi,
Am 28.08.24 um 16:56 schrieb Knight, Joshua via pve-user:
>
>
> We are seeing an issue on Proxmox 8 hosts where the underlying QEMU process for a guest will crash while starting a DPDK application in the guest.
>
>
> * Proxmox 8.2.4 with QEMU 9.0.2-2
> * Guest running Ubuntu 22.04, application is dpdk 22.11 testpmd
> * Using virtio network interfaces that are up/connected
> * Binding interfaces with the (legacy) igb_uio driver
>
> When starting the application, the VM ssh connection will disconnect and the VM will be powered off in the ui.
>
> root at karma06:~/dpdk-22.11# python3 /root/dpdk-22.11/res/usr/local/bin/dpdk-devbind.py --bind=igb_uio enp6s20
> root at karma06:~/dpdk-22.11# python3 /root/dpdk-22.11/res/usr/local/bin/dpdk-devbind.py --bind=igb_uio enp6s21
> root at karma06:~/dpdk-22.11# python3 /root/dpdk-22.11/res/usr/local/bin/dpdk-devbind.py --bind=igb_uio enp6s22
> root at karma06:~/dpdk-22.11# python3 /root/dpdk-22.11/res/usr/local/bin/dpdk-devbind.py --bind=igb_uio enp6s23
>
> root at karma06:~/dpdk-22.11# /root/dpdk-22.11/res/usr/local/bin/dpdk-testpmd -- -i --port-topology=chained --rxq=1 --txq=1 --rss-ip
> EAL: Detected CPU lcores: 6
> EAL: Detected NUMA nodes: 1
> EAL: Detected static linkage of DPDK
> EAL: Multi-process socket /var/run/dpdk/rte/mp_socket
> EAL: Selected IOVA mode 'PA'
> EAL: VFIO support initialized
> EAL: Probe PCI driver: net_virtio (1af4:1000) device: 0000:06:12.0 (socket -1)
> eth_virtio_pci_init(): Failed to init PCI device
> EAL: Requested device 0000:06:12.0 cannot be used
> EAL: Probe PCI driver: net_virtio (1af4:1000) device: 0000:06:13.0 (socket -1)
> eth_virtio_pci_init(): Failed to init PCI device
> EAL: Requested device 0000:06:13.0 cannot be used
> EAL: Probe PCI driver: net_virtio (1af4:1000) device: 0000:06:14.0 (socket -1)
> EAL: Probe PCI driver: net_virtio (1af4:1000) device: 0000:06:15.0 (socket -1)
> EAL: Probe PCI driver: net_virtio (1af4:1000) device: 0000:06:16.0 (socket -1)
> EAL: Probe PCI driver: net_virtio (1af4:1000) device: 0000:06:17.0 (socket -1)
> TELEMETRY: No legacy callbacks, legacy socket not created
> Interactive-mode selected
> Warning: NUMA should be configured manually by using --port-numa-config and --ring-numa-config parameters along with --numa.
> testpmd: create a new mbuf pool <mb_pool_0>: n=187456, size=2176, socket=0
> testpmd: preferred mempool ops selected: ring_mp_mc
> Configuring Port 0 (socket 0)
>
> client_loop: send disconnect: Broken pipe
>
>
>
> A QEMU assertion is seen in the host’s system log. Using GDB we can see that QEMU is aborted.
>
> karma QEMU[27334]: kvm: ../accel/kvm/kvm-all.c:1836: kvm_irqchip_commit_routes: Assertion `ret == 0' failed.
>
> Thread 10 "CPU 0/KVM" received signal SIGABRT, Aborted.
> [Switching to Thread 0x7d999cc006c0 (LWP 36256)]
> __pthread_kill_implementation (threadid=<optimized out>, signo=signo at entry=6, no_tid=no_tid at entry=0) at ./nptl/pthread_kill.c:44
> 44 ./nptl/pthread_kill.c: No such file or directory.
> (gdb) bt
> #0 __pthread_kill_implementation (threadid=<optimized out>, signo=signo at entry=6, no_tid=no_tid at entry=0) at ./nptl/pthread_kill.c:44
> #1 0x00007d99a10a9e8f in __pthread_kill_internal (signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:78
> #2 0x00007d99a105afb2 in __GI_raise (sig=sig at entry=6) at ../sysdeps/posix/raise.c:26
> #3 0x00007d99a1045472 in __GI_abort () at ./stdlib/abort.c:79
> #4 0x00007d99a1045395 in __assert_fail_base (fmt=0x7d99a11b9a90 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n",
> assertion=assertion at entry=0x5a9eb5a20f5e "ret == 0", file=file at entry=0x5a9eb5a021a5 "../accel/kvm/kvm-all.c", line=line at entry=1836,
> function=function at entry=0x5a9eb5a03ca0 <__PRETTY_FUNCTION__.23> "kvm_irqchip_commit_routes") at ./assert/assert.c:92
> #5 0x00007d99a1053eb2 in __GI___assert_fail (assertion=assertion at entry=0x5a9eb5a20f5e "ret == 0",
> file=file at entry=0x5a9eb5a021a5 "../accel/kvm/kvm-all.c", line=line at entry=1836,
> function=function at entry=0x5a9eb5a03ca0 <__PRETTY_FUNCTION__.23> "kvm_irqchip_commit_routes") at ./assert/assert.c:101
> #6 0x00005a9eb566248c in kvm_irqchip_commit_routes (s=0x5a9eb79eed10) at ../accel/kvm/kvm-all.c:1836
> #7 kvm_irqchip_commit_routes (s=0x5a9eb79eed10) at ../accel/kvm/kvm-all.c:1821
> #8 0x00005a9eb540bed2 in virtio_pci_one_vector_unmask (proxy=proxy at entry=0x5a9eb9f5ada0, queue_no=queue_no at entry=4294967295,
> vector=vector at entry=0, msg=..., n=0x5a9eb9f63368) at ../hw/virtio/virtio-pci.c:991
> #9 0x00005a9eb540c09c in virtio_pci_vector_unmask (dev=0x5a9eb9f5ada0, vector=0, msg=...) at ../hw/virtio/virtio-pci.c:1056
> #10 0x00005a9eb536ff62 in msix_fire_vector_notifier (is_masked=false, vector=0, dev=0x5a9eb9f5ada0) at ../hw/pci/msix.c:120
> #11 msix_handle_mask_update (dev=0x5a9eb9f5ada0, vector=0, was_masked=<optimized out>) at ../hw/pci/msix.c:140
> #12 0x00005a9eb5602260 in memory_region_write_accessor (mr=0x5a9eb9f5b3e0, addr=12, value=<optimized out>, size=4, shift=<optimized out>,
> mask=<optimized out>, attrs=...) at ../system/memory.c:497
> #13 0x00005a9eb5602f4e in access_with_adjusted_size (addr=addr at entry=12, value=value at entry=0x7d999cbfae58, size=size at entry=4,
> access_size_min=<optimized out>, access_size_max=<optimized out>, access_fn=0x5a9eb56021e0 <memory_region_write_accessor>,
> mr=<optimized out>, attrs=...) at ../system/memory.c:573
> #14 0x00005a9eb560403c in memory_region_dispatch_write (mr=mr at entry=0x5a9eb9f5b3e0, addr=addr at entry=12, data=<optimized out>,
> op=<optimized out>, attrs=attrs at entry=...) at ../system/memory.c:1528
> #15 0x00005a9eb560b95f in flatview_write_continue_step (attrs=attrs at entry=..., buf=buf at entry=0x7d99a3433028 "", mr_addr=12,
> l=l at entry=0x7d999cbfaf80, mr=0x5a9eb9f5b3e0, len=4) at ../system/physmem.c:2713
> #16 0x00005a9eb560bbed in flatview_write_continue (mr=<optimized out>, l=<optimized out>, mr_addr=<optimized out>, len=4, ptr=0xfdf8500c,
> attrs=..., addr=4260909068, fv=0x7d8d6c0796b0) at ../system/physmem.c:2743
> #17 flatview_write (fv=0x7d8d6c0796b0, addr=addr at entry=4260909068, attrs=attrs at entry=..., buf=buf at entry=0x7d99a3433028, len=len at entry=4)
> at ../system/physmem.c:2774
> #18 0x00005a9eb560f251 in address_space_write (len=4, buf=0x7d99a3433028, attrs=..., addr=4260909068, as=0x5a9eb66f1f20 <address_space_memory>)
> at ../system/physmem.c:2894
> #19 address_space_rw (as=0x5a9eb66f1f20 <address_space_memory>, addr=4260909068, attrs=attrs at entry=..., buf=buf at entry=0x7d99a3433028, len=4,
> is_write=<optimized out>) at ../system/physmem.c:2904
> #20 0x00005a9eb56660e8 in kvm_cpu_exec (cpu=cpu at entry=0x5a9eb81e6890) at ../accel/kvm/kvm-all.c:2917
> #21 0x00005a9eb56676d5 in kvm_vcpu_thread_fn (arg=arg at entry=0x5a9eb81e6890) at ../accel/kvm/kvm-accel-ops.c:50
> #22 0x00005a9eb581dfe8 in qemu_thread_start (args=0x5a9eb81ee390) at ../util/qemu-thread-posix.c:541
> #23 0x00007d99a10a8134 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
> #24 0x00007d99a11287dc in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
>
>
> One thing that’s interesting about this backtrace is it seems to exactly match an existing issue in QEMU that claims to be patched, and that patch should be present in QEMU 9.0.2, the version running on this Proxmox host.
>
> https://gitlab.com/qemu-project/qemu/-/issues/1928
>
> We’ve found a workaround by switching from the deprecated igb_uio driver to the vfio-pci driver when binding the interfaces for dpdk. In this case the VM does not crash. But I’m wondering if anyone has hit this before or if it’s a known issue. I would certainly not expect any operation in the guest to cause QEMU to crash. It’s also odd that the crash seen claims to be patched in 9.0.2.
>
> We’ve been able to reproduce this on Proxmox 8.0, 8.1, 8.2 on both AMD and Intel processors. The crash does not occur on earlier releases such as Proxmox 6.4, and does not occur with earlier dpdk versions such as 20.08.
>
> Thanks,
> Josh
>
we do have a revert of that patch currently, because it caused some
regressions that sounded just as bad as the original issue [0].
A fix for the regressions has landed upstream now [1], and I'll take a
look at pulling it in and dropping the revert.
[0]:
https://git.proxmox.com/?p=pve-qemu.git;a=blob;f=debian/patches/extra/0006-Revert-virtio-pci-fix-use-of-a-released-vector.patch;h=d2de6d11ba1e2a2bd2ea8dccf660ac6e66b047d4;hb=582fd47901356342b8e0bef19d7d8fdc324d2d96
[1]:
https://lore.kernel.org/qemu-devel/a8e63ff289d137197ad7a701a587cc432872d798.1724151593.git.mst@redhat.com/
Best Regards,
Fiona
More information about the pve-user
mailing list