[pve-devel] [PATCH qemu] add fix for io_uring issue to unbreak backup with TPM disk potentially getting stuck

Fabian Grünbichler f.gruenbichler at proxmox.com
Fri Dec 19 09:19:02 CET 2025


On November 26, 2025 12:03 pm, Fiona Ebner wrote:
> As reported in the community forum [0], backups with TPM disk on BTRFS
> storages would likely get stuck since the switch to '-blockdev' in
> qemu-server, i.e. commit 439f6e2a ("backup: use blockdev for TPM state
> file"), which moved away from using a 'drive_add' HMP command to
> attach the TPM state drive. Before that, the aio mode was QEMU's
> default 'threads'. Since that commit, the aio mode is the PVE default
> 'io_uring'.
> 
> The issue is actually not BTRFS-specific, but a logic bug in QEMU's
> current io_uring implementation, see the added patch for details. QEMU
> 10.2 includes a major rework of the io_uring feature, so the issue is
> already fixed there with QEMU commit 047dabef97 ("block/io_uring: use
> aio_add_sqe()"). While still on 10.1, a different fix is needed.
> 
> Upstream submission for the patch [1].
> 
> [0]: https://forum.proxmox.com/threads/170045/
> [1]: https://lists.nongnu.org/archive/html/qemu-stable/2025-11/msg00321.html

this has been reviewed and applied for 10.1.3 upstream, so we should
either pull it in or update to 10.1.3 (I think 10.2 still had other issues?)

> Signed-off-by: Fiona Ebner <f.ebner at proxmox.com>
> ---
>  ...void-potentially-getting-stuck-after.patch | 153 ++++++++++++++++++
>  debian/patches/series                         |   1 +
>  2 files changed, 154 insertions(+)
>  create mode 100644 debian/patches/extra/0011-block-io_uring-avoid-potentially-getting-stuck-after.patch
> 
> diff --git a/debian/patches/extra/0011-block-io_uring-avoid-potentially-getting-stuck-after.patch b/debian/patches/extra/0011-block-io_uring-avoid-potentially-getting-stuck-after.patch
> new file mode 100644
> index 0000000..372ecad
> --- /dev/null
> +++ b/debian/patches/extra/0011-block-io_uring-avoid-potentially-getting-stuck-after.patch
> @@ -0,0 +1,153 @@
> +From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001
> +From: Fiona Ebner <f.ebner at proxmox.com>
> +Date: Mon, 24 Nov 2025 15:28:27 +0100
> +Subject: [PATCH] block/io_uring: avoid potentially getting stuck after
> + resubmit at the end of ioq_submit()
> +
> +Note that this issue seems already fixed as a consequence of the large
> +io_uring rework with 047dabef97 ("block/io_uring: use aio_add_sqe()")
> +in current master, so this is purely for QEMU stable branches.
> +
> +At the end of ioq_submit(), there is an opportunistic call to
> +luring_process_completions(). This is the single caller of
> +luring_process_completions() that doesn't use the
> +luring_process_completions_and_submit() wrapper.
> +
> +Other callers use the wrapper, because luring_process_completions()
> +might require a subsequent call to ioq_submit() after resubmitting a
> +request. As noted for luring_resubmit():
> +
> +> Resubmit a request by appending it to submit_queue.  The caller must ensure
> +> that ioq_submit() is called later so that submit_queue requests are started.
> +
> +So the caller at the end of ioq_submit() violates the contract and can
> +in fact be problematic if no other requests come in later. In such a
> +case, the request intended to be resubmitted will never be actually be
> +submitted via io_uring_submit().
> +
> +A reproducer exposing this issue is [0], which is based on user
> +reports from [1]. Another reproducer is iotest 109 with '-i io_uring'.
> +
> +I had the most success to trigger the issue with [0] when using a
> +BTRFS RAID 1 storage. With tmpfs, it can take quite a few iterations,
> +but also triggers eventually on my machine. With iotest 109 with '-i
> +io_uring' the issue triggers reliably on my ext4 file system.
> +
> +Have ioq_submit() submit any resubmitted requests after calling
> +luring_process_completions(). The return value from io_uring_submit()
> +is checked to be non-negative before the opportunistic processing of
> +completions and going for the new resubmit logic, to ensure that a
> +failure of io_uring_submit() is not missed. Also note that the return
> +value already was not necessarily the total number of submissions,
> +since the loop might've been iterated more than once even before the
> +current change.
> +
> +Only trigger the resubmission logic if it is actually necessary to
> +avoid changing behavior more than necessary. For example iotest 109
> +would produce more 'mirror ready' events if always resubmitting after
> +luring_process_completions() at the end of ioq_submit().
> +
> +Note iotest 109 still does not pass as is when run with '-i io_uring',
> +because of two offset values for BLOCK_JOB_COMPLETED events being zero
> +instead of non-zero as in the expected output. Note that the two
> +affected test cases are expected failures and still fail, so they just
> +fail "faster". The test cases are actually not triggering the resubmit
> +logic, so the reason seems to be different ordering of requests and
> +completions of the current aio=io_uring implementation versus
> +aio=threads.
> +
> +[0]:
> +
> +> #!/bin/bash -e
> +> #file=/mnt/btrfs/disk.raw
> +> file=/tmp/disk.raw
> +> filesize=256
> +> readsize=512
> +> rm -f $file
> +> truncate -s $filesize $file
> +> ./qemu-system-x86_64 --trace '*uring*' --qmp stdio \
> +> --blockdev raw,node-name=node0,file.driver=file,file.cache.direct=off,file.filename=$file,file.aio=io_uring \
> +> <<EOF
> +> {"execute": "qmp_capabilities"}
> +> {"execute": "human-monitor-command", "arguments": { "command-line": "qemu-io node0 \"read 0 $readsize \"" }}
> +> {"execute": "quit"}
> +> EOF
> +
> +[1]: https://forum.proxmox.com/threads/170045/
> +
> +Cc: qemu-stable at nongnu.org
> +Signed-off-by: Fiona Ebner <f.ebner at proxmox.com>
> +---
> + block/io_uring.c | 16 +++++++++++++---
> + 1 file changed, 13 insertions(+), 3 deletions(-)
> +
> +diff --git a/block/io_uring.c b/block/io_uring.c
> +index dd4f304910..5dbafc8f7b 100644
> +--- a/block/io_uring.c
> ++++ b/block/io_uring.c
> +@@ -120,11 +120,14 @@ static void luring_resubmit_short_read(LuringState *s, LuringAIOCB *luringcb,
> +  * event loop.  When there are no events left  to complete the BH is being
> +  * canceled.
> +  *
> ++ * Returns whether ioq_submit() must be called again afterwards since requests
> ++ * were resubmitted via luring_resubmit().
> +  */
> +-static void luring_process_completions(LuringState *s)
> ++static bool luring_process_completions(LuringState *s)
> + {
> +     struct io_uring_cqe *cqes;
> +     int total_bytes;
> ++    bool resubmit = false;
> + 
> +     defer_call_begin();
> + 
> +@@ -182,6 +185,7 @@ static void luring_process_completions(LuringState *s)
> +              */
> +             if (ret == -EINTR || ret == -EAGAIN) {
> +                 luring_resubmit(s, luringcb);
> ++                resubmit = true;
> +                 continue;
> +             }
> +         } else if (!luringcb->qiov) {
> +@@ -194,6 +198,7 @@ static void luring_process_completions(LuringState *s)
> +             if (luringcb->is_read) {
> +                 if (ret > 0) {
> +                     luring_resubmit_short_read(s, luringcb, ret);
> ++                    resubmit = true;
> +                     continue;
> +                 } else {
> +                     /* Pad with zeroes */
> +@@ -224,6 +229,8 @@ end:
> +     qemu_bh_cancel(s->completion_bh);
> + 
> +     defer_call_end();
> ++
> ++    return resubmit;
> + }
> + 
> + static int ioq_submit(LuringState *s)
> +@@ -231,6 +238,7 @@ static int ioq_submit(LuringState *s)
> +     int ret = 0;
> +     LuringAIOCB *luringcb, *luringcb_next;
> + 
> ++resubmit:
> +     while (s->io_q.in_queue > 0) {
> +         /*
> +          * Try to fetch sqes from the ring for requests waiting in
> +@@ -260,12 +268,14 @@ static int ioq_submit(LuringState *s)
> +     }
> +     s->io_q.blocked = (s->io_q.in_queue > 0);
> + 
> +-    if (s->io_q.in_flight) {
> ++    if (ret >= 0 && s->io_q.in_flight) {
> +         /*
> +          * We can try to complete something just right away if there are
> +          * still requests in-flight.
> +          */
> +-        luring_process_completions(s);
> ++        if (luring_process_completions(s)) {
> ++            goto resubmit;
> ++        }
> +     }
> +     return ret;
> + }
> diff --git a/debian/patches/series b/debian/patches/series
> index b1afcd4..83e7f6d 100644
> --- a/debian/patches/series
> +++ b/debian/patches/series
> @@ -8,6 +8,7 @@ extra/0007-vfio-only-check-region-info-cache-for-initial-region.patch
>  extra/0008-ui-vdagent-fix-windows-agent-regression.patch
>  extra/0009-file-posix-populate-pwrite_zeroes_alignment.patch
>  extra/0010-block-use-pwrite_zeroes_alignment-when-writing-first.patch
> +extra/0011-block-io_uring-avoid-potentially-getting-stuck-after.patch
>  bitmap-mirror/0001-drive-mirror-add-support-for-sync-bitmap-mode-never.patch
>  bitmap-mirror/0002-drive-mirror-add-support-for-conditional-and-always-.patch
>  bitmap-mirror/0003-mirror-add-check-for-bitmap-mode-without-bitmap.patch
> -- 
> 2.47.3
> 
> 
> 
> _______________________________________________
> pve-devel mailing list
> pve-devel at lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
> 
> 
> 




More information about the pve-devel mailing list