[pve-devel] [PATCH qemu] add fix for io_uring issue to unbreak backup with TPM disk potentially getting stuck
Fabian Grünbichler
f.gruenbichler at proxmox.com
Fri Dec 19 09:19:02 CET 2025
On November 26, 2025 12:03 pm, Fiona Ebner wrote:
> As reported in the community forum [0], backups with TPM disk on BTRFS
> storages would likely get stuck since the switch to '-blockdev' in
> qemu-server, i.e. commit 439f6e2a ("backup: use blockdev for TPM state
> file"), which moved away from using a 'drive_add' HMP command to
> attach the TPM state drive. Before that, the aio mode was QEMU's
> default 'threads'. Since that commit, the aio mode is the PVE default
> 'io_uring'.
>
> The issue is actually not BTRFS-specific, but a logic bug in QEMU's
> current io_uring implementation, see the added patch for details. QEMU
> 10.2 includes a major rework of the io_uring feature, so the issue is
> already fixed there with QEMU commit 047dabef97 ("block/io_uring: use
> aio_add_sqe()"). While still on 10.1, a different fix is needed.
>
> Upstream submission for the patch [1].
>
> [0]: https://forum.proxmox.com/threads/170045/
> [1]: https://lists.nongnu.org/archive/html/qemu-stable/2025-11/msg00321.html
this has been reviewed and applied for 10.1.3 upstream, so we should
either pull it in or update to 10.1.3 (I think 10.2 still had other issues?)
> Signed-off-by: Fiona Ebner <f.ebner at proxmox.com>
> ---
> ...void-potentially-getting-stuck-after.patch | 153 ++++++++++++++++++
> debian/patches/series | 1 +
> 2 files changed, 154 insertions(+)
> create mode 100644 debian/patches/extra/0011-block-io_uring-avoid-potentially-getting-stuck-after.patch
>
> diff --git a/debian/patches/extra/0011-block-io_uring-avoid-potentially-getting-stuck-after.patch b/debian/patches/extra/0011-block-io_uring-avoid-potentially-getting-stuck-after.patch
> new file mode 100644
> index 0000000..372ecad
> --- /dev/null
> +++ b/debian/patches/extra/0011-block-io_uring-avoid-potentially-getting-stuck-after.patch
> @@ -0,0 +1,153 @@
> +From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001
> +From: Fiona Ebner <f.ebner at proxmox.com>
> +Date: Mon, 24 Nov 2025 15:28:27 +0100
> +Subject: [PATCH] block/io_uring: avoid potentially getting stuck after
> + resubmit at the end of ioq_submit()
> +
> +Note that this issue seems already fixed as a consequence of the large
> +io_uring rework with 047dabef97 ("block/io_uring: use aio_add_sqe()")
> +in current master, so this is purely for QEMU stable branches.
> +
> +At the end of ioq_submit(), there is an opportunistic call to
> +luring_process_completions(). This is the single caller of
> +luring_process_completions() that doesn't use the
> +luring_process_completions_and_submit() wrapper.
> +
> +Other callers use the wrapper, because luring_process_completions()
> +might require a subsequent call to ioq_submit() after resubmitting a
> +request. As noted for luring_resubmit():
> +
> +> Resubmit a request by appending it to submit_queue. The caller must ensure
> +> that ioq_submit() is called later so that submit_queue requests are started.
> +
> +So the caller at the end of ioq_submit() violates the contract and can
> +in fact be problematic if no other requests come in later. In such a
> +case, the request intended to be resubmitted will never be actually be
> +submitted via io_uring_submit().
> +
> +A reproducer exposing this issue is [0], which is based on user
> +reports from [1]. Another reproducer is iotest 109 with '-i io_uring'.
> +
> +I had the most success to trigger the issue with [0] when using a
> +BTRFS RAID 1 storage. With tmpfs, it can take quite a few iterations,
> +but also triggers eventually on my machine. With iotest 109 with '-i
> +io_uring' the issue triggers reliably on my ext4 file system.
> +
> +Have ioq_submit() submit any resubmitted requests after calling
> +luring_process_completions(). The return value from io_uring_submit()
> +is checked to be non-negative before the opportunistic processing of
> +completions and going for the new resubmit logic, to ensure that a
> +failure of io_uring_submit() is not missed. Also note that the return
> +value already was not necessarily the total number of submissions,
> +since the loop might've been iterated more than once even before the
> +current change.
> +
> +Only trigger the resubmission logic if it is actually necessary to
> +avoid changing behavior more than necessary. For example iotest 109
> +would produce more 'mirror ready' events if always resubmitting after
> +luring_process_completions() at the end of ioq_submit().
> +
> +Note iotest 109 still does not pass as is when run with '-i io_uring',
> +because of two offset values for BLOCK_JOB_COMPLETED events being zero
> +instead of non-zero as in the expected output. Note that the two
> +affected test cases are expected failures and still fail, so they just
> +fail "faster". The test cases are actually not triggering the resubmit
> +logic, so the reason seems to be different ordering of requests and
> +completions of the current aio=io_uring implementation versus
> +aio=threads.
> +
> +[0]:
> +
> +> #!/bin/bash -e
> +> #file=/mnt/btrfs/disk.raw
> +> file=/tmp/disk.raw
> +> filesize=256
> +> readsize=512
> +> rm -f $file
> +> truncate -s $filesize $file
> +> ./qemu-system-x86_64 --trace '*uring*' --qmp stdio \
> +> --blockdev raw,node-name=node0,file.driver=file,file.cache.direct=off,file.filename=$file,file.aio=io_uring \
> +> <<EOF
> +> {"execute": "qmp_capabilities"}
> +> {"execute": "human-monitor-command", "arguments": { "command-line": "qemu-io node0 \"read 0 $readsize \"" }}
> +> {"execute": "quit"}
> +> EOF
> +
> +[1]: https://forum.proxmox.com/threads/170045/
> +
> +Cc: qemu-stable at nongnu.org
> +Signed-off-by: Fiona Ebner <f.ebner at proxmox.com>
> +---
> + block/io_uring.c | 16 +++++++++++++---
> + 1 file changed, 13 insertions(+), 3 deletions(-)
> +
> +diff --git a/block/io_uring.c b/block/io_uring.c
> +index dd4f304910..5dbafc8f7b 100644
> +--- a/block/io_uring.c
> ++++ b/block/io_uring.c
> +@@ -120,11 +120,14 @@ static void luring_resubmit_short_read(LuringState *s, LuringAIOCB *luringcb,
> + * event loop. When there are no events left to complete the BH is being
> + * canceled.
> + *
> ++ * Returns whether ioq_submit() must be called again afterwards since requests
> ++ * were resubmitted via luring_resubmit().
> + */
> +-static void luring_process_completions(LuringState *s)
> ++static bool luring_process_completions(LuringState *s)
> + {
> + struct io_uring_cqe *cqes;
> + int total_bytes;
> ++ bool resubmit = false;
> +
> + defer_call_begin();
> +
> +@@ -182,6 +185,7 @@ static void luring_process_completions(LuringState *s)
> + */
> + if (ret == -EINTR || ret == -EAGAIN) {
> + luring_resubmit(s, luringcb);
> ++ resubmit = true;
> + continue;
> + }
> + } else if (!luringcb->qiov) {
> +@@ -194,6 +198,7 @@ static void luring_process_completions(LuringState *s)
> + if (luringcb->is_read) {
> + if (ret > 0) {
> + luring_resubmit_short_read(s, luringcb, ret);
> ++ resubmit = true;
> + continue;
> + } else {
> + /* Pad with zeroes */
> +@@ -224,6 +229,8 @@ end:
> + qemu_bh_cancel(s->completion_bh);
> +
> + defer_call_end();
> ++
> ++ return resubmit;
> + }
> +
> + static int ioq_submit(LuringState *s)
> +@@ -231,6 +238,7 @@ static int ioq_submit(LuringState *s)
> + int ret = 0;
> + LuringAIOCB *luringcb, *luringcb_next;
> +
> ++resubmit:
> + while (s->io_q.in_queue > 0) {
> + /*
> + * Try to fetch sqes from the ring for requests waiting in
> +@@ -260,12 +268,14 @@ static int ioq_submit(LuringState *s)
> + }
> + s->io_q.blocked = (s->io_q.in_queue > 0);
> +
> +- if (s->io_q.in_flight) {
> ++ if (ret >= 0 && s->io_q.in_flight) {
> + /*
> + * We can try to complete something just right away if there are
> + * still requests in-flight.
> + */
> +- luring_process_completions(s);
> ++ if (luring_process_completions(s)) {
> ++ goto resubmit;
> ++ }
> + }
> + return ret;
> + }
> diff --git a/debian/patches/series b/debian/patches/series
> index b1afcd4..83e7f6d 100644
> --- a/debian/patches/series
> +++ b/debian/patches/series
> @@ -8,6 +8,7 @@ extra/0007-vfio-only-check-region-info-cache-for-initial-region.patch
> extra/0008-ui-vdagent-fix-windows-agent-regression.patch
> extra/0009-file-posix-populate-pwrite_zeroes_alignment.patch
> extra/0010-block-use-pwrite_zeroes_alignment-when-writing-first.patch
> +extra/0011-block-io_uring-avoid-potentially-getting-stuck-after.patch
> bitmap-mirror/0001-drive-mirror-add-support-for-sync-bitmap-mode-never.patch
> bitmap-mirror/0002-drive-mirror-add-support-for-conditional-and-always-.patch
> bitmap-mirror/0003-mirror-add-check-for-bitmap-mode-without-bitmap.patch
> --
> 2.47.3
>
>
>
> _______________________________________________
> pve-devel mailing list
> pve-devel at lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
>
>
>
More information about the pve-devel
mailing list