[pve-devel] [RFC] PVE-Backup: create jobs in a drained section

Fiona Ebner f.ebner at proxmox.com
Tue Sep 5 15:18:55 CEST 2023


Am 08.02.23 um 15:07 schrieb Fiona Ebner:
> 
> Inserting the copy-before-write node is already protected with a
> drained section, which is why there should be no actual issue right
> now. While that drained section does not extend until the bcs bitmap
> initialization, it should also not be an issue currently, because the
> job is not created from a coroutine (and even if, there would need to
> be a yield point in between).
> 

My explanation why it's currently not required is wrong. Because there
can be an IO thread which can interact with the bitmap and that is what
happens (with my reproducer) for the crash reported here [0]. I couldn't
reproduce it anymore with this RFC applied.

A sketch of what happens is:

Notes:
* Each time a block-copy request is created, it resets the dirty bitmap
at the corresponding range.
* Thread 1 is still doing the backup setup, cbw_open() and
backup_init_bcs_bitmap() happen in backup_job_create()
* The check if a request can be created for a given range relies on the
dirty bitmap.

  Thread 1 bdrv_dirty_bitmap_merge_internal() as part of cbw_open()
A Thread 3 bdrv_reset_dirty_bitmap(offset=x, bytes=4MiB)
  Thread 1 bdrv_clear_dirty_bitmap() as part of backup_init_bcs_bitmap()
  Thread 1 bdrv_dirty_bitmap_merge_internal() as part of
           backup_init_bcs_bitmap()
B
  Thread 3 bdrv_reset_dirty_bitmap(offset=0, bytes=4MiB)
  Thread 3 bdrv_reset_dirty_bitmap(offset=4MiB, bytes=4MiB)
  ....
C Thread 3 bdrv_reset_dirty_bitmap(offset=x, bytes=4MiB)

Note that at time B, there can be a mismatch between the bitmap and the
request list, if merging didn't by chance set the bits at the location
for request A again.

Then, if C happens before A has finished, an assert will trigger,
because there already is a request in the same range:

> block_copy_task_create: Assertion `!reqlist_find_conflict(&s->reqs, offset, bytes)' failed
C doesn't need to cover exactly the same range as A of course, just
overlap, but it did for me.

That said, I also tried reproducing the issue with QEMU 7.2, but didn't
manage to yet. I'll take another look and then re-send this for QEMU 8.0
with a fixed commit message.

[0]: https://forum.proxmox.com/threads/133149/





More information about the pve-devel mailing list