[pve-devel] [PATCH qemu 03/11] block: add alloc-track driver

Wolfgang Bumiller w.bumiller at proxmox.com
Tue Jan 12 14:42:49 CET 2021


On Tue, Jan 12, 2021 at 12:29:05PM +0100, Stefan Reiter wrote:
> On 12/01/2021 11:54, Wolfgang Bumiller wrote:
> > one inline comment, otherwise lgtm
> > 
> > On Mon, Jan 11, 2021 at 12:14:01PM +0100, Stefan Reiter wrote:
> > > Add a new filter node 'alloc-track', which seperates reads and writes to
> > > different children, thus allowing to put a backing image behind any
> > > blockdev (regardless of driver support). Since we can't detect any
> > > pre-allocated blocks, we can only track new writes, hence the write
> > > target ('file') for this node must always be empty.
> > > 
> > > Intended use case is for live restoring, i.e. add a backup image as a
> > > block device into a VM, then put an alloc-track on the restore target
> > > and set the backup as backing. With this, one can use a regular
> > > 'block-stream' to restore the image, while the VM can already run in the
> > > background. Copy-on-read will help make progress as the VM reads as
> > > well.
> > > 
> > > This only worked if the target supports backing images, so up until now
> > > only for qcow2, with alloc-track any driver for the target can be used.
> > > 
> > > If 'auto-remove' is set, alloc-track will automatically detach itself
> > > once the backing image is removed. It will be replaced by 'file'.
> > > 
> > > Signed-off-by: Stefan Reiter <s.reiter at proxmox.com>
> > > ---
> > > 
> > > Potentially for upstream as well, but let's maybe test it a bit here first ;)
> > > 
> > >   block/Makefile.objs |   1 +
> > >   block/alloc-track.c | 319 ++++++++++++++++++++++++++++++++++++++++++++
> > >   2 files changed, 320 insertions(+)
> > >   create mode 100644 block/alloc-track.c
> > > 
> > > diff --git a/block/Makefile.objs b/block/Makefile.objs
> > > index 28fb3b7f7c..04f91b8cc9 100644
> > > --- a/block/Makefile.objs
> > > +++ b/block/Makefile.objs
> > > @@ -13,6 +13,7 @@ block-obj-$(CONFIG_QED) += qed-check.o
> > >   block-obj-y += vhdx.o vhdx-endian.o vhdx-log.o
> > >   block-obj-y += quorum.o
> > >   block-obj-y += zeroinit.o
> > > +block-obj-y += alloc-track.o
> > >   block-obj-y += blkdebug.o blkverify.o blkreplay.o
> > >   block-obj-$(CONFIG_PARALLELS) += parallels.o
> > >   block-obj-y += blklogwrites.o
> > > diff --git a/block/alloc-track.c b/block/alloc-track.c
> > > new file mode 100644
> > > index 0000000000..f2a4080579
> > > --- /dev/null
> > > +++ b/block/alloc-track.c
> > > @@ -0,0 +1,319 @@
> > > +/*
> > > + * Filter to allow backing images to be applied to any node. Assumes a blank
> > > + * image to begin with, only new writes are tracked as allocated, thus this
> > > + * must never be put on a node that already contains data.
> > > + *
> > > + * Copyright (c) 2020 Proxmox Server Solutions GmbH
> > > + * Copyright (c) 2020 Stefan Reiter <s.reiter at proxmox.com>
> > > + *
> > > + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> > > + * See the COPYING file in the top-level directory.
> > > + */
> > > +
> > > +#include "qemu/osdep.h"
> > > +#include "qapi/error.h"
> > > +#include "block/block_int.h"
> > > +#include "qapi/qmp/qdict.h"
> > > +#include "qapi/qmp/qstring.h"
> > > +#include "qemu/cutils.h"
> > > +#include "qemu/option.h"
> > > +#include "qemu/module.h"
> > > +
> > > +#define TRACK_OPT_AUTO_REMOVE "auto-remove"
> > > +
> > > +typedef struct {
> > > +    BdrvDirtyBitmap *bitmap;
> > > +    bool dropping;
> > > +    bool auto_remove;
> > > +} BDRVAllocTrackState;
> > > +
> > > +static QemuOptsList runtime_opts = {
> > > +    .name = "alloc-track",
> > > +    .head = QTAILQ_HEAD_INITIALIZER(runtime_opts.head),
> > > +    .desc = {
> > > +        {
> > > +            .name = TRACK_OPT_AUTO_REMOVE,
> > > +            .type = QEMU_OPT_BOOL,
> > > +            .help = "automatically replace this node with 'file' when 'backing'"
> > > +                    "is detached",
> > > +        },
> > > +        { /* end of list */ }
> > > +    },
> > > +};
> > > +
> > > +static void track_refresh_limits(BlockDriverState *bs, Error **errp)
> > > +{
> > > +    BlockDriverInfo bdi;
> > > +
> > > +    if (!bs->file) {
> > > +        return;
> > > +    }
> > > +
> > > +    /* always use alignment from underlying write device so RMW cycle for
> > > +     * bdrv_pwritev reads data from our backing via track_co_preadv (no partial
> > > +     * cluster allocation in 'file') */
> > > +    bdrv_get_info(bs->file->bs, &bdi);
> > > +    bs->bl.request_alignment = MAX(bs->file->bs->bl.request_alignment,
> > > +                                   MAX(bdi.cluster_size, BDRV_SECTOR_SIZE));
> > > +}
> > > +
> > > +static int track_open(BlockDriverState *bs, QDict *options, int flags,
> > > +                      Error **errp)
> > > +{
> > > +    BDRVAllocTrackState *s = bs->opaque;
> > > +    QemuOpts *opts;
> > > +    Error *local_err = NULL;
> > > +    int ret = 0;
> > > +
> > > +    opts = qemu_opts_create(&runtime_opts, NULL, 0, &error_abort);
> > > +    qemu_opts_absorb_qdict(opts, options, &local_err);
> > > +    if (local_err) {
> > > +        error_propagate(errp, local_err);
> > > +        ret = -EINVAL;
> > > +        goto fail;
> > > +    }
> > > +
> > > +    s->auto_remove = qemu_opt_get_bool(opts, TRACK_OPT_AUTO_REMOVE, false);
> > > +
> > > +    /* open the target (write) node, backing will be attached by block layer */
> > > +    bs->file = bdrv_open_child(NULL, options, "file", bs, &child_of_bds,
> > > +                               BDRV_CHILD_DATA | BDRV_CHILD_METADATA, false,
> > > +                               &local_err);
> > > +    if (local_err) {
> > > +        ret = -EINVAL;
> > > +        error_propagate(errp, local_err);
> > > +        goto fail;
> > > +    }
> > > +
> > > +    track_refresh_limits(bs, errp);
> > > +    uint64_t gran = bs->bl.request_alignment;
> > > +    s->bitmap = bdrv_create_dirty_bitmap(bs->file->bs, gran, NULL, &local_err);
> > > +    if (local_err) {
> > > +        ret = -EIO;
> > > +        error_propagate(errp, local_err);
> > > +        goto fail;
> > > +    }
> > > +
> > > +    s->dropping = false;
> > > +
> > > +fail:
> > > +    if (ret < 0) {
> > > +        bdrv_unref_child(bs, bs->file);
> > > +        if (s->bitmap) {
> > > +            bdrv_release_dirty_bitmap(s->bitmap);
> > > +        }
> > > +    }
> > > +    qemu_opts_del(opts);
> > > +    return ret;
> > > +}
> > > +
> > > +static void track_close(BlockDriverState *bs)
> > > +{
> > > +    BDRVAllocTrackState *s = bs->opaque;
> > > +    if (s->bitmap) {
> > > +        bdrv_release_dirty_bitmap(s->bitmap);
> > > +    }
> > > +}
> > > +
> > > +static int64_t track_getlength(BlockDriverState *bs)
> > > +{
> > > +    return bdrv_getlength(bs->file->bs);
> > > +}
> > > +
> > > +static int coroutine_fn track_co_preadv(BlockDriverState *bs,
> > > +    uint64_t offset, uint64_t bytes, QEMUIOVector *qiov, int flags)
> > > +{
> > > +    BDRVAllocTrackState *s = bs->opaque;
> > > +    QEMUIOVector local_qiov;
> > > +    int ret;
> > > +
> > > +    /* 'cur_offset' is relative to 'offset', 'local_offset' to image start */
> > > +    uint64_t cur_offset, local_offset;
> > > +    int64_t local_bytes;
> > > +    bool alloc;
> > > +
> > > +    /* a read request can span multiple granularity-sized chunks, and can thus
> > > +     * contain blocks with different allocation status - we could just iterate
> > > +     * granularity-wise, but for better performance use bdrv_dirty_bitmap_next_X
> > > +     * to find the next flip and consider everything up to that in one go */
> > > +    for (cur_offset = 0; cur_offset < bytes; cur_offset += local_bytes) {
> > > +        local_offset = offset + cur_offset;
> > > +        alloc = bdrv_dirty_bitmap_get(s->bitmap, local_offset);
> > > +        if (alloc) {
> > > +            local_bytes = bdrv_dirty_bitmap_next_zero(s->bitmap, local_offset,
> > > +                                                      bytes - cur_offset);
> > > +        } else {
> > > +            local_bytes = bdrv_dirty_bitmap_next_dirty(s->bitmap, local_offset,
> > > +                                                       bytes - cur_offset);
> > > +        }
> > > +
> > > +        /* _bitmap_next_X return is -1 if no end found within limit, otherwise
> > > +         * offset of next flip (to start of image) */
> > > +        local_bytes = local_bytes < 0 ?
> > > +            bytes - cur_offset :
> > > +            local_bytes - local_offset;
> > > +
> > > +        qemu_iovec_init_slice(&local_qiov, qiov, cur_offset, local_bytes);
> > > +
> > > +        if (alloc) {
> > > +            ret = bdrv_co_preadv(bs->file, local_offset, local_bytes,
> > > +                                 &local_qiov, flags);
> > > +        } else if (bs->backing) {
> > > +            ret = bdrv_co_preadv(bs->backing, local_offset, local_bytes,
> > > +                                 &local_qiov, flags);
> > > +        } else {
> > 
> > Here you handle the case where there's no backing file, but other
> > functions (eg. track_co_block_status) simply assume that `bs->backing`
> > is always valid.
> > 
> > Is this intentional?
> > 
> 
> Not intentional, but the only place where bs->backing is assumed is in
> track_co_block_status. Fix for this is simple enough, since according to the
> comment in 'include/block/block_int.h' when you return 0, the block layer
> automatically checks if a backing image is attached and assumes zero
> otherwise.
> 
> So basically just to avoid a potential NULL deref:
> 
> -----------------
> diff --git a/block/alloc-track.c b/block/alloc-track.c
> index f2a4080579..b1a41be05a 100644
> --- a/block/alloc-track.c
> +++ b/block/alloc-track.c
> @@ -222,10 +222,10 @@ static int coroutine_fn
> track_co_block_status(BlockDriverState *bs,
>      if (alloc) {
>          *file = bs->file->bs;
>          return BDRV_BLOCK_RAW | BDRV_BLOCK_OFFSET_VALID;
> -    } else {
> +    } else if (bs->backing) {
>          *file = bs->backing->bs;
> -        return 0;
>      }
> +    return 0;
>  }
> 
>  static void track_child_perm(BlockDriverState *bs, BdrvChild *c,
> -----------------
> 
> Yeah?
> 
> > It makes me also wonder - since you just zero out the iov below - if
> > this driver could maybe also replace the current zeroinit filter?
> > Otherwise it's at least a good reminder that the zeroinit code could
> > start making use of bitmaps as well ;-)
> > 
> 
> But zeroinit operates on writes, not reads? Not sure how this driver would
> help...

It tracks write offsets, this one tracks writes in general. The main
task of zeroinit is to make `bdrv_co_pwrite_zeroes` a nop if the
destination isn't "dirty" (under the assumption that its initial state
is all zeroed out already). So the difference would just be that there's
no backing file and that `pwrite_zeroes` checks the bitmap.
But yeah, come to think of it, that might be conflating too many things
into one driver.





More information about the pve-devel mailing list