[pve-devel] [PATCH v2 qemu-server 2/2] remote-migration: add target-cpu param

Fri Apr 28 11:12:54 CEST 2023

On April 28, 2023 8:43 am, DERUMIER, Alexandre wrote:
>> >
>>> And currently we don't support yet offline storage migration. (BTW,
>>> This is also breaking migration with unused disk).
>>> I don't known if we can send send|receiv transfert through the
> tunnel ?
>>> (I never tested it)
> 
>> we do, but maybe you tested with RBD which doesn't support storage
>> migration yet? withing a cluster it doesn't need to, since it's a
>> shared
>> storage, but between cluster we need to implement it (it's on my TODO
>> list and shouldn't be too hard since there is 'rbd export/import').
>> 
> Yes, this was with an unused rbd device indeed.
> (Another way could be to implement qemu-storage-daemon (never tested
> it) for offline sync with any storage, like lvm)
> 
> Also cloud-init drive seem to be unmigratable currently. (I wonder if
> we couldn't simply regenerate it on target, as now we have cloud-init
> pending section, we can correctly generate the cloudinit with current
> running config).
> 
> 
> 
>> > > given that it might make sense to save-guard this implementation
>> > > here,
>> > > and maybe switch to a new "mode" parameter?
>> > > 
>> > > online => switching CPU not allowed
>> > > offline or however-we-call-this-new-mode (or in the future, two-
>> > > phase-restart) => switching CPU allowed
>> > > 
>> > 
>> > Yes, I was thinking about that too.
>> > Maybe not "offline", because maybe we want to implement a real
>> > offline
>> > mode later.
>> > But simply "restart" ?
>> 
>> no, I meant moving the existing --online switch to a new mode
>> parameter,
>> then we'd have "online" and "offline", and then add your new mode on
>> top
>> "however-we-call-this-new-mode", and then we could in the future also
>> add "two-phase-restart" for the sync-twice mode I described :)
>> 
>> target-cpu would of course also be supported for the (existing)
>> offline
>> mode, since it just needs to adapt the target-cpu in the config.
>> 
>> the main thing I'd want to avoid is somebody accidentally setting
>> "target-cpu", not knowing/noticing that that entails what amounts to
>> a
>> reset of the VM as part of the migration..
>> 
> Yes, that what I had understanded
>  ;)
> 
> It's was more about "offline" term, because we don't offline the source
> vm until the disk migration is finished. (to reduce downtime)
> More like "online-restart" instead "offline".
> 
> Offline for me , is really, we shut the vm, then do the disk migration.

hmm, I guess how you see it. for me, online means without interruption,
anything else is offline :) but yeah, naming is hard, as always ;)

>> there were a few things down below that might also be worthy of
>> discussion. I also wonder whether the two variants of "freeze FS" and
>> "suspend without state" are enough - that only ensures that no more
>> I/O
>> happens so the volumes are bitwise identical, but shouldn't we also
>> at
>> least have the option of doing a clean shutdown at that point so that
>> applications can serialize/flush their state properly and that gets
>> synced across as well? else this is the equivalent of cutting the
>> power
>> cord, which might not be a good fit for all use cases ;)
>> 
> I had try the clean shutdown in my v1 patch 
> https://lists.proxmox.com/pipermail/pve-devel/2023-March/056291.html
> (without doing the block-job-complete) in phase3,  and I have fs
> coruption sometime.
> Not sure why exactly (Maybe os didn't have correctly shutdown or maybe
> some datas in the buffer ?)
> Maybe doing the block-job-complete before should make it safe.
> (transfert access to the nbd , then do the clean shutdown).

possibly we need a special "shutdown guest, but leave qemu running" way
of shutting down (so that the guest and any applications within can do
their thing, and the block job can transfer all the delta across).
completing or cancelling the block job before the guest has shut down
would mean the source and target are not consistent (since shutdown can
change the disk content, and that would then not be mirrored anymore?),
so I don't see any way that that could be an improvement. it would mean
that starting the shutdown is already the point of no return -
cancelling before would mean writes are not transferred to the target,
completing before would mean writes are not written to the source
anymore, so we can't fallback to the source node in error handling.

I guess we could have to approaches:

A - freeze or suspend (depending on QGA availability), then complete
block job and (re)start target VM
B - shutdown guest OS, complete, then exit source VM and (re)start
target VM

as always, there's a tradeoff there - A is faster, but less consistent
from the guests point of view (somwhat similar to pulling the power
cable). B can take a while (== service downtime!), but it has the same
semantics as a reboot.

there are also IMHO multiple ways to think about the target side:

A start VM in migration mode, but kill it without ever doing any
migration, then start it again with modified config (current approach)
B start VM paused (like when doing a backup of a stopped VM, without
incoming migration), but with overridden CPU parameter, then just
'resume' it when the block migration is finished
C don't start a VM at all, just the block devices via
qemu-storage-daemon for the block migration, then do a regular start
after the block migration and config update are done

B has the advantage over A that we don't risk the VM not being able to
restart (e.g., because of a race for memory or pass-through resources),
and also the resume should be (depending on exact environment possibly
quite a bit) faster than kill+start
C has the advantage over A and B that the migration itself is cheaper
resource-wise, but the big downside that we don't even know if the VM is
startable on the target node, and of course, it's a lot more code to
write. possibly I just included it because I am looking for an excuse to
play around with qemu-storage-daemon - it's probably the least relevant
variant for now ;)

> 
> I'll give a try in the V3. 
> 
> I just wonder if we can add a new param, like:
> 
> --online --fsfreeze
> 
> --online --shutdown
> 
> --online --2phase-restart

that would also be an option. not sure by heart if it's possible to
make --online into a property string that is backwards compatible with
the "plain boolean" option? if so, we could do

--online [mode=live,qga,suspend,shutdown,2phase,..]

with live being the default (not supporting target-cpu) and
qga,suspend,shutdown all handling target-cpu (2phase just included for
completeness sake)

alternative, if that doesn't work, having --online [--online-mode live,qga,suspend,..]

would be my second choice I guess, if we are reasonable sure that all
the possible extensions would be for running VMs only. the only thing
counter to that that I can think of would be storage migration using
qemu-storage-daemon (e.g., like you said, to somehow bolt on
incremental support using persistent bitmaps for storages/image formats
that don't support that otherwise), and there I am not even sure whether
that couldn't be somehow handled in pve-storage anyway

>  (I'm currently migrating a lot of vm between an old intel cluster to
> the new amd cluster, on different datacenter, with a different ceph
> cluster, so I can still do real production tests)

technically, target-cpu might also be a worthwhile feature for
heterogenous clusters where a proper/full live migration is not possible
for certain node/CPU combinations.. we do already update volume IDs when
using 'targetstorage', so also updating the CPU should be doable there
as well. using the still experimental remote migration as a field for
evaluation is fine, just something to keep in mind while thinking about
options, so that we don't accidentally maneuver ourselves into a corner
that makes that part impossible :)