[pve-devel] [RFC qemu-server] vm_resume: correctly honor $nocheck

Thomas Lamprecht t.lamprecht at proxmox.com
Sat May 25 11:27:28 CEST 2019


On 5/24/19 10:39 PM, Fabian Grünbichler wrote:
> On Fri, May 24, 2019 at 09:02:45PM +0200, Thomas Lamprecht wrote:
>> On 5/24/19 8:37 PM, Fabian Grünbichler wrote:
>>> On Fri, May 24, 2019 at 11:05:41AM +0200, Thomas Lamprecht wrote:
>>>> On 5/24/19 10:52 AM, Dominik Csapak wrote:
>>>>> On 5/24/19 9:52 AM, Thomas Lamprecht wrote:
> 
> [... trimming a bit, and sorry in advance for adding another long
> response to the thread ]
> 

no worries, thanks for keeping up here with me ;-)

>>>>>> You cannot really mix pmxcfs / totem / cpg operations with a SSH connection
>>>>>> and assume any order guarantees between them, there are none.
>>>>>>
>>>>>
>>>>> is exactly what we do here
>>>>>
>>>>> 1. migrate
>>>>> 2. move file
>>>>> 3. tell the target to resume (and take ownership)
>>>>
>>>> yeah, that's why I noted it, it's the underlying issue behind this bug, IMO.
>>>>
>>>>>
>>>>> so maybe we should have a better mechanism to trigger the resume
>>>>> on the target side?
>>>>>
>>>>
>>>> There are a few things one could do here, a relative simple one, leveraging
>>>> existing functionality could be doing a cfs lock before the move on source,
>>>> e.g., a cfs_lock_domain("migrate-$vmid"), then move, lock release on source,
>>>> connect to target and do the resume under the same lock with a relatively big
>>>> timeout (~ 90 seconds) - this should add the necessary synchrony here.
>>>
>>> IMHO none of that is a good idea since we want minimal down time..
>>
>> normally this is not expensive, and well, if you overload the network
>> then yes it may need a bit for this to happen, but that's not our issue,
>> if one does not want this then they should not overload things in the
>> first place..
>>
>> On not overloaded clusters/network the sequence:
>>
>> S 1. migrate done
>> S 2. lock (uncontented, so "immediate")
>> S 3. rename to target
>> S 4. release lock
>> S 5. ssh to target for resume
>> T 6. lock (normally the release from S already propagated, then noop, else
>>      milliseconds to some seconds of wait)
>> T 7. running again.
>>
>> should not have relevant overhead I'd guess, would need to try out though.
> 
> it's 4 times as many cfs operations - at least in the ideal case of live
> migration with shared everything, no replication.
> 
> block migration adds an additional SSH and block migration handover,
> replication adds an additional SSH for state transfer, and an additional
> cfs_lock {
>   parse_replication_cfg;
>   update_replication_cfg;
> }
> 
> but see below for further possible improvements for those non-ideal
> cases.
> 
>> As it works not with nocheck not set one can derive that most of the time
>> the locking will not add any delay - as else much more migrations would
>> fail.. So IMO really not such a bad idea..
> 
> the forum reports seem to indicate that it can happen quite regularly
> with bulk migration though, without the artificial massive load that I
> used to trigger it reliably when testing.
> 
>> And the thing is, we have really really fundamental assumptions that the
>> VM belongs to the host the config is on, and if you start doing things
>> with a VM which from your (node) POV is not belonging to you it opens a
>> can of worms..
> 
> but we only do it here under very specific circumstances:
> 
> - we know we just successfully transferred the config

yeah, true, that's correct and already makes the $nocheck single-handed
valid..

I'll re-check your patch, and if nothing strange hits me I'll apply it
to master and stable-5..

> - the VM is already running and the "real" one, although it is still
>   suspended
> - we want to resume it ASAP, with as little overhead as possible to
>   minimize downtime (I still remember [1])
> 
> [snip]
> 
> there are two more later operations that don't use nocheck, but they are
> probably already outside of the possible race window:
> - conditionally trimming cloned disks via SSH + 'qm guest cmd' (only for
>   block+live migration & opt-in, errors are ignored anyway)
> - unlocking the vm on the target node via SSH + 'qm unlock' (not fatal
>   for the migration, but requires manual cleanup)

quite the overhead produced, yes ^^

> 
> the following stuff is more problematic I think w.r.t. violating this
> basic assumption:
> 
> qemu_drive_mirror_monitor calls vm_resume with $nocheck set
> unconditionally (unless the guest agent is used, in which case we
> freeze/unfreeze instead of suspend/resume). I don't actually see a
> reason why it would need to set it at all - with live+block migration,
> the rename happens after block migration is completed, so we still own
> the VM. all other block migrations don't transfer the source VM anyway.
> 
> from a quick trip via grep and random spot checks, it looks like many
> vm_mon_cmd_nocheck calls are actually wrong, and should be vm_mon_cmd
> (all the _nocheck variant does is call check_running with $nocheck set,
> which just skips the 'die' in case the config file does not exist).
> AFAICT the only cases where using vm_mon_cmd_nocheck/check_running with
> $nocheck set are appropriate are:
> - live migration, on the target node, up until vm resume
> - live migration, on the source node, for the final cleanup/tear down
>   after the rename
> - live migration, some error handling cases (possibly completely covered
>   by the cases above)

agreed

> - code paths where no config exists yet, but check_running gets called
>   somehow (not sure if those exist? not sure if those SHOULD exist? :-P)

We allow it in vm_resum API path, left over from the pre-mtunnel resume
command, AFAIK, but that could be remove with 6.0 as breaking ABI change,
IMO it should not exist anymore - at least this one

> 
> further possible/potential improvements for reducing downtime:
> - move 'qm nbdstop' into mtunnel (live+block migration, saves
>   establishing an SSH connection, should be safe)
+1

> - move 'pvesr set-state' into mtunnel (transfer_replication_state, saves
>   establishing an SSH connection, should be safe)
+1

> - move switch_replication_job_target to after vm_resume, but before 'qm
>   unlock' (not sure of the ramifications - needs a closer look, would
>   remove the cfs_lock block I mentioned earlier from this "hot path")
yes, that one should be OK too, from a quick glance.


> 
> housekeeping:
> - drop nocheck parameter from vm_resume API call in PVE 6.x (we can rely
>   on mtunnel v1 being available, so no need for the SSH 'qm resume ...'
>   fallback which is probably the only reason why this ever ended up
>   being exposed over the API)

OK, I answered while reading, so only read this now but it's probably a
good sign that we came both to the same conclusion ^^

> 
>>> we don't want to introduce unnecessary delays into the window between
>>> "qemu migrate is complete" (and from qemu's POV, the target VM is now
>>> the active one) and "actually resume suspended VM on target node", which
>>> is exactly where this sequence of "rename config" -> "tell target node
>>> to resume VM" happens..
>>>
>>> I think the current approach is fine - provided vm_resume actually
>>> honors $nocheck, which should be the case (again) with this patch. an
>>> alternative approach would be to do the rename (with all the proper cfs
>>> locking) before starting the actual migration in phase2, but that would
>>> mean that lots of stuff in phase3(_cleanup) would need to be done with
>>> the config file no longer belonging to the source node, which basically
>>> means doing more stuff with "nocheck" semantics, as well as doing a
>>> revert of the handover in case the migration fails, further complicating
>>> things.
>>>
>>
>> As long as the VM can not run as-is (all VM state transferred) I would
>> not see it ready to belong to the target, so I would not rename the config
>> earlier - brings just headaches, and violates our assumptions.. Question
>> is how one does with an eventual coming post-copy migration
> 
> I don't prefer the early rename either, in case that was not obvious ;)

I thought so, just wanted to underline why it is really not a good idea from
our assumptions :)

> 
> 1: https://forum.proxmox.com/threads/pve-5-live-migration-downtime-degradation-2-4-sec.35890/





More information about the pve-devel mailing list