[pve-devel] [RFC qemu-server] vm_resume: correctly honor $nocheck

Fri May 24 22:39:57 CEST 2019

On Fri, May 24, 2019 at 09:02:45PM +0200, Thomas Lamprecht wrote:
> On 5/24/19 8:37 PM, Fabian Grünbichler wrote:
> > On Fri, May 24, 2019 at 11:05:41AM +0200, Thomas Lamprecht wrote:
> >> On 5/24/19 10:52 AM, Dominik Csapak wrote:
> >>> On 5/24/19 9:52 AM, Thomas Lamprecht wrote:

[... trimming a bit, and sorry in advance for adding another long
response to the thread ]

> >>>> You cannot really mix pmxcfs / totem / cpg operations with a SSH connection
> >>>> and assume any order guarantees between them, there are none.
> >>>>
> >>>
> >>> is exactly what we do here
> >>>
> >>> 1. migrate
> >>> 2. move file
> >>> 3. tell the target to resume (and take ownership)
> >>
> >> yeah, that's why I noted it, it's the underlying issue behind this bug, IMO.
> >>
> >>>
> >>> so maybe we should have a better mechanism to trigger the resume
> >>> on the target side?
> >>>
> >>
> >> There are a few things one could do here, a relative simple one, leveraging
> >> existing functionality could be doing a cfs lock before the move on source,
> >> e.g., a cfs_lock_domain("migrate-$vmid"), then move, lock release on source,
> >> connect to target and do the resume under the same lock with a relatively big
> >> timeout (~ 90 seconds) - this should add the necessary synchrony here.
> > 
> > IMHO none of that is a good idea since we want minimal down time..
> 
> normally this is not expensive, and well, if you overload the network
> then yes it may need a bit for this to happen, but that's not our issue,
> if one does not want this then they should not overload things in the
> first place..
> 
> On not overloaded clusters/network the sequence:
> 
> S 1. migrate done
> S 2. lock (uncontented, so "immediate")
> S 3. rename to target
> S 4. release lock
> S 5. ssh to target for resume
> T 6. lock (normally the release from S already propagated, then noop, else
>      milliseconds to some seconds of wait)
> T 7. running again.
> 
> should not have relevant overhead I'd guess, would need to try out though.

it's 4 times as many cfs operations - at least in the ideal case of live
migration with shared everything, no replication.

block migration adds an additional SSH and block migration handover,
replication adds an additional SSH for state transfer, and an additional
cfs_lock {
  parse_replication_cfg;
  update_replication_cfg;
}

but see below for further possible improvements for those non-ideal
cases.

> As it works not with nocheck not set one can derive that most of the time
> the locking will not add any delay - as else much more migrations would
> fail.. So IMO really not such a bad idea..

the forum reports seem to indicate that it can happen quite regularly
with bulk migration though, without the artificial massive load that I
used to trigger it reliably when testing.

> And the thing is, we have really really fundamental assumptions that the
> VM belongs to the host the config is on, and if you start doing things
> with a VM which from your (node) POV is not belonging to you it opens a
> can of worms..

but we only do it here under very specific circumstances:

- we know we just successfully transferred the config
- the VM is already running and the "real" one, although it is still
  suspended
- we want to resume it ASAP, with as little overhead as possible to
  minimize downtime (I still remember [1])

and we are using the exact mechanism (i.e., $nocheck) that was created
for those few cases (live migration is the only big one) where you might
need to override this basic assumption.

even if we'd add locking around the actual transfer and resume, the
whole live migration code path would still be full of calling stuff with
nocheck semantics, so we would not really gain anything tangible in the
big picture (although adding a comment in the appropriate places that we
need to have nocheck semantics between rename -> resume to avoid this
visibility race from affecting the migration is probably a good idea,
unless we go with the full cfs_lock approach).

there are two more later operations that don't use nocheck, but they are
probably already outside of the possible race window:
- conditionally trimming cloned disks via SSH + 'qm guest cmd' (only for
  block+live migration & opt-in, errors are ignored anyway)
- unlocking the vm on the target node via SSH + 'qm unlock' (not fatal
  for the migration, but requires manual cleanup)

the following stuff is more problematic I think w.r.t. violating this
basic assumption:

qemu_drive_mirror_monitor calls vm_resume with $nocheck set
unconditionally (unless the guest agent is used, in which case we
freeze/unfreeze instead of suspend/resume). I don't actually see a
reason why it would need to set it at all - with live+block migration,
the rename happens after block migration is completed, so we still own
the VM. all other block migrations don't transfer the source VM anyway.

from a quick trip via grep and random spot checks, it looks like many
vm_mon_cmd_nocheck calls are actually wrong, and should be vm_mon_cmd
(all the _nocheck variant does is call check_running with $nocheck set,
which just skips the 'die' in case the config file does not exist).
AFAICT the only cases where using vm_mon_cmd_nocheck/check_running with
$nocheck set are appropriate are:
- live migration, on the target node, up until vm resume
- live migration, on the source node, for the final cleanup/tear down
  after the rename
- live migration, some error handling cases (possibly completely covered
  by the cases above)
- code paths where no config exists yet, but check_running gets called
  somehow (not sure if those exist? not sure if those SHOULD exist? :-P)

further possible/potential improvements for reducing downtime:
- move 'qm nbdstop' into mtunnel (live+block migration, saves
  establishing an SSH connection, should be safe)
- move 'pvesr set-state' into mtunnel (transfer_replication_state, saves
  establishing an SSH connection, should be safe)
- move switch_replication_job_target to after vm_resume, but before 'qm
  unlock' (not sure of the ramifications - needs a closer look, would
  remove the cfs_lock block I mentioned earlier from this "hot path")

housekeeping:
- drop nocheck parameter from vm_resume API call in PVE 6.x (we can rely
  on mtunnel v1 being available, so no need for the SSH 'qm resume ...'
  fallback which is probably the only reason why this ever ended up
  being exposed over the API)

> > we don't want to introduce unnecessary delays into the window between
> > "qemu migrate is complete" (and from qemu's POV, the target VM is now
> > the active one) and "actually resume suspended VM on target node", which
> > is exactly where this sequence of "rename config" -> "tell target node
> > to resume VM" happens..
> > 
> > I think the current approach is fine - provided vm_resume actually
> > honors $nocheck, which should be the case (again) with this patch. an
> > alternative approach would be to do the rename (with all the proper cfs
> > locking) before starting the actual migration in phase2, but that would
> > mean that lots of stuff in phase3(_cleanup) would need to be done with
> > the config file no longer belonging to the source node, which basically
> > means doing more stuff with "nocheck" semantics, as well as doing a
> > revert of the handover in case the migration fails, further complicating
> > things.
> > 
> 
> As long as the VM can not run as-is (all VM state transferred) I would
> not see it ready to belong to the target, so I would not rename the config
> earlier - brings just headaches, and violates our assumptions.. Question
> is how one does with an eventual coming post-copy migration

I don't prefer the early rename either, in case that was not obvious ;)

1: https://forum.proxmox.com/threads/pve-5-live-migration-downtime-degradation-2-4-sec.35890/