[pve-devel] [RFC qemu-server] vm_resume: correctly honor $nocheck

Fri May 24 20:37:45 CEST 2019

On Fri, May 24, 2019 at 11:05:41AM +0200, Thomas Lamprecht wrote:
> On 5/24/19 10:52 AM, Dominik Csapak wrote:
> > On 5/24/19 9:52 AM, Thomas Lamprecht wrote:
> >> On 5/24/19 9:36 AM, Fabian Grünbichler wrote:
> >>> On Fri, May 24, 2019 at 08:24:17AM +0200, Dominik Csapak wrote:
> >>>> LGTM, i introduced this seemingly sometime last year... (oops)
> >>>>
> >>>> the question remains why it sometimes takes so long for a rename to
> >>>> be propagated
> >>>
> >>> the race window is actually very small - I guess it gets a bit bigger
> >>> and thus triggers more easily with the additional load.
> >>>
> >>> the delay that the rename takes to return can go up into the seconds
> >>> range (which is okay - if the pmxcfs is very very busy, write operations
> >>> can block a bit, I tested with two nodes writing non-stop ;)).
> >>>
> >>> the delay between visibility on source and target is so small that it is
> >>> within the margin of errors (we are talking about measuring timestamps
> >>> across node boundaries after all).
> >>>
> >>>> in my opinion this violates the assumptions we make regarding ownership
> >>>> of files/vms since it seems here that nobody own the vm when this happens
> >>>> (the source node believes the target is the owner and vice versa)
> >>>
> >>> maybe we can take a closer look next week at pmxcfs debug output.. we
> >>> don't have many instances of moving ownership from one node to the other
> >>> though, and migrating happens under a config lock anyway so modulo this
> >>> missing nocheck I don't see a way that this is problematic..
> >>>
> >>> it's probably an issue of node T having received and acked the change,
> >>> but not yet fully processed it. if you ack the change after making it
> >>> visible, you have the reverse problem (T getting updated before S).
> >>
> >> Virtual Synchrony / TOTEM [0] just says that if node a sees events happen
> >> in order: A -> B -> C then all nodes will see it in this order.
> >>
> >> But it's _not_ guaranteed that they see it at the same instant, that's not
> >> really possible.
> >>
> >> Corosync uses Extendend Virtual Synchrony[1] which in addition to above
> >> also ensures that group changes are ordered, but we undo this in our
> >> distributed final state machine (dfsm) in pmxcfs to virtual synchrony
> >> again, I won't say that there' so bug, but at least this beahvior is not
> >> one.
> >>
> >> You cannot really mix pmxcfs / totem / cpg operations with a SSH connection
> >> and assume any order guarantees between them, there are none.
> >>
> >> One would need to also sent a "event" over pmxcfs to signal the target
> >> note about the continue of the file, this then _would_ be ordered correctly,
> >> else, yes, there's a bug.
> >>
> >> [0]: https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.37.767
> >> [1]: https://pdfs.semanticscholar.org/99a2/89cf0c97804cec4bd6ad70459d21267d525a.pdf
> >>
> > 
> > ok makes sense, but
> > 
> >>
> >> You cannot really mix pmxcfs / totem / cpg operations with a SSH connection
> >> and assume any order guarantees between them, there are none.
> >>
> > 
> > is exactly what we do here
> > 
> > 1. migrate
> > 2. move file
> > 3. tell the target to resume (and take ownership)
> 
> yeah, that's why I noted it, it's the underlying issue behind this bug, IMO.
> 
> > 
> > so maybe we should have a better mechanism to trigger the resume
> > on the target side?
> > 
> 
> There are a few things one could do here, a relative simple one, leveraging
> existing functionality could be doing a cfs lock before the move on source,
> e.g., a cfs_lock_domain("migrate-$vmid"), then move, lock release on source,
> connect to target and do the resume under the same lock with a relatively big
> timeout (~ 90 seconds) - this should add the necessary synchrony here.

IMHO none of that is a good idea since we want minimal down time..

we don't want to introduce unnecessary delays into the window between
"qemu migrate is complete" (and from qemu's POV, the target VM is now
the active one) and "actually resume suspended VM on target node", which
is exactly where this sequence of "rename config" -> "tell target node
to resume VM" happens..

I think the current approach is fine - provided vm_resume actually
honors $nocheck, which should be the case (again) with this patch. an
alternative approach would be to do the rename (with all the proper cfs
locking) before starting the actual migration in phase2, but that would
mean that lots of stuff in phase3(_cleanup) would need to be done with
the config file no longer belonging to the source node, which basically
means doing more stuff with "nocheck" semantics, as well as doing a
revert of the handover in case the migration fails, further complicating
things.