[pve-devel] [RFC qemu-server] vm_resume: correctly honor $nocheck

Fri May 24 21:02:45 CEST 2019

On 5/24/19 8:37 PM, Fabian Grünbichler wrote:
> On Fri, May 24, 2019 at 11:05:41AM +0200, Thomas Lamprecht wrote:
>> On 5/24/19 10:52 AM, Dominik Csapak wrote:
>>> On 5/24/19 9:52 AM, Thomas Lamprecht wrote:
>>>> On 5/24/19 9:36 AM, Fabian Grünbichler wrote:
>>>>> On Fri, May 24, 2019 at 08:24:17AM +0200, Dominik Csapak wrote:
>>>>>> LGTM, i introduced this seemingly sometime last year... (oops)
>>>>>>
>>>>>> the question remains why it sometimes takes so long for a rename to
>>>>>> be propagated
>>>>>
>>>>> the race window is actually very small - I guess it gets a bit bigger
>>>>> and thus triggers more easily with the additional load.
>>>>>
>>>>> the delay that the rename takes to return can go up into the seconds
>>>>> range (which is okay - if the pmxcfs is very very busy, write operations
>>>>> can block a bit, I tested with two nodes writing non-stop ;)).
>>>>>
>>>>> the delay between visibility on source and target is so small that it is
>>>>> within the margin of errors (we are talking about measuring timestamps
>>>>> across node boundaries after all).
>>>>>
>>>>>> in my opinion this violates the assumptions we make regarding ownership
>>>>>> of files/vms since it seems here that nobody own the vm when this happens
>>>>>> (the source node believes the target is the owner and vice versa)
>>>>>
>>>>> maybe we can take a closer look next week at pmxcfs debug output.. we
>>>>> don't have many instances of moving ownership from one node to the other
>>>>> though, and migrating happens under a config lock anyway so modulo this
>>>>> missing nocheck I don't see a way that this is problematic..
>>>>>
>>>>> it's probably an issue of node T having received and acked the change,
>>>>> but not yet fully processed it. if you ack the change after making it
>>>>> visible, you have the reverse problem (T getting updated before S).
>>>>
>>>> Virtual Synchrony / TOTEM [0] just says that if node a sees events happen
>>>> in order: A -> B -> C then all nodes will see it in this order.
>>>>
>>>> But it's _not_ guaranteed that they see it at the same instant, that's not
>>>> really possible.
>>>>
>>>> Corosync uses Extendend Virtual Synchrony[1] which in addition to above
>>>> also ensures that group changes are ordered, but we undo this in our
>>>> distributed final state machine (dfsm) in pmxcfs to virtual synchrony
>>>> again, I won't say that there' so bug, but at least this beahvior is not
>>>> one.
>>>>
>>>> You cannot really mix pmxcfs / totem / cpg operations with a SSH connection
>>>> and assume any order guarantees between them, there are none.
>>>>
>>>> One would need to also sent a "event" over pmxcfs to signal the target
>>>> note about the continue of the file, this then _would_ be ordered correctly,
>>>> else, yes, there's a bug.
>>>>
>>>> [0]: https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.37.767
>>>> [1]: https://pdfs.semanticscholar.org/99a2/89cf0c97804cec4bd6ad70459d21267d525a.pdf
>>>>
>>>
>>> ok makes sense, but
>>>
>>>>
>>>> You cannot really mix pmxcfs / totem / cpg operations with a SSH connection
>>>> and assume any order guarantees between them, there are none.
>>>>
>>>
>>> is exactly what we do here
>>>
>>> 1. migrate
>>> 2. move file
>>> 3. tell the target to resume (and take ownership)
>>
>> yeah, that's why I noted it, it's the underlying issue behind this bug, IMO.
>>
>>>
>>> so maybe we should have a better mechanism to trigger the resume
>>> on the target side?
>>>
>>
>> There are a few things one could do here, a relative simple one, leveraging
>> existing functionality could be doing a cfs lock before the move on source,
>> e.g., a cfs_lock_domain("migrate-$vmid"), then move, lock release on source,
>> connect to target and do the resume under the same lock with a relatively big
>> timeout (~ 90 seconds) - this should add the necessary synchrony here.
> 
> IMHO none of that is a good idea since we want minimal down time..

normally this is not expensive, and well, if you overload the network
then yes it may need a bit for this to happen, but that's not our issue,
if one does not want this then they should not overload things in the
first place..

On not overloaded clusters/network the sequence:

S 1. migrate done
S 2. lock (uncontented, so "immediate")
S 3. rename to target
S 4. release lock
S 5. ssh to target for resume
T 6. lock (normally the release from S already propagated, then noop, else
     milliseconds to some seconds of wait)
T 7. running again.

should not have relevant overhead I'd guess, would need to try out though.

As it works not with nocheck not set one can derive that most of the time
the locking will not add any delay - as else much more migrations would
fail.. So IMO really not such a bad idea..

And the thing is, we have really really fundamental assumptions that the
VM belongs to the host the config is on, and if you start doing things
with a VM which from your (node) POV is not belonging to you it opens a
can of worms..

> 
> we don't want to introduce unnecessary delays into the window between
> "qemu migrate is complete" (and from qemu's POV, the target VM is now
> the active one) and "actually resume suspended VM on target node", which
> is exactly where this sequence of "rename config" -> "tell target node
> to resume VM" happens..
> 
> I think the current approach is fine - provided vm_resume actually
> honors $nocheck, which should be the case (again) with this patch. an
> alternative approach would be to do the rename (with all the proper cfs
> locking) before starting the actual migration in phase2, but that would
> mean that lots of stuff in phase3(_cleanup) would need to be done with
> the config file no longer belonging to the source node, which basically
> means doing more stuff with "nocheck" semantics, as well as doing a
> revert of the handover in case the migration fails, further complicating
> things.
> 

As long as the VM can not run as-is (all VM state transferred) I would
not see it ready to belong to the target, so I would not rename the config
earlier - brings just headaches, and violates our assumptions.. Question
is how one does with an eventual coming post-copy migration