[pve-devel] [RFC qemu-server] vm_resume: correctly honor $nocheck

Fri May 24 10:52:24 CEST 2019

On 5/24/19 9:52 AM, Thomas Lamprecht wrote:
> On 5/24/19 9:36 AM, Fabian Grünbichler wrote:
>> On Fri, May 24, 2019 at 08:24:17AM +0200, Dominik Csapak wrote:
>>> LGTM, i introduced this seemingly sometime last year... (oops)
>>>
>>> the question remains why it sometimes takes so long for a rename to
>>> be propagated
>>
>> the race window is actually very small - I guess it gets a bit bigger
>> and thus triggers more easily with the additional load.
>>
>> the delay that the rename takes to return can go up into the seconds
>> range (which is okay - if the pmxcfs is very very busy, write operations
>> can block a bit, I tested with two nodes writing non-stop ;)).
>>
>> the delay between visibility on source and target is so small that it is
>> within the margin of errors (we are talking about measuring timestamps
>> across node boundaries after all).
>>
>>> in my opinion this violates the assumptions we make regarding ownership
>>> of files/vms since it seems here that nobody own the vm when this happens
>>> (the source node believes the target is the owner and vice versa)
>>
>> maybe we can take a closer look next week at pmxcfs debug output.. we
>> don't have many instances of moving ownership from one node to the other
>> though, and migrating happens under a config lock anyway so modulo this
>> missing nocheck I don't see a way that this is problematic..
>>
>> it's probably an issue of node T having received and acked the change,
>> but not yet fully processed it. if you ack the change after making it
>> visible, you have the reverse problem (T getting updated before S).
> 
> Virtual Synchrony / TOTEM [0] just says that if node a sees events happen
> in order: A -> B -> C then all nodes will see it in this order.
> 
> But it's _not_ guaranteed that they see it at the same instant, that's not
> really possible.
> 
> Corosync uses Extendend Virtual Synchrony[1] which in addition to above
> also ensures that group changes are ordered, but we undo this in our
> distributed final state machine (dfsm) in pmxcfs to virtual synchrony
> again, I won't say that there' so bug, but at least this beahvior is not
> one.
> 
> You cannot really mix pmxcfs / totem / cpg operations with a SSH connection
> and assume any order guarantees between them, there are none.
> 
> One would need to also sent a "event" over pmxcfs to signal the target
> note about the continue of the file, this then _would_ be ordered correctly,
> else, yes, there's a bug.
> 
> [0]: https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.37.767
> [1]: https://pdfs.semanticscholar.org/99a2/89cf0c97804cec4bd6ad70459d21267d525a.pdf
> 

ok makes sense, but

 >
 > You cannot really mix pmxcfs / totem / cpg operations with a SSH 
connection
 > and assume any order guarantees between them, there are none.
 >

is exactly what we do here

1. migrate
2. move file
3. tell the target to resume (and take ownership)

so maybe we should have a better mechanism to trigger the resume
on the target side?