[pve-devel] [RFC qemu-server] vm_resume: correctly honor $nocheck

Thomas Lamprecht t.lamprecht at proxmox.com
Fri May 24 11:05:41 CEST 2019


On 5/24/19 10:52 AM, Dominik Csapak wrote:
> On 5/24/19 9:52 AM, Thomas Lamprecht wrote:
>> On 5/24/19 9:36 AM, Fabian Grünbichler wrote:
>>> On Fri, May 24, 2019 at 08:24:17AM +0200, Dominik Csapak wrote:
>>>> LGTM, i introduced this seemingly sometime last year... (oops)
>>>>
>>>> the question remains why it sometimes takes so long for a rename to
>>>> be propagated
>>>
>>> the race window is actually very small - I guess it gets a bit bigger
>>> and thus triggers more easily with the additional load.
>>>
>>> the delay that the rename takes to return can go up into the seconds
>>> range (which is okay - if the pmxcfs is very very busy, write operations
>>> can block a bit, I tested with two nodes writing non-stop ;)).
>>>
>>> the delay between visibility on source and target is so small that it is
>>> within the margin of errors (we are talking about measuring timestamps
>>> across node boundaries after all).
>>>
>>>> in my opinion this violates the assumptions we make regarding ownership
>>>> of files/vms since it seems here that nobody own the vm when this happens
>>>> (the source node believes the target is the owner and vice versa)
>>>
>>> maybe we can take a closer look next week at pmxcfs debug output.. we
>>> don't have many instances of moving ownership from one node to the other
>>> though, and migrating happens under a config lock anyway so modulo this
>>> missing nocheck I don't see a way that this is problematic..
>>>
>>> it's probably an issue of node T having received and acked the change,
>>> but not yet fully processed it. if you ack the change after making it
>>> visible, you have the reverse problem (T getting updated before S).
>>
>> Virtual Synchrony / TOTEM [0] just says that if node a sees events happen
>> in order: A -> B -> C then all nodes will see it in this order.
>>
>> But it's _not_ guaranteed that they see it at the same instant, that's not
>> really possible.
>>
>> Corosync uses Extendend Virtual Synchrony[1] which in addition to above
>> also ensures that group changes are ordered, but we undo this in our
>> distributed final state machine (dfsm) in pmxcfs to virtual synchrony
>> again, I won't say that there' so bug, but at least this beahvior is not
>> one.
>>
>> You cannot really mix pmxcfs / totem / cpg operations with a SSH connection
>> and assume any order guarantees between them, there are none.
>>
>> One would need to also sent a "event" over pmxcfs to signal the target
>> note about the continue of the file, this then _would_ be ordered correctly,
>> else, yes, there's a bug.
>>
>> [0]: https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.37.767
>> [1]: https://pdfs.semanticscholar.org/99a2/89cf0c97804cec4bd6ad70459d21267d525a.pdf
>>
> 
> ok makes sense, but
> 
>>
>> You cannot really mix pmxcfs / totem / cpg operations with a SSH connection
>> and assume any order guarantees between them, there are none.
>>
> 
> is exactly what we do here
> 
> 1. migrate
> 2. move file
> 3. tell the target to resume (and take ownership)

yeah, that's why I noted it, it's the underlying issue behind this bug, IMO.

> 
> so maybe we should have a better mechanism to trigger the resume
> on the target side?
> 

There are a few things one could do here, a relative simple one, leveraging
existing functionality could be doing a cfs lock before the move on source,
e.g., a cfs_lock_domain("migrate-$vmid"), then move, lock release on source,
connect to target and do the resume under the same lock with a relatively big
timeout (~ 90 seconds) - this should add the necessary synchrony here.





More information about the pve-devel mailing list