[pve-devel] [PATCH qemu-server 1/1] qemu: add offline migration from dead node

Tue Apr 1 15:20:22 CEST 2025

On 4/1/25 14:54, Thomas Lamprecht wrote:
> Am 01.04.25 um 13:37 schrieb Dominik Csapak:
>> Mhmm, what I meant here is that instructing the user to manually
>> do 'mv some-path some-other-path' has more error potential (e.g.
>> typos, misremembering nodenames/vmids/etc.) than e.g. clicking
>> the vm on the offline node and pressing a button (or
>> following a CLI tool output/options)
> 
> Which all have their error potential too, especially with hostnames
> being free-form and not exclusive.
> 
>> I mentioned it because fabian wrote we could maybe solve it with a
>> cluster wide VM lock, I think restricting the moving to such a lock
>> could work, under the assumption that the admin makes sure the offline
>> node is and stays offline. (Which he has to do anyway)
> 
> Still not sure what this would provide, pmxcfs gurantees that the VMID
> config can exist only once already anyway, so only one node can do a
> move and such moves can only happen if they would be equal to a file
> rename as any resource must be shared already to make this work.
> Well replication could be fixed up I guess, but that can be handled on
> VM start too. Cannot think of anything else (without an in-depth
> evaluation though) that an API can/should do different for the actual
> move itself. Doing some up-front checks is a different story, but that
> could also result in a false sense of safety.
> 
>> It still improves the UX for that situation since it's then a
>> provided/guided way vs. mv'ing files on the filesystem.
> 
> I'd not touch the move part though, at least for starters, just like the
> upgrade checker scripts it should only assist.
> 
>> Just to clarify, I'm not for blindly implementing such an API call/CLI tool/etc.
>> but wanted to argue that we probably want to improve the UX of that situation
>> as good as we can and offered my thoughts on how we could do it.
>   
> That's certainly fine; having it improved would be good, but I'm very wary
> of hot takes and hand waving (not meaning you here, just in general), this
> isn't a purge/remove/wipe of some resource on a working system, like wiping
> disks or removing guests, as that can present the information to the admin
> from a known good node that manages its state itself.
> An unknown/dead node is literally breaking core clustering assumption that
> we build upon on a lot of places, IMO a very different thing. Mentioning this
> as it might be easy to question why other destructive actions are exposed in
> the UI.
> 
> And FWIW, if I should reconsider this it would be much easier to argue for
> further integration if the basic assistant/checker guide/tool already
> existed for some time and was somewhat battle tested, as that would allow a
> much more confident evaluation of options, whatever those then look like;
> some "scary" hint in the UI with lots of exclamation marks does not cut it
> for me though, no offense to anybody.

I agree with all of your points, so I think the best and easiest way to improve the current
situation would be to:

* Improve the docs to emphasize more that this situation should be an exception
   and that working around cluster assumptions can have severe consequences.
   (Maybe nudge users towards HA if this is a common situation for them)
   Also it be good for it to be in a (like you suggested) check-list style
   manner, so that admins have an guided way to check for things like
   storage, running nodes, etc.

* Change the migration UI to show a warning that the node is offline
   and provide a direct link to above mentioned improved docs

What do you think?