[pve-devel] Volume live migration concurrency
Andrei Perapiolkin
andrei.perepiolkin at open-e.com
Wed May 28 16:49:43 CEST 2025
Hi Fabian,
Thank you for your time dedicated to this issue.
>> My current understanding is that all assets related to snapshots should
>> to be removed when volume is deactivation, is it correct?
>> Or all volumes and snapshots expected to be present across the entire
>> cluster until they are explicitly deleted?
> I am not quite sure what you mean by "present" - do you mean "exist in an
activated state"?
Exists in an active state - activated.
>> How should the cleanup tasks be triggered across the remaining nodes?
> it should not be needed
Consider following scenarios of live migration of a VM from 'node1' to
'node2':
1. Error occurs on 'node2' resulting in partial activation
2. Error occurs on 'node1' resulting in partial deactivation
3. Error occurs on both 'node1' and 'node2' resulting in dangling
artifacts remain on both 'node1' and 'node2'
That might lead to partial activation(some artefacts might be created)
and partial deactivation(some artifacts might remain uncleared).
Now, suppose the user unlocks the VM (if it was previously locked due to
the failure) and proceeds with another migration attempt, this time to
'node3', hoping for success.
What would happen to the artifacts on 'node1' and 'node2' in such a case?
Regarding 'path' function
In my case it is difficult to deterministically predict actual path of
the device.
Determining this path essentially requires activating the volume.
This approach is questionable, as it implies calling activate_volume
without Proxmox being aware that the activation has occurred.
What would happen if a failure occurs within Proxmox before it reaches
the stage of officially activating the volume?
Additionaly I believe that providing 'physical path' of the resource
that is not yet present(i.e. activated and usable) is a questionable
practice.
This creates a risk, as there is always a temptation to use the path
directly, under the assumption that the resource is ready.
This approach assumes that all developers are fully aware that a given
$path might merely be a placeholder, and that additional activation is
required before use.
The issue becomes even more complex in larger code base that integrate
third-party software—such as QEMU.
I might be mistaken, but during my experiments with the 'path' function,
I encountered an error where the virtualization system failed to open a
volume that had not been fully activated.
Perhaps this has been addressed in newer versions, but previously, there
appeared to be a race condition between volume activation and QEMU
attempting to operate on the expected block device path.
Andrei
On 5/28/25 03:06, Fabian Grünbichler wrote:
>> Andrei Perapiolkin <andrei.perepiolkin at open-e.com> hat am 27.05.2025 18:08 CEST geschrieben:
>>
>>
>>> 3. In the context of live migration: Will Proxmox skip calling
>>> /deactivate_volume/ for snapshots that have already been activated?
>>> Should the storage plugin explicitly deactivate all snapshots of a
>>> volume during migration?
>>> a live migration is not concerned with snapshots of shared volumes, and local
>>> volumes are removed on the source node after the migration has finished..
>>>
>>> but maybe you could expand this part?
>> My original idea was that since both 'activate_volume' and
>> 'deactivate_volume' methods have a 'snapname' argument they would both
>> be used to activate and deactivate snapshots respectivly.
>> And for each snapshot activation, there would be a corresponding
>> deactivation.
> deactivating volumes (and snapshots) is a lot trickier than activating
> them, because you might have multiple readers in parallel that we don't
> know about.
>
> so if you have the following pattern
>
> activate
> do something
> deactivate
>
> and two instances of that are interleaved:
>
> A: activate
> B: activate
> A: do something
> A: deactivate
> B: do something -> FAILURE, volume not active
>
> you have a problem.
>
> that's why we deactivate in special circumstances:
> - as part of error handling for freshly activated volumes
> - as part of migration when finally stopping the source VM or before
> freeing local source volumes
> - ..
>
> where we can be reasonably sure that no other user exists, or it is
> required for safety purposes.
>
> otherwise, we'd need to do refcounting on volume activations and have
> some way to hook that for external users, to avoid premature deactivation.
>
>> However, from observing the behavior during migration, I found that
>> 'deactivate_volume' is not called for snapshots that were previously
>> activated with 'activate_volume'.
> where they activated for the migration? or for cloning from a snapshot?
> or ..?
>
> maybe there is call path that should deactivate that snapshot after using
> it..
>
>> Therefore, I assumed that 'deactivate_volume' is responsible for
>> deactivating all snapshots related to the volume that was previously
>> activated.
>> The purpose if this question was to confirm this.
>>
>> From your response I conclude the following:
>> 1. Migration does not manages(i.e. it does not activate or deactivate
>> them volume snapshots.
> that really depends. a storage migration might activate a snapshot if
> that is required for transferring the volume. this mostly applies to
> offline migration or unused volumes though, and only for some storages.
>
>> 2. All volumes are expected to be present across all nodes in cluster,
>> for 'path' function to work.
> if at all possible, path should just do a "logical" conversion of volume ID
> to a stable/deterministic path, or the information required for Qemu to
> access the volume if no path exists. ideally, this means it works without
> activating the volume, but it might require querying the storage.
>
>> 3. For migration to work volume should be simultaneously present on both
>> nodes.
> for a live migration and shared storage, yes. for an offline migration with
> shared storage, the VM is never started on the target node, so no volume
> activation is required until that happens later. for local storages, volumes
> only exist on one node anyway (they are copied during the migration).
>
>> However, I couldn't find explicit instructions or guides on when and by
>> whom volume snapshot deactivation should be triggered.
> yes, this is a bit under-specified unfortunately. we are currently working
> on improving the documentation (and the storage plugin API).
>
>> Is it possible for a volume snapshot to remain active active after
>> volume itself was deactivated?
> I'd have to check all the code paths to give an answer to that.
> snapshots are rarely activated in general - IIRC mostly for
> - cloning from a snapshot
> - replication (limited to ZFS at the moment)
> - storage migration
>
> so just did that:
> - cloning from a snapshot only deactivates if the clone is to a different
> node, for both VM and CT -> see below
> - CT backup in snapshot mode deletes the snapshot which implies deactivation
> - storage_migrate (move_disk or offline migration) if a snapshot is passed,
> IIRC this only affects ZFS, which doesn't do activation anyway
>
>> During testing proxmox 8.2 Ive encountered situations when cloning a
>> volume from a snapshot did not resulted in snapshot deactivation.
>> This leads to the creation of 'dangling' snapshots if the volume is
>> later migrated.
> ah, that probably answers my question above.
>
> I think this might be one of those cases where deactivation is hard - you
> can have multiple clones from the same source VM running in parallel, and
> only the last one would be allowed to deactivate the snapshot/volume..
>
>> My current understanding is that all assets related to snapshots should
>> to be removed when volume is deactivation, is it correct?
>> Or all volumes and snapshots expected to be present across the entire
>> cluster until they are explicitly deleted?
> I am not quite sure what you mean by "present" - do you mean "exist in an
> activated state"?
>
>> Second option requires additional recommendation on artifact management.
>> May be it should be sent it as an separate email, but draft it here.
>>
>> If all volumes and snapshots are consistently present across entire
>> cluster and their creation/operation results in creation of additional
>> artifacts(such as iSCSI targets, multipath sessions, etc..), then this
>> artifacts should be removed on deletion of associated volume or snapshot.
>> Currently, it is unclear how all nodes in the cluster are notified of
>> such deletion as only one node in the cluster receives 'free_image' or
>> 'volume_snapshot_delete' request.
>> What is a proper way to instruct plugin on other nodes in the cluster
>> that given volume/snapshot is requested for deletion and all artifacts
>> related to it have to be removed?
> I now get where you are coming from I think! a volume should only be active
> on a single node, except during a live migration, where the source node
> will always get a deactivation call at the end.
>
> deactivating a volume should also tear down related, volume-specific
> resources, if applicable.
>
>> How should the cleanup tasks be triggered across the remaining nodes?
> it should not be needed, but I think you've found an edge case where we
> need to improve.
>
> I think our RBD plugin is also affected by this, all the other plugins
> either:
> - don't support snapshots (or cloning from them)
> - are local only
> - don't need any special activation/deactivation
>
> I think the safe approach is likely to deactivate all snapshots when
> deactivating the volume itself, for now.
>
More information about the pve-devel
mailing list