[pdm-devel] [RFC PATCH datacenter-manager 0/3] implement bulk start

Thu Jan 30 09:14:17 CET 2025

On 1/29/25 19:48, Thomas Lamprecht wrote:
> Am 29.01.25 um 11:51 schrieb Dominik Csapak:
>> Sending as RFC, because it's still very rough and i want to get some
>> early feedback.
>>
>> This series implements an api call 'bulk-start' which is running on
>> the pdm itself, that mimics the bulkstart from pve, but without the
>> node limitation of pve.
>>
>> Does that make sense? Or would it be better to try to implement that
>> on pve side? The advantage we have here is that we have an
>> external view of the cluster, which means that things like node
>> failures, synchronisation, etc. are much easier to handle.
> 
> I think we talked offlist about this a while ago, albeit rather casually,
> and yes IMO exposing this on the PVE side would be better – it can be done
> more efficiently there, better control for overall active job count and
> avoids some oddities. TBH I'd be surprised if it's easier to do from
> external with the same feature set.
> 
> Having an external services handle this over a potentially flaky connection
> seems much more error-prone to me compared going over a LAN that clusters
> require.
> 
> IMO we actually should avoid having much of this stuff or dedicated state
> (that affects the remotes or their resources) in the PDM directly. The
> more things are handled by the end products the 1) simpler PDM stays
> (PVE needs some complexity anyway, coupling two complex projects will IMO
> amplify maintenance cost more) 2) ensures PVE provides already a powerful
> feature set on its own – i.e. PVE already has a good architecture and is
> not as limited like vmware esxi, which requires vsphere for relatively
> simple (from user POV, not implementation) things even if they are only
> affecting nodes in the same LAN, so we should continue to mainly "empower"
> PVE and plug that into PDM 3) PDM will become relatively complex even
> with trying to avoid state and such features implemented only there,
> all the metrics, tasks, health and SDN tracking is already quite a bit
> to handle, if done actually well, flexible and powerful.
> 
>> If we'd implment something like this on PVE, there has to be a node
>> that has control of the api calls to make (or to schedule something via
>> pmxcfs) and that is probably much harder to do there (pmxcfs sync queue)
>> or brings some problems with it (node dies in the middle of an api call)
> 
> In the simplest architecture it could be like the SDN reload is
> implemented; I'm quite sure that I mentioned that, but would not bet that
> much on my (or most) brain(s) that is.
> 
> I.e. a single task on one node that connects to all involved cluster nodes
> through the API and creates the respective bulk-tasks for the guests residing
> on each node and then polls these. Some generic infrastructure for doing such
> things might be nice and would have some reuse between different bulk tasks
> and SDN, potentially others in the future.
> Switching to an even more efficient channel or method could be done
> transparently (from POV of the external user/program of the cluster-wide
> bulk-action API), so I'd not worry too much about that now.
> 
> Besides that there are (most of the time) fewer points of failures between
> nodes compared to PDM and nodes network wise, if node(s) indeed die in the
> middle of an API call the PDM is naturally cannot magically fix that and
> as node failure is not expected behavior but rather an extraordinary event
> it also means that an interrupted bulk-action is not really a big problem
> there.
> 
> in short: lets do this in PVE directly.

Sounds good to me, with one caveat. When implementing this, I would go for
a new api call on the pve side that does this to properly separate this.
I think otherwise the existing call would get much more complex. (but I have to
try it first). On the pdm side I'd implement it with a fallback
to call the "old" bulkstart api call on each node in case the new api call does
not exist?

That way older nodes/clusters can still profit from the functionality without
much state/logic handling on the pdm side.

We can still remove that fallback from PDM again when the PDM code is sufficiently old
and PDM has no first release yet.

How does that sound?

> 
>> It's very early, so please don't judge the actual api call code just
>> now, I'd extend it with failure resulotion, polling the task, etc.
>>
>> OTOH there is the question if the UI makes sense this way, or if we want
>> to combine the 'select to view details' and 'select to to a bulk action'
>> into one. Or if we want to do the bulk actions more like in pve with
>> a popup that shows the vm list again.
>>
>> Dominik Csapak (3):
>>    server: pve api: add new bulkstart api call
>>    pdm-client: add bulk_start method
>>    ui: pve tree: add bulk start action
>>
>>   lib/pdm-client/src/lib.rs |   9 ++-
>>   server/src/api/pve/mod.rs |  98 +++++++++++++++++++++++++++-
>>   ui/src/pve/tree.rs        | 133 ++++++++++++++++++++++++++++++++++++--
>>   3 files changed, 234 insertions(+), 6 deletions(-)
>>
>