[pdm-devel] RFC: Synchronizing configuration changes across remotes

Thu Jan 30 16:48:10 CET 2025

I'm currently working on the SDN integration and for that I need a way
to deploy SDN configuration changes to multiple remotes
simultaneously.

In general I will need to do the following:

* Create / Update / Delete some parts of the SDN configuration of
multiple remotes, preferably synchronized across the remotes.
* Apply the new SDN configuration (possibly opt-in) for all/some nodes
in multiple remotes

During this operation it would make sense to make sure that there are
no pending changes in the SDN configuration, so users do not
accidentally apply unrelated changes via PDM. We also need to prevent
any concurrent SDN configuration changes for the same reason - so we
don't apply any unrelated configuration.

The question is: Do we also want to be able to prevent concurrent
changes across multiple remotes, or are we fine with only preventing
concurrent changes on a remote level? With network configuration
affecting more than one remote, I think it would be better to
synchronize changes across remotes since oftentimes applying the
configuration to only one remote doesn't really make sense and the
failure to apply configuration could affect the other remote.

The two options I see, depending on the answer to that question:
* introducing some form of lock that prevents any changes to the SDN
configuration from other sources
* do something based on the current digest functionality

The general process for making changes to the SDN configuration would
look as follows with the lock-based approach:
* check for pending changes, and if there are none: lock the SDN
configuration (atomically in one API call)
* make the changes to the SDN configuration
* apply the SDN configuration changes
* release the lock
* In the case of errors we can rollback the configuration changes and
then release all locks.

I currently gravitate towards the lock-based approach due to the
following reasons:
* It enables us to synchronize changes across multiple remotes - as
compared to a digest based approach.
* It's a lot more ergonomic for developers, since you simply
acquire/release the lock. With a digest-based approach, modifications
that require multiple API calls need to acquire a new digest
everytime and track it across multiple API calls. With SDN specifically,
when applying the configuration, we need to provide and check the digest
as well.
* It is just easier to prevent concurrent changes in the first place
rather than reacting to them. If they cannot occur, then rollbacking
is easier and less error-prone since the developer can assume nothing
changed in the previously handled remotes as well.

The downsides of this approach I can see:
* It requires sweeping changes to basically the whole SDN API, and
keeping backwards compatibility is harder.
* Also, many API endpoints in PVE already provide the digest
functionality, so it would be a lot easier to retro-fit this for usage
with PDM and possibly require no changes at all.
* In case of failures on the PDM side it is harder to recover, since
it requires manual intervention (removing the lock manually).

For single configuration files the digest-based approach could work
quite well in cases where we don't need to synchronize changes across
multiple remotes. But for SDN the digest-based approach is a bit more
complicated: We currently generate digests for each section in the
configuration file, instead of for the configuration file as a whole.
This would be relatively easy to add though. The second problem is
that the configuration is split across multiple files, so we'd need to
either look at all digests of all configuration files in all API calls
or check a 'global' SDN configuration digest on every call. Again,
certainly solvable but also requires some work.

Since even with our best effort we will run into situations where the
lock doesn't get properly released, a simple escape hatch to unlock
the SDN config should be provided (like qm unlock). One such scenario
would be PDM losing connectivity to one of the remotes while holding
the lock, there's not really anything we can do there.

Since we probably need some form of doing this with other
configuration files as well, I wanted to ask for your input. I think
this concept could be applied generally to configuration changes that
need to be made synchronized across multiple remotes (syncing firewall
configuration comes to mind). This is just a rough draft on how this
could work and I probably oversaw some edge-cases. I'm happy for any
input or alternative ideas!