[pdm-devel] RFC: Synchronizing configuration changes across remotes

Mon Feb 3 18:02:49 CET 2025

Am 30.01.25 um 16:48 schrieb Stefan Hanreich:
> During this operation it would make sense to make sure that there are
> no pending changes in the SDN configuration, so users do not
> accidentally apply unrelated changes via PDM. We also need to prevent
> any concurrent SDN configuration changes for the same reason - so we
> don't apply any unrelated configuration.
> 
> The question is: Do we also want to be able to prevent concurrent
> changes across multiple remotes, or are we fine with only preventing
> concurrent changes on a remote level? With network configuration
> affecting more than one remote, I think it would be better to
> synchronize changes across remotes since oftentimes applying the
> configuration to only one remote doesn't really make sense and the
> failure to apply configuration could affect the other remote.

To answer yes to your specific question and really mean it (as in:
actually safe, generic reliable, and maybe even atomic) would mean that
we need to add a cluster layer with an algorithm than actually ensures
us consensus over all remotes.
I.e., Paxos, like corosync uses, or raft or something like that
(depending on the exact properties wanted).
All these are very costly and do not scale well at all, so with
interpreting your question rather narrowly I'd have to answer with a
strong no; requiring that would severely limit PDM in its usefulness and
IME bring major complexity and headaches along with it.  And as rolling
out the network to FRR and what not else has tons of side effects doing
all this consensus work would quite definitively rather useless anyway.

But, squinting a bit more and interpreting the question to not mean that
we should add ways for doing things in guaranteed lockstep through an
FSM distributed over all remotes to rather mean that one adds some way
to ensure one can do various edits without others being able to make any
modifications during that sequence it should be dooable.
Especially if we transparently shift responsibility to clean things up
if anything goes wrong to the user (with some methods to empower them
doing so, documentation and tooling wise).

> I currently gravitate towards the lock-based approach due to the
> following reasons:

Yeah, digest is not giving you anything here, at least for anything that
consists of more than one change; and adding a dedicated central API
endpoint for every variant of batch update we might need seems hardly
scalable nor like good API design.

> * It enables us to synchronize changes across multiple remotes - as
> compared to a digest based approach.
> * It's a lot more ergonomic for developers, since you simply
> acquire/release the lock. With a digest-based approach, modifications
> that require multiple API calls need to acquire a new digest
> everytime and track it across multiple API calls. With SDN specifically,
> when applying the configuration, we need to provide and check the digest
> as well.
> * It is just easier to prevent concurrent changes in the first place
> rather than reacting to them. If they cannot occur, then rollbacking
> is easier and less error-prone since the developer can assume nothing
> changed in the previously handled remotes as well.
> 
> The downsides of this approach I can see:
> * It requires sweeping changes to basically the whole SDN API, and
> keeping backwards compatibility is harder.

Does it really require sweeping changes? I'd think modifications are
already hedging against concurrent access now, so this should not mean
we change to a completely new edit paradigm here.

My thoughts when we talked was to go roughly for:
Add a new endpoint that 1) ensure basic healthiness and 2) registers a
lock for the whole, or potentially only some parts, of the SDN stack.
This should work by returning a lock-cookie random string to be used by
subsequent calls to do various updates in one go while ensuring nothing
else can do so or just steal our lock.  Then check this lock centrally
on any write-config and be basically done I think?

A slightly more elaborate variant might be to also split the edit step,
i.e.
1. check all remotes and get lock
2. extend the config(s) with a section (or a separate ".new" config) for
   pending changes, write all new changes to that.
3. commit the pending sections or .new config file.

With that you would have the smallest possibility for failure due to
unrelated node/connection hickups and reduce the time gap for actually
activating the changes. If something is off an admin even could manually
apply these directly on the cluster/nodes.

> * Also, many API endpoints in PVE already provide the digest
> functionality, so it would be a lot easier to retro-fit this for usage
> with PDM and possibly require no changes at all.

Digest should be able to co-exist, if the config is unlocked and digest
is the same then the edit is generally safe.

> * In case of failures on the PDM side it is harder to recover, since
> it requires manual intervention (removing the lock manually).

Well, a partially rolled out SDN update might always be (relatively)
hard to recover from; which approach would avoid that (and not require
paxos, or raft level guarantees)?

> For single configuration files the digest-based approach could work
> quite well in cases where we don't need to synchronize changes across
> multiple remotes. But for SDN the digest-based approach is a bit more
> complicated: We currently generate digests for each section in the
> configuration file, instead of for the configuration file as a whole.
> This would be relatively easy to add though. The second problem is
> that the configuration is split across multiple files, so we'd need to
> either look at all digests of all configuration files in all API calls
> or check a 'global' SDN configuration digest on every call. Again,
> certainly solvable but also requires some work.

FWIW, we already got pmxcfs backed domain locks, which I added for the
HA stack back in the day. These allow relatively cheaply to take a lock
that only one pmxcfs instance (i.e., one node) at a time can hold.  Pair
that with some local lock (e.g., flock, in single-process, many threads
rust land it could be an even cheaper mutex) and you can quite simply
and not to expensively lock edits – and I'd figure SDN modifications do
not have _that_ high of a frequency to make performance here to critical
for such locking to become a problem.

> Since even with our best effort we will run into situations where the
> lock doesn't get properly released, a simple escape hatch to unlock
> the SDN config should be provided (like qm unlock). One such scenario
> would be PDM losing connectivity to one of the remotes while holding
> the lock, there's not really anything we can do there.

With the variant that allows separately committing a change the lock
could be released along that (only if there is no error, naturally),
that should avoid most problematic situation where an admin cannot be
sure what to do. As otherwise the new config, or pending section, would
still exist and can help to interpret the status quo of the network and
the best course of action.

> Since we probably need some form of doing this with other
> configuration files as well, I wanted to ask for your input. I think
> this concept could be applied generally to configuration changes that
> need to be made synchronized across multiple remotes (syncing firewall
> configuration comes to mind). This is just a rough draft on how this
> could work and I probably oversaw some edge-cases. I'm happy for any
> input or alternative ideas!

There are further details to flesh out here, but in any way I think that
we really should focus on SDN here and avoid some overly generic
solution but rather tailor it to the specific SDN use case(s) at hand.
Firewall can IMO be placed under the SDN umbrella and might be fine to
use similar mechanics, maybe even exactly the same, but I would not
concentrate on building something generic here now, if we can re-use
that then great, but SDN is quite specific, not a lot of things depend
on rolling out changes in (best-effort) lock step to ensure the end
result is actually a functioning thing. Meaning, most other things
should not require any such inter-remote synchronization building blocks
in the first place; we need to ensure we do not use that hammer to
often, as recently replied to Dominik's bulk-action mail: the PDM should
be minimal in state and configs it manages itself for some remote
management feature, else things will get complex and coupled way to
fast.