[pve-devel] StorPool storage plugin concerns

Fri Feb 14 12:42:16 CET 2025

> Ivaylo Markov <ivaylo.markov at storpool.com> hat am 13.02.2025 16:21 CET geschrieben:
> 
> On 04/02/2025 16:46, Fabian Grünbichler wrote:
> 
> > > Ivaylo Markov via pve-devel <pve-devel at lists.proxmox.com> hat am 04.02.2025 13:44 CET geschrieben:
> > > Greetings,
> > > 
> > > I was pointed here to discuss the StorPool storage plugin[0] with the 
> > > dev team.
> > > If I understand correctly, there is a concern with the our HA watchdog 
> > > daemon, and I'd like to explain the why and how.
> > > 
> > Hi!
> > 
> > I am not sure whether there were previous discussions on some other channel, it might be helpful to include pointers to them if there are! Thanks for reaching out to our devel list, IMHO it's always best to get to a common understanding and hopefully a solution together, instead of on our own :)
> Apologies for the confusion - there was a conversation at the management level between companies in regard to StorPool becoming a solution provider partner, and my understanding was that the PVE team had some concerns regarding the HA changes in our storage plugin.

Yes, I heard/found out about that after my reply :)

>  The storage plugin is functional by itself and is used by some customers with the stock PVE watchdog in non-HCI scenarios. 
>  
>  The replacement watchdog is supposed to be used only in Proxmox+StorPool HCI deployments, where no other storage is used. All of the deployments are under continuous monitoring by our team, so we can put guard rails in to avoid unsupported configurations. We take the responsibility for debugging HA issues in these deployments.
>  It is developed and tested by us and it is in use in production by a couple of customers.
>  
>  It makes sense to move the HCI-specific watchdog functionality in a separate repo, so that the storage plugin repo is cleaner. We will do so shortly.

That is probably a good idea, makes it more obvious whether just using the storage (client) part, or the full HCI-with-alternate-HA-semantics package, and harder to introduce accidental interdependencies between the two.

> > > As a distributed storage system, StorPool has its own internal 
> > > clustering mechanisms; it can run
> > > on networks that are independent from the PVE cluster one, and thus 
> > > remain unaffected by network
> > > partitions or other problems that would cause the standard PVE watchdog 
> > > to reboot a node.
> > > In the case of HCI (compute + storage) nodes, this reboot can interrupt 
> > > the normal operation of the
> > > StorPool cluster, causing reduced performance or downtime, which could 
> > > be avoided if the host is not restarted.
> > > This is why we do our best to avoid such behavior across the different 
> > > cloud management platforms.
> > > 
> > This is similar to other storage providers like Ceph, which come with their own quorum/clustering/.. mechanism. In general, co-hosting two different systems like that will not increase overall availability or reliability, unless you can make them cooperate with eachother, which is usually quite tricky/hard.
> > 
> > E.g., in the case of Ceph+PVE (which I am obviously much more familiar with than your approach/solution):
> > - PVE clustering uses corosync+pmxcfs+PVE's HA stack, with HA enabled this entails fencing, otherwise the cluster mostly goes read-only
> > - Ceph will use its own monitors to determine quorum, and go read-only or inaccessible depending on how much of the cluster is up and how it is configured
> > 
> > Since the quorum mechanisms are mostly independent (which doesn't mean they can't go down at the same time for the same or unrelated reasons), you can have partial failure scenarios:
> > - Ceph could go read-only or down, while PVE itself is fine, but guests using Ceph are still experiencing I/O errors
> > - PVE could go read-only, but already running guests can still write to the Ceph storage
> > - PVE could fence a node which only hosts OSDs, and the remaining cluster can take over with just a short downtime of HA guests which were running on the fenced node
> > - PVE could fence all nodes running Ceph monitors, Ceph goes down hard, but PVE itself remains operable with the remaining majority of nodes
> > - ...
> > 
> > If you want to reduce this interference, then HCI is not the way to go, but separating compute and storage into entirely independent parts of you environment (you probably already know this ;) and we both know this can be a hard sell as it's the more expensive approach for small to medium setups).
> I agree, non-HCI setups are simpler (and simple can often be better), but HCI also has advantages and is demanded by customers. We run a couple of KVM HCI clouds for our own production workloads and test/dev/lab use-cases, so we know why customers chose HCI.

yes, HCI Ceph is also very popular with our users for valid reasons! :)

> > > Currently, when our daemon detects an unexpected exit of a resource 
> > > manager, it will SIGKILL PVE
> > > HA services and running VMs on the node, which should prevent 2 
> > > instances of the same VM running at
> > > the same time. PVE services and our block storage client daemon are 
> > > restarted as well.
> > > 
> > > We're open to discussion and suggestions for our approach and 
> > > implementation.
> > > 
> > I just took a very quick peek, and maybe I understood something wrong (please correct me if I did!). as far as I can tell your watchdog implementation replaces ours, which means that there would be no more fencing in case a HA-enabled node leaves the quorate partition of the corosync cluster (this seems to be the whole point of your watchdog takeover - to avoid fencing)? Even if you kill all HA resources/guests and the HA services, this is still dangerous as the other nodes in the cluster will assume that the node has fenced itself after the grace period is over. This self-fencing property is a hard requirement for our HA stack, if that is undesirable for your use case you'd need to not allow HA in the first place (in which case, you also don't need to take over the watchdog, since it won't be armed). Note that while running guests and tasks are the most "high risk" parts, you simply cannot know what other processes/.. On the failing node is potentially accessing (writing to) state (
> >  such as VM disks) on shared storage(s) and thus can cause corruption if the node is not fully fenced by the time another node takes over.
> > 
> > Could you maybe describe a bit more how your clustering works, and what your watchdog setup entails? The repo didn't provide much high level details and I don't want to read through all the code to try to map that back to a rough design (feel free to link to documentation of course!), since you can probably provide that overview much better and easier.
> > 
> > Fabian
> The goal of our StorPool+Proxmox HCI efforts has been to enable HCI deployments without decreasing the availability of the StorPool and Proxmox clusters. This is achieved by making sure Proxmox's clustering cannot restart nodes and making sure that VMs and other Proxmox services are killed when Proxmox wants to fence a node. The StorPool cluster doesn't need or use node fencing (how is a matter of a separate, longer conversation), so it does not affect the Proxmox cluster directly.
>  
>  In HCI scenarios with StorPool, which are supported only when StorPool is the only shared storage configured, we replace the standard PVE watchdog with our own implementation.
>  
>  When a node needs to be fenced our watchdog replacement performs the following actions:
>  SIGKILLs all guests
>  force-detaches SP volumes, and ensures our client block device cannot submit new IOs. "Force detach" in StorPool ensures that no further IO can be submitted by the client, even if it was temporarily disconnected.
>  
>  Additionally, when a VM is started, the storage plugin first force-detaches its volumes from all hosts other than the one it is about to be started on. With these precautions in place there should be sufficient protection against parallel writes from multiple nodes. Writes to pmxcfs are handled by PVE’s clustering components, and we don’t expect any problems there.
>  
>  We will also make sure that there are no other storages configured by means of monitoring of the Proxmox storage configuration.
>  
>  What we've done so far seems to be sufficient to achieve the goals - it effectively removes the possibility of the Proxmox cluster killing off a storage node, while still effectively fencing VMs and other services. As with any piece of software, there are things which can be done to make it even better. A few non-committed examples:
>  - support for containers, not just VMs
>  - automatic recovery so it has UX similar to the default watchdog
> Please let us know your thoughts and any further concerns, we'd like to address them as Proxmox HCI support is important to us.

AFAICT from the description above (not looking at code or actually testing anything), issues on your storage layer should be ruled out. But it still leaves issues with anything else, e.g. any long running task (either by PVE, or by the admin) that involves a HA-managed guest is at risk of being "split-brained". In a regular (HA) setup, another node will only recover the config (and thus ownership) of the guest once the requisite timeouts have passed, which means it *knows* the failed node must have fenced itself. In your setup, this is not the case anymore - the non-quorate node still has the VM config (since it is not quorate, it cannot notice the "theft" of the config by the HA stack running on the quorate partition of the cluster) and thus (from a local point of view) at least RO ownership of that guest. Depending on the sequence of events, such a task might have passed a quorum check earlier and not yet reached the next such check, and thus even think it still has full ownership and act accordingly! Obviously, writes to your shared storage or to /etc/pve would be blocked, but that doesn't mean that nothing dangerous can happen (e.g., local or external state being corrupted or running out of sync by writes on/from two different nodes).

The only way to make this safe(r) would be to basically disallow any custom integration (to ensure no non-PVE tasks are running) and kill the whole PVE stack on quorum loss, including any spawned tasks and pmxcfs. At that point, all the configs and API would become unavailable as well, so the risk of something/somebody misinterpreting anything should become zero - if there is no information, nothing can be misinterpreted after all ;) This would mean basically mean "downgrading" a PVE+StorPool node to a StorPool node on quorum loss, which is your intended semantics (I think?).

This approach does come with a new problem though - once this node rejoins the cluster, you'd need to bring up all of the PVE stack again in an orderly fashion.

I hope the above explains why and how PVE is using self-fencing via watchdogs, and the implications of disabling that while keeping HA "enabled". If something is unclear or you have more questions, please reach out!