[pve-devel] StorPool storage plugin concerns

Tue Feb 4 15:46:07 CET 2025

> Ivaylo Markov via pve-devel <pve-devel at lists.proxmox.com> hat am 04.02.2025 13:44 CET geschrieben:
> Greetings,
> 
> I was pointed here to discuss the StorPool storage plugin[0] with the 
> dev team.
> If I understand correctly, there is a concern with the our HA watchdog 
> daemon, and I'd like to explain the why and how.

Hi!

I am not sure whether there were previous discussions on some other channel, it might be helpful to include pointers to them if there are! Thanks for reaching out to our devel list, IMHO it's always best to get to a common understanding and hopefully a solution together, instead of on our own :)

> As a distributed storage system, StorPool has its own internal 
> clustering mechanisms; it can run
> on networks that are independent from the PVE cluster one, and thus 
> remain unaffected by network
> partitions or other problems that would cause the standard PVE watchdog 
> to reboot a node.
> In the case of HCI (compute + storage) nodes, this reboot can interrupt 
> the normal operation of the
> StorPool cluster, causing reduced performance or downtime, which could 
> be avoided if the host is not restarted.
> This is why we do our best to avoid such behavior across the different 
> cloud management platforms.

This is similar to other storage providers like Ceph, which come with their own quorum/clustering/.. mechanism. In general, co-hosting two different systems like that will not increase overall availability or reliability, unless you can make them cooperate with eachother, which is usually quite tricky/hard.

E.g., in the case of Ceph+PVE (which I am obviously much more familiar with than your approach/solution):
- PVE clustering uses corosync+pmxcfs+PVE's HA stack, with HA enabled this entails fencing, otherwise the cluster mostly goes read-only
- Ceph will use its own monitors to determine quorum, and go read-only or inaccessible depending on how much of the cluster is up and how it is configured

Since the quorum mechanisms are mostly independent (which doesn't mean they can't go down at the same time for the same or unrelated reasons), you can have partial failure scenarios:
- Ceph could go read-only or down, while PVE itself is fine, but guests using Ceph are still experiencing I/O errors
- PVE could go read-only, but already running guests can still write to the Ceph storage
- PVE could fence a node which only hosts OSDs, and the remaining cluster can take over with just a short downtime of HA guests which were running on the fenced node
- PVE could fence all nodes running Ceph monitors, Ceph goes down hard, but PVE itself remains operable with the remaining majority of nodes
- ...

If you want to reduce this interference, then HCI is not the way to go, but separating compute and storage into entirely independent parts of you environment (you probably already know this ;) and we both know this can be a hard sell as it's the more expensive approach for small to medium setups).

> Currently, when our daemon detects an unexpected exit of a resource 
> manager, it will SIGKILL PVE
> HA services and running VMs on the node, which should prevent 2 
> instances of the same VM running at
> the same time. PVE services and our block storage client daemon are 
> restarted as well.
> 
> We're open to discussion and suggestions for our approach and 
> implementation.

I just took a very quick peek, and maybe I understood something wrong (please correct me if I did!). as far as I can tell your watchdog implementation replaces ours, which means that there would be no more fencing in case a HA-enabled node leaves the quorate partition of the corosync cluster (this seems to be the whole point of your watchdog takeover - to avoid fencing)? Even if you kill all HA resources/guests and the HA services, this is still dangerous as the other nodes in the cluster will assume that the node has fenced itself after the grace period is over. This self-fencing property is a hard requirement for our HA stack, if that is undesirable for your use case you'd need to not allow HA in the first place (in which case, you also don't need to take over the watchdog, since it won't be armed). Note that while running guests and tasks are the most "high risk" parts, you simply cannot know what other processes/.. On the failing node is potentially accessing (writing to) state (
 such as VM disks) on shared storage(s) and thus can cause corruption if the node is not fully fenced by the time another node takes over.

Could you maybe describe a bit more how your clustering works, and what your watchdog setup entails? The repo didn't provide much high level details and I don't want to read through all the code to try to map that back to a rough design (feel free to link to documentation of course!), since you can probably provide that overview much better and easier.

Fabian