[pve-devel] StorPool storage plugin concerns

Thu Feb 13 16:21:33 CET 2025

On 04/02/2025 16:46, Fabian Grünbichler wrote:
>> Ivaylo Markov via pve-devel<pve-devel at lists.proxmox.com>  hat am 04.02.2025 13:44 CET geschrieben:
>> Greetings,
>>
>> I was pointed here to discuss the StorPool storage plugin[0] with the
>> dev team.
>> If I understand correctly, there is a concern with the our HA watchdog
>> daemon, and I'd like to explain the why and how.
> Hi!
>
> I am not sure whether there were previous discussions on some other channel, it might be helpful to include pointers to them if there are! Thanks for reaching out to our devel list, IMHO it's always best to get to a common understanding and hopefully a solution together, instead of on our own :)
Apologies for the confusion - there was a conversation at the management 
level between companies in regard to StorPool becoming a solution 
provider partner, and my understanding was that the PVE team had some 
concerns regarding the HA changes in our storage plugin.

The storage plugin is functional by itself and is used by some customers 
with the stock PVE watchdog in non-HCI scenarios.

The replacement watchdog is supposed to be used only in Proxmox+StorPool 
HCI deployments, where no other storage is used. All of the deployments 
are under continuous monitoring by our team, so we can put guard rails 
in to avoid unsupported configurations. We take the responsibility for 
debugging HA issues in these deployments.
It is developed and tested by us and it is in use in production by a 
couple of customers.

It makes sense to move the HCI-specific watchdog functionality in a 
separate repo, so that the storage plugin repo is cleaner. We will do so 
shortly.

>
>> As a distributed storage system, StorPool has its own internal
>> clustering mechanisms; it can run
>> on networks that are independent from the PVE cluster one, and thus
>> remain unaffected by network
>> partitions or other problems that would cause the standard PVE watchdog
>> to reboot a node.
>> In the case of HCI (compute + storage) nodes, this reboot can interrupt
>> the normal operation of the
>> StorPool cluster, causing reduced performance or downtime, which could
>> be avoided if the host is not restarted.
>> This is why we do our best to avoid such behavior across the different
>> cloud management platforms.
> This is similar to other storage providers like Ceph, which come with their own quorum/clustering/.. mechanism. In general, co-hosting two different systems like that will not increase overall availability or reliability, unless you can make them cooperate with eachother, which is usually quite tricky/hard.
>
> E.g., in the case of Ceph+PVE (which I am obviously much more familiar with than your approach/solution):
> - PVE clustering uses corosync+pmxcfs+PVE's HA stack, with HA enabled this entails fencing, otherwise the cluster mostly goes read-only
> - Ceph will use its own monitors to determine quorum, and go read-only or inaccessible depending on how much of the cluster is up and how it is configured
>
> Since the quorum mechanisms are mostly independent (which doesn't mean they can't go down at the same time for the same or unrelated reasons), you can have partial failure scenarios:
> - Ceph could go read-only or down, while PVE itself is fine, but guests using Ceph are still experiencing I/O errors
> - PVE could go read-only, but already running guests can still write to the Ceph storage
> - PVE could fence a node which only hosts OSDs, and the remaining cluster can take over with just a short downtime of HA guests which were running on the fenced node
> - PVE could fence all nodes running Ceph monitors, Ceph goes down hard, but PVE itself remains operable with the remaining majority of nodes
> - ...
>
> If you want to reduce this interference, then HCI is not the way to go, but separating compute and storage into entirely independent parts of you environment (you probably already know this ;) and we both know this can be a hard sell as it's the more expensive approach for small to medium setups).

I agree, non-HCI setups are simpler (and simple can often be better), 
but HCI also has advantages and is demanded by customers. We run a 
couple of KVM HCI clouds for our own production workloads and 
test/dev/lab use-cases, so we know why customers chose HCI.

>
>> Currently, when our daemon detects an unexpected exit of a resource
>> manager, it will SIGKILL PVE
>> HA services and running VMs on the node, which should prevent 2
>> instances of the same VM running at
>> the same time. PVE services and our block storage client daemon are
>> restarted as well.
>>
>> We're open to discussion and suggestions for our approach and
>> implementation.
> I just took a very quick peek, and maybe I understood something wrong (please correct me if I did!). as far as I can tell your watchdog implementation replaces ours, which means that there would be no more fencing in case a HA-enabled node leaves the quorate partition of the corosync cluster (this seems to be the whole point of your watchdog takeover - to avoid fencing)? Even if you kill all HA resources/guests and the HA services, this is still dangerous as the other nodes in the cluster will assume that the node has fenced itself after the grace period is over. This self-fencing property is a hard requirement for our HA stack, if that is undesirable for your use case you'd need to not allow HA in the first place (in which case, you also don't need to take over the watchdog, since it won't be armed). Note that while running guests and tasks are the most "high risk" parts, you simply cannot know what other processes/.. On the failing node is potentially accessing (writing to) state (
>   such as VM disks) on shared storage(s) and thus can cause corruption if the node is not fully fenced by the time another node takes over.
>
> Could you maybe describe a bit more how your clustering works, and what your watchdog setup entails? The repo didn't provide much high level details and I don't want to read through all the code to try to map that back to a rough design (feel free to link to documentation of course!), since you can probably provide that overview much better and easier.
>
> Fabian
The goal of our StorPool+Proxmox HCI efforts has been to enable HCI 
deployments without decreasing the availability of the StorPool and 
Proxmox clusters. This is achieved by making sure Proxmox's clustering 
cannot restart nodes and making sure that VMs and other Proxmox services 
are killed when Proxmox wants to fence a node. The StorPool cluster 
doesn't need or use node fencing (how is a matter of a separate, longer 
conversation), so it does not affect the Proxmox cluster directly.

In HCI scenarios with StorPool, which are supported only when StorPool 
is the only shared storage configured, we replace the standard PVE 
watchdog with our own implementation.

When a node needs to be fenced our watchdog replacement performs the 
following actions:
SIGKILLs all guests
force-detaches SP volumes, and ensures our client block device cannot 
submit new IOs. "Force detach" in StorPool ensures that no further IO 
can be submitted by the client, even if it was temporarily disconnected.

Additionally, when a VM is started, the storage plugin first 
force-detaches its volumes from all hosts other than the one it is about 
to be started on. With these precautions in place there should be 
sufficient protection against parallel writes from multiple nodes. 
Writes to pmxcfs are handled by PVE’s clustering components, and we 
don’t expect any problems there.

We will also make sure that there are no other storages configured by 
means of monitoring of the Proxmox storage configuration.

What we've done so far seems to be sufficient to achieve the goals - it 
effectively removes the possibility of the Proxmox cluster killing off a 
storage node, while still effectively fencing VMs and other services. As 
with any piece of software, there are things which can be done to make 
it even better. A few non-committed examples:
  - support for containers, not just VMs
  - automatic recovery so it has UX similar to the default watchdog

Please let us know your thoughts and any further concerns, we'd like to 
address them as Proxmox HCI support is important to us.

Thank you,
Ivaylo

-- 
Ivaylo Markov
Quality & Automation Engineer
StorPool Storage
https://www.storpool.com