[pve-devel] applied: Re: [PATCH cluster] corosync.conf sync: reload after sleep
Thomas Lamprecht
t.lamprecht at proxmox.com
Thu Jul 7 11:38:58 CEST 2022
On 07/07/2022 10:21, Fabian Grünbichler wrote:
> if processing a corosync.conf update is delayed on a single node,
> reloading the config too early can have disastrous results (loss of
> token and HA fence). artifically delay the reload command by one second
> to allow update propagation in most scenarios until a proper solution
> (e.g., using broadcasting/querying of locally deployed config versions)
> has been developed and fully tested.
>
> reported on the forum:
> https://forum.proxmox.com/threads/expanding-cluster-reboots-all-vms.110903/
>
> reported issue can be reproduced by deploying a patched pmxcfs on
> non-reloading node that sleeps before writing out a broadcasted
> corosync.conf update and adding a node to the cluster, leading to the
> following sequence of events:
>
> - corosync config reload command received
> - corosync config update written out
>
> which causes that particular node to have a different view of cluster
> topology, causing all corosync communication to fail for all nodes until
> corosync on the affected node is restarted (the on-disk config is
> correct after all, just not in effect).
>
> Signed-off-by: Fabian Grünbichler <f.gruenbichler at proxmox.com>
> ---
> tested new cluster creation from scratch, and cluster expansion (on a
> test PVE cluster with HA enabled and running guests, to simulate some
> load).
>
> data/src/dcdb.c | 6 ++++++
> 1 file changed, 6 insertions(+)
>
>
applied, thanks!
for now the simplest stop gap, any more elaborate mechanism may be better
suited for a major release anyway, upgrade-wise.
More information about the pve-devel
mailing list