[pve-devel] applied: Re: [PATCH cluster] corosync.conf sync: reload after sleep

Thomas Lamprecht t.lamprecht at proxmox.com
Thu Jul 7 11:38:58 CEST 2022


On 07/07/2022 10:21, Fabian Grünbichler wrote:
> if processing a corosync.conf update is delayed on a single node,
> reloading the config too early can have disastrous results (loss of
> token and HA fence). artifically delay the reload command by one second
> to allow update propagation in most scenarios until a proper solution
> (e.g., using broadcasting/querying of locally deployed config versions)
> has been developed and fully tested.
> 
> reported on the forum:
> https://forum.proxmox.com/threads/expanding-cluster-reboots-all-vms.110903/
> 
> reported issue can be reproduced by deploying a patched pmxcfs on
> non-reloading node that sleeps before writing out a broadcasted
> corosync.conf update and adding a node to the cluster, leading to the
> following sequence of events:
> 
> - corosync config reload command received
> - corosync config update written out
> 
> which causes that particular node to have a different view of cluster
> topology, causing all corosync communication to fail for all nodes until
> corosync on the affected node is restarted (the on-disk config is
> correct after all, just not in effect).
> 
> Signed-off-by: Fabian Grünbichler <f.gruenbichler at proxmox.com>
> ---
> tested new cluster creation from scratch, and cluster expansion (on a
> test PVE cluster with HA enabled and running guests, to simulate some
> load).
> 
>  data/src/dcdb.c | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
>

applied, thanks!

for now the simplest stop gap, any more elaborate mechanism may be better
suited for a major release anyway, upgrade-wise.





More information about the pve-devel mailing list