[pve-devel] [PATCH cluster] fix #3596: handle delnode of offline node
Fabian Grünbichler
f.gruenbichler at proxmox.com
Fri Nov 12 13:59:32 CET 2021
On November 12, 2021 1:14 pm, Thomas Lamprecht wrote:
> On 12.11.21 12:50, Fabian Ebner wrote:
>> Am 12.11.21 um 09:45 schrieb Fabian Grünbichler:
>>> the recommended way is to first shutdown, then delnode, and never let it
>>> come back online, in which case corosync-cfgtool won't be able to kill
>>> the removed (offline) node.
>>>
>>> also, the order was wrong - if we first update corosync.conf to remove
>>> the node entry from the nodelist, corosync doesn't know about the nodeid
>>> anymore, so killing will fail even if the node is still online.
>>>
>>> Signed-off-by: Fabian Grünbichler <f.gruenbichler at proxmox.com>
>>> ---
>>> data/PVE/API2/ClusterConfig.pm | 8 ++++++--
>>> 1 file changed, 6 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/data/PVE/API2/ClusterConfig.pm b/data/PVE/API2/ClusterConfig.pm
>>> index 8f4a5bb..5a6a1ac 100644
>>> --- a/data/PVE/API2/ClusterConfig.pm
>>> +++ b/data/PVE/API2/ClusterConfig.pm
>>> @@ -485,9 +485,13 @@ __PACKAGE__->register_method ({
>>> delete $nodelist->{$node};
>>> - PVE::Corosync::update_nodelist($conf, $nodelist);
>>> + # allowed to fail when node is already shut down!
>>> + eval {
>>> + PVE::Tools::run_command(['corosync-cfgtool','-k', $nodeid])
>>> + if defined($nodeid);
>>> + };
>>>
>>
>> But what if it fails for a different reason than 'CS_ERR_NOT_EXIST'? Shouldn't we match the error?
>
> at least that examples is like ENOENT on unlink, an OK error (user could
> have -k'illed it before that).
>
IMHO it's okay to treat all errors as warnings here - if you follow the
instructions killing is not possible. if you didn't follow them, and the
node is online, but killing fails for some reason you still get the
output, the node is removed from corosync.conf on all nodes, and thus no
traffic is possible anymore between the cluster and the separated node
(knet will reject traffic from unknown -i.e., not contained in the
nodelist- nodes). no traffic means the separated node is kicked out of
the quorum, so it can't do any harm anymore ;)
More information about the pve-devel
mailing list