[pve-devel] [PATCH cluster] fix #3596: handle delnode of offline node

Fabian Grünbichler f.gruenbichler at proxmox.com
Fri Nov 12 13:59:32 CET 2021


On November 12, 2021 1:14 pm, Thomas Lamprecht wrote:
> On 12.11.21 12:50, Fabian Ebner wrote:
>> Am 12.11.21 um 09:45 schrieb Fabian Grünbichler:
>>> the recommended way is to first shutdown, then delnode, and never let it
>>> come back online, in which case corosync-cfgtool won't be able to kill
>>> the removed (offline) node.
>>>
>>> also, the order was wrong - if we first update corosync.conf to remove
>>> the node entry from the nodelist, corosync doesn't know about the nodeid
>>> anymore, so killing will fail even if the node is still online.
>>>
>>> Signed-off-by: Fabian Grünbichler <f.gruenbichler at proxmox.com>
>>> ---
>>>   data/PVE/API2/ClusterConfig.pm | 8 ++++++--
>>>   1 file changed, 6 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/data/PVE/API2/ClusterConfig.pm b/data/PVE/API2/ClusterConfig.pm
>>> index 8f4a5bb..5a6a1ac 100644
>>> --- a/data/PVE/API2/ClusterConfig.pm
>>> +++ b/data/PVE/API2/ClusterConfig.pm
>>> @@ -485,9 +485,13 @@ __PACKAGE__->register_method ({
>>>             delete $nodelist->{$node};
>>>   -        PVE::Corosync::update_nodelist($conf, $nodelist);
>>> +        # allowed to fail when node is already shut down!
>>> +        eval {
>>> +        PVE::Tools::run_command(['corosync-cfgtool','-k', $nodeid])
>>> +            if defined($nodeid);
>>> +        };
>>>   
>> 
>> But what if it fails for a different reason than 'CS_ERR_NOT_EXIST'? Shouldn't we match the error?
> 
> at least that examples is like ENOENT on unlink, an OK error (user could
> have -k'illed it before that).
>

IMHO it's okay to treat all errors as warnings here - if you follow the 
instructions killing is not possible. if you didn't follow them, and the 
node is online, but killing fails for some reason you still get the 
output, the node is removed from corosync.conf on all nodes, and thus no 
traffic is possible anymore between the cluster and the separated node 
(knet will reject traffic from unknown -i.e., not contained in the 
nodelist- nodes). no traffic means the separated node is kicked out of 
the quorum, so it can't do any harm anymore ;)




More information about the pve-devel mailing list