[PVE-User] critical HA problem on a PVE6 cluster
Eneko Lacunza
elacunza at binovo.es
Mon May 11 10:39:10 CEST 2020
Hi Hervé,
This seems a network issue. What is the network setup in this cluster?
What logs in syslog about corosync and pve-cluster?
Don't enable HA until you have a stable cluster quorum.
Cheers
Eneko
El 11/5/20 a las 10:35, Herve Ballans escribió:
> Hi everybody,
>
> I would like to take the opportunity at the beginning of this new week
> to ask my issue again.
>
> Has anyone had any idea why a such problem occurred, or is this
> problem really something new ?
>
> Thanks again,
> Hervé
>
> On 07/05/2020 18:28, Herve Ballans wrote:
>> Hi all,
>>
>> *Cluster info:*
>>
>> * 5 nodes (version PVE 6.1-3 at the time the problem occured)
>> * Ceph rbd storage (Nautilus)
>> * In production since many years with no major issues
>> * No specific network problems at the time the problem occured
>> * Nodes are on the same date (configured with the same ntp server)
>>
>> *Symptoms:*
>>
>> Suddenly, last night (around 7 PM), all nodes of our cluster seems to
>> have rebooted in the same time with no apparent reasons (I mean, we
>> weren't doing antything on it) !
>> During the reboot, services "Corosync Cluster Engine" and "Proxmox VE
>> replication runer" failed. After node rebooted, we are obliged to
>> start those services manually.
>>
>> Once rebooted with all pve services, some nodes were in HA lrm status
>> : old timestamp - dead? while others were in active status or in
>> wait_for_agent_lock status ?...
>> Nodes switch states regularly...and it loops back and forth as long
>> as we don't change the configuration...
>>
>> In the same time, pve-ha-crm service got unexpected error, as for
>> example : "Configuration file
>> 'nodes/inf-proxmox6/qemu-server/501.conf' does not exist" even though
>> the file exists but on an another node !
>> Such message is probably a consequence of the fencing between nodes
>> due to the change of status...
>>
>> *What we have tried until now to stabilize the situation:*
>>
>> After several investigations and several operations that have failed
>> to solve anything (in particular a complete upgrade to the latest PVE
>> version 6.1-11),
>>
>> we finally removed the HA configuration of all the VM.
>> Since, the state seems to be stabilized although, obviously, it is
>> not nominal !
>>
>> Now, all the nodes are in HA lrm status : idle and sometimes switch
>> to old timestamp - dead? state, then come back to idle state.
>> None of them are in "active" state.
>> Obviously, quorum status is "no quorum"
>>
>> It will be noted that, as soon as we try to re-activate the HA status
>> on the VMs, problem occurs again (nodes reboot!) :(
>>
>> *Question:*
>>
>> Have you ever experienced such a problem or do you know a way to
>> restore a correct HA configuration in this case ?
>> I point out that nodes are currently on version PVE 6.1-11.
>>
>> I can put some specific logs if useful.
>>
>> Thanks in advance for your help,
>> Hervé
>>
>> _______________________________________________
>> pve-user mailing list
>> pve-user at pve.proxmox.com
>> https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
> _______________________________________________
> pve-user mailing list
> pve-user at pve.proxmox.com
> https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
--
Zuzendari Teknikoa / Director Técnico
Binovo IT Human Project, S.L.
Telf. 943569206
Astigarragako bidea 2, 2º izq. oficina 11; 20180 Oiartzun (Gipuzkoa)
www.binovo.es
More information about the pve-user
mailing list