[PVE-User] Cluster disaster

Thu Nov 10 11:39:53 CET 2016

On 11/09/2016 11:46 PM, Dhaussy Alexandre wrote:
> I had again another outage...
> BUT now everything is back online ! yay !
>
> So i think i had (at least) two problems :
>
> 1 - When installing/upgrading a node.
>
> If the node sees all SAN storages LUN before install, debian
> partitionner tries to scan all LUNs..
> This causes almost all nodes to reboot (not sure why, maybe it causes
> latency in lvm cluster, or a problem with a lock somewhere..)
>
> Same thing happens when f*$king os_prober spawns out on kernel upgrade.
> It scans all LVs and causes nodes reboots. So now i make sure of this in
> /etc/default/grub => GRUB_DISABLE_OS_PROBER=true

Yes OS_PROBER is _bad_ and may even corrupt some FS under some 
conditions, AFAIK.
The Proxmox VE iso does not have it for this reason.

>
> 2 - There seems to be a bug in lrm.
>
> Tonight i have seen timeouts in qmstarts in /var/log/pve/tasks/active.
> Just after the timeouts, lrm was kind of stuck doing nothing.

If it's doing nothing it would be interesting to see in which state it is.
Because if it's already online and  active the watchdog must trigger if 
it is stuck for ~60 seconds or more.

> Services began to start again after i restarted the service, anyway a
> few seconds after, the nodes got fenced.

Hmm, this means the watchdog was already running out.

> I think the timeouts are due to a bottlenet in our storage switchs, i
> have a few messages like this :
>
> Nov  9 22:34:40 proxmoxt25 kernel: [ 5389.318716] qla2xxx
> [0000:08:00.1]-801c:2: Abort command issued nexus=2:2:28 --  1 2002.
> Nov  9 22:34:41 proxmoxt25 kernel: [ 5390.482259] qla2xxx
> [0000:08:00.1]-801c:2: Abort command issued nexus=2:1:28 --  1 2002.
>
> So when all nodes rebooted, i may have hit the bottleneck, then the lrm
> bug, and all ha services were frozen... (happened several times.)

Yeah I looked a bit through logs of two of your nodes, it looks like the 
system hit quite some bottle necks..
CRM/LRM run often in 'loop took to long' errors the filesystem also is 
sometimes not writable.
You have in some logs some huge retransmit list from corosync.

Where does your cluster communication happens, not on the storage network?

A few general hints:

The ha-stack does not likes it when somebody moves the VM configs around 
from a VM in the started/migrate state.
If it's in stopped it's OK as there it can fixup the VM location. Else 
it cannot simply fixup the location as it does not know if the resource 
still runs on the (old) node.

Modifying the manager status does not works, if a manager is currently 
elected.
The manager reads it only on it transition from slave to manager to get 
the last state in memory.
After that it writes it just out so that on a master reelection the new 
master has the most current state.

So if something bad as this happens again I'd to the following:

If no master election happen, but there is a quorate parition of nodes 
and you are sure that thier pve-ha-crm service is up and running (else 
restart it first) you can try to trigger an instant master reelection by 
deleting the olds masters lock (which may not yet be invalid through 
timeout):
rmdir /etc/pve/priv/lock/ha_manager_lock/

If then a master election happens you should be fine and the HA stack 
will do its work and recover.

If you have to move the VMs you should disable those primary, ha-manager 
disable SID does that also quite well in a lot of problematic situations 
as it just edits the resources.cfg.
If this does not work you have no quorum or pve-cluster has a problem, 
which both mean HA recovery cannot take place on this node one way or 
the other.

>
> Thanks again for the help.
> Alexandre.
>
> Le 09/11/2016 à 20:54, Thomas Lamprecht a écrit :
>>
>> On 09.11.2016 18:05, Dhaussy Alexandre wrote:
>>> I have done a cleanup of ressources with echo "" >
>>> /etc/pve/ha/resources.cfg
>>>
>>> It seems to have resolved all problems with inconsistent status of
>>> lrm/lcm in the GUI.
>>>
>> Good. Logs would be interesting to see what went wrong but I do not
>> know if I can skim through them as your setup is not too small and there
>> may be much noise from the outage in there.
>>
>> If you have time you may sent me the log file(s) generated by:
>>
>> journalctl --since "-2 days" -u corosync -u pve-ha-lrm -u pve-ha-crm
>> -u pve-cluster  > pve-log-$(hostname).log
>>
>> (adapt the "-2 days" accordingly, it understands also something like,
>> "-1 day 3 hours")
>>
>> Sent them directly to my address (The list does not accepts bigger
>> attachments,
>> limit is something like 20-20 kb AFAIK).
>> I cannot promise any deep examination, but I can skim through them and
>> look what happened in the HA stack, maybe I see something obvious.
>>
>>> A new master have been elected. The manager_status file have been
>>> cleaned up.
>>> All nodes are idle or active.
>>>
>>> I am re-starting all vms in ha with "ha manager add".
>>> Seems to work now... :-/
>>>
>>> Le 09/11/2016 à 17:40, Dhaussy Alexandre a écrit :
>>>> Sorry my old message was too big...
>>>>
>>>> Thanks for the input !...
>>>>
>>>> I have attached manager_status files.
>>>> .old is the original file, and .new is the file i have modified and put
>>>> in /etc/pve/ha.
>>>>
>>>> I know this is bad but here's what i've done :
>>>>
>>>> - delnode on known NON-working nodes.
>>>> - rm -Rf /etc/pve/nodes/x for all NON-working nodes.
>>>> - replace all NON-working nodes with working nodes in
>>>> /etc/pve/ha/manager_status
>>>> - mv VM.conf files in the proper node directory
>>>> (/etc/pve/nodes/x/qemu-server/) in reference to
>>>> /etc/pve/ha/manager_status
>>>> - restart pve-ha-crm and pve-ha-lrm on all nodes
>>>>
>>>> Now on several nodes i have thoses messages :
>>>>
>>>> nov. 09 17:08:19 proxmoxt34 pve-ha-crm[26200]: status change startup =>
>>>> wait_for_quorum
>>>> nov. 09 17:12:04 proxmoxt34 pve-ha-crm[26200]: ipcc_send_rec failed:
>>>> Noeud final de transport n'est pas connecté
>>>> nov. 09 17:12:04 proxmoxt34 pve-ha-crm[26200]: ipcc_send_rec failed:
>>>> Connexion refusée
>>>> nov. 09 17:12:04 proxmoxt34 pve-ha-crm[26200]: ipcc_send_rec failed:
>>>> Connexion refusée
>>>>
>>
>> This means that something with the cluster filesystem (pve-cluster)
>> was not OK.
>> Those messages weren't there previously?
>>
>>
>>>> nov. 09 17:08:22 proxmoxt34 pve-ha-lrm[26282]: status change startup =>
>>>> wait_for_agent_lock
>>>> nov. 09 17:12:07 proxmoxt34 pve-ha-lrm[26282]: ipcc_send_rec failed:
>>>> Noeud final de transport n'est pas connecté
>>>>
>>>> We are also investigating on a possible network problem..
>>>>
>> Multicast properly working?
>>
>>
>>>> Le 09/11/2016 à 17:00, Thomas Lamprecht a écrit :
>>>>> Hi,
>>>>>
>>>>> On 09.11.2016 16:29, Dhaussy Alexandre wrote:
>>>>>> I try to remove from ha in the gui, but nothing happends.
>>>>>> There are some services in "error" or "fence" state.
>>>>>>
>>>>>> Now i tried to remove the non-working nodes from the cluster... but i
>>>>>> still see those nodes in /etc/pve/ha/manager_status.
>>>>> Can you post the manager status please?
>>>>>
>>>>> Also, is pve-ha-lrm and pve-ha-crm up and running without any error
>>>>> on all nodes, at least on those in the quorate partition?
>>>>>
>>>>> check with:
>>>>> systemctl status pve-ha-lrm
>>>>> systemctl status pve-ha-crm
>>>>>
>>>>> If not restart them, and if then its still problematic please post the
>>>>> output
>>>>> of the systemctl status call (if its the same on all node one output
>>>>> should be enough).
>>>>>
>>>>>
>>>>>> Le 09/11/2016 à 16:13, Dietmar Maurer a écrit :
>>>>>>>> I wanted to remove vms from HA and start the vms locally, but I
>>>>>>>> can’t even do
>>>>>>>> that (nothing happens.)
>>>>> You can remove them from HA by emptying the HA resource file (this
>>>>> deletes also
>>>>> comments and group settings, but if you need to start them _now_ that
>>>>> shouldn't be a problem)
>>>>>
>>>>> echo "" > /etc/pve/ha/resources.cfg
>>>>>
>>>>> Afterwards you should be able to start them manually.
>>>>>
>>>>>
>>>>>>> How do you do that exactly (on the GUI)? You should be able to start
>>>>>>> them
>>>>>>> manually afterwards.
>>>>>>>
>>>>>> _______________________________________________
>>>>>> pve-user mailing list
>>>>>> pve-user at pve.proxmox.com
>>>>>> http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>>>>>>
>>>>> _______________________________________________
>>>>> pve-user mailing list
>>>>> pve-user at pve.proxmox.com
>>>>> http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>>>> _______________________________________________
>>>> pve-user mailing list
>>>> pve-user at pve.proxmox.com
>>>> http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>>> _______________________________________________
>>> pve-user mailing list
>>> pve-user at pve.proxmox.com
>>> http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>>>
>> _______________________________________________
>> pve-user mailing list
>> pve-user at pve.proxmox.com
>> http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
> _______________________________________________
> pve-user mailing list
> pve-user at pve.proxmox.com
> http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user