[pve-devel] Proxmox 4 feedback

Fri Oct 9 20:27:36 CEST 2015

Le 09/10/2015 20:14, Gilou a écrit :
> Le 09/10/2015 18:36, Gilou a écrit :
>> Le 09/10/2015 18:21, Dietmar Maurer a écrit :
>>>> So I tried again.. HA doesn't work.
>>>> Both resources are now frozen (?), and they didn't restart... Even after
>>>> 5 minutes...
>>>> service vm:102 (pve1, freeze)
>>>> service vm:303 (pve1, freeze)
>>>
>>> The question is why they are frozen. The only action which 
>>> puts them to 'freeze' is when you shutdown a node.
>>>
>>
>> I pulled the ethernet cables out of the to-be-failing node when I
>> tested. It didn't shut down. I plugged them back in 20 minutes later.
>> They were down (so I guess the fencing worked). But still?
>>
> 
> OK, so I reinstalled fresh from the PVE 4 ISO 3 nodes, that are using
> one single NIC to communicate with a NFS server and themselves. Cluster
> is up, and one VM is protected:
> # ha-manager status
> quorum OK
> master pve1 (active, Fri Oct  9 19:55:06 2015)
> lrm pve1 (active, Fri Oct  9 19:55:12 2015)
> lrm pve2 (active, Fri Oct  9 19:55:07 2015)
> lrm pve3 (active, Fri Oct  9 19:55:10 2015)
> service vm:100 (pve2, started)
> # pvecm status
> Quorum information
> ------------------
> Date:             Fri Oct  9 19:55:22 2015
> Quorum provider:  corosync_votequorum
> Nodes:            3
> Node ID:          0x00000001
> Ring ID:          12
> Quorate:          Yes
> 
> Votequorum information
> ----------------------
> Expected votes:   3
> Highest expected: 3
> Total votes:      3
> Quorum:           2
> Flags:            Quorate
> 
> Membership information
> ----------------------
>     Nodeid      Votes Name
> 0x00000002          1 192.168.44.129
> 0x00000003          1 192.168.44.132
> 0x00000001          1 192.168.44.143 (local)
> 
> One one of the nodes, incidentally, the one running the HA VM, I already
> get those:
> Oct 09 19:55:07 pve2 pve-ha-lrm[1211]: watchdog update failed - Broken pipe
> 
> Not good.
> I tried to migrate to pve1 to see what happens:
> Executing HA migrate for VM 100 to node pve1
> unable to open file '/etc/pve/ha/crm_commands.tmp.3377' - No such file
> or directory
> TASK ERROR: command 'ha-manager migrate vm:100 pve1' failed: exit code 2
> 
> OK.. so we can't migrate running HA VMs ? What did I get wrong here?
> So. I remove the VM from HA, I migrate it on pve1, see what happens. It
> works. OK. I stop the VM. Enable HA. It won't start.
> service vm:100 (pve1, freeze)
> 
> OK. And now, on pve1:
> Oct 09 19:59:16 pve1 pve-ha-crm[1202]: watchdog update failed - Broken pipe
> 
> OK... Let's try pve3, cold migrate, without ha, enable ha again..
> interesting, now we have:
> # ha-manager status
> quorum OK
> master pve1 (active, Fri Oct  9 20:09:46 2015)
> lrm pve1 (old timestamp - dead?, Fri Oct  9 19:58:57 2015)
> lrm pve2 (active, Fri Oct  9 20:09:47 2015)
> lrm pve3 (active, Fri Oct  9 20:09:50 2015)
> service vm:100 (pve3, started)
> 
> Why is pve1 not reporting properly...
> 
> And now on 3 nodes:
> Oct 09 20:10:40 pve3 pve-ha-lrm[1208]: watchdog update failed - Broken pipe
> Oct 09 20:10:50 pve3 pve-ha-lrm[1208]: watchdog update failed - Broken pipe
> Oct 09 20:11:00 pve3 pve-ha-lrm[1208]: watchdog update failed - Broken pipe
> 
> Wtf? omping reports multicast is getting through, but I'm not sure what
> would be the issue there... It worked on 3.4 on the same physical setup.
> So ?
> 
>

Well, then I still tried to see some failover, so I unplugged pve3 which
had the VM, something happened:

Oct  9 20:18:26 pve1 pve-ha-crm[1202]: node 'pve3': state changed from
'online' => 'unknown'
Oct  9 20:19:16 pve1 pve-ha-crm[1202]: service 'vm:100': state changed
from 'started' to 'fence'
Oct  9 20:19:16 pve1 pve-ha-crm[1202]: node 'pve3': state changed from
'unknown' => 'fence'
Oct  9 20:20:26 pve1 pve-ha-crm[1202]: successfully acquired lock
'ha_agent_pve3_lock'
Oct  9 20:20:26 pve1 pve-ha-crm[1202]: fencing: acknowleged - got agent
lock for node 'pve3'
Oct  9 20:20:26 pve1 pve-ha-crm[1202]: node 'pve3': state changed from
'fence' => 'unknown'
Oct  9 20:20:26 pve1 pve-ha-crm[1202]: service 'vm:100': state changed
from 'fence' to 'stopped'
Oct  9 20:20:36 pve1 pve-ha-crm[1202]: watchdog update failed - Broken pipe
Oct  9 20:20:36 pve1 pve-ha-crm[1202]: service 'vm:100': state changed
from 'stopped' to 'started'  (node = pve1)
Oct  9 20:20:36 pve1 pve-ha-crm[1202]: service 'vm:100': state changed
from 'started' to 'freeze'

OK, frozen. great.
root at pve1:~# ha-manager status
quorum OK
master pve1 (active, Fri Oct  9 20:23:26 2015)
lrm pve1 (old timestamp - dead?, Fri Oct  9 19:58:57 2015)
lrm pve2 (active, Fri Oct  9 20:23:27 2015)
lrm pve3 (old timestamp - dead?, Fri Oct  9 20:18:10 2015)
service vm:100 (pve1, freeze)

What to do?
(Then starting manually doesn't work.. only way is to pull it out of
HA... all the same circus).