[pve-devel] Proxmox 4 feedback
Gilou
contact+dev at gilouweb.com
Fri Oct 9 20:27:36 CEST 2015
Le 09/10/2015 20:14, Gilou a écrit :
> Le 09/10/2015 18:36, Gilou a écrit :
>> Le 09/10/2015 18:21, Dietmar Maurer a écrit :
>>>> So I tried again.. HA doesn't work.
>>>> Both resources are now frozen (?), and they didn't restart... Even after
>>>> 5 minutes...
>>>> service vm:102 (pve1, freeze)
>>>> service vm:303 (pve1, freeze)
>>>
>>> The question is why they are frozen. The only action which
>>> puts them to 'freeze' is when you shutdown a node.
>>>
>>
>> I pulled the ethernet cables out of the to-be-failing node when I
>> tested. It didn't shut down. I plugged them back in 20 minutes later.
>> They were down (so I guess the fencing worked). But still?
>>
>
> OK, so I reinstalled fresh from the PVE 4 ISO 3 nodes, that are using
> one single NIC to communicate with a NFS server and themselves. Cluster
> is up, and one VM is protected:
> # ha-manager status
> quorum OK
> master pve1 (active, Fri Oct 9 19:55:06 2015)
> lrm pve1 (active, Fri Oct 9 19:55:12 2015)
> lrm pve2 (active, Fri Oct 9 19:55:07 2015)
> lrm pve3 (active, Fri Oct 9 19:55:10 2015)
> service vm:100 (pve2, started)
> # pvecm status
> Quorum information
> ------------------
> Date: Fri Oct 9 19:55:22 2015
> Quorum provider: corosync_votequorum
> Nodes: 3
> Node ID: 0x00000001
> Ring ID: 12
> Quorate: Yes
>
> Votequorum information
> ----------------------
> Expected votes: 3
> Highest expected: 3
> Total votes: 3
> Quorum: 2
> Flags: Quorate
>
> Membership information
> ----------------------
> Nodeid Votes Name
> 0x00000002 1 192.168.44.129
> 0x00000003 1 192.168.44.132
> 0x00000001 1 192.168.44.143 (local)
>
> One one of the nodes, incidentally, the one running the HA VM, I already
> get those:
> Oct 09 19:55:07 pve2 pve-ha-lrm[1211]: watchdog update failed - Broken pipe
>
> Not good.
> I tried to migrate to pve1 to see what happens:
> Executing HA migrate for VM 100 to node pve1
> unable to open file '/etc/pve/ha/crm_commands.tmp.3377' - No such file
> or directory
> TASK ERROR: command 'ha-manager migrate vm:100 pve1' failed: exit code 2
>
> OK.. so we can't migrate running HA VMs ? What did I get wrong here?
> So. I remove the VM from HA, I migrate it on pve1, see what happens. It
> works. OK. I stop the VM. Enable HA. It won't start.
> service vm:100 (pve1, freeze)
>
> OK. And now, on pve1:
> Oct 09 19:59:16 pve1 pve-ha-crm[1202]: watchdog update failed - Broken pipe
>
> OK... Let's try pve3, cold migrate, without ha, enable ha again..
> interesting, now we have:
> # ha-manager status
> quorum OK
> master pve1 (active, Fri Oct 9 20:09:46 2015)
> lrm pve1 (old timestamp - dead?, Fri Oct 9 19:58:57 2015)
> lrm pve2 (active, Fri Oct 9 20:09:47 2015)
> lrm pve3 (active, Fri Oct 9 20:09:50 2015)
> service vm:100 (pve3, started)
>
> Why is pve1 not reporting properly...
>
> And now on 3 nodes:
> Oct 09 20:10:40 pve3 pve-ha-lrm[1208]: watchdog update failed - Broken pipe
> Oct 09 20:10:50 pve3 pve-ha-lrm[1208]: watchdog update failed - Broken pipe
> Oct 09 20:11:00 pve3 pve-ha-lrm[1208]: watchdog update failed - Broken pipe
>
> Wtf? omping reports multicast is getting through, but I'm not sure what
> would be the issue there... It worked on 3.4 on the same physical setup.
> So ?
>
>
Well, then I still tried to see some failover, so I unplugged pve3 which
had the VM, something happened:
Oct 9 20:18:26 pve1 pve-ha-crm[1202]: node 'pve3': state changed from
'online' => 'unknown'
Oct 9 20:19:16 pve1 pve-ha-crm[1202]: service 'vm:100': state changed
from 'started' to 'fence'
Oct 9 20:19:16 pve1 pve-ha-crm[1202]: node 'pve3': state changed from
'unknown' => 'fence'
Oct 9 20:20:26 pve1 pve-ha-crm[1202]: successfully acquired lock
'ha_agent_pve3_lock'
Oct 9 20:20:26 pve1 pve-ha-crm[1202]: fencing: acknowleged - got agent
lock for node 'pve3'
Oct 9 20:20:26 pve1 pve-ha-crm[1202]: node 'pve3': state changed from
'fence' => 'unknown'
Oct 9 20:20:26 pve1 pve-ha-crm[1202]: service 'vm:100': state changed
from 'fence' to 'stopped'
Oct 9 20:20:36 pve1 pve-ha-crm[1202]: watchdog update failed - Broken pipe
Oct 9 20:20:36 pve1 pve-ha-crm[1202]: service 'vm:100': state changed
from 'stopped' to 'started' (node = pve1)
Oct 9 20:20:36 pve1 pve-ha-crm[1202]: service 'vm:100': state changed
from 'started' to 'freeze'
OK, frozen. great.
root at pve1:~# ha-manager status
quorum OK
master pve1 (active, Fri Oct 9 20:23:26 2015)
lrm pve1 (old timestamp - dead?, Fri Oct 9 19:58:57 2015)
lrm pve2 (active, Fri Oct 9 20:23:27 2015)
lrm pve3 (old timestamp - dead?, Fri Oct 9 20:18:10 2015)
service vm:100 (pve1, freeze)
What to do?
(Then starting manually doesn't work.. only way is to pull it out of
HA... all the same circus).
More information about the pve-devel
mailing list