[pve-devel] Proxmox 4 feedback
Gilou
contact+dev at gilouweb.com
Fri Oct 9 20:44:30 CEST 2015
Le 09/10/2015 20:27, Gilou a écrit :
> Le 09/10/2015 20:14, Gilou a écrit :
>> Le 09/10/2015 18:36, Gilou a écrit :
>>> Le 09/10/2015 18:21, Dietmar Maurer a écrit :
>>>>> So I tried again.. HA doesn't work.
>>>>> Both resources are now frozen (?), and they didn't restart... Even after
>>>>> 5 minutes...
>>>>> service vm:102 (pve1, freeze)
>>>>> service vm:303 (pve1, freeze)
>>>>
>>>> The question is why they are frozen. The only action which
>>>> puts them to 'freeze' is when you shutdown a node.
>>>>
>>>
>>> I pulled the ethernet cables out of the to-be-failing node when I
>>> tested. It didn't shut down. I plugged them back in 20 minutes later.
>>> They were down (so I guess the fencing worked). But still?
>>>
>>
>> OK, so I reinstalled fresh from the PVE 4 ISO 3 nodes, that are using
>> one single NIC to communicate with a NFS server and themselves. Cluster
>> is up, and one VM is protected:
>> # ha-manager status
>> quorum OK
>> master pve1 (active, Fri Oct 9 19:55:06 2015)
>> lrm pve1 (active, Fri Oct 9 19:55:12 2015)
>> lrm pve2 (active, Fri Oct 9 19:55:07 2015)
>> lrm pve3 (active, Fri Oct 9 19:55:10 2015)
>> service vm:100 (pve2, started)
>> # pvecm status
>> Quorum information
>> ------------------
>> Date: Fri Oct 9 19:55:22 2015
>> Quorum provider: corosync_votequorum
>> Nodes: 3
>> Node ID: 0x00000001
>> Ring ID: 12
>> Quorate: Yes
>>
>> Votequorum information
>> ----------------------
>> Expected votes: 3
>> Highest expected: 3
>> Total votes: 3
>> Quorum: 2
>> Flags: Quorate
>>
>> Membership information
>> ----------------------
>> Nodeid Votes Name
>> 0x00000002 1 192.168.44.129
>> 0x00000003 1 192.168.44.132
>> 0x00000001 1 192.168.44.143 (local)
>>
>> One one of the nodes, incidentally, the one running the HA VM, I already
>> get those:
>> Oct 09 19:55:07 pve2 pve-ha-lrm[1211]: watchdog update failed - Broken pipe
>>
>> Not good.
>> I tried to migrate to pve1 to see what happens:
>> Executing HA migrate for VM 100 to node pve1
>> unable to open file '/etc/pve/ha/crm_commands.tmp.3377' - No such file
>> or directory
>> TASK ERROR: command 'ha-manager migrate vm:100 pve1' failed: exit code 2
>>
>> OK.. so we can't migrate running HA VMs ? What did I get wrong here?
>> So. I remove the VM from HA, I migrate it on pve1, see what happens. It
>> works. OK. I stop the VM. Enable HA. It won't start.
>> service vm:100 (pve1, freeze)
>>
>> OK. And now, on pve1:
>> Oct 09 19:59:16 pve1 pve-ha-crm[1202]: watchdog update failed - Broken pipe
>>
>> OK... Let's try pve3, cold migrate, without ha, enable ha again..
>> interesting, now we have:
>> # ha-manager status
>> quorum OK
>> master pve1 (active, Fri Oct 9 20:09:46 2015)
>> lrm pve1 (old timestamp - dead?, Fri Oct 9 19:58:57 2015)
>> lrm pve2 (active, Fri Oct 9 20:09:47 2015)
>> lrm pve3 (active, Fri Oct 9 20:09:50 2015)
>> service vm:100 (pve3, started)
>>
>> Why is pve1 not reporting properly...
>>
>> And now on 3 nodes:
>> Oct 09 20:10:40 pve3 pve-ha-lrm[1208]: watchdog update failed - Broken pipe
>> Oct 09 20:10:50 pve3 pve-ha-lrm[1208]: watchdog update failed - Broken pipe
>> Oct 09 20:11:00 pve3 pve-ha-lrm[1208]: watchdog update failed - Broken pipe
>>
>> Wtf? omping reports multicast is getting through, but I'm not sure what
>> would be the issue there... It worked on 3.4 on the same physical setup.
>> So ?
>>
>>
>
> Well, then I still tried to see some failover, so I unplugged pve3 which
> had the VM, something happened:
>
> Oct 9 20:18:26 pve1 pve-ha-crm[1202]: node 'pve3': state changed from
> 'online' => 'unknown'
> Oct 9 20:19:16 pve1 pve-ha-crm[1202]: service 'vm:100': state changed
> from 'started' to 'fence'
> Oct 9 20:19:16 pve1 pve-ha-crm[1202]: node 'pve3': state changed from
> 'unknown' => 'fence'
> Oct 9 20:20:26 pve1 pve-ha-crm[1202]: successfully acquired lock
> 'ha_agent_pve3_lock'
> Oct 9 20:20:26 pve1 pve-ha-crm[1202]: fencing: acknowleged - got agent
> lock for node 'pve3'
> Oct 9 20:20:26 pve1 pve-ha-crm[1202]: node 'pve3': state changed from
> 'fence' => 'unknown'
> Oct 9 20:20:26 pve1 pve-ha-crm[1202]: service 'vm:100': state changed
> from 'fence' to 'stopped'
> Oct 9 20:20:36 pve1 pve-ha-crm[1202]: watchdog update failed - Broken pipe
> Oct 9 20:20:36 pve1 pve-ha-crm[1202]: service 'vm:100': state changed
> from 'stopped' to 'started' (node = pve1)
> Oct 9 20:20:36 pve1 pve-ha-crm[1202]: service 'vm:100': state changed
> from 'started' to 'freeze'
>
> OK, frozen. great.
> root at pve1:~# ha-manager status
> quorum OK
> master pve1 (active, Fri Oct 9 20:23:26 2015)
> lrm pve1 (old timestamp - dead?, Fri Oct 9 19:58:57 2015)
> lrm pve2 (active, Fri Oct 9 20:23:27 2015)
> lrm pve3 (old timestamp - dead?, Fri Oct 9 20:18:10 2015)
> service vm:100 (pve1, freeze)
>
> What to do?
> (Then starting manually doesn't work.. only way is to pull it out of
> HA... all the same circus).
As far as multicast goes:
% ansible -a "omping -m 239.192.6.92 -c 10000 -i 0.001 -F -q pve1 pve2
pve3" -f 3 -i 'pve1,pve2,pve3' all -u root
pve3 | success | rc=0 >>
pve1 : waiting for response msg
pve2 : waiting for response msg
pve1 : joined (S,G) = (*, 239.192.6.92), pinging
pve2 : joined (S,G) = (*, 239.192.6.92), pinging
pve1 : given amount of query messages was sent
pve2 : given amount of query messages was sent
pve1 : unicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev =
0.084/0.145/0.652/0.029
pve1 : multicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev =
0.085/0.149/0.666/0.030
pve2 : unicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev =
0.086/0.147/0.300/0.029
pve2 : multicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev =
0.087/0.151/0.301/0.029
pve2 | success | rc=0 >>
pve1 : waiting for response msg
pve3 : waiting for response msg
pve1 : joined (S,G) = (*, 239.192.6.92), pinging
pve3 : waiting for response msg
pve3 : joined (S,G) = (*, 239.192.6.92), pinging
pve3 : waiting for response msg
pve3 : server told us to stop
pve1 : given amount of query messages was sent
pve1 : unicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev =
0.071/0.149/0.637/0.032
pve1 : multicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev =
0.090/0.154/0.638/0.034
pve3 : unicast, xmt/rcv/%loss = 8664/8664/0%, min/avg/max/std-dev =
0.087/0.149/0.947/0.033
pve3 : multicast, xmt/rcv/%loss = 8664/8664/0%, min/avg/max/std-dev =
0.092/0.154/0.948/0.033
pve1 | success | rc=0 >>
pve2 : waiting for response msg
pve3 : waiting for response msg
pve2 : waiting for response msg
pve3 : waiting for response msg
pve2 : joined (S,G) = (*, 239.192.6.92), pinging
pve3 : joined (S,G) = (*, 239.192.6.92), pinging
pve3 : waiting for response msg
pve3 : server told us to stop
pve2 : waiting for response msg
pve2 : server told us to stop
pve2 : unicast, xmt/rcv/%loss = 8540/8540/0%, min/avg/max/std-dev =
0.080/0.149/0.312/0.030
pve2 : multicast, xmt/rcv/%loss = 8540/8540/0%, min/avg/max/std-dev =
0.091/0.153/0.325/0.031
pve3 : unicast, xmt/rcv/%loss = 8141/8141/0%, min/avg/max/std-dev =
0.089/0.148/0.980/0.032
pve3 : multicast, xmt/rcv/%loss = 8141/8141/0%, min/avg/max/std-dev =
0.091/0.154/0.994/0.032
And for 10 mins..
% ansible -a "omping -c 600 -i 1 -q pve1 pve2 pve3" -f 3 -i
'pve1,pve2,pve3' all -u root
pve2 | success | rc=0 >>
pve1 : waiting for response msg
pve3 : waiting for response msg
pve3 : joined (S,G) = (*, 232.43.211.234), pinging
pve1 : joined (S,G) = (*, 232.43.211.234), pinging
pve1 : given amount of query messages was sent
pve3 : given amount of query messages was sent
pve1 : unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev =
0.108/0.215/0.343/0.046
pve1 : multicast, xmt/rcv/%loss = 600/599/0% (seq>=2 0%),
min/avg/max/std-dev = 0.119/0.222/0.346/0.048
pve3 : unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev =
0.098/0.221/0.355/0.049
pve3 : multicast, xmt/rcv/%loss = 600/599/0% (seq>=2 0%),
min/avg/max/std-dev = 0.118/0.226/0.370/0.050
pve1 | success | rc=0 >>
pve2 : waiting for response msg
pve3 : waiting for response msg
pve2 : waiting for response msg
pve3 : waiting for response msg
pve2 : joined (S,G) = (*, 232.43.211.234), pinging
pve3 : joined (S,G) = (*, 232.43.211.234), pinging
pve2 : given amount of query messages was sent
pve3 : given amount of query messages was sent
pve2 : unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev =
0.107/0.221/0.343/0.050
pve2 : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev =
0.110/0.227/0.344/0.052
pve3 : unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev =
0.098/0.224/0.328/0.050
pve3 : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev =
0.114/0.229/0.335/0.050
pve3 | success | rc=0 >>
pve1 : waiting for response msg
pve2 : waiting for response msg
pve1 : joined (S,G) = (*, 232.43.211.234), pinging
pve2 : waiting for response msg
pve2 : joined (S,G) = (*, 232.43.211.234), pinging
pve1 : given amount of query messages was sent
pve2 : waiting for response msg
pve2 : server told us to stop
pve1 : unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev =
0.113/0.213/0.335/0.048
pve1 : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev =
0.114/0.220/0.347/0.052
pve2 : unicast, xmt/rcv/%loss = 599/599/0%, min/avg/max/std-dev =
0.111/0.210/0.320/0.048
pve2 : multicast, xmt/rcv/%loss = 599/599/0%, min/avg/max/std-dev =
0.115/0.216/0.332/0.049
I'm sad! And I'm leaving for the weekend. My lab should stay around for
a while, but this is not really good looking :(
Cheers
Gilles
More information about the pve-devel
mailing list