[pve-devel] Proxmox 4 feedback

Fri Oct 9 20:44:30 CEST 2015

Le 09/10/2015 20:27, Gilou a écrit :
> Le 09/10/2015 20:14, Gilou a écrit :
>> Le 09/10/2015 18:36, Gilou a écrit :
>>> Le 09/10/2015 18:21, Dietmar Maurer a écrit :
>>>>> So I tried again.. HA doesn't work.
>>>>> Both resources are now frozen (?), and they didn't restart... Even after
>>>>> 5 minutes...
>>>>> service vm:102 (pve1, freeze)
>>>>> service vm:303 (pve1, freeze)
>>>>
>>>> The question is why they are frozen. The only action which 
>>>> puts them to 'freeze' is when you shutdown a node.
>>>>
>>>
>>> I pulled the ethernet cables out of the to-be-failing node when I
>>> tested. It didn't shut down. I plugged them back in 20 minutes later.
>>> They were down (so I guess the fencing worked). But still?
>>>
>>
>> OK, so I reinstalled fresh from the PVE 4 ISO 3 nodes, that are using
>> one single NIC to communicate with a NFS server and themselves. Cluster
>> is up, and one VM is protected:
>> # ha-manager status
>> quorum OK
>> master pve1 (active, Fri Oct  9 19:55:06 2015)
>> lrm pve1 (active, Fri Oct  9 19:55:12 2015)
>> lrm pve2 (active, Fri Oct  9 19:55:07 2015)
>> lrm pve3 (active, Fri Oct  9 19:55:10 2015)
>> service vm:100 (pve2, started)
>> # pvecm status
>> Quorum information
>> ------------------
>> Date:             Fri Oct  9 19:55:22 2015
>> Quorum provider:  corosync_votequorum
>> Nodes:            3
>> Node ID:          0x00000001
>> Ring ID:          12
>> Quorate:          Yes
>>
>> Votequorum information
>> ----------------------
>> Expected votes:   3
>> Highest expected: 3
>> Total votes:      3
>> Quorum:           2
>> Flags:            Quorate
>>
>> Membership information
>> ----------------------
>>     Nodeid      Votes Name
>> 0x00000002          1 192.168.44.129
>> 0x00000003          1 192.168.44.132
>> 0x00000001          1 192.168.44.143 (local)
>>
>> One one of the nodes, incidentally, the one running the HA VM, I already
>> get those:
>> Oct 09 19:55:07 pve2 pve-ha-lrm[1211]: watchdog update failed - Broken pipe
>>
>> Not good.
>> I tried to migrate to pve1 to see what happens:
>> Executing HA migrate for VM 100 to node pve1
>> unable to open file '/etc/pve/ha/crm_commands.tmp.3377' - No such file
>> or directory
>> TASK ERROR: command 'ha-manager migrate vm:100 pve1' failed: exit code 2
>>
>> OK.. so we can't migrate running HA VMs ? What did I get wrong here?
>> So. I remove the VM from HA, I migrate it on pve1, see what happens. It
>> works. OK. I stop the VM. Enable HA. It won't start.
>> service vm:100 (pve1, freeze)
>>
>> OK. And now, on pve1:
>> Oct 09 19:59:16 pve1 pve-ha-crm[1202]: watchdog update failed - Broken pipe
>>
>> OK... Let's try pve3, cold migrate, without ha, enable ha again..
>> interesting, now we have:
>> # ha-manager status
>> quorum OK
>> master pve1 (active, Fri Oct  9 20:09:46 2015)
>> lrm pve1 (old timestamp - dead?, Fri Oct  9 19:58:57 2015)
>> lrm pve2 (active, Fri Oct  9 20:09:47 2015)
>> lrm pve3 (active, Fri Oct  9 20:09:50 2015)
>> service vm:100 (pve3, started)
>>
>> Why is pve1 not reporting properly...
>>
>> And now on 3 nodes:
>> Oct 09 20:10:40 pve3 pve-ha-lrm[1208]: watchdog update failed - Broken pipe
>> Oct 09 20:10:50 pve3 pve-ha-lrm[1208]: watchdog update failed - Broken pipe
>> Oct 09 20:11:00 pve3 pve-ha-lrm[1208]: watchdog update failed - Broken pipe
>>
>> Wtf? omping reports multicast is getting through, but I'm not sure what
>> would be the issue there... It worked on 3.4 on the same physical setup.
>> So ?
>>
>>
> 
> Well, then I still tried to see some failover, so I unplugged pve3 which
> had the VM, something happened:
> 
> Oct  9 20:18:26 pve1 pve-ha-crm[1202]: node 'pve3': state changed from
> 'online' => 'unknown'
> Oct  9 20:19:16 pve1 pve-ha-crm[1202]: service 'vm:100': state changed
> from 'started' to 'fence'
> Oct  9 20:19:16 pve1 pve-ha-crm[1202]: node 'pve3': state changed from
> 'unknown' => 'fence'
> Oct  9 20:20:26 pve1 pve-ha-crm[1202]: successfully acquired lock
> 'ha_agent_pve3_lock'
> Oct  9 20:20:26 pve1 pve-ha-crm[1202]: fencing: acknowleged - got agent
> lock for node 'pve3'
> Oct  9 20:20:26 pve1 pve-ha-crm[1202]: node 'pve3': state changed from
> 'fence' => 'unknown'
> Oct  9 20:20:26 pve1 pve-ha-crm[1202]: service 'vm:100': state changed
> from 'fence' to 'stopped'
> Oct  9 20:20:36 pve1 pve-ha-crm[1202]: watchdog update failed - Broken pipe
> Oct  9 20:20:36 pve1 pve-ha-crm[1202]: service 'vm:100': state changed
> from 'stopped' to 'started'  (node = pve1)
> Oct  9 20:20:36 pve1 pve-ha-crm[1202]: service 'vm:100': state changed
> from 'started' to 'freeze'
> 
> OK, frozen. great.
> root at pve1:~# ha-manager status
> quorum OK
> master pve1 (active, Fri Oct  9 20:23:26 2015)
> lrm pve1 (old timestamp - dead?, Fri Oct  9 19:58:57 2015)
> lrm pve2 (active, Fri Oct  9 20:23:27 2015)
> lrm pve3 (old timestamp - dead?, Fri Oct  9 20:18:10 2015)
> service vm:100 (pve1, freeze)
> 
> What to do?
> (Then starting manually doesn't work.. only way is to pull it out of
> HA... all the same circus).

As far as multicast goes:
% ansible -a "omping -m 239.192.6.92 -c 10000 -i 0.001 -F -q pve1 pve2
pve3" -f 3 -i 'pve1,pve2,pve3' all -u root
pve3 | success | rc=0 >>
pve1 : waiting for response msg
pve2 : waiting for response msg
pve1 : joined (S,G) = (*, 239.192.6.92), pinging
pve2 : joined (S,G) = (*, 239.192.6.92), pinging
pve1 : given amount of query messages was sent
pve2 : given amount of query messages was sent

pve1 :   unicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev =
0.084/0.145/0.652/0.029
pve1 : multicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev =
0.085/0.149/0.666/0.030
pve2 :   unicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev =
0.086/0.147/0.300/0.029
pve2 : multicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev =
0.087/0.151/0.301/0.029

pve2 | success | rc=0 >>
pve1 : waiting for response msg
pve3 : waiting for response msg
pve1 : joined (S,G) = (*, 239.192.6.92), pinging
pve3 : waiting for response msg
pve3 : joined (S,G) = (*, 239.192.6.92), pinging
pve3 : waiting for response msg
pve3 : server told us to stop
pve1 : given amount of query messages was sent

pve1 :   unicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev =
0.071/0.149/0.637/0.032
pve1 : multicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev =
0.090/0.154/0.638/0.034
pve3 :   unicast, xmt/rcv/%loss = 8664/8664/0%, min/avg/max/std-dev =
0.087/0.149/0.947/0.033
pve3 : multicast, xmt/rcv/%loss = 8664/8664/0%, min/avg/max/std-dev =
0.092/0.154/0.948/0.033

pve1 | success | rc=0 >>
pve2 : waiting for response msg
pve3 : waiting for response msg
pve2 : waiting for response msg
pve3 : waiting for response msg
pve2 : joined (S,G) = (*, 239.192.6.92), pinging
pve3 : joined (S,G) = (*, 239.192.6.92), pinging
pve3 : waiting for response msg
pve3 : server told us to stop
pve2 : waiting for response msg
pve2 : server told us to stop

pve2 :   unicast, xmt/rcv/%loss = 8540/8540/0%, min/avg/max/std-dev =
0.080/0.149/0.312/0.030
pve2 : multicast, xmt/rcv/%loss = 8540/8540/0%, min/avg/max/std-dev =
0.091/0.153/0.325/0.031
pve3 :   unicast, xmt/rcv/%loss = 8141/8141/0%, min/avg/max/std-dev =
0.089/0.148/0.980/0.032
pve3 : multicast, xmt/rcv/%loss = 8141/8141/0%, min/avg/max/std-dev =
0.091/0.154/0.994/0.032

And for 10 mins..
% ansible -a "omping -c 600 -i 1 -q pve1 pve2 pve3" -f 3 -i
'pve1,pve2,pve3' all -u root
pve2 | success | rc=0 >>
pve1 : waiting for response msg
pve3 : waiting for response msg
pve3 : joined (S,G) = (*, 232.43.211.234), pinging
pve1 : joined (S,G) = (*, 232.43.211.234), pinging
pve1 : given amount of query messages was sent
pve3 : given amount of query messages was sent

pve1 :   unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev =
0.108/0.215/0.343/0.046
pve1 : multicast, xmt/rcv/%loss = 600/599/0% (seq>=2 0%),
min/avg/max/std-dev = 0.119/0.222/0.346/0.048
pve3 :   unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev =
0.098/0.221/0.355/0.049
pve3 : multicast, xmt/rcv/%loss = 600/599/0% (seq>=2 0%),
min/avg/max/std-dev = 0.118/0.226/0.370/0.050

pve1 | success | rc=0 >>
pve2 : waiting for response msg
pve3 : waiting for response msg
pve2 : waiting for response msg
pve3 : waiting for response msg
pve2 : joined (S,G) = (*, 232.43.211.234), pinging
pve3 : joined (S,G) = (*, 232.43.211.234), pinging
pve2 : given amount of query messages was sent
pve3 : given amount of query messages was sent

pve2 :   unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev =
0.107/0.221/0.343/0.050
pve2 : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev =
0.110/0.227/0.344/0.052
pve3 :   unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev =
0.098/0.224/0.328/0.050
pve3 : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev =
0.114/0.229/0.335/0.050

pve3 | success | rc=0 >>
pve1 : waiting for response msg
pve2 : waiting for response msg
pve1 : joined (S,G) = (*, 232.43.211.234), pinging
pve2 : waiting for response msg
pve2 : joined (S,G) = (*, 232.43.211.234), pinging
pve1 : given amount of query messages was sent
pve2 : waiting for response msg
pve2 : server told us to stop

pve1 :   unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev =
0.113/0.213/0.335/0.048
pve1 : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev =
0.114/0.220/0.347/0.052
pve2 :   unicast, xmt/rcv/%loss = 599/599/0%, min/avg/max/std-dev =
0.111/0.210/0.320/0.048
pve2 : multicast, xmt/rcv/%loss = 599/599/0%, min/avg/max/std-dev =
0.115/0.216/0.332/0.049

I'm sad! And I'm leaving for the weekend. My lab should stay around for
a while, but this is not really good looking :(

Cheers
Gilles