[PVE-User] VM network disconnect issue after upgrade to PVE 6.1

Thu Feb 20 14:47:50 CET 2020

Hi Gianni,

El 20/2/20 a las 13:48, Gianni Milo escribió:
> See comments below...
Thanks for the comments!
> vmbr0 is on a 2x1Gbit bond0
>> Ceph public and private are on 2x10Gbit bond2
>> Backup network is IPv6 on 2x1Gbit bond1, to a Synology NAS.
>>
> Where's the cluster (corosync) traffic flowing ? On vmbr0 ? Would be a good
> idea to split that as well if possible (perhaps by using a different VLAN?).
Yes, it's on vmbr0 (bond0). We haven't noted any cluster issues, VMs 
don't have much network traffic.
>
>> We think that the backups may be the issue; until yesterday backups were
>> done over vmbr0 with IPv4; as they nearly saturated the 1Gbit link, we
>> changed the network and storage configuration so that backup NAS access
>> was done over bond1, as it wasn't used previously. We're using IPv6 now
>> because Synology can't configure two IPv4 on a bond from the GUI.
>>
> Using a separate network for the backup traffic should always help, that's
> a good decision.
> I'm having difficulties understanding why you had to configure 2 IPv4
> addresses on a single bond, why you need 2 of them?
The idea was to use a different subnet for backup traffic, and to 
configure it on bond1 of proxmox nodes, so bond1 was used instead of 
bond0 for backups. As NAS didn't support it (it only has 1 bond), we 
changed to IPv6 on bond1 for backup traffic (it's a routing issue, using 
different subnet is just for making it easy). Site is remote so I didn't 
want to loose current network config of the NAS.
>> But it seems the issue has happened again tonight (SQL Server connection
>> drop). VM has network connectivity on the morning, so it isn't a
>> permanent problem.
>>
> Do the affected VMs listen to the vmbr0 network for "outside" communication
> ? Is that the interface where the SQL server is accepting the connections
> from ?
Yes, we have only vmbr0 on the cluster, all VMs are connected to it, as 
is the outside world. (bond0)
>> We tried running the main VM backup yesterday morning, but couldn't
>> reproduce the issue, although during regular backup all 3 nodes are
>> doing backups and in the test we only performed the backup of the only
>> VM storaged on SSD pool.
>>
>>
>> How about reducing (or scheduling at different times) the backup jobs on
>> each node, at least for testing if the backup is causing the problem.
I'll check with the site admin about this, didn't really think about 
this but could help determine if that is the issue, thanks!
>> Backup reports:
>> NFO: status: 100% (322122547200/322122547200), sparse 22% (72698785792),
>> duration 2416, read/write 3650/0 MB/s
>> INFO: transferred 322122 MB in 2416 seconds (133 MB/s)
>>
>> And peaks like:
>> INFO: status: 70% (225552891904/322122547200), sparse 3% (12228284416),
>> duration 2065, read/write 181/104 MB/s
>>
>>
>> Have you tried setting (bandwidth) limits on the backup jobs and see if
>> that helps ?
Not really. I've looked through the docs, but seems I can only affect 
write bandwith on NAS (only has backups). This would affect read I guess...
>
>> Feb 20 00:00:38 sotllo pve-ha-lrm[3930822]: VM 103 qmp command failed -
>> VM 103 qmp command 'query-status' failed - unable to connect to VM 103
>> qmp socket - timeout after 31 retries
>> Feb 20 00:00:38 sotllo pve-ha-lrm[3930822]: VM 103 qmp command
>> 'query-status' failed - unable to connect to VM 103 qmp socket - timeout
>> after 31 retries#012
>> [...]
>>
> Looks like the host resources (where this specific VM is running on) are
> exhausted at this point, or perhaps the VM itself is overloaded somehow.
I can't see any indication of something like that in VM and proxmox node 
graphs though... :(

VM is below 20% in CPU use... and node is even lower...I think a 
bluestore OSD should be able to use 4-6 cores before it hits limits?
>> This is only seen
>> for 3 of the 4 VMs in HA, but for the other VMs it is just logged twice,
>> and not everyday (there're on the HDD pool). For this VM there are lots
>> of logs everyday.
>>
> Are there any scheduled (I/O intensive) jobs running within these VMs at
> the same time where host(s) is trying to back them up ?
I don't think so, at least there' isn't any wait I/O at all on this VM 
(SSD pool).
>> CPU during backup is low in the physical server, about 1.5-3.5 max load
>> and 10% max use.
> How about the storage (Ceph pools in this case) I/O where these VMs are
> running on ? Are they struggling during the backup time ?
I don't have this data, but looking at CPU use I don't expect this to be 
the case, storage of the VM is SSD. If disk/ceph were the issue, I'd 
expect much more CPU use in physical nodes...
> Although it has been working fine until now, maybe e1000 emulation could
>> be the issue? We'll have to schedule downtime but can try to change to
>> virtio.
>>
> If that's what all affected VMs have in common, then yes definitely that
> could be one of the reasons(even though you mentioned that they were
> working fine before the PVE upgrade?). Is there a specific reason you need
> e1000 emulation? virtio performs much better.
Most VMs where P2V, so e1000 seemed the natural choice to minimize 
hardware changes on Windows hosts. That worked really well so didn't 
look to change them and VMs are not very network intensive.

Thanks a lot for your comments, I'd look with site managers to change 
backup schedule. Some nodes neet 6-7 hours so that won't be trivial, but 
we'd be able to extract info from that.

Will report back when we have more data.

Regards
Eneko

-- 
Zuzendari Teknikoa / Director Técnico
Binovo IT Human Project, S.L.
Telf. 943569206
Astigarragako bidea 2, 2º izq. oficina 11; 20180 Oiartzun (Gipuzkoa)
www.binovo.es