[PVE-User] VM network disconnect issue after upgrade to PVE 6.1

Thu Feb 20 13:48:10 CET 2020

Hello,

See comments below...

vmbr0 is on a 2x1Gbit bond0
> Ceph public and private are on 2x10Gbit bond2
> Backup network is IPv6 on 2x1Gbit bond1, to a Synology NAS.
>

Where's the cluster (corosync) traffic flowing ? On vmbr0 ? Would be a good
idea to split that as well if possible (perhaps by using a different VLAN?).

> We think that the backups may be the issue; until yesterday backups were
> done over vmbr0 with IPv4; as they nearly saturated the 1Gbit link, we
> changed the network and storage configuration so that backup NAS access
> was done over bond1, as it wasn't used previously. We're using IPv6 now
> because Synology can't configure two IPv4 on a bond from the GUI.
>

Using a separate network for the backup traffic should always help, that's
a good decision.
I'm having difficulties understanding why you had to configure 2 IPv4
addresses on a single bond, why you need 2 of them?

> But it seems the issue has happened again tonight (SQL Server connection
> drop). VM has network connectivity on the morning, so it isn't a
> permanent problem.
>

Do the affected VMs listen to the vmbr0 network for "outside" communication
? Is that the interface where the SQL server is accepting the connections
from ?

> We tried running the main VM backup yesterday morning, but couldn't
> reproduce the issue, although during regular backup all 3 nodes are
> doing backups and in the test we only performed the backup of the only
> VM storaged on SSD pool.
>

How about reducing (or scheduling at different times) the backup jobs on
each node, at least for testing if the backup is causing the problem.

Backup reports:
> NFO: status: 100% (322122547200/322122547200), sparse 22% (72698785792),
> duration 2416, read/write 3650/0 MB/s
> INFO: transferred 322122 MB in 2416 seconds (133 MB/s)
>
> And peaks like:
> INFO: status: 70% (225552891904/322122547200), sparse 3% (12228284416),
> duration 2065, read/write 181/104 MB/s
>

Have you tried setting (bandwidth) limits on the backup jobs and see if
that helps ?

> Feb 20 00:00:38 sotllo pve-ha-lrm[3930822]: VM 103 qmp command failed -
> VM 103 qmp command 'query-status' failed - unable to connect to VM 103
> qmp socket - timeout after 31 retries
> Feb 20 00:00:38 sotllo pve-ha-lrm[3930822]: VM 103 qmp command
> 'query-status' failed - unable to connect to VM 103 qmp socket - timeout
> after 31 retries#012
> [...]
>

Looks like the host resources (where this specific VM is running on) are
exhausted at this point, or perhaps the VM itself is overloaded somehow.

> So it seems backup is having a big impact on the VM.

Yes, indeed...

> This is only seen
> for 3 of the 4 VMs in HA, but for the other VMs it is just logged twice,
> and not everyday (there're on the HDD pool). For this VM there are lots
> of logs everyday.
>

Are there any scheduled (I/O intensive) jobs running within these VMs at
the same time where host(s) is trying to back them up ?

> CPU during backup is low in the physical server, about 1.5-3.5 max load
> and 10% max use.
>

How about the storage (Ceph pools in this case) I/O where these VMs are
running on ? Are they struggling during the backup time ?

Although it has been working fine until now, maybe e1000 emulation could
> be the issue? We'll have to schedule downtime but can try to change to
> virtio.
>

If that's what all affected VMs have in common, then yes definitely that
could be one of the reasons(even though you mentioned that they were
working fine before the PVE upgrade?). Is there a specific reason you need
e1000 emulation? virtio performs much better.

G.