[PVE-User] VM network disconnect issue after upgrade to PVE 6.1

Fri Feb 21 12:42:27 CET 2020

Hi. 

I has many problem with IPv6. Sometimes the IPv6 on VMs stop works. Sometimes IPv6 on host stop too. It's happen only on proxmox cluster (vms and hosts). Others devices don't affected. 
Here the default route IPv6 is lost. This week I disabled IPv6 on DNS and I'm using only IPv4. 

Pehaps this cold be your problem too. 

Humberto. 

De: "Eneko Lacunza" <elacunza at binovo.es> 
Para: "PVE User List" <pve-user at pve.proxmox.com> 
Enviadas: Quinta-feira, 20 de fevereiro de 2020 6:05:12 
Assunto: [PVE-User] VM network disconnect issue after upgrade to PVE 6.1 

Hi all, 

On february 11th we upgraded a PVE 5.3 cluster to 5.4, then to 6.1 . 

This is an hyperconverged cluster with 3 servers, redundant network, 
Ceph with two storage pools, one HDD based and the other SSD based: 

Each server consists of: 
- Dell R530 
- 1xXeon E5-2620 8c/16t 2.1Ghz 
- 64GB RAM 
- 4x1Gbit ethernet (2 bonds) 
- 2x10Gbit ethernet (1 bond) 
- 1xIntel S4500 480GB - System + Bluestore DB for HDDs 
- 1xIntel S4500 480GB - Bluestore OSD 
- 4x1TB HDD - Bluestore OSD (with 30GB db on SSD) 

There are two Dell n1224T switches, each bond has one interface to each 
switch. Bonds are Active/passive, all active interfaces are on the same 
switch. 

vmbr0 is on a 2x1Gbit bond0 
Ceph public and private are on 2x10Gbit bond2 
Backup network is IPv6 on 2x1Gbit bond1, to a Synology NAS. 

SSD disk wearout is at 0%. 

It seems that since the upgrade, were're experiencing network 
connectivity issues in the night, during the backup window. 

We think that the backups may be the issue; until yesterday backups were 
done over vmbr0 with IPv4; as they nearly saturated the 1Gbit link, we 
changed the network and storage configuration so that backup NAS access 
was done over bond1, as it wasn't used previously. We're using IPv6 now 
because Synology can't configure two IPv4 on a bond from the GUI. 

But it seems the issue has happened again tonight (SQL Server connection 
drop). VM has network connectivity on the morning, so it isn't a 
permanent problem. 

We tried running the main VM backup yesterday morning, but couldn't 
reproduce the issue, although during regular backup all 3 nodes are 
doing backups and in the test we only performed the backup of the only 
VM storaged on SSD pool. 

This VM has 8vcores, 10GB of RAM, one disk Virtio scsi0 300GB 
cache=writeback, network is e1000. 

Backup reports: 
NFO: status: 100% (322122547200/322122547200), sparse 22% (72698785792), 
duration 2416, read/write 3650/0 MB/s 
INFO: transferred 322122 MB in 2416 seconds (133 MB/s) 

And peaks like: 
INFO: status: 70% (225552891904/322122547200), sparse 3% (12228284416), 
duration 2065, read/write 181/104 MB/s 
INFO: status: 71% (228727980032/322122547200), sparse 3% (12228317184), 
duration 2091, read/write 122/122 MB/s 
INFO: status: 72% (232054063104/322122547200), sparse 3% (12228349952), 
duration 2118, read/write 123/123 MB/s 
INFO: status: 73% (235237539840/322122547200), sparse 3% (12230103040), 
duration 2147, read/write 109/109 MB/s 
INFO: status: 74% (238500708352/322122547200), sparse 3% (12237438976), 
duration 2177, read/write 108/108 MB/s 

Also, during backup we see the following messages in syslog of the 
physical node: 
Feb 20 00:00:18 sotllo pve-ha-lrm[3930696]: VM 103 qmp command failed - 
VM 103 qmp command 'query-status' failed - got timeout 
Feb 20 00:00:18 sotllo pve-ha-lrm[3930696]: VM 103 qmp command 
'query-status' failed - got timeout#012 
Feb 20 00:00:28 sotllo pve-ha-lrm[3930759]: VM 103 qmp command failed - 
VM 103 qmp command 'query-status' failed - unable to connect to VM 103 
qmp socket - timeout after 31 retries 
Feb 20 00:00:28 sotllo pve-ha-lrm[3930759]: VM 103 qmp command 
'query-status' failed - unable to connect to VM 103 qmp socket - timeout 
after 31 retries#012 
Feb 20 00:00:38 sotllo pve-ha-lrm[3930822]: VM 103 qmp command failed - 
VM 103 qmp command 'query-status' failed - unable to connect to VM 103 
qmp socket - timeout after 31 retries 
Feb 20 00:00:38 sotllo pve-ha-lrm[3930822]: VM 103 qmp command 
'query-status' failed - unable to connect to VM 103 qmp socket - timeout 
after 31 retries#012 
[...] 
Feb 20 00:40:38 sotllo pve-ha-lrm[3948846]: VM 103 qmp command failed - 
VM 103 qmp command 'query-status' failed - got timeout 
Feb 20 00:40:38 sotllo pve-ha-lrm[3948846]: VM 103 qmp command 
'query-status' failed - got timeout#012 
Feb 20 00:41:28 sotllo pve-ha-lrm[3949193]: VM 103 qmp command failed - 
VM 103 qmp command 'query-status' failed - got timeout 
Feb 20 00:41:28 sotllo pve-ha-lrm[3949193]: VM 103 qmp command 
'query-status' failed - got timeout#012 

So it seems backup is having a big impact on the VM. This is only seen 
for 3 of the 4 VMs in HA, but for the other VMs it is just logged twice, 
and not everyday (there're on the HDD pool). For this VM there are lots 
of logs everyday. 

CPU during backup is low in the physical server, about 1.5-3.5 max load 
and 10% max use. 

Although it has been working fine until now, maybe e1000 emulation could 
be the issue? We'll have to schedule downtime but can try to change to 
virtio. 

Any other ideas about what could be producing the issue? 

Thanks a lot for reading through here!! 

All three nodes have the same versions: 

root at sotllo:~# pveversion -v 
proxmox-ve: 6.1-2 (running kernel: 5.3.13-3-pve) 
pve-manager: 6.1-7 (running version: 6.1-7/13e58d5e) 
pve-kernel-5.3: 6.1-3 
pve-kernel-helper: 6.1-3 
pve-kernel-4.15: 5.4-12 
pve-kernel-5.3.13-3-pve: 5.3.13-3 
pve-kernel-4.15.18-24-pve: 4.15.18-52 
pve-kernel-4.15.18-10-pve: 4.15.18-32 
pve-kernel-4.13.13-5-pve: 4.13.13-38 
pve-kernel-4.13.13-2-pve: 4.13.13-33 
ceph: 14.2.6-pve1 
ceph-fuse: 14.2.6-pve1 
corosync: 3.0.2-pve4 
criu: 3.11-3 
glusterfs-client: 5.5-3 
ifupdown: 0.8.35+pve1 
ksm-control-daemon: 1.3-1 
libjs-extjs: 6.0.1-10 
libknet1: 1.13-pve1 
libpve-access-control: 6.0-6 
libpve-apiclient-perl: 3.0-2 
libpve-common-perl: 6.0-11 
libpve-guest-common-perl: 3.0-3 
libpve-http-server-perl: 3.0-4 
libpve-storage-perl: 6.1-4 
libqb0: 1.0.5-1 
libspice-server1: 0.14.2-4~pve6+1 
lvm2: 2.03.02-pve4 
lxc-pve: 3.2.1-1 
lxcfs: 3.0.3-pve60 
novnc-pve: 1.1.0-1 
proxmox-mini-journalreader: 1.1-1 
proxmox-widget-toolkit: 2.1-3 
pve-cluster: 6.1-4 
pve-container: 3.0-19 
pve-docs: 6.1-4 
pve-edk2-firmware: 2.20191127-1 
pve-firewall: 4.0-10 
pve-firmware: 3.0-4 
pve-ha-manager: 3.0-8 
pve-i18n: 2.0-4 
pve-qemu-kvm: 4.1.1-2 
pve-xtermjs: 4.3.0-1 
qemu-server: 6.1-5 
smartmontools: 7.1-pve2 
spiceterm: 3.1-1 
vncterm: 1.6-1 
zfsutils-linux: 0.8.3-pve1 

-- 
Zuzendari Teknikoa / Director Técnico 
Binovo IT Human Project, S.L. 
Telf. 943569206 
Astigarragako bidea 2, 2º izq. oficina 11; 20180 Oiartzun (Gipuzkoa) 
www.binovo.es 

_______________________________________________ 
pve-user mailing list 
pve-user at pve.proxmox.com 
https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user

Humberto José de Sousa 

Analista de TI 

CTIC 

Câmpus São José 

(48) 3381-2821 

Instituto Federal de Santa Catarina - Câmpus São José 

R. José Lino Kretzer, 608, Praia Comprida, São José / SC - CEP: 88103-310 

www.sj.ifsc.edu.br

----- Mensagem original -----
De: "Eneko Lacunza" <elacunza at binovo.es>
Para: "PVE User List" <pve-user at pve.proxmox.com>
Enviadas: Quinta-feira, 20 de fevereiro de 2020 6:05:12
Assunto: [PVE-User] VM network disconnect issue after upgrade to PVE 6.1

Hi all, 

On february 11th we upgraded a PVE 5.3 cluster to 5.4, then to 6.1 . 

This is an hyperconverged cluster with 3 servers, redundant network, 
Ceph with two storage pools, one HDD based and the other SSD based: 

Each server consists of: 
- Dell R530 
- 1xXeon E5-2620 8c/16t 2.1Ghz 
- 64GB RAM 
- 4x1Gbit ethernet (2 bonds) 
- 2x10Gbit ethernet (1 bond) 
- 1xIntel S4500 480GB - System + Bluestore DB for HDDs 
- 1xIntel S4500 480GB - Bluestore OSD 
- 4x1TB HDD - Bluestore OSD (with 30GB db on SSD)