[PVE-User] VM network disconnect issue after upgrade to PVE 6.1
Humberto Jose De Sousa
humbertos at ifsc.edu.br
Fri Feb 21 12:42:27 CET 2020
Hi.
I has many problem with IPv6. Sometimes the IPv6 on VMs stop works. Sometimes IPv6 on host stop too. It's happen only on proxmox cluster (vms and hosts). Others devices don't affected.
Here the default route IPv6 is lost. This week I disabled IPv6 on DNS and I'm using only IPv4.
Pehaps this cold be your problem too.
Humberto.
De: "Eneko Lacunza" <elacunza at binovo.es>
Para: "PVE User List" <pve-user at pve.proxmox.com>
Enviadas: Quinta-feira, 20 de fevereiro de 2020 6:05:12
Assunto: [PVE-User] VM network disconnect issue after upgrade to PVE 6.1
Hi all,
On february 11th we upgraded a PVE 5.3 cluster to 5.4, then to 6.1 .
This is an hyperconverged cluster with 3 servers, redundant network,
Ceph with two storage pools, one HDD based and the other SSD based:
Each server consists of:
- Dell R530
- 1xXeon E5-2620 8c/16t 2.1Ghz
- 64GB RAM
- 4x1Gbit ethernet (2 bonds)
- 2x10Gbit ethernet (1 bond)
- 1xIntel S4500 480GB - System + Bluestore DB for HDDs
- 1xIntel S4500 480GB - Bluestore OSD
- 4x1TB HDD - Bluestore OSD (with 30GB db on SSD)
There are two Dell n1224T switches, each bond has one interface to each
switch. Bonds are Active/passive, all active interfaces are on the same
switch.
vmbr0 is on a 2x1Gbit bond0
Ceph public and private are on 2x10Gbit bond2
Backup network is IPv6 on 2x1Gbit bond1, to a Synology NAS.
SSD disk wearout is at 0%.
It seems that since the upgrade, were're experiencing network
connectivity issues in the night, during the backup window.
We think that the backups may be the issue; until yesterday backups were
done over vmbr0 with IPv4; as they nearly saturated the 1Gbit link, we
changed the network and storage configuration so that backup NAS access
was done over bond1, as it wasn't used previously. We're using IPv6 now
because Synology can't configure two IPv4 on a bond from the GUI.
But it seems the issue has happened again tonight (SQL Server connection
drop). VM has network connectivity on the morning, so it isn't a
permanent problem.
We tried running the main VM backup yesterday morning, but couldn't
reproduce the issue, although during regular backup all 3 nodes are
doing backups and in the test we only performed the backup of the only
VM storaged on SSD pool.
This VM has 8vcores, 10GB of RAM, one disk Virtio scsi0 300GB
cache=writeback, network is e1000.
Backup reports:
NFO: status: 100% (322122547200/322122547200), sparse 22% (72698785792),
duration 2416, read/write 3650/0 MB/s
INFO: transferred 322122 MB in 2416 seconds (133 MB/s)
And peaks like:
INFO: status: 70% (225552891904/322122547200), sparse 3% (12228284416),
duration 2065, read/write 181/104 MB/s
INFO: status: 71% (228727980032/322122547200), sparse 3% (12228317184),
duration 2091, read/write 122/122 MB/s
INFO: status: 72% (232054063104/322122547200), sparse 3% (12228349952),
duration 2118, read/write 123/123 MB/s
INFO: status: 73% (235237539840/322122547200), sparse 3% (12230103040),
duration 2147, read/write 109/109 MB/s
INFO: status: 74% (238500708352/322122547200), sparse 3% (12237438976),
duration 2177, read/write 108/108 MB/s
Also, during backup we see the following messages in syslog of the
physical node:
Feb 20 00:00:18 sotllo pve-ha-lrm[3930696]: VM 103 qmp command failed -
VM 103 qmp command 'query-status' failed - got timeout
Feb 20 00:00:18 sotllo pve-ha-lrm[3930696]: VM 103 qmp command
'query-status' failed - got timeout#012
Feb 20 00:00:28 sotllo pve-ha-lrm[3930759]: VM 103 qmp command failed -
VM 103 qmp command 'query-status' failed - unable to connect to VM 103
qmp socket - timeout after 31 retries
Feb 20 00:00:28 sotllo pve-ha-lrm[3930759]: VM 103 qmp command
'query-status' failed - unable to connect to VM 103 qmp socket - timeout
after 31 retries#012
Feb 20 00:00:38 sotllo pve-ha-lrm[3930822]: VM 103 qmp command failed -
VM 103 qmp command 'query-status' failed - unable to connect to VM 103
qmp socket - timeout after 31 retries
Feb 20 00:00:38 sotllo pve-ha-lrm[3930822]: VM 103 qmp command
'query-status' failed - unable to connect to VM 103 qmp socket - timeout
after 31 retries#012
[...]
Feb 20 00:40:38 sotllo pve-ha-lrm[3948846]: VM 103 qmp command failed -
VM 103 qmp command 'query-status' failed - got timeout
Feb 20 00:40:38 sotllo pve-ha-lrm[3948846]: VM 103 qmp command
'query-status' failed - got timeout#012
Feb 20 00:41:28 sotllo pve-ha-lrm[3949193]: VM 103 qmp command failed -
VM 103 qmp command 'query-status' failed - got timeout
Feb 20 00:41:28 sotllo pve-ha-lrm[3949193]: VM 103 qmp command
'query-status' failed - got timeout#012
So it seems backup is having a big impact on the VM. This is only seen
for 3 of the 4 VMs in HA, but for the other VMs it is just logged twice,
and not everyday (there're on the HDD pool). For this VM there are lots
of logs everyday.
CPU during backup is low in the physical server, about 1.5-3.5 max load
and 10% max use.
Although it has been working fine until now, maybe e1000 emulation could
be the issue? We'll have to schedule downtime but can try to change to
virtio.
Any other ideas about what could be producing the issue?
Thanks a lot for reading through here!!
All three nodes have the same versions:
root at sotllo:~# pveversion -v
proxmox-ve: 6.1-2 (running kernel: 5.3.13-3-pve)
pve-manager: 6.1-7 (running version: 6.1-7/13e58d5e)
pve-kernel-5.3: 6.1-3
pve-kernel-helper: 6.1-3
pve-kernel-4.15: 5.4-12
pve-kernel-5.3.13-3-pve: 5.3.13-3
pve-kernel-4.15.18-24-pve: 4.15.18-52
pve-kernel-4.15.18-10-pve: 4.15.18-32
pve-kernel-4.13.13-5-pve: 4.13.13-38
pve-kernel-4.13.13-2-pve: 4.13.13-33
ceph: 14.2.6-pve1
ceph-fuse: 14.2.6-pve1
corosync: 3.0.2-pve4
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.13-pve1
libpve-access-control: 6.0-6
libpve-apiclient-perl: 3.0-2
libpve-common-perl: 6.0-11
libpve-guest-common-perl: 3.0-3
libpve-http-server-perl: 3.0-4
libpve-storage-perl: 6.1-4
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 3.2.1-1
lxcfs: 3.0.3-pve60
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.1-3
pve-cluster: 6.1-4
pve-container: 3.0-19
pve-docs: 6.1-4
pve-edk2-firmware: 2.20191127-1
pve-firewall: 4.0-10
pve-firmware: 3.0-4
pve-ha-manager: 3.0-8
pve-i18n: 2.0-4
pve-qemu-kvm: 4.1.1-2
pve-xtermjs: 4.3.0-1
qemu-server: 6.1-5
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.3-pve1
--
Zuzendari Teknikoa / Director Técnico
Binovo IT Human Project, S.L.
Telf. 943569206
Astigarragako bidea 2, 2º izq. oficina 11; 20180 Oiartzun (Gipuzkoa)
www.binovo.es
_______________________________________________
pve-user mailing list
pve-user at pve.proxmox.com
https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
Humberto José de Sousa
Analista de TI
CTIC
Câmpus São José
(48) 3381-2821
Instituto Federal de Santa Catarina - Câmpus São José
R. José Lino Kretzer, 608, Praia Comprida, São José / SC - CEP: 88103-310
www.sj.ifsc.edu.br
----- Mensagem original -----
De: "Eneko Lacunza" <elacunza at binovo.es>
Para: "PVE User List" <pve-user at pve.proxmox.com>
Enviadas: Quinta-feira, 20 de fevereiro de 2020 6:05:12
Assunto: [PVE-User] VM network disconnect issue after upgrade to PVE 6.1
Hi all,
On february 11th we upgraded a PVE 5.3 cluster to 5.4, then to 6.1 .
This is an hyperconverged cluster with 3 servers, redundant network,
Ceph with two storage pools, one HDD based and the other SSD based:
Each server consists of:
- Dell R530
- 1xXeon E5-2620 8c/16t 2.1Ghz
- 64GB RAM
- 4x1Gbit ethernet (2 bonds)
- 2x10Gbit ethernet (1 bond)
- 1xIntel S4500 480GB - System + Bluestore DB for HDDs
- 1xIntel S4500 480GB - Bluestore OSD
- 4x1TB HDD - Bluestore OSD (with 30GB db on SSD)
There are two Dell n1224T switches, each bond has one interface to each
switch. Bonds are Active/passive, all active interfaces are on the same
switch.
vmbr0 is on a 2x1Gbit bond0
Ceph public and private are on 2x10Gbit bond2
Backup network is IPv6 on 2x1Gbit bond1, to a Synology NAS.
SSD disk wearout is at 0%.
It seems that since the upgrade, were're experiencing network
connectivity issues in the night, during the backup window.
We think that the backups may be the issue; until yesterday backups were
done over vmbr0 with IPv4; as they nearly saturated the 1Gbit link, we
changed the network and storage configuration so that backup NAS access
was done over bond1, as it wasn't used previously. We're using IPv6 now
because Synology can't configure two IPv4 on a bond from the GUI.
But it seems the issue has happened again tonight (SQL Server connection
drop). VM has network connectivity on the morning, so it isn't a
permanent problem.
We tried running the main VM backup yesterday morning, but couldn't
reproduce the issue, although during regular backup all 3 nodes are
doing backups and in the test we only performed the backup of the only
VM storaged on SSD pool.
This VM has 8vcores, 10GB of RAM, one disk Virtio scsi0 300GB
cache=writeback, network is e1000.
Backup reports:
NFO: status: 100% (322122547200/322122547200), sparse 22% (72698785792),
duration 2416, read/write 3650/0 MB/s
INFO: transferred 322122 MB in 2416 seconds (133 MB/s)
And peaks like:
INFO: status: 70% (225552891904/322122547200), sparse 3% (12228284416),
duration 2065, read/write 181/104 MB/s
INFO: status: 71% (228727980032/322122547200), sparse 3% (12228317184),
duration 2091, read/write 122/122 MB/s
INFO: status: 72% (232054063104/322122547200), sparse 3% (12228349952),
duration 2118, read/write 123/123 MB/s
INFO: status: 73% (235237539840/322122547200), sparse 3% (12230103040),
duration 2147, read/write 109/109 MB/s
INFO: status: 74% (238500708352/322122547200), sparse 3% (12237438976),
duration 2177, read/write 108/108 MB/s
Also, during backup we see the following messages in syslog of the
physical node:
Feb 20 00:00:18 sotllo pve-ha-lrm[3930696]: VM 103 qmp command failed -
VM 103 qmp command 'query-status' failed - got timeout
Feb 20 00:00:18 sotllo pve-ha-lrm[3930696]: VM 103 qmp command
'query-status' failed - got timeout#012
Feb 20 00:00:28 sotllo pve-ha-lrm[3930759]: VM 103 qmp command failed -
VM 103 qmp command 'query-status' failed - unable to connect to VM 103
qmp socket - timeout after 31 retries
Feb 20 00:00:28 sotllo pve-ha-lrm[3930759]: VM 103 qmp command
'query-status' failed - unable to connect to VM 103 qmp socket - timeout
after 31 retries#012
Feb 20 00:00:38 sotllo pve-ha-lrm[3930822]: VM 103 qmp command failed -
VM 103 qmp command 'query-status' failed - unable to connect to VM 103
qmp socket - timeout after 31 retries
Feb 20 00:00:38 sotllo pve-ha-lrm[3930822]: VM 103 qmp command
'query-status' failed - unable to connect to VM 103 qmp socket - timeout
after 31 retries#012
[...]
Feb 20 00:40:38 sotllo pve-ha-lrm[3948846]: VM 103 qmp command failed -
VM 103 qmp command 'query-status' failed - got timeout
Feb 20 00:40:38 sotllo pve-ha-lrm[3948846]: VM 103 qmp command
'query-status' failed - got timeout#012
Feb 20 00:41:28 sotllo pve-ha-lrm[3949193]: VM 103 qmp command failed -
VM 103 qmp command 'query-status' failed - got timeout
Feb 20 00:41:28 sotllo pve-ha-lrm[3949193]: VM 103 qmp command
'query-status' failed - got timeout#012
So it seems backup is having a big impact on the VM. This is only seen
for 3 of the 4 VMs in HA, but for the other VMs it is just logged twice,
and not everyday (there're on the HDD pool). For this VM there are lots
of logs everyday.
CPU during backup is low in the physical server, about 1.5-3.5 max load
and 10% max use.
Although it has been working fine until now, maybe e1000 emulation could
be the issue? We'll have to schedule downtime but can try to change to
virtio.
Any other ideas about what could be producing the issue?
Thanks a lot for reading through here!!
All three nodes have the same versions:
root at sotllo:~# pveversion -v
proxmox-ve: 6.1-2 (running kernel: 5.3.13-3-pve)
pve-manager: 6.1-7 (running version: 6.1-7/13e58d5e)
pve-kernel-5.3: 6.1-3
pve-kernel-helper: 6.1-3
pve-kernel-4.15: 5.4-12
pve-kernel-5.3.13-3-pve: 5.3.13-3
pve-kernel-4.15.18-24-pve: 4.15.18-52
pve-kernel-4.15.18-10-pve: 4.15.18-32
pve-kernel-4.13.13-5-pve: 4.13.13-38
pve-kernel-4.13.13-2-pve: 4.13.13-33
ceph: 14.2.6-pve1
ceph-fuse: 14.2.6-pve1
corosync: 3.0.2-pve4
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.13-pve1
libpve-access-control: 6.0-6
libpve-apiclient-perl: 3.0-2
libpve-common-perl: 6.0-11
libpve-guest-common-perl: 3.0-3
libpve-http-server-perl: 3.0-4
libpve-storage-perl: 6.1-4
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 3.2.1-1
lxcfs: 3.0.3-pve60
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.1-3
pve-cluster: 6.1-4
pve-container: 3.0-19
pve-docs: 6.1-4
pve-edk2-firmware: 2.20191127-1
pve-firewall: 4.0-10
pve-firmware: 3.0-4
pve-ha-manager: 3.0-8
pve-i18n: 2.0-4
pve-qemu-kvm: 4.1.1-2
pve-xtermjs: 4.3.0-1
qemu-server: 6.1-5
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.3-pve1
--
Zuzendari Teknikoa / Director Técnico
Binovo IT Human Project, S.L.
Telf. 943569206
Astigarragako bidea 2, 2º izq. oficina 11; 20180 Oiartzun (Gipuzkoa)
www.binovo.es
_______________________________________________
pve-user mailing list
pve-user at pve.proxmox.com
https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
More information about the pve-user
mailing list