VMs hung after live migration - Intel CPU

Eneko Lacunza elacunza at binovo.es
Thu Nov 3 17:55:29 CET 2022


Hi all,

We have a HCI cluster, upgraded to latest enterprise version as of today 
afternoon:

# pveversion -v
proxmox-ve: 7.2-1 (running kernel: 5.15.60-2-pve)
pve-manager: 7.2-11 (running version: 7.2-11/b76d3178)
pve-kernel-helper: 7.2-13
pve-kernel-5.15: 7.2-12
pve-kernel-5.4: 6.4-18
pve-kernel-5.3: 6.1-6
pve-kernel-5.15.60-2-pve: 5.15.60-2
pve-kernel-5.15.39-1-pve: 5.15.39-1
pve-kernel-5.4.189-2-pve: 5.4.189-2
pve-kernel-5.4.174-2-pve: 5.4.174-2
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-5.3.18-2-pve: 5.3.18-2
ceph: 15.2.17-pve1
ceph-fuse: 15.2.17-pve1
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: residual config
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve1
libproxmox-acme-perl: 1.4.2
libproxmox-backup-qemu0: 1.3.1-1
libpve-access-control: 7.2-4
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.2-3
libpve-guest-common-perl: 4.1-3
libpve-http-server-perl: 4.1-4
libpve-storage-perl: 7.2-10
libqb0: 1.0.5-1
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.0-3
lxcfs: 4.0.12-pve1
novnc-pve: 1.3.0-3
proxmox-backup-client: 2.2.7-1
proxmox-backup-file-restore: 2.2.7-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.5.1
pve-cluster: 7.2-2
pve-container: 4.2-2
pve-docs: 7.2-2
pve-edk2-firmware: 3.20220526-1
pve-firewall: 4.2-6
pve-firmware: 3.5-4
pve-ha-manager: 3.4.0
pve-i18n: 2.7-2
pve-qemu-kvm: 7.0.0-3
pve-xtermjs: 4.16.0-1
qemu-server: 7.2-4
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.7.1~bpo11+1
vncterm: 1.7-1
zfsutils-linux: 2.1.6-pve1

Cluster has 3 nodes:
- proxmox1: Dell 64GB RAM, Xeon E-2356G
- proxmox2: Dell 64GB RAM, Xeon E-2356G
- proxmox3: HP 24GB RAM, Xeon E5-2407

Network is 10G.

Each node is a full Ceph node, with mon, mgr and 4 OSDs (there's a mix 
of bluestore and filestore)

Today 2 VMs hung after live migrating from "proxmox3" node to "proxmox1" 
node. We haven't noticed this previously on this cluster. We upgrade it 
quarterly and previous kernel was pve-kernel-5.15.39-1-pve .

- "erp" VM is windows 2012r2, with 24GB RAM, 4 vcores, UEFI BIOS, virtio 
network and virtio-scsi disks
- "dc" VM is windows 2019, with 8-14GB RAM, 2 vcores, UEFI BIOS, virtio 
network and virtio-scsi disks

Both VMs became unresponsive, even from console. "erp" was using about 
75% of CPU and "dc" 100% of CPU. Only those two VMs on that node.

Both VMs have survived live migration from "proxmox1" to "proxmox3" this 
afternoon...

I thought this kind of issues would be fixed in 5.15 kernel, at least on 
Intel CPUs and DC grade hardware... :-(

Shall I revert to 5.13 kernels? It seems that 5.13 kernels are no longer 
maintained within Proxmox?

Thanks


Eneko Lacunza
Zuzendari teknikoa | Director técnico
Binovo IT Human Project

Tel. +34 943 569 206 |https://www.binovo.es
Astigarragako Bidea, 2 - 2º izda. Oficina 10-11, 20180 Oiartzun

https://www.youtube.com/user/CANALBINOVO
https://www.linkedin.com/company/37269706/


More information about the pve-user mailing list