[PVE-User] VMs hung after live migration - Intel CPU

Mon Apr 17 17:18:31 CEST 2023

Hi all,

We just tested today the following migrations with latest PVE 7.4:

Ryzen 5900X 5.13.19-6-pve -> Ryzen 1700 6.2.9-1-pve : OK (Linux and 
Windows, kvm64 cpu)
Ryzen 5900X 5.13.19-6-pve -> Ryzen 2600X 6.2.9-1-pve : OK (Linux and 
Windows, kvm64 cpu)
Ryzen 2600X 6.2.9-1-pve <-> Ryzen 1700 6.2.9-1-pve : OK (Linux and 
Windows, kvm64 cpu)
Ryzen 5900X 6.2.9-1-pve <-> Ryzen 2600X 6.2.9-1-pve : OK (Linux and 
Windows, kvm64 cpu)

We were awaiting on 5.13.19 kernel because those issues, now it seems 
there's a way to upgrade kernel without stopping VMs in a mixed 
CPU-model cluster.

Thanks

El 16/11/22 a las 12:31, Eneko Lacunza escribió:
> Hi,
>
> A new kernel 5.15.74-1 is out, and I saw that a TSC bug fix prepared 
> by Fiona (thanks a lot!) was there, so I just tried it out:
>
> Ryzen 1700 5.13.19-6-pve -> Ryzen 5900X 5.15.74-1-pve : migrations OK
> Ryzen 5900X 5.15.74-1-pve -> Ryzen 1700 5.15.74-1-pve: linux 
> migrations failed, Windows OK
>
> I noticed that in the VMs were something was logged to console, there 
> was no mention of TSC.
>
> This time the error was (Debian 10 kernel):
> PANIC: double fault, error_code: 0x0
>
> Then a kernel panic, I have it in a screenshot if that can help.
>
> I recall some floating point issue was reported, no idea if that has 
> been tracked.
>
> I think there has been progress with the issues we are seeing in this 
> Ryzen cluster, although 5.15 kernel is unworkable yet with 5.15.74...
>
> Cheers
>
>
> El 9/11/22 a las 9:21, Eneko Lacunza via pve-user escribió:
>>
>> Hi Jan,
>>
>> El 8/11/22 a las 21:57, Jan Vlach escribió:
>>> thank you a million for taking your time to re-test this! It really 
>>> helps me to understand what to expect that works and what doesn’t. I 
>>> had a glimpse of an idea to create cluster with mixed CPUs of EPYC 
>>> gen1 and EPYC gen3, but this really seems like a road to hell(tm). 
>>> So I’ll keep the clusters homogenous with the same gen of CPU. I 
>>> have two sites, but fortunately, I can keep the clusters homogenous 
>>> (with one having “more power”).
>>>
>>> Honestly, up until now, I thought I could abstract from the version 
>>> of linux kernel I’m running. Because, hey, it’s all KVM.  I’m 
>>> setting my VMs with cpu type host to have the benefit of accelerated 
>>> AES and other instructions, but I have yet to see if EPYCv1 is 
>>> compatible with EPYCv3. (v being gen) Thanks for teaching me a new 
>>> trick or a thing to be aware of at least! (I remember this to be an 
>>> issue with VMware heterogenous clusters (with cpus of different 
>>> generations), but I really though KVM64 would help you to abstract 
>>> from all this, KVM64 being Pentium4-era CPU)
>>
>> We haven't found any issue with this until kernel 5.15 (we are using 
>> Proxmox since 0.9 or something like that!). Only issue has been 
>> trouble live migrating VMs with 2+ cores between Intel and AMD 
>> processors, but no issues at all with different generations of the 
>> same brand. Until 5.15.* that is.
>>
>> This is the only reason to have kvm64 vCPU type, and also the reason 
>> for it to be the default! So I don't understand why this is taking so 
>> long to be fixed.
>>
>>>
>>> Do you use virtio drivers for storage and network card at all? Can 
>>> you see a pattern there where the 3 Debian/Windows machines were not 
>>> affected? Did they use virtio or not?
>>
>> Yes, virtio drivers for storage and network for all Debian and 
>> Windows 2008r2.
>>
>>>
>>> I really don’t see a reason why the migration back from 5.13 -> 5.19 
>>> should bring that 50/100% CPU load and hanging. I’ve had some 
>>> phantom load before with having “Use tablet for pointer: Yes” 
>>> before, but that was in the 5% ballpark per VM.
>>
>> Issue is not CPU load "per se", but that the VM is hung (not able to 
>> do anything in console)
>>
>>>
>>> I’m just a fellow proxmox admin/user. Hope this would ring a bell or 
>>> spark interest in the core proxmox team. I’ve had struggles with 
>>> 5.15 before with GPU passthrough (wasn’t able to do this) and 
>>> OpenBSD vm’s taking minutes compared to tens of seconds to boot on 
>>> 5.15 before.
>>>
>>> All and all, thanks for all the hints I could test before 
>>> production, do it won’t hurt “down the road” …
>>
>> For now, we're pinning 5.13 kernel that is working perfectly (except 
>> AMD<->Intel migration, but that is a years long issue).
>>
>>>
>>> JV
>>> P.S. i’m trying to push my boss towards a commercial subscription 
>>> for our clusters, but at this point I really am no sure it would 
>>> help ...
>>
>> I'm sure this must have been reported, no idea why it wasn't 
>> fixed/official kernel downgraded to 5.13 . In the forum someone from 
>> Proxmox even commented that we shouldn't run clusters with different 
>> generation CPUs, which was shocking to read, frankly. We have 
>> customers that have commercial support that we pinned to 5.13 kernel 
>> preventively because we found the issue in our "eat our own food" 
>> cluster beforehand!! :-)
>>
>> Cheers
>>
>>>
>>>
>>>> On 8. 11. 2022, at 18:18, Eneko Lacunza via 
>>>> pve-user<pve-user at lists.proxmox.com> wrote:
>>>>
>>>>
>>>> From: Eneko Lacunza<elacunza at binovo.es>
>>>> Subject: Re: [PVE-User] VMs hung after live migration - Intel CPU
>>>> Date: 8 November 2022 18:18:44 CET
>>>> To:pve-user at lists.proxmox.com
>>>>
>>>>
>>>> Hi Jan,
>>>>
>>>> I had some time to re-test this.
>>>>
>>>> I tried live migration with KVM64 CPU between 2 nodes:
>>>>
>>>> node-ryzen1700 - kernel 5.19.7-1-pve
>>>> node-ryzen5900x - kernel 5.19.7-1-pve
>>>>
>>>> I bulk-migrated 9 VMs (8 Debian 9/10/11 and 1 Windows 2008r2).
>>>> This works OK in both directions.
>>>>
>>>> Then I downgraded a node to 5.13:
>>>> node-ryzen1700 - kernel 5.19.7-1-pve
>>>> node-ryzen5900x - kernel 5.13.19-6-pve
>>>>
>>>> Migration of those 9 VMs worked well from node-ryzen1700 -> 
>>>> node->ryzen5900x
>>>>
>>>> But migration of those 9 VMs back node->ryzen5900x -> 
>>>> node-ryzen1700 was a disaster: all 8 debian VMs hung with 50/100% 
>>>> CPU use. Window 2008r2 seems not affected by the issue at all.
>>>>
>>>> 3 other Debian/Windows VMs on node-ryzen1700 were not affected.
>>>>
>>>> After migrating both nodes to kernel 5.13:
>>>>
>>>> node-ryzen1700 - kernel 5.13.19-6-pve
>>>> node-ryzen5900x - kernel 5.13.19-6-pve
>>>>
>>>> Migration of those 9 VMs node->ryzen5900x -> node-ryzen1700 works 
>>>> as intended :)
>>>>
>>>> Cheers
>>>>
>>>>
>>>>
>>>> El 8/11/22 a las 9:40, Eneko Lacunza via pve-user escribió:
>>>>> Hi Jan,
>>>>>
>>>>> Yes, there's no issue if CPUs are the same.
>>>>>
>>>>> VMs hang when CPUs are of different enough generation, even being 
>>>>> of the same brand and using KVM64 vCPU.
>>>>>
>>>>> El 7/11/22 a las 22:59, Jan Vlach escribió:
>>>>>> Hi,
>>>>>>
>>>>>> For what’s it worth, live VM migration with Linux VMs with 
>>>>>> various debian versions work here just fine. I’m using virtio for 
>>>>>> networking and virtio scsi for disks. (The only version where I 
>>>>>> had problems was debian6 where the kernel does not support virtio 
>>>>>> scsi and megaraid sas 8708EM2 needs to be used. I get kernel 
>>>>>> panic in mpt_sas on thaw after migration.)
>>>>>>
>>>>>> We're running 5.15.60-1-pve on three node cluster with AMD EPYC 
>>>>>> 7551P 32-Core Processor. These are supermicros with latest bios 
>>>>>> (latest microcode?) and BMC
>>>>>>
>>>>>> Storage is local ZFS pool, backed by SSDS in striped mirrors (4 
>>>>>> devices on each node). Migration has dedicated 2x 10GigE LACP and 
>>>>>> dedicated VLAN on switch stack.
>>>>>>
>>>>>> I have more nodes with EPYC3/Milan on the way, so I’ll test those 
>>>>>> later as well.
>>>>>>
>>>>>> What does your cluster look hardware-wise? What are the problems 
>>>>>> you experienced with VM migratio on 5.13->5.19?
>>>>>>
>>>>>> Thanks,
>>>>>> JV
>>>> Eneko Lacunza
>>>> Zuzendari teknikoa | Director técnico
>>>> Binovo IT Human Project
>>>>
>>>> Tel. +34 943 569 206 |https://www.binovo.es
>>>> Astigarragako Bidea, 2 - 2º izda. Oficina 10-11, 20180 Oiartzun
>>>>
>>>> https://www.youtube.com/user/CANALBINOVO
>>>> https://www.linkedin.com/company/37269706/
>>>>
>>>>
>>>> _______________________________________________
>>>> pve-user mailing list
>>>> pve-user at lists.proxmox.com
>>>> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>>> _______________________________________________
>>> pve-user mailing list
>>> pve-user at lists.proxmox.com
>>> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user 
>

Eneko Lacunza
Zuzendari teknikoa | Director técnico
Binovo IT Human Project

Tel. +34 943 569 206 |https://www.binovo.es
Astigarragako Bidea, 2 - 2º izda. Oficina 10-11, 20180 Oiartzun

https://www.youtube.com/user/CANALBINOVO
https://www.linkedin.com/company/37269706/