[PVE-User] VMs hung after live migration - Intel CPU

Wed Nov 16 12:31:13 CET 2022

Hi,

A new kernel 5.15.74-1 is out, and I saw that a TSC bug fix prepared by 
Fiona (thanks a lot!) was there, so I just tried it out:

Ryzen 1700 5.13.19-6-pve -> Ryzen 5900X 5.15.74-1-pve : migrations OK
Ryzen 5900X 5.15.74-1-pve -> Ryzen 1700 5.15.74-1-pve: linux migrations 
failed, Windows OK

I noticed that in the VMs were something was logged to console, there 
was no mention of TSC.

This time the error was (Debian 10 kernel):
PANIC: double fault, error_code: 0x0

Then a kernel panic, I have it in a screenshot if that can help.

I recall some floating point issue was reported, no idea if that has 
been tracked.

I think there has been progress with the issues we are seeing in this 
Ryzen cluster, although 5.15 kernel is unworkable yet with 5.15.74...

Cheers

El 9/11/22 a las 9:21, Eneko Lacunza via pve-user escribió:
>
> Hi Jan,
>
> El 8/11/22 a las 21:57, Jan Vlach escribió:
>> thank you a million for taking your time to re-test this! It really 
>> helps me to understand what to expect that works and what doesn’t. I 
>> had a glimpse of an idea to create cluster with mixed CPUs of EPYC 
>> gen1 and EPYC gen3, but this really seems like a road to hell(tm). So 
>> I’ll keep the clusters homogenous with the same gen of CPU. I have 
>> two sites, but fortunately, I can keep the clusters homogenous (with 
>> one having “more power”).
>>
>> Honestly, up until now, I thought I could abstract from the version 
>> of linux kernel I’m running. Because, hey, it’s all KVM.  I’m setting 
>> my VMs with cpu type host to have the benefit of accelerated AES and 
>> other instructions, but I have yet to see if EPYCv1 is compatible 
>> with EPYCv3. (v being gen) Thanks for teaching me a new trick or a 
>> thing to be aware of at least! (I remember this to be an issue with 
>> VMware heterogenous clusters (with cpus of different generations), 
>> but I really though KVM64 would help you to abstract from all this, 
>> KVM64 being Pentium4-era CPU)
>
> We haven't found any issue with this until kernel 5.15 (we are using 
> Proxmox since 0.9 or something like that!). Only issue has been 
> trouble live migrating VMs with 2+ cores between Intel and AMD 
> processors, but no issues at all with different generations of the 
> same brand. Until 5.15.* that is.
>
> This is the only reason to have kvm64 vCPU type, and also the reason 
> for it to be the default! So I don't understand why this is taking so 
> long to be fixed.
>
>>
>> Do you use virtio drivers for storage and network card at all? Can 
>> you see a pattern there where the 3 Debian/Windows machines were not 
>> affected? Did they use virtio or not?
>
> Yes, virtio drivers for storage and network for all Debian and Windows 
> 2008r2.
>
>>
>> I really don’t see a reason why the migration back from 5.13 -> 5.19 
>> should bring that 50/100% CPU load and hanging. I’ve had some phantom 
>> load before with having “Use tablet for pointer: Yes” before, but 
>> that was in the 5% ballpark per VM.
>
> Issue is not CPU load "per se", but that the VM is hung (not able to 
> do anything in console)
>
>>
>> I’m just a fellow proxmox admin/user. Hope this would ring a bell or 
>> spark interest in the core proxmox team. I’ve had struggles with 5.15 
>> before with GPU passthrough (wasn’t able to do this) and OpenBSD vm’s 
>> taking minutes compared to tens of seconds to boot on 5.15 before.
>>
>> All and all, thanks for all the hints I could test before production, 
>> do it won’t hurt “down the road” …
>
> For now, we're pinning 5.13 kernel that is working perfectly (except 
> AMD<->Intel migration, but that is a years long issue).
>
>>
>> JV
>> P.S. i’m trying to push my boss towards a commercial subscription for 
>> our clusters, but at this point I really am no sure it would help ...
>
> I'm sure this must have been reported, no idea why it wasn't 
> fixed/official kernel downgraded to 5.13 . In the forum someone from 
> Proxmox even commented that we shouldn't run clusters with different 
> generation CPUs, which was shocking to read, frankly. We have 
> customers that have commercial support that we pinned to 5.13 kernel 
> preventively because we found the issue in our "eat our own food" 
> cluster beforehand!! :-)
>
> Cheers
>
>>
>>
>>> On 8. 11. 2022, at 18:18, Eneko Lacunza via 
>>> pve-user<pve-user at lists.proxmox.com> wrote:
>>>
>>>
>>> From: Eneko Lacunza<elacunza at binovo.es>
>>> Subject: Re: [PVE-User] VMs hung after live migration - Intel CPU
>>> Date: 8 November 2022 18:18:44 CET
>>> To:pve-user at lists.proxmox.com
>>>
>>>
>>> Hi Jan,
>>>
>>> I had some time to re-test this.
>>>
>>> I tried live migration with KVM64 CPU between 2 nodes:
>>>
>>> node-ryzen1700 - kernel 5.19.7-1-pve
>>> node-ryzen5900x - kernel 5.19.7-1-pve
>>>
>>> I bulk-migrated 9 VMs (8 Debian 9/10/11 and 1 Windows 2008r2).
>>> This works OK in both directions.
>>>
>>> Then I downgraded a node to 5.13:
>>> node-ryzen1700 - kernel 5.19.7-1-pve
>>> node-ryzen5900x - kernel 5.13.19-6-pve
>>>
>>> Migration of those 9 VMs worked well from node-ryzen1700 -> 
>>> node->ryzen5900x
>>>
>>> But migration of those 9 VMs back node->ryzen5900x -> node-ryzen1700 
>>> was a disaster: all 8 debian VMs hung with 50/100% CPU use. Window 
>>> 2008r2 seems not affected by the issue at all.
>>>
>>> 3 other Debian/Windows VMs on node-ryzen1700 were not affected.
>>>
>>> After migrating both nodes to kernel 5.13:
>>>
>>> node-ryzen1700 - kernel 5.13.19-6-pve
>>> node-ryzen5900x - kernel 5.13.19-6-pve
>>>
>>> Migration of those 9 VMs node->ryzen5900x -> node-ryzen1700 works as 
>>> intended :)
>>>
>>> Cheers
>>>
>>>
>>>
>>> El 8/11/22 a las 9:40, Eneko Lacunza via pve-user escribió:
>>>> Hi Jan,
>>>>
>>>> Yes, there's no issue if CPUs are the same.
>>>>
>>>> VMs hang when CPUs are of different enough generation, even being 
>>>> of the same brand and using KVM64 vCPU.
>>>>
>>>> El 7/11/22 a las 22:59, Jan Vlach escribió:
>>>>> Hi,
>>>>>
>>>>> For what’s it worth, live VM migration with Linux VMs with various 
>>>>> debian versions work here just fine. I’m using virtio for 
>>>>> networking and virtio scsi for disks. (The only version where I 
>>>>> had problems was debian6 where the kernel does not support virtio 
>>>>> scsi and megaraid sas 8708EM2 needs to be used. I get kernel panic 
>>>>> in mpt_sas on thaw after migration.)
>>>>>
>>>>> We're running 5.15.60-1-pve on three node cluster with AMD EPYC 
>>>>> 7551P 32-Core Processor. These are supermicros with latest bios 
>>>>> (latest microcode?) and BMC
>>>>>
>>>>> Storage is local ZFS pool, backed by SSDS in striped mirrors (4 
>>>>> devices on each node). Migration has dedicated 2x 10GigE LACP and 
>>>>> dedicated VLAN on switch stack.
>>>>>
>>>>> I have more nodes with EPYC3/Milan on the way, so I’ll test those 
>>>>> later as well.
>>>>>
>>>>> What does your cluster look hardware-wise? What are the problems 
>>>>> you experienced with VM migratio on 5.13->5.19?
>>>>>
>>>>> Thanks,
>>>>> JV
>>> Eneko Lacunza
>>> Zuzendari teknikoa | Director técnico
>>> Binovo IT Human Project
>>>
>>> Tel. +34 943 569 206 |https://www.binovo.es
>>> Astigarragako Bidea, 2 - 2º izda. Oficina 10-11, 20180 Oiartzun
>>>
>>> https://www.youtube.com/user/CANALBINOVO
>>> https://www.linkedin.com/company/37269706/
>>>
>>>
>>> _______________________________________________
>>> pve-user mailing list
>>> pve-user at lists.proxmox.com
>>> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>> _______________________________________________
>> pve-user mailing list
>> pve-user at lists.proxmox.com
>> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user 

Eneko Lacunza
Zuzendari teknikoa | Director técnico
Binovo IT Human Project

Tel. +34 943 569 206 |https://www.binovo.es
Astigarragako Bidea, 2 - 2º izda. Oficina 10-11, 20180 Oiartzun

https://www.youtube.com/user/CANALBINOVO
https://www.linkedin.com/company/37269706/