[pve-devel] [RFC kernel] revert problematic TSC multiplier commit

Mon Sep 5 10:25:36 CEST 2022

Hi Fiona,

I just confirmed that in addition to issue reported in 
https://bugzilla.proxmox.com/show_bug.cgi?id=4073 (live migrated VM hung 
using 100% CPU), we also reproduce issue reported in

https://forum.proxmox.com/threads/zeitspr%C3%BCnge-in-vms-seit-pve-7-2.112756/  

At least what I understand google translating it :)

- From 5.15.39-4-pve (Ryzen 2600X) to 5.15.39-4-pve (Ryzen 1700): Problems
   * Migration of a VM hung 4 other VMs in destination host (no 100% CPU use, consoles show time related issues, watchdog, etc.).

We can test the patch if you can provide a kernel .deb package.

Thanks

El 2/9/22 a las 14:08, Eneko Lacunza escribió:
> Hi,
>
> El 2/9/22 a las 9:59, Eneko Lacunza escribió:
>> Hi,
>>
>> El 2/9/22 a las 9:47, Fiona Ebner escribió:
>>> Am 02.09.22 um 09:22 schrieb Eneko Lacunza:
>>>> Hi Fiona,
>>>>
>>>> Does this patch correspond to kernels linked in this forum thread?
>>>>
>>>> https://forum.proxmox.com/threads/proxmox-7-2-3-ceph-16-2-7-migrating-vms-hangs-them-kernel-panic-on-linux-freeze-on-windows.109645/page-2#post-488479
>>>>
>>> No, there is no public build with the below patch yet.
>> Ok, thanks for the clarification.
>>
>>> Did you already test the kernel with the fpu patches that's mentioned in
>>> that forum post?
>>
>> No, I was waiting for a good time-window in our prod cluster to test 
>> it :) Seems it will be today.
>
> I have just tested, and that patch doesn't seem to help. VMs hung with 
> 100% CPU use with that version in live-migration destination host. 
> Just updated bugzilla entry.
>
>
>>
>>>> If so I can test them and see if that helps with bugzilla entry #4073:
>>>> https://bugzilla.proxmox.com/show_bug.cgi?id=4073
>>>>
>>> I don't think theses issues are related, as there, the VM that's been
>>> migrated hangs, and here other VMs on the node were affected.
>>
>> Yes, that's true, but I have seen other VMs on the nodes to be 
>> affected too (but less frequently). Maybe we are impacted by the two 
>> issues :)
>
> I have easily reproduced hang on migrated (linux) VMs, but not hanging 
> other VMs in today tests.
>
> Cheers
>
>>
>>>
>>>>>> which might be responsible for several issues reported in the
>>>>>> community forum[0][1].
>>>>>>
>>>>>> In my case, loading a VM snapshot that originally was taken on
>>>>>> a CPU from a different vendor often caused problems in other VMs(!).
>>>>>> In particular, it often led to RCU stalls (with similar messages as in
>>>>>> [1]) or slowdowns, and sometimes clock jumps far into the future (like
>>>>>> in [0]). With this revert applied, everything seems to run smoothly
>>>>>> even after loading the "bad" snapshot 10 times.
>>>>>>
>>>>>> [0]https://forum.proxmox.com/threads/112756/
>>>>>> [1]https://forum.proxmox.com/threads/111494/
>>> The fix 11d39e8cc43e1c6737af19ca9372e590061b5ad2 is only for AMD/SVM, so
>>> most likely [1], where people with Intel N5105 are affected, is not
>>> related either. RCU stall messages can happen for different reasons of
>>> course ;)
>>>
>>
>> Our cluster has AMD CPUs.
>>
>> I'll report back the results of our tests if I can finally try the 
>> test kernel today.
>>
>> Thanks
>>
>

Eneko Lacunza
Zuzendari teknikoa | Director técnico
Binovo IT Human Project

Tel. +34 943 569 206 |https://www.binovo.es
Astigarragako Bidea, 2 - 2º izda. Oficina 10-11, 20180 Oiartzun

https://www.youtube.com/user/CANALBINOVO
https://www.linkedin.com/company/37269706/