[pve-devel] [RFC kernel] revert problematic TSC multiplier commit

Eneko Lacunza elacunza at binovo.es
Fri Sep 2 14:08:31 CEST 2022


Hi,

El 2/9/22 a las 9:59, Eneko Lacunza escribió:
> Hi,
>
> El 2/9/22 a las 9:47, Fiona Ebner escribió:
>> Am 02.09.22 um 09:22 schrieb Eneko Lacunza:
>>> Hi Fiona,
>>>
>>> Does this patch correspond to kernels linked in this forum thread?
>>>
>>> https://forum.proxmox.com/threads/proxmox-7-2-3-ceph-16-2-7-migrating-vms-hangs-them-kernel-panic-on-linux-freeze-on-windows.109645/page-2#post-488479
>>>
>> No, there is no public build with the below patch yet.
> Ok, thanks for the clarification.
>
>> Did you already test the kernel with the fpu patches that's mentioned in
>> that forum post?
>
> No, I was waiting for a good time-window in our prod cluster to test 
> it :) Seems it will be today.

I have just tested, and that patch doesn't seem to help. VMs hung with 
100% CPU use with that version in live-migration destination host. Just 
updated bugzilla entry.


>
>>> If so I can test them and see if that helps with bugzilla entry #4073:
>>> https://bugzilla.proxmox.com/show_bug.cgi?id=4073
>>>
>> I don't think theses issues are related, as there, the VM that's been
>> migrated hangs, and here other VMs on the node were affected.
>
> Yes, that's true, but I have seen other VMs on the nodes to be 
> affected too (but less frequently). Maybe we are impacted by the two 
> issues :)

I have easily reproduced hang on migrated (linux) VMs, but not hanging 
other VMs in today tests.

Cheers

>
>>
>>>>> which might be responsible for several issues reported in the
>>>>> community forum[0][1].
>>>>>
>>>>> In my case, loading a VM snapshot that originally was taken on
>>>>> a CPU from a different vendor often caused problems in other VMs(!).
>>>>> In particular, it often led to RCU stalls (with similar messages as in
>>>>> [1]) or slowdowns, and sometimes clock jumps far into the future (like
>>>>> in [0]). With this revert applied, everything seems to run smoothly
>>>>> even after loading the "bad" snapshot 10 times.
>>>>>
>>>>> [0]https://forum.proxmox.com/threads/112756/
>>>>> [1]https://forum.proxmox.com/threads/111494/
>> The fix 11d39e8cc43e1c6737af19ca9372e590061b5ad2 is only for AMD/SVM, so
>> most likely [1], where people with Intel N5105 are affected, is not
>> related either. RCU stall messages can happen for different reasons of
>> course ;)
>>
>
> Our cluster has AMD CPUs.
>
> I'll report back the results of our tests if I can finally try the 
> test kernel today.
>
> Thanks
>

Eneko Lacunza
Zuzendari teknikoa | Director técnico
Binovo IT Human Project

Tel. +34 943 569 206 |https://www.binovo.es
Astigarragako Bidea, 2 - 2º izda. Oficina 10-11, 20180 Oiartzun

https://www.youtube.com/user/CANALBINOVO
https://www.linkedin.com/company/37269706/



More information about the pve-devel mailing list