[PVE-User] Proxmox VM hard resets

Tue Jan 17 18:45:11 CET 2023

can you reproduce this with debian 11 or ubuntu 22 VM (create some load
there), i think this is not a proxmox problem which can be solved at the
proxmox/vm-guest level

see
https://www.theregister.com/2017/11/28/stunning_antistun_vm_stun_problem_fix/
for example

roland

Am 17.01.23 um 16:04 schrieb Adam Weremczuk:
> Hi all,
>
> My environment is quite unusual as I run PVE 7.2-11 as a VM on VMware
> 7.0.2. It runs several LXC containers and generally things are working
> fine.
>
> Recently the Proxmox VM (called "jaguar") started resetting itself
> (and all containers) shortly after Altaro VM Backup kicked off a
> scheduled VM backup over the network.
> Each time a hard reset was requested by the OS itself (Proxmox
> hypervisor).
>
> The time of the "stun/unstun" operation seems to be causing the issue
> here i.e. usually the stun/unstun operation should take a very short
> amount of time, however, in my case, depending on the load on both the
> hypervisor and the guest VM (nested hypervisor), that time can vary
> and take a bit longer, snippet below from various stun/unstun operations:
>
> 2023-01-12T23:00:55.407Z| vcpu-0| | I005: CPT: vm was stunned for
> 32142467 us
> 2023-01-12T23:01:12.848Z| vcpu-0| | I005: CPT: vm was stunned for
> 14942070 us
> 2023-01-12T23:11:35.984Z| vcpu-0| opID=1487b0d5| I005: CPT: vm was
> stunned for 277986 us
> 2023-01-12T23:11:39.431Z| vcpu-0| | I005: CPT: vm was stunned for
> 122089 us
>
> As you can see the stun time is different between each disk, now what
> I think that is happening here is depending on the stun/unstun time of
> the VM (virtualized hypervisor), the virtualized hypervisor watchdog
> is noticing that the OS is being frozen for a X amount time and
> issuing a hard reset. I guess when the stun time is over 30 sec, the
> guest OS is issuing a hard reset.
>
> 2023-01-12T23:00:55.407Z| vcpu-0| | I005: CPT: vm was stunned for
> 32142467 us
> 2023-01-12T23:00:55.407Z| vcpu-0| | I005: SnapshotVMXTakeSnapshotWork:
> Transition to mode 1.
> 2023-01-12T23:00:55.407Z| vcpu-0| | I005:
> SnapshotVMXTakeSnapshotComplete: Done with snapshot
> 'ALTAROTEMPSNAPSHOTDONOTDELETE463b73a7-f363-4daf-acf3-b0322fe84429': 95
> 2023-01-12T23:00:55.407Z| vcpu-0| | I005:
> VigorTransport_ServerSendResponse opID=1487b008 seq=887616: Completed
> Snapshot request.
> 2023-01-12T23:00:55.409Z| vcpu-8| | I005: HBACommon: First write on
> scsi0:0.fileName='/vmfs/volumes/61364720-e494cfe4-6cff-b083fed97d91/jaguar/jaguar-000001.vmdk'
> 2023-01-12T23:00:55.409Z| vcpu-8| | I005: DDB: "longContentID" =
> "08bf301ae8e75c151d2f273571a4ea9f" (was
> "2a6fd4c33a60f8d724ccc100a666f0d7")
> 2023-01-12T23:00:57.906Z| vcpu-8| | I005: DISKLIB-CHAIN :
> DiskChainUpdateContentID: old=0xa666f0d7, new=0x71a4ea9f
> (08bf301ae8e75c151d2f273571a4ea9f)
> 2023-01-12T23:00:57.906Z| vcpu-9| | I005: Chipset: The guest has
> requested that the virtual machine be hard reset.
>
> I'm struggling to establish how the watchdog timer (or equivalent) is
> configured :( Maybe increasing its trigger time would solve the issue?
>
> Any other ideas / similar experiences?
>
> Regards,
> Adam
>
>
> _______________________________________________
> pve-user mailing list
> pve-user at lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>