[pve-devel] PVE child process behavior question

Mon Jun 2 10:49:34 CEST 2025

> Denis Kanchev <denis.kanchev at storpool.com> hat am 02.06.2025 10:35 CEST geschrieben:
> 
> 
> > I thought your storage plugin is a shared storage, so there is no storage migration at all, yet you keep talking about storage migration?It's a shared storage indeed, the issue was that the migration process on the destination host got OOM killed and the migration failed, most probably that's why there is no log about the storage migration, but that didn't stop the storage migration on the destination host.

could you please explain what you mean by storage migration? :)

when I say "storage migration" I mean either
- the target VM exporting newly allocated volumes via NBD, and the source
  VM mirroring its disks via blockjob onto those exported volumes
- PVE::Storage::storage_migrate, which exports a volume, pipes it over SSH
  or a websocket tunnel and imports it on the other side

the first is what happens in a live migration for volumes currently used
by the VM. the second is what happens for other volumes, or in case of an
offline migration.

both will only happen for local volumes, as with a shared storage,
*there is nothing to migrate*.

are you talking about something your storage does (hand-over of control?)?

there also is no "migration process on the destination host", there just is
the target VM running there - did that VM get OOM-killed? or the `qm start`
invocation itself? or ... ? the migration task is only running on the source
node..

please really try to be specific here, it's easy to misunderstand things
or guess wrongly otherwise..

AFAIU, the sequence was:

migration started
target VM started
live-migration started
something happens on the destination node (??) that aborts the migration
source node does migrate_cancel (which is somehow hooked to your storage and removes a flag/lock/.. on the volume?)
something on the destination node calls activate_volume (which checks this flag/lock and is confused because it is missing?)

> 2025-04-11T03:26:52.283913+07:00 telpr01pve03 kernel: [96031.290519] pvesh invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0
> 
> Here is one more migration task attempt where it lived long enough to show more detailed log:
> 
> 2025-04-11 03:29:11 starting migration of VM 2421 to node 'telpr01pve06' (10.10.17.6)
> 2025-04-11 03:29:11 starting VM 2421 on remote node 'telpr01pve06' 
> 2025-04-11 03:29:15 [telpr01pve06] Warning: sch_htb: quantum of class 10001 is big. Consider r2q change. 
> 2025-04-11 03:29:15 [telpr01pve06] kvm: failed to find file '/usr/share/qemu-server/bootsplash.jpg' 
> 2025-04-11 03:29:15 start remote tunnel 
> 2025-04-11 03:29:16 ssh tunnel ver 1 
> 2025-04-11 03:29:16 starting online/live migration on unix:/run/qemu-server/2421.migrate 
> 2025-04-11 03:29:16 set migration capabilities 
> 2025-04-11 03:29:16 migration downtime limit: 100 ms 
> 2025-04-11 03:29:16 migration cachesize: 256.0 MiB 
> 2025-04-11 03:29:16 set migration parameters 
> 2025-04-11 03:29:16 start migrate command to unix:/run/qemu-server/2421.migrate 
> 2025-04-11 03:29:17 migration active, transferred 281.0 MiB of 2.0 GiB VM-state, 340.5 MiB/s 
> 2025-04-11 03:29:18 migration active, transferred 561.5 MiB of 2.0 GiB VM-state, 307.2 MiB/s 
> 2025-04-11 03:29:19 migration active, transferred 849.2 MiB of 2.0 GiB VM-state, 288.5 MiB/s 
> 2025-04-11 03:29:20 migration active, transferred 1.1 GiB of 2.0 GiB VM-state, 283.7 MiB/s 
> 2025-04-11 03:29:21 migration active, transferred 1.4 GiB of 2.0 GiB VM-state, 302.5 MiB/s 
> 2025-04-11 03:29:23 migration active, transferred 1.8 GiB of 2.0 GiB VM-state, 278.6 MiB/s 
> 2025-04-11 03:29:23 migration status error: failed 
> 2025-04-11 03:29:23 ERROR: online migrate failure - aborting 
> 2025-04-11 03:29:23 aborting phase 2 - cleanup resources 
> 2025-04-11 03:29:23 migrate_cancel
> 2025-04-11 03:29:25 ERROR: migration finished with problems (duration 00:00:14) 
> TASK ERROR: migration problems
>  
> 
> > could you provide the full migration task log and the VM config?
> 2025-04-11 03:26:50 starting migration of VM 2421 to node 'telpr01pve03' (10.10.17.3) ### QemuMigrate::phase1() +749
> 2025-04-11 03:26:50 starting VM 2421 on remote node 'telpr01pve03' # QemuMigrate::phase2_start_local_cluster() +888 
> 2025-04-11 03:26:52 ERROR: online migrate failure - remote command failed with exit code 255 
> 2025-04-11 03:26:52 aborting phase 2 - cleanup resources 
> 2025-04-11 03:26:52 migrate_cancel 
> 2025-04-11 03:26:53 ERROR: migration finished with problems (duration 00:00:03) 
> TASK ERROR: migration problems
> 
> 
> VM config#Ubuntu-24.04-14082024
> #StorPool adjustment 
> agent: 1,fstrim_cloned_disks=1 
> autostart: 1 
> boot: c 
> bootdisk: scsi0 
> cipassword: XXX 
> citype: nocloud 
> ciupgrade: 0 
> ciuser: test
> cores: 2 
> cpu: EPYC-Genoa 
> cpulimit: 2 
> ide0: VMDataSp:vm-2421-cloudinit.raw,media=cdrom 
> ipconfig0: ipxxx 
> memory: 2048 
> meta: creation-qemu=8.1.5,ctime=1722917972 
> name: kredibel-service 
> nameserver: xxx 
> net0: virtio=xxx,bridge=vmbr2,firewall=1,rate=250,tag=220 
> numa: 0 
> onboot: 1 
> ostype: l26 
> scsi0: VMDataSp:vm-2421-disk-0-sp-bj7n.b.sdj.raw,aio=native,discard=on,iops_rd=20000,iops_rd_max=40000,iops_rd_max_length=60,iops_wr=20000,iops_wr_max=40000,iops_wr_max_length=60,iothread=1,size=40G 
> scsihw: virtio-scsi-single 
> searchdomain: neo.internal 
> serial0: socket 
> smbios1: uuid=dfxxx 
> sockets: 1 
> sshkeys: ssh-rsa% 
> vmgenid: 17b154a0-
>  
> IN this case the call to PVE::Storage::Plugin::activate_volume() was performed after migration cancelation2025-04-11T03:26:53.072206+07:00 telpr01pve03 qm[3670228]: StorPool plugin: NOT a live migration of VM 2421, will force detach volume ~bj7n.b.abe <<< This log is from the sub activate_volume() in our custom storage plugin
>  
> 
> 
>