[PVE-User] Failed live migration on Supermicro with EPYCv1 - what's going on here?

DERUMIER, Alexandre alexandre.derumier at groupe-cyllene.com
Fri Oct 20 15:56:54 CEST 2023


Hi,

what is the cpu model of the vms ?


-------- Message initial --------
De: Jan Vlach <janus at volny.cz>
Répondre à: Proxmox VE user list <pve-user at lists.proxmox.com>
À: Proxmox VE user list <pve-user at lists.proxmox.com>
Objet: [PVE-User] Failed live migration on Supermicro with EPYCv1 -
what's going on here?
Date: 20/10/2023 11:52:45

Hello proxmox-users,

I have a proxmox cluster running PVE7 with local ZFS pool (striped
mirrors), fully patched (no-subscription repo) and now I’m rebooting
into new kernels (5.15 .60 -> .126)

Migration network is dedicated 2x 10 GigE LACP interface on every node.

These are dual socketed Supermicro boxes with 2x AMD EPYC 7281 16-Core
Processor. Microcode is already 0x800126e everywhere.

The VMs are Cisco Ironport Appliances running FreeBSD (no qemu-guest-
agent, disabled in settings). For some, the live migration fails on
transferring contents of RAM. The job cleans up remote zvol, but kills
source VM.

Couple weeks ago, I migrated at least 24 ironport VMs without a hiccup.

What’s going on here? Where else can I look? Log with snipped the 500G
disk log transfer, there were no errors, just time and percent going
up. 

On a tangent - on bulk migrate, first VM in the batch complains that
port 60001 is already used and job can’t bind, so the first VM gets
skipped. Probably unrelated, different error.

Thank you for cluestick,
JV

2023-10-20 11:10:47 use dedicated network address for sending migration
traffic (10.30.24.20)
2023-10-20 11:10:47 starting migration of VM 148 to node 'prox-node7'
(10.30.24.20)
2023-10-20 11:10:47 found local disk 'local-zfs:vm-148-disk-0' (in
current VM config)
2023-10-20 11:10:47 starting VM 148 on remote node 'prox-node7'
2023-10-20 11:10:52 volume 'local-zfs:vm-148-disk-0' is 'local-zfs:vm-
148-disk-0' on the target
2023-10-20 11:10:52 start remote tunnel
2023-10-20 11:10:53 ssh tunnel ver 1
2023-10-20 11:10:53 starting storage migration
2023-10-20 11:10:53 scsi0: start migration to
nbd:10.30.24.20:60001:exportname=drive-scsi0
drive mirror is starting for drive-scsi0
drive-scsi0: transferred 477.0 MiB of 500.0 GiB (0.09%) in 14m 17s
drive-scsi0: transferred 1.1 GiB of 500.0 GiB (0.22%) in 14m 18s
drive-scsi0: transferred 1.7 GiB of 500.0 GiB (0.34%) in 14m 19s
drive-scsi0: transferred 2.3 GiB of 500.0 GiB (0.46%) in 14m 20s
drive-scsi0: transferred 2.9 GiB of 500.0 GiB (0.57%) in 14m 21s
drive-scsi0: transferred 3.5 GiB of 500.0 GiB (0.69%) in 14m 22s
drive-scsi0: transferred 4.0 GiB of 500.0 GiB (0.81%) in 14m 23s
drive-scsi0: transferred 4.6 GiB of 500.0 GiB (0.92%) in 14m 24s
drive-scsi0: transferred 5.0 GiB of 500.0 GiB (1.01%) in 14m 25s
drive-scsi0: transferred 5.5 GiB of 500.0 GiB (1.10%) in 14m 26s
drive-scsi0: transferred 6.0 GiB of 500.0 GiB (1.20%) in 14m 27s
drive-scsi0: transferred 6.5 GiB of 500.0 GiB (1.30%) in 14m 28s
drive-scsi0: transferred 6.9 GiB of 500.0 GiB (1.38%) in 14m 29s
drive-scsi0: transferred 7.4 GiB of 500.0 GiB (1.48%) in 14m 30s
drive-scsi0: transferred 7.8 GiB of 500.0 GiB (1.56%) in 14m 31s
drive-scsi0: transferred 8.2 GiB of 500.0 GiB (1.65%) in 14m 32s
… snipped to keep it sane, no errors here …
drive-scsi0: transferred 500.2 GiB of 500.8 GiB (99.87%) in 28m 32s
drive-scsi0: transferred 500.5 GiB of 500.8 GiB (99.94%) in 28m 33s
drive-scsi0: transferred 500.5 GiB of 500.8 GiB (99.95%) in 28m 34s
drive-scsi0: transferred 500.6 GiB of 500.8 GiB (99.96%) in 28m 35s
drive-scsi0: transferred 500.7 GiB of 500.8 GiB (99.97%) in 28m 36s
drive-scsi0: transferred 500.7 GiB of 500.8 GiB (99.97%) in 28m 37s
drive-scsi0: transferred 500.7 GiB of 500.8 GiB (99.98%) in 28m 38s
drive-scsi0: transferred 500.8 GiB of 500.8 GiB (99.99%) in 28m 39s
drive-scsi0: transferred 500.8 GiB of 500.8 GiB (100.00%) in 28m 40s
drive-scsi0: transferred 500.8 GiB of 500.8 GiB (100.00%) in 28m 41s,
ready
all 'mirror' jobs are ready
2023-10-20 11:39:34 starting online/live migration on
tcp:10.30.24.20:60000
2023-10-20 11:39:34 set migration capabilities
2023-10-20 11:39:34 migration downtime limit: 100 ms
2023-10-20 11:39:34 migration cachesize: 1.0 GiB
2023-10-20 11:39:34 set migration parameters
2023-10-20 11:39:34 start migrate command to tcp:10.30.24.20:60000
2023-10-20 11:39:35 migration active, transferred 615.9 MiB of 8.0 GiB
VM-state, 537.5 MiB/s
2023-10-20 11:39:36 migration active, transferred 1.1 GiB of 8.0 GiB
VM-state, 812.6 MiB/s
2023-10-20 11:39:37 migration active, transferred 1.6 GiB of 8.0 GiB
VM-state, 440.5 MiB/s
2023-10-20 11:39:38 migration active, transferred 2.1 GiB of 8.0 GiB
VM-state, 495.3 MiB/s
2023-10-20 11:39:39 migration active, transferred 2.5 GiB of 8.0 GiB
VM-state, 250.1 MiB/s
2023-10-20 11:39:40 migration active, transferred 2.9 GiB of 8.0 GiB
VM-state, 490.4 MiB/s
2023-10-20 11:39:41 migration active, transferred 3.4 GiB of 8.0 GiB
VM-state, 514.4 MiB/s
2023-10-20 11:39:42 migration active, transferred 3.9 GiB of 8.0 GiB
VM-state, 485.9 MiB/s
2023-10-20 11:39:43 migration active, transferred 4.3 GiB of 8.0 GiB
VM-state, 488.2 MiB/s
2023-10-20 11:39:44 migration active, transferred 4.8 GiB of 8.0 GiB
VM-state, 738.3 MiB/s
2023-10-20 11:39:45 migration active, transferred 5.6 GiB of 8.0 GiB
VM-state, 730.8 MiB/s
2023-10-20 11:39:46 migration active, transferred 6.2 GiB of 8.0 GiB
VM-state, 492.9 MiB/s
2023-10-20 11:39:47 migration active, transferred 6.7 GiB of 8.0 GiB
VM-state, 471.5 MiB/s
2023-10-20 11:39:48 migration active, transferred 7.1 GiB of 8.0 GiB
VM-state, 469.4 MiB/s
2023-10-20 11:39:49 migration active, transferred 7.9 GiB of 8.0 GiB
VM-state, 666.7 MiB/s
2023-10-20 11:39:50 migration active, transferred 8.6 GiB of 8.0 GiB
VM-state, 771.9 MiB/s
2023-10-20 11:39:51 migration active, transferred 9.4 GiB of 8.0 GiB
VM-state, 1.2 GiB/s
2023-10-20 11:39:51 xbzrle: send updates to 33286 pages in 23.2 MiB
encoded memory, cache-miss 96.68%, overflow 5045
2023-10-20 11:39:52 auto-increased downtime to continue migration: 200
ms
2023-10-20 11:39:53 migration active, transferred 9.9 GiB of 8.0 GiB
VM-state, 1.1 GiB/s
2023-10-20 11:39:53 xbzrle: send updates to 177238 pages in 60.2 MiB
encoded memory, cache-miss 73.74%, overflow 9766
query migrate failed: VM 148 qmp command 'query-migrate' failed -
client closed connection

2023-10-20 11:39:54 query migrate failed: VM 148 qmp command 'query-
migrate' failed - client closed connection
query migrate failed: VM 148 not running

2023-10-20 11:39:55 query migrate failed: VM 148 not running
query migrate failed: VM 148 not running

2023-10-20 11:39:56 query migrate failed: VM 148 not running
query migrate failed: VM 148 not running

2023-10-20 11:39:57 query migrate failed: VM 148 not running
query migrate failed: VM 148 not running

2023-10-20 11:39:58 query migrate failed: VM 148 not running
query migrate failed: VM 148 not running

2023-10-20 11:39:59 query migrate failed: VM 148 not running
2023-10-20 11:39:59 ERROR: online migrate failure - too many query
migrate failures - aborting
2023-10-20 11:39:59 aborting phase 2 - cleanup resources
2023-10-20 11:39:59 migrate_cancel
2023-10-20 11:39:59 migrate_cancel error: VM 148 not running
2023-10-20 11:39:59 ERROR: query-status error: VM 148 not running
drive-scsi0: Cancelling block job
2023-10-20 11:39:59 ERROR: VM 148 not running
2023-10-20 11:40:04 ERROR: migration finished with problems (duration
00:29:17)
TASK ERROR: migration problems

_______________________________________________
pve-user mailing list
pve-user at lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user



More information about the pve-user mailing list