[PVE-User] Problem with QEMU drive-mirror after cancelling VM disk move

Wed Apr 1 09:45:20 CEST 2020

Hello Fabian!

On 4/1/20 9:38 AM, Fabian Grünbichler wrote:
>> drive-scsi0: transferred: 6571425792 bytes remaining: 4154458112 bytes
>> total: 10725883904 bytes progression: 61.27 % busy: 1 ready: 0
>> drive-scsi0: Cancelling block job
> was the target some sort of network storage that started hanging? this 
> looks rather unusual..

We were able to reproduce this issue right now on the same cluster.
The Disk move operation was doing move from local "directory" type of
storage (VM disks reside as .qcow2 files) to attached CEPH storage pool.

We did 2 attempts on the same VM, with the same disk - first attempt
failed (disk transfer got stuck on 10.25 % progress):

deprecated setting 'migration_unsecure' and new 'migration: type' set at
same time! Ignore 'migration_unsecure'
create full clone of drive scsi1
(nvme-local-vm:82082108/vm-82082108-disk-1.qcow2)
drive mirror is starting for drive-scsi1
drive-scsi1: transferred: 737148928 bytes remaining: 20737687552 bytes
total: 21474836480 bytes progression: 3.43 % busy: 1 ready: 0
drive-scsi1: transferred: 1512046592 bytes remaining: 19962789888 bytes
total: 21474836480 bytes progression: 7.04 % busy: 1 ready: 0
drive-scsi1: transferred: 2198994944 bytes remaining: 19260243968 bytes
total: 21459238912 bytes progression: 10.25 % busy: 1 ready: 0
[[[[ here goes 230+ lines of the same 10.25 % progress status ]]]]
drive-scsi1: Cancelling block job

After cancelling this job I looked into VM's QM monitor to see if the
block-job is still there, and of course it is:

# info block-jobs
Type mirror, device drive-scsi1: Completed 2198994944 of 21459238912
bytes, speed limit 0 bytes/s

trying to cancel this block-job does nothing and our next step is to
shutdown the VM from Proxmox GUI - this also fails with the following in
task log:

TASK ERROR: VM quit/powerdown failed - got timeout

After that, we tried the following from SSH root console:

# qm stop 82082108
VM quit/powerdown failed - terminating now with SIGTERM
VM still running - terminating now with SIGKILL

and after that QM Monitor stopped to respond from Proxmox GUI as expected:

# info block-jobs
ERROR: VM 82082108 qmp command 'human-monitor-command' failed - unable
to connect to VM 82082108 qmp socket - timeout after 31 retries

So at this point the VM is completely stopped, disk not moved. The VM
was started again and we did same steps (Disk move) exactly as above. We
got identical restults - the Disk move operation got stuck:

drive-scsi1: transferred: 2187460608 bytes remaining: 19271778304 bytes
total: 21459238912 bytes progression: 10.19 % busy: 1 ready: 0

After cancelling Disk move operation all above symptoms persist - VM
won't shutdown from GUI, the block-job is visible from QM monitor and
won't cancel.

Our next test case was to do a Disk move offline - when VM is shutdown.
And guess what - this worked without a glitch. Same storage, same disk,
but VM is in stopped state.

But even after that, when the disk is on CEPH and VM is started and
running, we attempted to do Disk move from CEPH back to local storage
ONLINE - this also worked like a charm without any blocks or issues.

VM disk size we were moving back and forth isn't very big - only 20GB.

The problem is that this issue does not appear to be happening to every
virtual machine disk - we moved several disks before we hit this issue
again.

At the time of writing this message my colleague is doing some other
Disk move on the cluster and he said he hit same problem with another
VM's disk - 40GB in size - task stuck at the very beggining:
drive-scsi1: transferred: 427819008 bytes remaining: 70243188736 bytes
total: 70671007744 bytes progression: 0.61 % busy: 1 ready: 0

Let me know if I can provide some further information or do some
debugging - as we can reproduce this problem 100% now.

regards,
Mikhail.