[PVE-User] NTFS/Windows Server corruption after successful live storage migration

Chris Murray chrismurray84 at gmail.com
Wed Nov 5 16:52:53 CET 2014


Hi all,

 

On a particular Proxmox VE cluster (pve-manager/3.3-5/bfebec03 (running
kernel: 2.6.32-33-pve)), I'm experiencing corruption inside virtual
machines after they've been storage-migrated between two NFS mounts. I'm
aware there are probably a few avenues of investigation but since this
is time consuming, I'm wondering if anyone can think which would be the
most telling?

 

The symptom is that Windows Server VMs will refuse to boot correctly
after a live storage migration, or some service on them will fail to
load, or Windows will start to complain about corrupt files etc. It
seems to be if the machines are migrated while writing data, which is
unavoidable for some since they provide email, databases, etc. This can
be a migration to another NFS mount, or a migration into another file
format on the same mount. Migrations while VMs are powered down do not
exhibit this issue, although I'm now calculating MD5 sums before and
after just in case.

 

To reproduce this intermittent issue, I've:

 

1.       Created four VMs on 'NFS_A', 2 virtual CPU, 32GB RAW IDE
(writethrough), 2GB memory

2.       Installed Windows Server 2008R2 into all four

3.       Joined to the domain and allowed the first round of Windows
Updates to download to the machine

4.       Started installing this round of Windows Updates on all servers
simultaneously in order to provide some load which touches system
components, as of course Windows Updates do.

5.       A few minutes in, migrate storage to NFS_B, back to NFS_A etc

 

The migrations were:

VM 100000 -       none

VM 100001 -       none

VM 100002 -       To NFS_B

Back to NFS_A. Failed on "TASK ERROR: storage migration failed:
mirroring error: VM 100002 qmp command 'block-job-complete' failed - The
active block job for device 'drive-ide0' cannot be completed"

Tried migrating to NFS_A again. Success this time.

To NFS_B

VM 100003 -       To NFS_B

Back to NFS_A

To NFS_B

 

At this time, I left the machines to finish their updates.

 

On completion, chkdsk c: all of them:

 

100000: Windows has checked the file system and found no problems.

100001: Windows has checked the file system and found no problems.

100002: Windows has checked the file system and found no problems. (this
is the one that had a failed migration)

100003: 

The type of the file system is NTFS.

 

WARNING!  F parameter not specified.

Running CHKDSK in read-only mode.

 

CHKDSK is verifying files (stage 1 of 3)...

  70144 file records processed.

File verification completed.

  61 large file records processed.

  0 bad file records processed.

  0 EA records processed.

  60 reparse records processed.

CHKDSK is verifying indexes (stage 2 of 3)...

66 percent complete. (83724 of 101286 index entries processed)

Error detected in index $I30 for file 38380.

Error detected in index $I30 for file 38380.

67 percent complete. (84428 of 101286 index entries processed)

Error detected in index $I30 for file 64316.

Error detected in index $I30 for file 64316.

Error detected in index $I30 for file 64323.

Error detected in index $I30 for file 64323.

Error detected in index $I30 for file 64324.

Error detected in index $I30 for file 64324.

Error detected in index $I30 for file 64325.

Error detected in index $I30 for file 64325.

  101286 index entries processed.

Index verification completed.

 

Errors found.  CHKDSK cannot continue in read-only mode.

 

 

 

What I don't know at this point is how serious these errors are, but I
think it does prove that something's going wrong with the migration
process and leads me to understand why some machines would suddenly
develop serious faults after they've been migrated.

 

How can I troubleshoot further? I'm keen to maximise the effectiveness
of what I do next. Since the problem doesn't happen to all migrations,
I'd ideally like to do something to prove the issue rather than change
something which temporarily makes it appear to go away, only to find
that I migrate a production server in the future and have to restore
from backup/snapshot again. 

 

Thanks in advance,

Chris

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.proxmox.com/pipermail/pve-user/attachments/20141105/a089bbd4/attachment.htm>


More information about the pve-user mailing list