[pve-devel] Default cache mode for VM hard drives

Wolfgang Bumiller w.bumiller at proxmox.com
Thu May 28 11:27:23 CEST 2015

I was able to reproduce the problem on my local machine (kernel 3.10).
To be sure everything's correct I added some error checking to the code.

I'm attaching the changed source (and the bdiff source).
Transcript is below.

I also added an fsync() before close() due to this section in close(2)'s NOTES

«It is not common  for  a  filesystem  to flush the buffers when the stream is
If you need to be sure that the data is physically stored, use fsync(2).»
(Note that this is about filesystems, not block devices.)

I'd also like to point out this part of open(2)'s description of O_DIRECT:
«File I/O is done directly to/from user-space buffers.  The O_DIRECT flag on its
own makes an effort to transfer data syn‐
chronously, but does not give the guarantees of the O_SYNC flag that data and
necessary metadata are transferred.»

Specifically, the "directly to/from user-space buffers" here is the key which
tells me that this behaviour should also be *possible* (yet unlikely) to happen
in a single-threaded program.
Basically, when you do a regular write() without O_DIRECT, the kernel makes a
SINGLE copy of the user-space buffer into the cache. Syncing to disks on the
lower level then happens out of this cache, which is implemented thread-safely.
However, when you do use O_DIRECT, the above words tell me that instead of
buffering the user-space data at least once, the kernel simply passes the
pointer-to-userspace down to the lower level.
Then these situations can happen: (all of these are speculation and have to be
checked in the kernel source, provided this issue still exists in newer kernel
versions, which we should check first!)
*) MDraid (and probably DMraid too (untested)): multiple threads read from the
userspace simultaneously BUT independently and thus not synchronized while
writing to disk. This means they're racing for data changes with the userspace.
*) DRBD: issues a send() for each mirror. Simultaneously or not, the task of
reading from userspace is then handed over to the actual send()ing party, which
is a different thread or a different time for each mirroring host.

Basically, my theory is that what happens with O_DIRECT is simply that the
userspace buffer is read multiple times at different points in time.
Unfortunately this could very well be intended behavior. (At least the manpage
suggests so.)

shell transcript:
$ for i in 1 2; do rm -f block$i; dd if=/dev/zero of=block$i bs=1M count=100;
100+0 records in
100+0 records out
104857600 bytes (105 MB) copied, 0.0358786 s, 2.9 GB/s
100+0 records in
100+0 records out
104857600 bytes (105 MB) copied, 0.0360704 s, 2.9 GB/s
$ ./bdiff block{1,2}

### Two equal zeroed fiels named block1 and block2

$ sudo losetup /dev/loop0 block1
$ sudo losetup /dev/loop1 block2
$ sudo mdadm --create /dev/md0 --level=1 --raid-devices=2 /dev/loop{0,1}
mdadm: Note: this array has metadata at the start and
    may not be suitable as a boot device.  If you plan to
    store '/boot' on this device please ensure that
    your boot-loader understands md/v1.x metadata, or use
Continue creating array? y
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md0 started.
$ ./bdiff block{1,2}
block 1 differs

### This is where the metadata is contained

$ sudo chown $USER /dev/md0
$ gcc -pthread awrite.c
$ ./a.out /dev/md0
Waiting for change_buffer thread
Waiting for write_to_blkdev thread
$ ./bdiff block{1,2}
block 1 differs
block 4746 differs

### There it seems to have been overwriting the 02 blocks with 03 blocks while
the lower level storage threads were competing reading the userspace.
### In another run it showed block 8369 instead of 4746.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: awrite.c
Type: text/x-csrc
Size: 1311 bytes
Desc: not available
URL: <http://lists.proxmox.com/pipermail/pve-devel/attachments/20150528/0ec31467/attachment.c>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: bdiff.c
Type: text/x-csrc
Size: 894 bytes
Desc: not available
URL: <http://lists.proxmox.com/pipermail/pve-devel/attachments/20150528/0ec31467/attachment-0001.c>

More information about the pve-devel mailing list