[pve-devel] Default cache mode for VM hard drives

Thu May 28 13:31:58 CEST 2015

Hi,

I'm not kernel/IO expert in any way, but I think this test program has a 
race condition, so it is not helping us diagnose the problem.

We're writing to buffer x while it is in use by write syscall. This is 
plainly wrong on userspace.

Cheers
Eneko

On 28/05/15 11:27, Wolfgang Bumiller wrote:
> I was able to reproduce the problem on my local machine (kernel 3.10).
> To be sure everything's correct I added some error checking to the code.
>
> I'm attaching the changed source (and the bdiff source).
> Transcript is below.
>
> I also added an fsync() before close() due to this section in close(2)'s NOTES
> section:
>
> «It is not common  for  a  filesystem  to flush the buffers when the stream is
> closed.
> If you need to be sure that the data is physically stored, use fsync(2).»
> (Note that this is about filesystems, not block devices.)
>
> I'd also like to point out this part of open(2)'s description of O_DIRECT:
> «File I/O is done directly to/from user-space buffers.  The O_DIRECT flag on its
> own makes an effort to transfer data syn‐
> chronously, but does not give the guarantees of the O_SYNC flag that data and
> necessary metadata are transferred.»
>
> Specifically, the "directly to/from user-space buffers" here is the key which
> tells me that this behaviour should also be *possible* (yet unlikely) to happen
> in a single-threaded program.
> Basically, when you do a regular write() without O_DIRECT, the kernel makes a
> SINGLE copy of the user-space buffer into the cache. Syncing to disks on the
> lower level then happens out of this cache, which is implemented thread-safely.
> However, when you do use O_DIRECT, the above words tell me that instead of
> buffering the user-space data at least once, the kernel simply passes the
> pointer-to-userspace down to the lower level.
> Then these situations can happen: (all of these are speculation and have to be
> checked in the kernel source, provided this issue still exists in newer kernel
> versions, which we should check first!)
> *) MDraid (and probably DMraid too (untested)): multiple threads read from the
> userspace simultaneously BUT independently and thus not synchronized while
> writing to disk. This means they're racing for data changes with the userspace.
> *) DRBD: issues a send() for each mirror. Simultaneously or not, the task of
> reading from userspace is then handed over to the actual send()ing party, which
> is a different thread or a different time for each mirroring host.
>
> Basically, my theory is that what happens with O_DIRECT is simply that the
> userspace buffer is read multiple times at different points in time.
> Unfortunately this could very well be intended behavior. (At least the manpage
> suggests so.)
>
> shell transcript:
> $ for i in 1 2; do rm -f block$i; dd if=/dev/zero of=block$i bs=1M count=100;
> done
> 100+0 records in
> 100+0 records out
> 104857600 bytes (105 MB) copied, 0.0358786 s, 2.9 GB/s
> 100+0 records in
> 100+0 records out
> 104857600 bytes (105 MB) copied, 0.0360704 s, 2.9 GB/s
> $ ./bdiff block{1,2}
>
> ### Two equal zeroed fiels named block1 and block2
>
> $ sudo losetup /dev/loop0 block1
> $ sudo losetup /dev/loop1 block2
> $ sudo mdadm --create /dev/md0 --level=1 --raid-devices=2 /dev/loop{0,1}
> mdadm: Note: this array has metadata at the start and
>      may not be suitable as a boot device.  If you plan to
>      store '/boot' on this device please ensure that
>      your boot-loader understands md/v1.x metadata, or use
>      --metadata=0.90
> Continue creating array? y
> mdadm: Defaulting to version 1.2 metadata
> mdadm: array /dev/md0 started.
> $ ./bdiff block{1,2}
> block 1 differs
>
> ### This is where the metadata is contained
>
> $ sudo chown $USER /dev/md0
> $ gcc -pthread awrite.c
> $ ./a.out /dev/md0
> Waiting for change_buffer thread
> Waiting for write_to_blkdev thread
> $ ./bdiff block{1,2}
> block 1 differs
> block 4746 differs
> $
>
> ### There it seems to have been overwriting the 02 blocks with 03 blocks while
> the lower level storage threads were competing reading the userspace.
> ### In another run it showed block 8369 instead of 4746.
>
>
> _______________________________________________
> pve-devel mailing list
> pve-devel at pve.proxmox.com
> http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

-- 
Zuzendari Teknikoa / Director Técnico
Binovo IT Human Project, S.L.
Telf. 943575997
       943493611
Astigarraga bidea 2, planta 6 dcha., ofi. 3-2; 20180 Oiartzun (Gipuzkoa)
www.binovo.es

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.proxmox.com/pipermail/pve-devel/attachments/20150528/ba36ba00/attachment.htm>