<div dir="ltr"><div>Alexandre,<br><br>> <span class="im"></span>qemu use librbd to access directly to ceph, so host don't have any /dev/rbd.. or filesystem mount.<br>Ah, I understand, this is not a normal block device but userspace lib.<br><br>> <span class="im"></span>ceph use O_DIRECT+O_DYNC to write to the journal of osds.<br></div><div>Is this done inside KVM process? If so then KVM keeps buffer for this O_DIRECT writing. Therefore if multiple threads can access (and change) this buffer at the same time then the similar issue can happen in theory.<br></div><div><br></div>Stanislav<br></div><div class="gmail_extra"><br><div class="gmail_quote">On Thu, May 28, 2015 at 2:44 PM, Alexandre DERUMIER <span dir="ltr"><<a href="mailto:aderumier@odiso.com" target="_blank">aderumier@odiso.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">>> qemu rbd access is only userland, so host don't have any cache or buffer.<br>

>>If RBD device does not use host cache then it is very likely that RBD utilizes O_DIRECT. I am not sure if there are other ways to avoid host cache.<br>

<br>

</span>qemu use librbd to access directly to ceph, so host don't have any /dev/rbd.. or filesystem mount.<br>

<span class=""><br>

>> When data is written to ceph, it's written to the journal of each osd and replicas, before ack to the client.<br>

>>It can't be written to all destination right at the same time. If buffer changed meanwhile then data that reach different nodes data can differ.<br>

<br>

</span>ceph use O_DIRECT+O_DYNC to write to the journal of osds.<br>

Reads are always done on 1 primary osd.<br>

<span class=""><br>

<br>

<br>

----- Mail original -----<br>

De: "Stanislav German-Evtushenko" <<a href="mailto:ginermail@gmail.com">ginermail@gmail.com</a>><br>

À: "aderumier" <<a href="mailto:aderumier@odiso.com">aderumier@odiso.com</a>><br>

Cc: "dietmar" <<a href="mailto:dietmar@proxmox.com">dietmar@proxmox.com</a>>, "pve-devel" <<a href="mailto:pve-devel@pve.proxmox.com">pve-devel@pve.proxmox.com</a>><br>

</span>Envoyé: Jeudi 28 Mai 2015 13:10:52<br>

<div class="HOEnZb"><div class="h5">Objet: Re: [pve-devel] Default cache mode for VM hard drives<br>

<br>

Alexandre,<br>

<br>

The important point is whether O_DIRECT is used with Ceph or not. Don't you know?<br>

<br>

> qemu rbd access is only userland, so host don't have any cache or buffer.<br>

If RBD device does not use host cache then it is very likely that RBD utilizes O_DIRECT. I am not sure if there are other ways to avoid host cache.<br>

<br>

> When data is written to ceph, it's written to the journal of each osd and replicas, before ack to the client.<br>

It can't be written to all destination right at the same time. If buffer changed meanwhile then data that reach different nodes data can differ.<br>

<br>

Stanislav<br>

<br>

On Thu, May 28, 2015 at 1:58 PM, Alexandre DERUMIER < <a href="mailto:aderumier@odiso.com">aderumier@odiso.com</a> > wrote:<br>

<br>

<br>

>>BTW: can anybody test drbd_oos_test.c against Ceph? I guess we will have the same result.<br>

<br>

I think they are no problem with ceph, qemu cache option only enable|disable rbd_cache.<br>

qemu rbd access is only userland, so host don't have any cache or buffer.<br>

When data is written to ceph, it's written to the journal of each osd and replicas, before ack to the client.<br>

<br>

<br>

<br>

<br>

----- Mail original -----<br>

De: "Stanislav German-Evtushenko" < <a href="mailto:ginermail@gmail.com">ginermail@gmail.com</a> ><br>

À: "aderumier" < <a href="mailto:aderumier@odiso.com">aderumier@odiso.com</a> ><br>

Cc: "dietmar" < <a href="mailto:dietmar@proxmox.com">dietmar@proxmox.com</a> >, "pve-devel" < <a href="mailto:pve-devel@pve.proxmox.com">pve-devel@pve.proxmox.com</a> ><br>

Envoyé: Jeudi 28 Mai 2015 10:27:34<br>

Objet: Re: [pve-devel] Default cache mode for VM hard drives<br>

<br>

Alexandre,<br>

<br>

> That's why we need to use barrier or FUA in last kernel in guest, when using O_DIRECT, to be sure that guest filesystem is ok and datas are flushed at regular interval.<br>

<br>

The problems are:<br>

- Linux swap - no barrier or something similar<br>

- Windows - I have no idea what Windows does to ensure consistency but the issue is reproducible for Windows 7.<br>

<br>

BTW: can anybody test drbd_oos_test.c against Ceph? I guess we will have the same result.<br>

<br>

Stanislav<br>

<br>

On Thu, May 28, 2015 at 11:22 AM, Stanislav German-Evtushenko < <a href="mailto:ginermail@gmail.com">ginermail@gmail.com</a> > wrote:<br>

<br>

<br>

<br>

Alexandre,<br>

<br>

> do you see the problem with qemu cache=directsync ? (O_DIRECT + O_DSYNC).<br>

Yes, it happens in less number of cases (may be 10 times less) but still happens. I have a reproducible case with Windows 7 and directsync.<br>

<br>

Stanislav<br>

<br>

On Thu, May 28, 2015 at 11:18 AM, Alexandre DERUMIER < <a href="mailto:aderumier@odiso.com">aderumier@odiso.com</a> > wrote:<br>

<br>

BQ_BEGIN<br>

>>Resume: when working in O_DIRECT mode QEMU has to wait until "write" system call is finished before changing this buffer OR QEMU has to create new buffer every time OR ... other ideas?<br>

<br>

AFAIK, only O_DSYNC can guarantee that data are really written to the last layer(disk platters)<br>

<br>

That's why we need to use barrier or FUA in last kernel in guest, when using O_DIRECT, to be sure that guest filesystem is ok and datas are flushed at regular interval.<br>

(To avoid incoherent filesystem with datas).<br>

<br>

<br>

do you see the problem with qemu cache=directsync ? (O_DIRECT + O_DSYNC).<br>

<br>

<br>

<br>

<br>

<br>

----- Mail original -----<br>

De: "Stanislav German-Evtushenko" < <a href="mailto:ginermail@gmail.com">ginermail@gmail.com</a> ><br>

À: "dietmar" < <a href="mailto:dietmar@proxmox.com">dietmar@proxmox.com</a> ><br>

Cc: "aderumier" < <a href="mailto:aderumier@odiso.com">aderumier@odiso.com</a> >, "pve-devel" < <a href="mailto:pve-devel@pve.proxmox.com">pve-devel@pve.proxmox.com</a> ><br>

Envoyé: Jeudi 28 Mai 2015 09:54:32<br>

Objet: Re: [pve-devel] Default cache mode for VM hard drives<br>

<br>

Dietmar,<br>

<br>

fsync esures that data reaches underlying hardware but it does not help being sure that buffer is not changed until it is fully written.<br>

<br>

I will describe my understanding here why we get this problem with O_DIRECT and don't have without.<br>

<br>

** Without O_DIRECT **<br>

1. Application tries to write data from buffer<br>

2. Data from buffer goes to host cache<br>

3. RAID writers get data from host cache and put to /dev/loop1 and /dev/loop2<br>

Even if buffer changes data in host cache will not be changed so RAID is consistent.<br>

<br>

** With O_DIRECT **<br>

1. Application tries to write data from buffer<br>

2. RAID writers get data from application (!!!) bufferand put to /dev/loop1 and /dev/loop2<br>

if meanwhile data in buffer is changed (this change can be done in different posix thread) then we have different data reachs /dev/loop1 and /dev/loop2<br>

<br>

Resume: when working in O_DIRECT mode QEMU has to wait until "write" system call is finished before changing this buffer OR QEMU has to create new buffer every time OR ... other ideas?<br>

<br>

Stanislav<br>

<br>

On Thu, May 28, 2015 at 10:31 AM, Dietmar Maurer < <a href="mailto:dietmar@proxmox.com">dietmar@proxmox.com</a> > wrote:<br>

<br>

<br>

> I have just done the same test with mdadm and not DRBD. And what I found<br>

> that this problem was reproducible on the software raid too, just as it was<br>

> claimed by Lars Ellenberg. It means that problem is not only related to<br>

> DRBD but to O_DIRECT mode generally when we don't use host cache and a<br>

> block device reads data directly from userspace.<br>

<br>

We simply think the behavior is correct. If you want to be sure data is<br>

on disk you have to call fsync.<br>

<br>

<br>

<br>

<br>

<br>

<br>

<br>

<br>

<br>

<br>

<br>

<br>

BQ_END<br>

<br>

<br>

</div></div></blockquote></div><br></div>