[pve-devel] Default cache mode for VM hard drives

Thu May 28 13:44:12 CEST 2015

>> qemu rbd access is only userland, so host don't have any cache or buffer. 
>>If RBD device does not use host cache then it is very likely that RBD utilizes O_DIRECT. I am not sure if there are other ways to avoid host cache. 

qemu use librbd to access directly to ceph, so host don't have any /dev/rbd.. or filesystem mount.

>> When data is written to ceph, it's written to the journal of each osd and replicas, before ack to the client. 
>>It can't be written to all destination right at the same time. If buffer changed meanwhile then data that reach different nodes data can differ. 

ceph use O_DIRECT+O_DYNC to write to the journal of osds.
Reads are always done on 1 primary osd.

----- Mail original -----
De: "Stanislav German-Evtushenko" <ginermail at gmail.com>
À: "aderumier" <aderumier at odiso.com>
Cc: "dietmar" <dietmar at proxmox.com>, "pve-devel" <pve-devel at pve.proxmox.com>
Envoyé: Jeudi 28 Mai 2015 13:10:52
Objet: Re: [pve-devel] Default cache mode for VM hard drives

Alexandre, 

The important point is whether O_DIRECT is used with Ceph or not. Don't you know? 

> qemu rbd access is only userland, so host don't have any cache or buffer. 
If RBD device does not use host cache then it is very likely that RBD utilizes O_DIRECT. I am not sure if there are other ways to avoid host cache. 

> When data is written to ceph, it's written to the journal of each osd and replicas, before ack to the client. 
It can't be written to all destination right at the same time. If buffer changed meanwhile then data that reach different nodes data can differ. 

Stanislav 

On Thu, May 28, 2015 at 1:58 PM, Alexandre DERUMIER < aderumier at odiso.com > wrote: 

>>BTW: can anybody test drbd_oos_test.c against Ceph? I guess we will have the same result. 

I think they are no problem with ceph, qemu cache option only enable|disable rbd_cache. 
qemu rbd access is only userland, so host don't have any cache or buffer. 
When data is written to ceph, it's written to the journal of each osd and replicas, before ack to the client. 

----- Mail original ----- 
De: "Stanislav German-Evtushenko" < ginermail at gmail.com > 
À: "aderumier" < aderumier at odiso.com > 
Cc: "dietmar" < dietmar at proxmox.com >, "pve-devel" < pve-devel at pve.proxmox.com > 
Envoyé: Jeudi 28 Mai 2015 10:27:34 
Objet: Re: [pve-devel] Default cache mode for VM hard drives 

Alexandre, 

> That's why we need to use barrier or FUA in last kernel in guest, when using O_DIRECT, to be sure that guest filesystem is ok and datas are flushed at regular interval. 

The problems are: 
- Linux swap - no barrier or something similar 
- Windows - I have no idea what Windows does to ensure consistency but the issue is reproducible for Windows 7. 

BTW: can anybody test drbd_oos_test.c against Ceph? I guess we will have the same result. 

Stanislav 

On Thu, May 28, 2015 at 11:22 AM, Stanislav German-Evtushenko < ginermail at gmail.com > wrote: 

Alexandre, 

> do you see the problem with qemu cache=directsync ? (O_DIRECT + O_DSYNC). 
Yes, it happens in less number of cases (may be 10 times less) but still happens. I have a reproducible case with Windows 7 and directsync. 

Stanislav 

On Thu, May 28, 2015 at 11:18 AM, Alexandre DERUMIER < aderumier at odiso.com > wrote: 

BQ_BEGIN 
>>Resume: when working in O_DIRECT mode QEMU has to wait until "write" system call is finished before changing this buffer OR QEMU has to create new buffer every time OR ... other ideas? 

AFAIK, only O_DSYNC can guarantee that data are really written to the last layer(disk platters) 

That's why we need to use barrier or FUA in last kernel in guest, when using O_DIRECT, to be sure that guest filesystem is ok and datas are flushed at regular interval. 
(To avoid incoherent filesystem with datas). 

do you see the problem with qemu cache=directsync ? (O_DIRECT + O_DSYNC). 

----- Mail original ----- 
De: "Stanislav German-Evtushenko" < ginermail at gmail.com > 
À: "dietmar" < dietmar at proxmox.com > 
Cc: "aderumier" < aderumier at odiso.com >, "pve-devel" < pve-devel at pve.proxmox.com > 
Envoyé: Jeudi 28 Mai 2015 09:54:32 
Objet: Re: [pve-devel] Default cache mode for VM hard drives 

Dietmar, 

fsync esures that data reaches underlying hardware but it does not help being sure that buffer is not changed until it is fully written. 

I will describe my understanding here why we get this problem with O_DIRECT and don't have without. 

** Without O_DIRECT ** 
1. Application tries to write data from buffer 
2. Data from buffer goes to host cache 
3. RAID writers get data from host cache and put to /dev/loop1 and /dev/loop2 
Even if buffer changes data in host cache will not be changed so RAID is consistent. 

** With O_DIRECT ** 
1. Application tries to write data from buffer 
2. RAID writers get data from application (!!!) bufferand put to /dev/loop1 and /dev/loop2 
if meanwhile data in buffer is changed (this change can be done in different posix thread) then we have different data reachs /dev/loop1 and /dev/loop2 

Resume: when working in O_DIRECT mode QEMU has to wait until "write" system call is finished before changing this buffer OR QEMU has to create new buffer every time OR ... other ideas? 

Stanislav 

On Thu, May 28, 2015 at 10:31 AM, Dietmar Maurer < dietmar at proxmox.com > wrote: 

> I have just done the same test with mdadm and not DRBD. And what I found 
> that this problem was reproducible on the software raid too, just as it was 
> claimed by Lars Ellenberg. It means that problem is not only related to 
> DRBD but to O_DIRECT mode generally when we don't use host cache and a 
> block device reads data directly from userspace. 

We simply think the behavior is correct. If you want to be sure data is 
on disk you have to call fsync. 

BQ_END