[PVE-User] shared LVM on host-based mirrored iSCSI LUNs

Mon Apr 23 17:37:41 CEST 2012

Hi Dietmar,

Am 23.04.2012 13:21, schrieb Dietmar Maurer:
>> Since I did not configure any locking for mdadm I figured that mdadm would
>> lead to corrupting the contents of the logical volumes used as virtual hard
>> disks.
>>
>> But to my surprise fsck did not reveal any errors.
>>
>> So my question is: Is there some locking already in place and I just missed it?
>> clvm is installed but obviously not used, /etc/lvm.conf is set to file based
>> locking and the locking_dir is local to every server,
>
> Yes, we have cluster wide locking, as long as you use the pve tools to manage storage.

Well - as far as I understood that cluster wide locking is in place is 
no problem for drbd or iSCSI/FC-Targets.

The difference which I am really not sure about is the RAID setup with 
mdadm using 2 iSCSI-Targets.

The fault scenario I am thinking about is this:

node pve1 is running a VM managed by HA when it crashes. As the crash 
occured the vm was writing data to its hard disk. In normal operation 
mode the data is passed to LVM which will pass it to mdadm - and mdadm 
will write the data to each raid member disk.

I suspect that there may be a chance that the last write operation was 
only successful to one of the raid members.

Now the cluster starts its work and will do two things: fence the failed 
node and start the vm on another node, let's say pve2.

Since the restart of pve1 initiated by fencing will take some time and 
booting the vm pve2 starts earlier, it is likely that the raid metadata 
will still state "clean" when pve1 starts to connect to the storage 
again - so that will not be a problem.

But looking at the physical extents used by the logical volume the 
situaqtion is different; the last write operation may have failed and 
now the extents may hold different data. When data is read from a RAID1 
volume mdadm is supposed to do round-robin-reading in order to speed up 
disk access. I believe that there is a 50/50 chance from which raid 
member the extent will be read, so it is not defined if the correct data 
will be read. Or am I missing something here?

The cluster wide locking is working on lvm layer. But my concern this 
time is one layer further down: mdadm.

Stefan