[PVE-User] Problem with LVM - lvremove command - deadlock ? - version : 2.6.32-4-pve #1 SMP Wed Dec 15 14:04:31 CET 2010 x86_64 GNU/Linux

Wed Apr 6 16:17:09 CEST 2011

Hi

I use proxmox 1.7 and I run backup job with LVM snapshots in order to
save KVM virtual machines.
I have two nodes;
I have disk replication via DRBD

I use Bacula for my backup jobs. There is a before job script that
runs lvcreate to make a new snapshot. Then there is a kpartx in order
to detect partition on the volume and create associated block devices
in /dev/mapper/…
Then there is a mount for each partition.

at the end of the job there is another script that do the reverse
operation. All the partition are unmounted, kpartx -d removes devices
associated to partitions.
After that I run lvremove -f in order to delete the snapshot.

The snapshot is deleted from /dev/mapper but lvremove is crashed like
every other lvm commands (lvdisplay…).
a ps aux gives the following :
root     10188  0.0  0.0  26036 13760 pts/0    D<L+ 14:10   0:00 lvremove

First I had a doubt about if my script unmounted partitions
successfully. But I recently tried the following commands:

while $(true) ; do  /sbin/lvcreate -L 5G -s -n test-lvremove-snap
/dev/drbd0vg/vm-101-disk-1 ; kpartx -a -v
/dev/drbd0vg/test-lvremove-snap;  mount -v
/dev/mapper/drbd0vg-test--lvremove--snap1 /media/test-lvremove/ ;
umount -v /dev/mapper/drbd0vg-test--lvremove--snap1  ; kpartx -d -v
/dev/drbd0vg/test-lvremove-snap ;  lvremove -f
/dev/drbd0vg/test-lvremove-snap ; sleep  2 ; done

while $(true) ; do  /sbin/lvcreate -L 5G -s -n test-lvremove-snap
/dev/drbd0vg/vm-101-disk-1 ; sleep 3 ; kpartx -a -v
/dev/drbd0vg/test-lvremove-snap ; sleep 3 ; mount -v
/dev/mapper/drbd0vg-test--lvremove--snap1 /media/test-lvremove/ ;
sleep 3 ; umount -v /dev/mapper/drbd0vg-test--lvremove--snap1  ; sleep
3 ; kpartx -d -v /dev/drbd0vg/test-lvremove-snap ;  sleep 3 ; lvremove
-f  /dev/drbd0vg/test-lvremove-snap ; sleep 3 ; done

The first, which does not includes the "sleep 5" crashed after few
loops. I tried also without the mount&umount and it crashed too.
This scenario crashes very quickly if there is a lot of IO an the disk
Before implementing my scripts I never had any problem with lvremove.
There was always a time between 2 commands.

Trace of the first test :

  Logical volume "test-lvremove-snap" created
add map drbd0vg-test--lvremove--snap1 (254:22): 0 64251904 linear
/dev/drbd0vg/test-lvremove-snap 2048
add map drbd0vg-test--lvremove--snap2 (254:23): 0 2850818 linear
/dev/drbd0vg/test-lvremove-snap 64255998
add map drbd0vg-test--lvremove--snap5 (254:24): 0 2850816 254:23 2
mount: you didn't specify a filesystem type for
/dev/mapper/drbd0vg-test--lvremove--snap1
       I will try type ext4
/dev/mapper/drbd0vg-test--lvremove--snap1 on /media/test-lvremove type ext4 (rw)
/dev/mapper/drbd0vg-test--lvremove--snap1 umounted
del devmap : drbd0vg-test--lvremove--snap5
del devmap : drbd0vg-test--lvremove--snap2
del devmap : drbd0vg-test--lvremove--snap1

kpartx followed by lvremove seems to make a deadlock; when it crashes
the full node is crashed with all its VM.
I am still able to login and run something on the node but everything
that is in relation to LVM and the virtual machines is impossible.
DRBD is still running and synchronizing volumes.

Have you got some idea ?  Is this a known bug ?
I saw some posts on forums. Someone said that it may come from kernel
version  2.6.32

Thanks

Hugo