[PVE-User] Thin LVM showing more used space than expected

Tue Dec 27 20:39:23 CET 2022

Am 27.12.2022 um 18:54 schrieb Óscar de Arriba:
> Hello all,
>
>  From ~1 week ago, one of my Proxmox nodes' data LVM is doing strange things.
>
>   For storage, I'm using a commercial Crucial MX500 SATA SSD connected directly to the motherboard controller (no PCIe HBA for the system+data disk) and it is brand new - and S.M.A.R.T. checks are passing, only 4% of wearout. I have set up proxmox inside a cluster with LVM and making backups to a NFS external location.
>
> Last week I tried to migrate an stopped VM of ~64 GiB from one server to another, and found out *the SSD started to underperform (~5 MB/s) after roughly 55 GiB copied *(this pattern was repeated several times).
> It was so bad that *even cancelling the migration, the SSD continued busy writting at that speeed and I need to reboot the instance, as it was completely unusable* (it is in my homelab, not running mission critical workloads, so it was okay to do that). After the reboot, I could remove the half-copied VM disk.
>
> After that, (and several retries, even making a backup to an external storage and trying to restore the backup, just in case the bottleneck was on the migration process) I ended up creating the instance from scratch and migrating data from one VM to another - so the VM was crearted brand new and no bottleneck was hit.
>
> The problem is that *now the pve/data logical volume is showing 377 GiB used, but the total size of stored VM disks (even if they are 100% approvisioned) is 168 GiB*. I checked and both VMs have no snapshots.
>
> I don't know if the reboot while writting to the disk (always having cancelled the migration first) damaged the LV in some way, but after thinking about it it does not even make sense that an SSD of this type ends up writting at 5 MB/s, even with the writting cache full. It should be writting far faster than that even without cache.
>
> Some information about the storage:
>
> `root at venom:~# lvs -a
>    LV              VG  Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
>    data            pve twi-aotz-- 377.55g             96.13  1.54
>    [data_tdata]    pve Twi-ao---- 377.55g
>    [data_tmeta]    pve ewi-ao----  <3.86g
>    [lvol0_pmspare] pve ewi-------  <3.86g
>    root            pve -wi-ao----  60.00g
>    swap            pve -wi-ao----   4.00g
>    vm-150-disk-0   pve Vwi-a-tz--   4.00m data        14.06
>    vm-150-disk-1   pve Vwi-a-tz-- 128.00g data        100.00
>    vm-201-disk-0   pve Vwi-aotz--   4.00m data        14.06
>    vm-201-disk-1   pve Vwi-aotz--  40.00g data        71.51`
>
> and can be also seen on this post on the forum I did a couple of days ago: https://forum.proxmox.com/threads/thin-lvm-showing-more-used-space-than-expected.120051/
>
> Any ideas aside from doing a backup and reinstall from scratch?
>
> Thanks in advance!
>
> _______________________________________________
> pve-user mailing list
> pve-user at lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user

Hi,

Never used lvm-thin, so beware, this is just guessing, but to me this 
looks like, for some reason, something filled up your pool once 
(probably the migration?). Consumer SSDs don't perform well when 
allocation all space (at least to my knowledge) and, even there is still 
space in the pool, there are no free blocks (as for the SSDs 
controller). Therefore the low speed may come from this situation, as 
the controller needs to erase blocks, before writing them again, due to 
the lack of (known) free space. Did you try to run a fstrim on the VMs 
to regain the allocated space? At least on linux something like "fstrim 
-av" should do the trick. Also the "discard" option needs to be enabled 
for all volumes you want to trim, so check the VM config first.

hth
Martin