[PVE-User] sharing zfs experience
s.ivanov at proxmox.com
Mon Jul 8 20:19:18 CEST 2019
On Mon, 8 Jul 2019 18:50:07 +0200
TonÄi StipiÄeviÄ <tonci at suma-informatika.hr> wrote:
> Hi to all,
> A customer of mine runs two clusters :
> 1. 2node with ibm v370 san as shared strageÂ (hared lvm)
> 2.Â 3node cluster all nodes run zfs ...Â no shared storage
> Couple days ago he had an power outage and during that period of time
> I was kind a worrying how apcupsd & proxmox will handle this
> 1. Both nodes were properly shut down but one of 2 them dies ,
> independent from power outage :) but just in the same time. I booted
> up remaining node , adjusted "votes" and started all vm-s residing on
> the shared lvm storage ...Â No further questions ... prox handled
> that correctly
> 2. all 3 nodes started up but the most important lxc conteiner cloud
> not start.
> Reason: Job for pve-container at 104.service failed because the control
> process exited with error code. See "systemctl status
> pve-container at 104.service" and "journalctl -xe" for details. TASK
> ERROR: command 'systemctl start pve-container at 104' failed: exit code 1
> Upgrading, restarting etc etc did not helped at all. The problem was
> that rootfs from this contaierÂ was completely empty ( it contained
> only /dev/ and /mnt/Â dirs . Fortunately second mount point (aka 2nd
> disk) with 2T of data was pretty healthy and visible. So one option
> was to restore it from backup but zfs list command showed that this
> data set still holds data as much as it should (disk 0)
This somehow reminds me of a recent thread in the forum:
did the rpool get imported completely - or are there some errors in the
journal while the system booted?
In any case - glad you manged to resolve the issue!
> root at pve01-hrz-zm:~# ls -al /rpool/data/subvol-104-disk-0/
> total 10
> drwxr-xr-x 4 root root 4 SrpÂ 4 14:07 .
> drwxr-xr-x 9 root root 9 SrpÂ 4 23:17 ..
> drwxr-xr-x 2 root root 2 SrpÂ 4 14:07 dev
> drwxr-xr-x 3 root root 3 SrpÂ 4 14:07 mnt
> root at pve01-hrz-zm:~# zfs list
> NAMEÂ Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â USEDÂ AVAILÂ REFERÂ MOUNTPOINT
> rpoolÂ Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â 2,15TÂ 1,36TÂ Â 104KÂ /rpool
> rpool/dataÂ Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â 2,15TÂ 1,36TÂ Â 128KÂ /rpool/data
> rpool/data/subvol-104-disk-0Â Â Â 751MÂ 15,3GÂ Â 751M
> rpool/data/subvol-104-disk-1Â Â 2,15TÂ Â 894GÂ 2,15T
> Interesting was that both lcx containers from this node had "empty"
> disk-0Â (but the other one was not that big, it had only disk-0) and
> none of them could start.
> After many tries I decided to migrate this little container to other
> just to see what will happen :Â migration was successfull and
> starting up as well .Â OK (true relief finally :). then I tried to
> make backup of this vm just to see what will happen. No, backup was
> not successfull ... backup archive was only 1.7KB big. Ok, let's get
> back to migration scenario. So, the final conclusion was that
> migration itself was not the solution but snapshot was the right one.
> Snapshot was the step that revived this disk-0.
> So , at the end I just made snapshot of the 104-disk-0, cloned it
> back right after to 1044-disk-0 and then just change the reference in
> lxc configuration. After that lxc started successfully.
> I'm very wondering why this happened but am also very happy that
> above simple steps saved my day.
> Hopefully this information helps somebody that will run into same
> problem , but in the same time I truly hope that it won't happen :)
> Tonci Stipicevic
> pve-user mailing list
> pve-user at pve.proxmox.com
More information about the pve-user