[PVE-User] sharing zfs experience
Stoiko Ivanov
s.ivanov at proxmox.com
Mon Jul 8 20:19:18 CEST 2019
hi,
On Mon, 8 Jul 2019 18:50:07 +0200
Tonči Stipičević <tonci at suma-informatika.hr> wrote:
> Hi to all,
>
> A customer of mine runs two clusters :
>
> 1. 2node with ibm v370 san as shared strage (hared lvm)
>
> 2. 3node cluster all nodes run zfs ... no shared storage
>
>
> Couple days ago he had an power outage and during that period of time
> I was kind a worrying how apcupsd & proxmox will handle this
> situation.
>
> 1. Both nodes were properly shut down but one of 2 them dies ,
> independent from power outage :) but just in the same time. I booted
> up remaining node , adjusted "votes" and started all vm-s residing on
> the shared lvm storage ... No further questions ... prox handled
> that correctly
>
> 2. all 3 nodes started up but the most important lxc conteiner cloud
> not start.
>
> Reason: Job for pve-container at 104.service failed because the control
> process exited with error code. See "systemctl status
> pve-container at 104.service" and "journalctl -xe" for details. TASK
> ERROR: command 'systemctl start pve-container at 104' failed: exit code 1
>
> Upgrading, restarting etc etc did not helped at all. The problem was
> that rootfs from this contaier was completely empty ( it contained
> only /dev/ and /mnt/ dirs . Fortunately second mount point (aka 2nd
> disk) with 2T of data was pretty healthy and visible. So one option
> was to restore it from backup but zfs list command showed that this
> data set still holds data as much as it should (disk 0)
This somehow reminds me of a recent thread in the forum:
https://forum.proxmox.com/threads/reboot-of-pve-host-breaks-lxc-container-startup.55486/#post-255641
did the rpool get imported completely - or are there some errors in the
journal while the system booted?
In any case - glad you manged to resolve the issue!
>
> root at pve01-hrz-zm:~# ls -al /rpool/data/subvol-104-disk-0/
> total 10
> drwxr-xr-x 4 root root 4 Srp 4 14:07 .
> drwxr-xr-x 9 root root 9 Srp 4 23:17 ..
> drwxr-xr-x 2 root root 2 Srp 4 14:07 dev
> drwxr-xr-x 3 root root 3 Srp 4 14:07 mnt
>
> root at pve01-hrz-zm:~# zfs list
> NAME USED AVAIL REFER MOUNTPOINT
> rpool 2,15T 1,36T 104K /rpool
> rpool/data 2,15T 1,36T 128K /rpool/data
> rpool/data/subvol-104-disk-0 751M 15,3G 751M
> /rpool/data/subvol-104-disk-0
> rpool/data/subvol-104-disk-1 2,15T 894G 2,15T
> /rpool/data/subvol-104-disk-1
>
>
> Interesting was that both lcx containers from this node had "empty"
> disk-0 (but the other one was not that big, it had only disk-0) and
> none of them could start.
>
> After many tries I decided to migrate this little container to other
> just to see what will happen : migration was successfull and
> starting up as well . OK (true relief finally :). then I tried to
> make backup of this vm just to see what will happen. No, backup was
> not successfull ... backup archive was only 1.7KB big. Ok, let's get
> back to migration scenario. So, the final conclusion was that
> migration itself was not the solution but snapshot was the right one.
> Snapshot was the step that revived this disk-0.
>
> So , at the end I just made snapshot of the 104-disk-0, cloned it
> back right after to 1044-disk-0 and then just change the reference in
> lxc configuration. After that lxc started successfully.
>
>
> I'm very wondering why this happened but am also very happy that
> above simple steps saved my day.
>
> Hopefully this information helps somebody that will run into same
> problem , but in the same time I truly hope that it won't happen :)
>
>
> BR
>
> Tonci Stipicevic
>
>
>
>
>
>
>
> _______________________________________________
> pve-user mailing list
> pve-user at pve.proxmox.com
> https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
More information about the pve-user
mailing list