[PVE-User] sharing zfs experience

Mon Jul 8 20:19:18 CEST 2019

hi,

On Mon, 8 Jul 2019 18:50:07 +0200
Tonči Stipičević <tonci at suma-informatika.hr> wrote:

> Hi to all,
> 
> A customer of mine runs two clusters :
> 
> 1. 2node with ibm v370 san as shared strage  (hared lvm)
> 
> 2.  3node cluster all nodes run zfs ...  no shared storage
> 
> 
> Couple days ago he had an power outage and during that period of time
> I was kind a worrying how apcupsd & proxmox will handle this
> situation.
> 
> 1. Both nodes were properly shut down but one of 2 them dies , 
> independent from power outage :) but just in the same time. I booted
> up remaining node , adjusted "votes" and started all vm-s residing on
> the shared lvm storage ...  No further questions ... prox handled
> that correctly
> 
> 2. all 3 nodes started up but the most important lxc conteiner cloud
> not start.
> 
> Reason: Job for pve-container at 104.service failed because the control 
> process exited with error code. See "systemctl status 
> pve-container at 104.service" and "journalctl -xe" for details. TASK
> ERROR: command 'systemctl start pve-container at 104' failed: exit code 1
> 
> Upgrading, restarting etc etc did not helped at all. The problem was 
> that rootfs from this contaier  was completely empty ( it contained
> only /dev/ and /mnt/  dirs . Fortunately second mount point (aka 2nd
> disk) with 2T of data was pretty healthy and visible. So one option
> was to restore it from backup but zfs list command showed that this
> data set still holds data as much as it should (disk 0)
This somehow reminds me of a recent thread in the forum:
https://forum.proxmox.com/threads/reboot-of-pve-host-breaks-lxc-container-startup.55486/#post-255641

did the rpool get imported completely - or are there some errors in the
journal while the system booted?

In any case - glad you manged to resolve the issue!

> 
> root at pve01-hrz-zm:~# ls -al /rpool/data/subvol-104-disk-0/
> total 10
> drwxr-xr-x 4 root root 4 Srp  4 14:07 .
> drwxr-xr-x 9 root root 9 Srp  4 23:17 ..
> drwxr-xr-x 2 root root 2 Srp  4 14:07 dev
> drwxr-xr-x 3 root root 3 Srp  4 14:07 mnt
> 
> root at pve01-hrz-zm:~# zfs list
> NAME                            USED  AVAIL  REFER  MOUNTPOINT
> rpool                          2,15T  1,36T   104K  /rpool
> rpool/data                     2,15T  1,36T   128K  /rpool/data
> rpool/data/subvol-104-disk-0    751M  15,3G   751M 
> /rpool/data/subvol-104-disk-0
> rpool/data/subvol-104-disk-1   2,15T   894G  2,15T 
> /rpool/data/subvol-104-disk-1
> 
> 
> Interesting was that both lcx containers from this node had "empty" 
> disk-0  (but the other one was not that big, it had only disk-0) and 
> none of them could start.
> 
> After many tries I decided to migrate this little container to other 
> just to see what will happen :  migration was successfull and
> starting up as well .  OK (true relief finally :). then I tried to
> make backup of this vm just to see what will happen. No, backup was
> not successfull ... backup archive was only 1.7KB big. Ok, let's get
> back to migration scenario. So, the final conclusion was that
> migration itself was not the solution but snapshot was the right one.
> Snapshot was the step that revived this disk-0.
> 
> So , at the end I just made snapshot of the 104-disk-0, cloned it
> back right after to 1044-disk-0 and then just change the reference in
> lxc configuration. After that lxc started successfully.
> 
> 
> I'm very wondering why this happened but am also very happy that
> above simple steps saved my day.
> 
> Hopefully this information helps somebody that will run into same 
> problem , but in the same time I truly hope that it won't happen :)
> 
> 
> BR
> 
> Tonci Stipicevic
> 
> 
> 
> 	
> 	
> 	
> 
> _______________________________________________
> pve-user mailing list
> pve-user at pve.proxmox.com
> https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user