[PVE-User] sharing zfs experience
Tonči Stipičević
tonci at suma-informatika.hr
Mon Jul 8 18:50:07 CEST 2019
Hi to all,
A customer of mine runs two clusters :
1. 2node with ibm v370 san as shared strage (hared lvm)
2. 3node cluster all nodes run zfs ... no shared storage
Couple days ago he had an power outage and during that period of time I
was kind a worrying how apcupsd & proxmox will handle this situation.
1. Both nodes were properly shut down but one of 2 them dies ,
independent from power outage :) but just in the same time. I booted up
remaining node , adjusted "votes" and started all vm-s residing on the
shared lvm storage ... No further questions ... prox handled that correctly
2. all 3 nodes started up but the most important lxc conteiner cloud not
start.
Reason: Job for pve-container at 104.service failed because the control
process exited with error code. See "systemctl status
pve-container at 104.service" and "journalctl -xe" for details. TASK ERROR:
command 'systemctl start pve-container at 104' failed: exit code 1
Upgrading, restarting etc etc did not helped at all. The problem was
that rootfs from this contaier was completely empty ( it contained only
/dev/ and /mnt/ dirs . Fortunately second mount point (aka 2nd disk)
with 2T of data was pretty healthy and visible. So one option was to
restore it from backup but zfs list command showed that this data set
still holds data as much as it should (disk 0)
root at pve01-hrz-zm:~# ls -al /rpool/data/subvol-104-disk-0/
total 10
drwxr-xr-x 4 root root 4 Srp 4 14:07 .
drwxr-xr-x 9 root root 9 Srp 4 23:17 ..
drwxr-xr-x 2 root root 2 Srp 4 14:07 dev
drwxr-xr-x 3 root root 3 Srp 4 14:07 mnt
root at pve01-hrz-zm:~# zfs list
NAME USED AVAIL REFER MOUNTPOINT
rpool 2,15T 1,36T 104K /rpool
rpool/data 2,15T 1,36T 128K /rpool/data
rpool/data/subvol-104-disk-0 751M 15,3G 751M
/rpool/data/subvol-104-disk-0
rpool/data/subvol-104-disk-1 2,15T 894G 2,15T
/rpool/data/subvol-104-disk-1
Interesting was that both lcx containers from this node had "empty"
disk-0 (but the other one was not that big, it had only disk-0) and
none of them could start.
After many tries I decided to migrate this little container to other
just to see what will happen : migration was successfull and starting
up as well . OK (true relief finally :). then I tried to make backup of
this vm just to see what will happen. No, backup was not successfull ...
backup archive was only 1.7KB big. Ok, let's get back to migration
scenario. So, the final conclusion was that migration itself was not the
solution but snapshot was the right one. Snapshot was the step that
revived this disk-0.
So , at the end I just made snapshot of the 104-disk-0, cloned it back
right after to 1044-disk-0 and then just change the reference in lxc
configuration. After that lxc started successfully.
I'm very wondering why this happened but am also very happy that above
simple steps saved my day.
Hopefully this information helps somebody that will run into same
problem , but in the same time I truly hope that it won't happen :)
BR
Tonci Stipicevic
More information about the pve-user
mailing list