[PVE-User] sharing zfs experience
tonci at suma-informatika.hr
Tue Jul 9 12:34:03 CEST 2019
Stoiko hi ,
thank you for your reply
Now I'm even more worried after reading this recent thread you sent me
:(Â ...Â I'm not sure any more what to expect after next reboot :)
So the question is how to avoid such scenarios in the future ? ...
My pools seems to be fully correct ... zpool status shows no errors at
all. I think something went wrong on container level ... How come that
disk-1 survived and disk-0 did not?
I can send some reports like syslogÂ or something so just please tell me
Thank you very much in advance
/srdaÄan pozdrav / best regards
TonÄi StipiÄeviÄ, dipl. ing. elektr.
/direktor / manager/**
*podrÅ¡ka / upravljanje
**IT*/Â sustavima za male i srednje tvrtke/
/Small & Medium Business
/*IT*//*support / management*
BadaliÄeva 27 / 10000 Zagreb / Hrvatska â Croatia
mob: +385 91 1234003
fax: +385 1Â 5560007
On 08. 07. 2019. 20:19, Stoiko Ivanov wrote:
> Plus Hosting<support at plus.hr>
> On Mon, 8 Jul 2019 18:50:07 +0200
> TonÄi StipiÄeviÄ<tonci at suma-informatika.hr> wrote:
>> Hi to all,
>> A customer of mine runs two clusters :
>> 1. 2node with ibm v370 san as shared strageÂ (hared lvm)
>> 2.Â 3node cluster all nodes run zfs ...Â no shared storage
>> Couple days ago he had an power outage and during that period of time
>> I was kind a worrying how apcupsd & proxmox will handle this
>> 1. Both nodes were properly shut down but one of 2 them dies ,
>> independent from power outage :) but just in the same time. I booted
>> up remaining node , adjusted "votes" and started all vm-s residing on
>> the shared lvm storage ...Â No further questions ... prox handled
>> that correctly
>> 2. all 3 nodes started up but the most important lxc conteiner cloud
>> not start.
>> Reason: Job forpve-container at 104.service failed because the control
>> process exited with error code. See "systemctl status
>> pve-container at 104.service" and "journalctl -xe" for details. TASK
>> ERROR: command 'systemctl start pve-container at 104' failed: exit code 1
>> Upgrading, restarting etc etc did not helped at all. The problem was
>> that rootfs from this contaierÂ was completely empty ( it contained
>> only /dev/ and /mnt/Â dirs . Fortunately second mount point (aka 2nd
>> disk) with 2T of data was pretty healthy and visible. So one option
>> was to restore it from backup but zfs list command showed that this
>> data set still holds data as much as it should (disk 0)
> This somehow reminds me of a recent thread in the forum:
> did the rpool get imported completely - or are there some errors in the
> journal while the system booted?
> In any case - glad you manged to resolve the issue!
>> root at pve01-hrz-zm:~# ls -al /rpool/data/subvol-104-disk-0/
>> total 10
>> drwxr-xr-x 4 root root 4 SrpÂ 4 14:07 .
>> drwxr-xr-x 9 root root 9 SrpÂ 4 23:17 ..
>> drwxr-xr-x 2 root root 2 SrpÂ 4 14:07 dev
>> drwxr-xr-x 3 root root 3 SrpÂ 4 14:07 mnt
>> root at pve01-hrz-zm:~# zfs list
>> NAMEÂ Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â USEDÂ AVAILÂ REFERÂ MOUNTPOINT
>> rpoolÂ Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â 2,15TÂ 1,36TÂ Â 104KÂ /rpool
>> rpool/dataÂ Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â 2,15TÂ 1,36TÂ Â 128KÂ /rpool/data
>> rpool/data/subvol-104-disk-0Â Â Â 751MÂ 15,3GÂ Â 751M
>> rpool/data/subvol-104-disk-1Â Â 2,15TÂ Â 894GÂ 2,15T
>> Interesting was that both lcx containers from this node had "empty"
>> disk-0Â (but the other one was not that big, it had only disk-0) and
>> none of them could start.
>> After many tries I decided to migrate this little container to other
>> just to see what will happen :Â migration was successfull and
>> starting up as well .Â OK (true relief finally :). then I tried to
>> make backup of this vm just to see what will happen. No, backup was
>> not successfull ... backup archive was only 1.7KB big. Ok, let's get
>> back to migration scenario. So, the final conclusion was that
>> migration itself was not the solution but snapshot was the right one.
>> Snapshot was the step that revived this disk-0.
>> So , at the end I just made snapshot of the 104-disk-0, cloned it
>> back right after to 1044-disk-0 and then just change the reference in
>> lxc configuration. After that lxc started successfully.
>> I'm very wondering why this happened but am also very happy that
>> above simple steps saved my day.
>> Hopefully this information helps somebody that will run into same
>> problem , but in the same time I truly hope that it won't happen :)
>> Tonci Stipicevic
>> pve-user mailing list
>> pve-user at pve.proxmox.com
More information about the pve-user