[PVE-User] ZFS-8000-8A on a non-system disk. How to do?

Jan Vlach janus at volny.cz
Sun Nov 12 20:47:01 CET 2023


Hi,

having same number of checksum errors on all drives really means bad cabling or bad RAM. 

- if you have ECC RAM, check for errors in ipmitool i.e.
ipmitool sel elist
- you could see something in dmesg too.
- if you don't have ECC ram, get the memtest in UEFI mode from https://www.memtest86.com/ <https://www.memtest86.com/> take the host offline and let it run for day or two.

- I've seen this with Supermicro server where the cable for last two slots out of 10 was bent and touching the case lid and those two slots have been resetting the bus showing me increasing errors on all drives. Scrubs just changed the affected files and metadata, so I didn't trust the host anymore and consistency of data, restored everything from good backup to different one and then debugged.

- If at this point you want to backup and restore and you don't have backups, it's game over for you. 

JV

> On 12. 11. 2023, at 19:32, Stefan <proxmox at qwertz1.com> wrote:
> 
> I assume you already have ruled out flaky hardware? (Bad cable, RAM). If so repairing is not possible. You can theoretically bypass the backup/destroy/restore way but why?
> You have three faulty drives that need to be replaced anyway. That operation + identifying the failed file(s) takes much longer than just copy back from backup.
> 
> 
> 
> Am 12. November 2023 16:21:17 MEZ schrieb Marco Gaiarin <gaio at lilliput.linux.it>:
>> 
>> I've got:
>> 
>> root at lisei:~# zpool status -xv
>> pool: rpool-hdd
>> state: ONLINE
>> status: One or more devices has experienced an error resulting in data
>> 	corruption.  Applications may be affected.
>> action: Restore the file in question if possible.  Otherwise restore the
>> 	entire pool from backup.
>>  see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
>> scan: scrub repaired 0B in 00:37:50 with 1 errors on Sun Nov 12 01:01:53 2023
>> config:
>> 
>> 	NAME                                            STATE     READ WRITE CKSUM
>> 	rpool-hdd                                       ONLINE       0     0     0
>> 	  raidz1-0                                      ONLINE       0     0     0
>> 	    ata-WDC_WD2003FZEX-00SRLA0_WD-WMC6N0D6J2LN  ONLINE       0     0     2
>> 	    ata-WDC_WD2003FZEX-00SRLA0_WD-WMC6N0D7Z60F  ONLINE       0     0     2
>> 	    ata-WDC_WD2003FZEX-00SRLA0_WD-WMC6N0D2JSHZ  ONLINE       0     0     2
>> 
>> errors: Permanent errors have been detected in the following files:
>> 
>>       rpool-hdd/vm-401-disk-0:<0x1>
>> 
>> disk is for an VM used as a mere repository for rsnapshot backup, so contain
>> many copy of the same files, with different and abunndant retention.
>> Is an addon disk for the VM, eg i can safely if needed umount it, even
>> detach it.
>> 
>> 
>> There's something i can do to repair the volue, possibly online? really i
>> have to backup it, destroy and restore from backup?!
>> 
>> 
>> Thanks.
>> 
>> -- 
>> Non può sentirsi degno di essere italiano chi non vota SI al referendum
>> 				(Silvio Berlusconi, 21 giugno 2006)
>> 
>> 
>> 
>> _______________________________________________
>> pve-user mailing list
>> pve-user at lists.proxmox.com
>> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>> 
> _______________________________________________
> pve-user mailing list
> pve-user at lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user



More information about the pve-user mailing list