[PVE-User] Severe disk corruption: PBS, SATA

Thu May 19 06:07:05 CEST 2022

from over here in the cheap seats, another potential strangeness injector:

zfs + any sort of raid controller which plays the abstraction game between raw disk and the OS can cause any number of weird and painful scenarios.

ZFS believes it has an accurate idea of the underlying disks.

it does it’s voodoo wholly believing that it’s solely responsible for dealing with data durability.

with a raid controller in between playing the shell game with IO, things USUALLY work…. RIGHT UNTIL THEY DONT.

i’m sure you’re well aware of this, and have probably already mitigated this concern with a jbod controller, or something that isn’t preventing the OS (and thus ZFS) from talking directly to the disks… but It felt worth pointing out on the off chance that this got overlooked.

hope you are well and the gremlins are promptly discovered and put back into their comfortable chairs so they can resume their harmless heckling.

🐺W

[= The contents of this message have been written, read, processed, erased, sorted, sniffed, compressed, rewritten, misspelled, overcompensated, lost, found, and most importantly delivered entirely with recycled electrons =]

> On May 18, 2022, at 11:21, nada <nada at verdnatura.es> wrote:
> 
> hi Marco
> you used some local ZFS filesystem according to your info, so you may try
> 
> zfs list
> zpool list -v
> zpool history
> zpool import ...
> zpool replace ...
> 
> all the best
> Nada
> 
>> On 2022-05-18 10:04, Marco Gaiarin wrote:
>> We are depicting some vary severe disk corruption on one of our
>> installation, that is indeed a bit 'niche' but...
>> PVE 6.4 host on a Dell PowerEdge T340:
>>    root at sdpve1:~# uname -a
>>    Linux sdpve1 5.4.106-1-pve #1 SMP PVE 5.4.106-1 (Fri, 19 Mar 2021
>> 11:08:47 +0100) x86_64 GNU/Linux
>> Debian squeeze i386 on guest:
>>    sdinny:~# uname -a
>>    Linux sdinny 2.6.32-5-686 #1 SMP Mon Feb 29 00:51:35 UTC 2016 i686 GNU/Linux
>> boot disk defined as:
>>    sata0: local-zfs:vm-120-disk-0,discard=on,size=100G
>> After enabling PBS, everytime the backup of the VM start:
>> root at sdpve1:~# grep vzdump /var/log/syslog.1
>> May 17 20:27:17 sdpve1 pvedaemon[24825]: <root at pam> starting task
>> UPID:sdpve1:00005132:36BE6E40:6283E905:vzdump:120:root at pam:
>> May 17 20:27:17 sdpve1 pvedaemon[20786]: INFO: starting new backup
>> job: vzdump 120 --node sdpve1 --storage nfs-scratch --compress zstd
>> --remove 0 --mode snapshot
>> May 17 20:36:50 sdpve1 pvedaemon[24825]: <root at pam> end task
>> UPID:sdpve1:00005132:36BE6E40:6283E905:vzdump:120:root at pam: OK
>> May 17 22:00:01 sdpve1 CRON[1734]: (root) CMD (vzdump 100 101 120
>> --mode snapshot --mailto sys at admin --quiet 1 --mailnotification
>> failure --storage pbs-BP)
>> May 17 22:00:02 sdpve1 vzdump[1738]: <root at pam> starting task
>> UPID:sdpve1:00000AE6:36C6F7D7:6283FEC2:vzdump::root at pam:
>> May 17 22:00:02 sdpve1 vzdump[2790]: INFO: starting new backup job:
>> vzdump 100 101 120 --mailnotification failure --quiet 1 --mode
>> snapshot --storage pbs-BP --mailto sys at admin
>> May 17 22:00:02 sdpve1 vzdump[2790]: INFO: Starting Backup of VM 100 (qemu)
>> May 17 22:00:52 sdpve1 vzdump[2790]: INFO: Finished Backup of VM 100 (00:00:50)
>> May 17 22:00:52 sdpve1 vzdump[2790]: INFO: Starting Backup of VM 101 (qemu)
>> May 17 22:02:09 sdpve1 vzdump[2790]: INFO: Finished Backup of VM 101 (00:01:17)
>> May 17 22:02:10 sdpve1 vzdump[2790]: INFO: Starting Backup of VM 120 (qemu)
>> May 17 23:31:02 sdpve1 vzdump[2790]: INFO: Finished Backup of VM 120 (01:28:52)
>> May 17 23:31:02 sdpve1 vzdump[2790]: INFO: Backup job finished successfully
>> May 17 23:31:02 sdpve1 vzdump[1738]: <root at pam> end task
>> UPID:sdpve1:00000AE6:36C6F7D7:6283FEC2:vzdump::root at pam: OK
>> The VM depicted some massive and severe IO trouble:
>> May 17 22:40:48 sdinny kernel: [124793.000045] ata3.00: exception
>> Emask 0x0 SAct 0xf43d2c SErr 0x0 action 0x6 frozen
>> May 17 22:40:48 sdinny kernel: [124793.000493] ata3.00: failed
>> command: WRITE FPDMA QUEUED
>> May 17 22:40:48 sdinny kernel: [124793.000749] ata3.00: cmd
>> 61/10:10:58:e3:01/00:00:05:00:00/40 tag 2 ncq 8192 out
>> May 17 22:40:48 sdinny kernel: [124793.000749]          res
>> 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
>> May 17 22:40:48 sdinny kernel: [124793.001628] ata3.00: status: { DRDY }
>> May 17 22:40:48 sdinny kernel: [124793.001850] ata3.00: failed
>> command: WRITE FPDMA QUEUED
>> May 17 22:40:48 sdinny kernel: [124793.002175] ata3.00: cmd
>> 61/10:18:70:79:09/00:00:05:00:00/40 tag 3 ncq 8192 out
>> May 17 22:40:48 sdinny kernel: [124793.002175]          res
>> 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
>> May 17 22:40:48 sdinny kernel: [124793.003052] ata3.00: status: { DRDY }
>> May 17 22:40:48 sdinny kernel: [124793.003273] ata3.00: failed
>> command: WRITE FPDMA QUEUED
>> May 17 22:40:48 sdinny kernel: [124793.003527] ata3.00: cmd
>> 61/10:28:98:31:11/00:00:05:00:00/40 tag 5 ncq 8192 out
>> May 17 22:40:48 sdinny kernel: [124793.003559]          res
>> 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
>> May 17 22:40:48 sdinny kernel: [124793.004420] ata3.00: status: { DRDY }
>> May 17 22:40:48 sdinny kernel: [124793.004640] ata3.00: failed
>> command: WRITE FPDMA QUEUED
>> May 17 22:40:48 sdinny kernel: [124793.004893] ata3.00: cmd
>> 61/10:40:d8:4a:20/00:00:05:00:00/40 tag 8 ncq 8192 out
>> May 17 22:40:48 sdinny kernel: [124793.004894]          res
>> 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
>> May 17 22:40:48 sdinny kernel: [124793.005769] ata3.00: status: { DRDY }
>> [...]
>> May 17 22:40:48 sdinny kernel: [124793.020296] ata3: hard resetting link
>> May 17 22:41:12 sdinny kernel: [124817.132126] ata3: SATA link up 1.5
>> Gbps (SStatus 113 SControl 300)
>> May 17 22:41:12 sdinny kernel: [124817.132275] ata3.00: configured for UDMA/100
>> May 17 22:41:12 sdinny kernel: [124817.132277] ata3.00: device
>> reported invalid CHS sector 0
>> May 17 22:41:12 sdinny kernel: [124817.132279] ata3.00: device
>> reported invalid CHS sector 0
>> May 17 22:41:12 sdinny kernel: [124817.132280] ata3.00: device
>> reported invalid CHS sector 0
>> May 17 22:41:12 sdinny kernel: [124817.132281] ata3.00: device
>> reported invalid CHS sector 0
>> May 17 22:41:12 sdinny kernel: [124817.132281] ata3.00: device
>> reported invalid CHS sector 0
>> May 17 22:41:12 sdinny kernel: [124817.132282] ata3.00: device
>> reported invalid CHS sector 0
>> May 17 22:41:12 sdinny kernel: [124817.132283] ata3.00: device
>> reported invalid CHS sector 0
>> May 17 22:41:12 sdinny kernel: [124817.132284] ata3.00: device
>> reported invalid CHS sector 0
>> May 17 22:41:12 sdinny kernel: [124817.132285] ata3.00: device
>> reported invalid CHS sector 0
>> May 17 22:41:12 sdinny kernel: [124817.132286] ata3.00: device
>> reported invalid CHS sector 0
>> May 17 22:41:12 sdinny kernel: [124817.132287] ata3.00: device
>> reported invalid CHS sector 0
>> May 17 22:41:12 sdinny kernel: [124817.132288] ata3.00: device
>> reported invalid CHS sector 0
>> May 17 22:41:12 sdinny kernel: [124817.132289] ata3.00: device
>> reported invalid CHS sector 0
>> May 17 22:41:12 sdinny kernel: [124817.132295] ata3: EH complete
>> VM is still 'alive', and works.
>> But we was forced to do a reboot (power outgage) and after that all the
>> partition of the disk desappeared, we were forced to restore them with
>> some tools like 'testdisk'.
>> Partition on backups the same, desappeared.
>> Note that there's also a 'plain' local backup that run on sunday, and this
>> backup task seems does not generate trouble (but still seems to have
>> partition desappeared, thus was done after an I/O error).
>> We have hit a Kernel/Qemu bug?
> 
> _______________________________________________
> pve-user mailing list
> pve-user at lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>