[PVE-User] Severe disk corruption: PBS, SATA

Wed May 18 10:53:04 CEST 2022

Hi Marco,

I would try changing that sata0 disk to virtio-blk (maybe in a clone VM 
first). I think squeeze will support it; then try PBS backup again.

El 18/5/22 a las 10:04, Marco Gaiarin escribió:
> We are depicting some vary severe disk corruption on one of our
> installation, that is indeed a bit 'niche' but...
>
> PVE 6.4 host on a Dell PowerEdge T340:
> 	root at sdpve1:~# uname -a
> 	Linux sdpve1 5.4.106-1-pve #1 SMP PVE 5.4.106-1 (Fri, 19 Mar 2021 11:08:47 +0100) x86_64 GNU/Linux
>
> Debian squeeze i386 on guest:
> 	sdinny:~# uname -a
> 	Linux sdinny 2.6.32-5-686 #1 SMP Mon Feb 29 00:51:35 UTC 2016 i686 GNU/Linux
>
> boot disk defined as:
> 	sata0: local-zfs:vm-120-disk-0,discard=on,size=100G
>
>
> After enabling PBS, everytime the backup of the VM start:
>
>   root at sdpve1:~# grep vzdump /var/log/syslog.1
>   May 17 20:27:17 sdpve1 pvedaemon[24825]: <root at pam> starting task UPID:sdpve1:00005132:36BE6E40:6283E905:vzdump:120:root at pam:
>   May 17 20:27:17 sdpve1 pvedaemon[20786]: INFO: starting new backup job: vzdump 120 --node sdpve1 --storage nfs-scratch --compress zstd --remove 0 --mode snapshot
>   May 17 20:36:50 sdpve1 pvedaemon[24825]: <root at pam> end task UPID:sdpve1:00005132:36BE6E40:6283E905:vzdump:120:root at pam: OK
>   May 17 22:00:01 sdpve1 CRON[1734]: (root) CMD (vzdump 100 101 120 --mode snapshot --mailto sys at admin --quiet 1 --mailnotification failure --storage pbs-BP)
>   May 17 22:00:02 sdpve1 vzdump[1738]: <root at pam> starting task UPID:sdpve1:00000AE6:36C6F7D7:6283FEC2:vzdump::root at pam:
>   May 17 22:00:02 sdpve1 vzdump[2790]: INFO: starting new backup job: vzdump 100 101 120 --mailnotification failure --quiet 1 --mode snapshot --storage pbs-BP --mailto sys at admin
>   May 17 22:00:02 sdpve1 vzdump[2790]: INFO: Starting Backup of VM 100 (qemu)
>   May 17 22:00:52 sdpve1 vzdump[2790]: INFO: Finished Backup of VM 100 (00:00:50)
>   May 17 22:00:52 sdpve1 vzdump[2790]: INFO: Starting Backup of VM 101 (qemu)
>   May 17 22:02:09 sdpve1 vzdump[2790]: INFO: Finished Backup of VM 101 (00:01:17)
>   May 17 22:02:10 sdpve1 vzdump[2790]: INFO: Starting Backup of VM 120 (qemu)
>   May 17 23:31:02 sdpve1 vzdump[2790]: INFO: Finished Backup of VM 120 (01:28:52)
>   May 17 23:31:02 sdpve1 vzdump[2790]: INFO: Backup job finished successfully
>   May 17 23:31:02 sdpve1 vzdump[1738]: <root at pam> end task UPID:sdpve1:00000AE6:36C6F7D7:6283FEC2:vzdump::root at pam: OK
>
> The VM depicted some massive and severe IO trouble:
>
>   May 17 22:40:48 sdinny kernel: [124793.000045] ata3.00: exception Emask 0x0 SAct 0xf43d2c SErr 0x0 action 0x6 frozen
>   May 17 22:40:48 sdinny kernel: [124793.000493] ata3.00: failed command: WRITE FPDMA QUEUED
>   May 17 22:40:48 sdinny kernel: [124793.000749] ata3.00: cmd 61/10:10:58:e3:01/00:00:05:00:00/40 tag 2 ncq 8192 out
>   May 17 22:40:48 sdinny kernel: [124793.000749]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
>   May 17 22:40:48 sdinny kernel: [124793.001628] ata3.00: status: { DRDY }
>   May 17 22:40:48 sdinny kernel: [124793.001850] ata3.00: failed command: WRITE FPDMA QUEUED
>   May 17 22:40:48 sdinny kernel: [124793.002175] ata3.00: cmd 61/10:18:70:79:09/00:00:05:00:00/40 tag 3 ncq 8192 out
>   May 17 22:40:48 sdinny kernel: [124793.002175]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
>   May 17 22:40:48 sdinny kernel: [124793.003052] ata3.00: status: { DRDY }
>   May 17 22:40:48 sdinny kernel: [124793.003273] ata3.00: failed command: WRITE FPDMA QUEUED
>   May 17 22:40:48 sdinny kernel: [124793.003527] ata3.00: cmd 61/10:28:98:31:11/00:00:05:00:00/40 tag 5 ncq 8192 out
>   May 17 22:40:48 sdinny kernel: [124793.003559]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
>   May 17 22:40:48 sdinny kernel: [124793.004420] ata3.00: status: { DRDY }
>   May 17 22:40:48 sdinny kernel: [124793.004640] ata3.00: failed command: WRITE FPDMA QUEUED
>   May 17 22:40:48 sdinny kernel: [124793.004893] ata3.00: cmd 61/10:40:d8:4a:20/00:00:05:00:00/40 tag 8 ncq 8192 out
>   May 17 22:40:48 sdinny kernel: [124793.004894]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
>   May 17 22:40:48 sdinny kernel: [124793.005769] ata3.00: status: { DRDY }
>   [...]
>   May 17 22:40:48 sdinny kernel: [124793.020296] ata3: hard resetting link
>   May 17 22:41:12 sdinny kernel: [124817.132126] ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
>   May 17 22:41:12 sdinny kernel: [124817.132275] ata3.00: configured for UDMA/100
>   May 17 22:41:12 sdinny kernel: [124817.132277] ata3.00: device reported invalid CHS sector 0
>   May 17 22:41:12 sdinny kernel: [124817.132279] ata3.00: device reported invalid CHS sector 0
>   May 17 22:41:12 sdinny kernel: [124817.132280] ata3.00: device reported invalid CHS sector 0
>   May 17 22:41:12 sdinny kernel: [124817.132281] ata3.00: device reported invalid CHS sector 0
>   May 17 22:41:12 sdinny kernel: [124817.132281] ata3.00: device reported invalid CHS sector 0
>   May 17 22:41:12 sdinny kernel: [124817.132282] ata3.00: device reported invalid CHS sector 0
>   May 17 22:41:12 sdinny kernel: [124817.132283] ata3.00: device reported invalid CHS sector 0
>   May 17 22:41:12 sdinny kernel: [124817.132284] ata3.00: device reported invalid CHS sector 0
>   May 17 22:41:12 sdinny kernel: [124817.132285] ata3.00: device reported invalid CHS sector 0
>   May 17 22:41:12 sdinny kernel: [124817.132286] ata3.00: device reported invalid CHS sector 0
>   May 17 22:41:12 sdinny kernel: [124817.132287] ata3.00: device reported invalid CHS sector 0
>   May 17 22:41:12 sdinny kernel: [124817.132288] ata3.00: device reported invalid CHS sector 0
>   May 17 22:41:12 sdinny kernel: [124817.132289] ata3.00: device reported invalid CHS sector 0
>   May 17 22:41:12 sdinny kernel: [124817.132295] ata3: EH complete
>
> VM is still 'alive', and works.
> But we was forced to do a reboot (power outgage) and after that all the
> partition of the disk desappeared, we were forced to restore them with
> some tools like 'testdisk'.
> Partition on backups the same, desappeared.
>
>
> Note that there's also a 'plain' local backup that run on sunday, and this
> backup task seems does not generate trouble (but still seems to have
> partition desappeared, thus was done after an I/O error).
>
>
> We have hit a Kernel/Qemu bug?
>

Eneko Lacunza
Zuzendari teknikoa | Director técnico
Binovo IT Human Project

Tel. +34 943 569 206 |https://www.binovo.es
Astigarragako Bidea, 2 - 2º izda. Oficina 10-11, 20180 Oiartzun

https://www.youtube.com/user/CANALBINOVO
https://www.linkedin.com/company/37269706/