[PVE-User] HDD errors in VMs

Mon Jan 4 19:53:22 CET 2016

Hi Emmanuel.

Wow this are good tips. we can check for. thank you!

What we´ve started with is my thread in december.
[So Dez 27 05:17:44 2015] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 
action 0x6 frozen
[So Dez 27 05:17:44 2015] ata1.00: failed command: WRITE DMA
[So Dez 27 05:17:44 2015] ata1.00: cmd 
ca/00:80:b8:4e:ce/00:00:00:00:00/eb tag 0 dma 65536 out res 
40/00:01:00:00:00/00:00:00:00:00/a0 Emask 0x4(timeout)
[So Dez 27 05:17:44 2015] ata1.00: status: { DRDY }
[So Dez 27 05:17:44 2015] ata1: soft resetting link
[So Dez 27 05:17:45 2015] ata1.01: NODEV after polling detection
[So Dez 27 05:17:45 2015] ata1.00: configured for MWDMA2
[So Dez 27 05:17:45 2015] ata1.00: device reported invalid CHS sector 0
[So Dez 27 05:17:45 2015] ata1: EH complete

OR

kernel: [309438.824333] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 
action 0x6 frozen
kernel: [309438.825198] ata1.00: failed command: FLUSH CACHE
kernel: [309438.825921] ata1.00: cmd e7/00:00:00:00:00/00:00:00:00:00/a0 
tag 0
kernel: [309438.825921]          res 40/00:01:00:00:00/00:00:00:00:00/a0 
Emask 0x4 (timeout)
kernel: [309438.827996] ata1.00: status: { DRDY }
kernel: [309443.868140] ata1: link is slow to respond, please be patient 
(ready=0)
kernel: [309448.852147] ata1: device not ready (errno=-16), forcing 
hardreset
kernel: [309448.852175] ata1: soft resetting link
kernel: [309449.009123] ata1.00: configured for MWDMA2
kernel: [309449.009129] ata1.00: retrying FLUSH 0xe7 Emask 0x4
kernel: [309449.009532] ata1.00: device reported invalid CHS sector 0
kernel: [309449.009545] ata1: EH complete

The problem started with VMs just stopped working with those messages in 
kernel.log inside the VM.
half a year all works fine and than it started ;)
There were several VMs on this hosts. Some with older kernels, some with 
newer ones. (debian 8.2 - 3.16.x i.e.)
BUT only the new VMs with newer kernels stopped working (some days after 
last update of pve-kernel 2.x)
Those crashing VMs are smaller ones. The big ones with old kernels just 
run and run and run.

So after Dmitry´s response we switched from IDE to default virtIO.
Short time after we got hung_task_timeout_secs and blocked for more than 
120 seconds problems.
BUT again ONLY in the debian 8.2 VMs sporadically and also not during 
backup times or cronjob times (daily, weekly, etc)

So to check your points:
1. heavy IO
io activity is nearby 0 on those VMs, only during backup-times are big 
peaks. but during the backup the VMs just run fine.
The problems occour on an empty host with only one vm and a raid5 with 
7200er SAS drives and also on another node with raid 1 with 7200er SAS 
drives.
IO-Wait times on the busy node are about 1-2 (%?) logged as peak in 
proxmox gui.
So i´m thinking heavy io seems not the problem.

2. RAM
VMs are between 4 and 8GB of ram only (host only 32GB ram). so this 
should be easely handled by the raid controller

3. berserker
This was also my first idea, but the problem occours over different VMs. 
Yes all VMs are running with debian 8.2 and plesk 12.5 but with 
different sites.
So there have to be some identically probs in debian or plesk.

all Plesk servers are fresh installed and just run with max. 10 to 20 
domains and only small sites.
Plesk 12.5 brings in a reverse caching proxy (nginx) per default.

----

So after reading a lot of threads in proxmox forum and many many other 
sites the discussion widely started about kernel problems (see i.e. 
https://forum.proxmox.com/threads/linux-guest-problems-on-new-haswell-ep-processors.20372/page-4#post-124663)

Yesterday we migrated the first machine to the new proxmox 4 with 4.x 
kernel and will have a look now how long it will be up without errors.

And again big big thank you for your ideas!

kind regards
Michael