[PVE-User] HDD errors in VMs
Michael Pöllinger
m.poellinger at wds-tech.de
Mon Jan 4 19:53:22 CET 2016
Hi Emmanuel.
Wow this are good tips. we can check for. thank you!
What we´ve started with is my thread in december.
[So Dez 27 05:17:44 2015] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0
action 0x6 frozen
[So Dez 27 05:17:44 2015] ata1.00: failed command: WRITE DMA
[So Dez 27 05:17:44 2015] ata1.00: cmd
ca/00:80:b8:4e:ce/00:00:00:00:00/eb tag 0 dma 65536 out res
40/00:01:00:00:00/00:00:00:00:00/a0 Emask 0x4(timeout)
[So Dez 27 05:17:44 2015] ata1.00: status: { DRDY }
[So Dez 27 05:17:44 2015] ata1: soft resetting link
[So Dez 27 05:17:45 2015] ata1.01: NODEV after polling detection
[So Dez 27 05:17:45 2015] ata1.00: configured for MWDMA2
[So Dez 27 05:17:45 2015] ata1.00: device reported invalid CHS sector 0
[So Dez 27 05:17:45 2015] ata1: EH complete
OR
kernel: [309438.824333] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0
action 0x6 frozen
kernel: [309438.825198] ata1.00: failed command: FLUSH CACHE
kernel: [309438.825921] ata1.00: cmd e7/00:00:00:00:00/00:00:00:00:00/a0
tag 0
kernel: [309438.825921] res 40/00:01:00:00:00/00:00:00:00:00/a0
Emask 0x4 (timeout)
kernel: [309438.827996] ata1.00: status: { DRDY }
kernel: [309443.868140] ata1: link is slow to respond, please be patient
(ready=0)
kernel: [309448.852147] ata1: device not ready (errno=-16), forcing
hardreset
kernel: [309448.852175] ata1: soft resetting link
kernel: [309449.009123] ata1.00: configured for MWDMA2
kernel: [309449.009129] ata1.00: retrying FLUSH 0xe7 Emask 0x4
kernel: [309449.009532] ata1.00: device reported invalid CHS sector 0
kernel: [309449.009545] ata1: EH complete
The problem started with VMs just stopped working with those messages in
kernel.log inside the VM.
half a year all works fine and than it started ;)
There were several VMs on this hosts. Some with older kernels, some with
newer ones. (debian 8.2 - 3.16.x i.e.)
BUT only the new VMs with newer kernels stopped working (some days after
last update of pve-kernel 2.x)
Those crashing VMs are smaller ones. The big ones with old kernels just
run and run and run.
So after Dmitry´s response we switched from IDE to default virtIO.
Short time after we got hung_task_timeout_secs and blocked for more than
120 seconds problems.
BUT again ONLY in the debian 8.2 VMs sporadically and also not during
backup times or cronjob times (daily, weekly, etc)
So to check your points:
1. heavy IO
io activity is nearby 0 on those VMs, only during backup-times are big
peaks. but during the backup the VMs just run fine.
The problems occour on an empty host with only one vm and a raid5 with
7200er SAS drives and also on another node with raid 1 with 7200er SAS
drives.
IO-Wait times on the busy node are about 1-2 (%?) logged as peak in
proxmox gui.
So i´m thinking heavy io seems not the problem.
2. RAM
VMs are between 4 and 8GB of ram only (host only 32GB ram). so this
should be easely handled by the raid controller
3. berserker
This was also my first idea, but the problem occours over different VMs.
Yes all VMs are running with debian 8.2 and plesk 12.5 but with
different sites.
So there have to be some identically probs in debian or plesk.
all Plesk servers are fresh installed and just run with max. 10 to 20
domains and only small sites.
Plesk 12.5 brings in a reverse caching proxy (nginx) per default.
----
So after reading a lot of threads in proxmox forum and many many other
sites the discussion widely started about kernel problems (see i.e.
https://forum.proxmox.com/threads/linux-guest-problems-on-new-haswell-ep-processors.20372/page-4#post-124663)
Yesterday we migrated the first machine to the new proxmox 4 with 4.x
kernel and will have a look now how long it will be up without errors.
And again big big thank you for your ideas!
kind regards
Michael
More information about the pve-user
mailing list