[pve-devel] transparent huge pages support / disk passthrough corruption

Thu Jan 19 09:43:53 CET 2017

Hi,

I have reenable THP ( transparent_hugepage=madvise) since around 1 year (with pve-kernel 4.2-4.4), and I don't have problem anymore like in the past.

I'm hosting a lot of database (mysql,sqlserver, redis, mongo,...) and I don't have seen performance impact since I have reenable THP.

So I think it's pretty safe to set it by default.

----- Mail original -----
De: "Fabian Grünbichler" <f.gruenbichler at proxmox.com>
À: "pve-devel" <pve-devel at pve.proxmox.com>
Cc: "aderumier" <aderumier at odiso.com>, "Andreas Steinel" <a.steinel at gmail.com>
Envoyé: Jeudi 19 Janvier 2017 09:35:43
Objet: transparent huge pages support / disk passthrough corruption

So it seems like the recently reported problems[1] with disk pass 
through using virtio-scsi(-single) are caused by a combination of Qemu 
since 2.7 not handling memory fragmentation (well) and our compiled-in 
default of disabling transparent huge pages on the kernel side. 

While I will investigate further and see whether this is not fixable on 
the Qemu side as well, I think it would be a good idea to revisit the 
decision to patch this default in[2]. 

@Andreas, Alexandre: you both where proponents of disabling THP support 
back then, but the current kernel docs[3] say (emphasis mine): 

-----%<----- 
Transparent Hugepage Support can be entirely disabled (*mostly for 
debugging purposes*) or only enabled inside MADV_HUGEPAGE regions (to 
avoid the risk of consuming more memory resources) or enabled system 
wide. This can be achieved with one of: 

echo always >/sys/kernel/mm/transparent_hugepage/enabled 
echo madvise >/sys/kernel/mm/transparent_hugepage/enabled 
echo never >/sys/kernel/mm/transparent_hugepage/enabled 

It's also possible to limit defrag efforts in the VM to generate 
hugepages in case they're not immediately free to madvise regions or 
to never try to defrag memory and simply fallback to regular pages 
unless hugepages are immediately available. Clearly if we spend CPU 
time to defrag memory, we would expect to gain even more by the fact 
we use hugepages later instead of regular pages. This isn't always 
guaranteed, but it may be more likely in case the allocation is for a 
MADV_HUGEPAGE region. 

echo always >/sys/kernel/mm/transparent_hugepage/defrag 
echo madvise >/sys/kernel/mm/transparent_hugepage/defrag 
echo never >/sys/kernel/mm/transparent_hugepage/defrag 
----->%----- 

so I think setting both enabled and defrag to "madvise" by default would 
be advisable - the admin can override it (permanently with a kernel boot 
parameter, or at run time with the sysfs interface) anyway if they 
really know it causes performance issues. 

if you have any hard benchmark data to back up staying at "never", 
please send it soon ;) preferable both with non-transparent hugepages 
setup and without, and with both "always" and "madvise" for enabled and 
defrag. 

running a setup that is intended for debugging purposes (see above) as 
default seems strange to me (and this was probably the reason why we 
needed to patch "never" as default in in the first place). while I am 
not yet convinced that this solves the passthrough data corruption issue 
entirely, it is very reliably reproducable with THP disabled, and not at 
all so far on my test setup with THP enabled - so I propose switching 
with the next kernel update, unless there are (serious) objections. 

1: https://forum.proxmox.com/threads/proxmox-4-4-virtio_scsi-regression.31471/ 
2: http://pve.proxmox.com/pipermail/pve-devel/2015-September/017079.html 
3. https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/Documentation/vm/transhuge.txt?h=linux-4.4.y#n95