[pve-devel] transparent huge pages support / disk passthrough corruption
Fabian Grünbichler
f.gruenbichler at proxmox.com
Thu Jan 19 09:35:43 CET 2017
So it seems like the recently reported problems[1] with disk pass
through using virtio-scsi(-single) are caused by a combination of Qemu
since 2.7 not handling memory fragmentation (well) and our compiled-in
default of disabling transparent huge pages on the kernel side.
While I will investigate further and see whether this is not fixable on
the Qemu side as well, I think it would be a good idea to revisit the
decision to patch this default in[2].
@Andreas, Alexandre: you both where proponents of disabling THP support
back then, but the current kernel docs[3] say (emphasis mine):
-----%<-----
Transparent Hugepage Support can be entirely disabled (*mostly for
debugging purposes*) or only enabled inside MADV_HUGEPAGE regions (to
avoid the risk of consuming more memory resources) or enabled system
wide. This can be achieved with one of:
echo always >/sys/kernel/mm/transparent_hugepage/enabled
echo madvise >/sys/kernel/mm/transparent_hugepage/enabled
echo never >/sys/kernel/mm/transparent_hugepage/enabled
It's also possible to limit defrag efforts in the VM to generate
hugepages in case they're not immediately free to madvise regions or
to never try to defrag memory and simply fallback to regular pages
unless hugepages are immediately available. Clearly if we spend CPU
time to defrag memory, we would expect to gain even more by the fact
we use hugepages later instead of regular pages. This isn't always
guaranteed, but it may be more likely in case the allocation is for a
MADV_HUGEPAGE region.
echo always >/sys/kernel/mm/transparent_hugepage/defrag
echo madvise >/sys/kernel/mm/transparent_hugepage/defrag
echo never >/sys/kernel/mm/transparent_hugepage/defrag
----->%-----
so I think setting both enabled and defrag to "madvise" by default would
be advisable - the admin can override it (permanently with a kernel boot
parameter, or at run time with the sysfs interface) anyway if they
really know it causes performance issues.
if you have any hard benchmark data to back up staying at "never",
please send it soon ;) preferable both with non-transparent hugepages
setup and without, and with both "always" and "madvise" for enabled and
defrag.
running a setup that is intended for debugging purposes (see above) as
default seems strange to me (and this was probably the reason why we
needed to patch "never" as default in in the first place). while I am
not yet convinced that this solves the passthrough data corruption issue
entirely, it is very reliably reproducable with THP disabled, and not at
all so far on my test setup with THP enabled - so I propose switching
with the next kernel update, unless there are (serious) objections.
1: https://forum.proxmox.com/threads/proxmox-4-4-virtio_scsi-regression.31471/
2: http://pve.proxmox.com/pipermail/pve-devel/2015-September/017079.html
3. https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/Documentation/vm/transhuge.txt?h=linux-4.4.y#n95
More information about the pve-devel
mailing list