[pve-devel] transparent huge pages support / disk passthrough corruption

Thu Jan 19 19:26:10 CET 2017

It seems that the current implementation is much better than it was in the
RHEL-based kernel.

On Thu, Jan 19, 2017 at 9:43 AM, Alexandre DERUMIER <aderumier at odiso.com>
wrote:

> Hi,
>
> I have reenable THP ( transparent_hugepage=madvise) since around 1 year
> (with pve-kernel 4.2-4.4), and I don't have problem anymore like in the
> past.
>
> I'm hosting a lot of database (mysql,sqlserver, redis, mongo,...) and I
> don't have seen performance impact since I have reenable THP.
>
> So I think it's pretty safe to set it by default.
>
>
>
>
> ----- Mail original -----
> De: "Fabian Grünbichler" <f.gruenbichler at proxmox.com>
> À: "pve-devel" <pve-devel at pve.proxmox.com>
> Cc: "aderumier" <aderumier at odiso.com>, "Andreas Steinel" <
> a.steinel at gmail.com>
> Envoyé: Jeudi 19 Janvier 2017 09:35:43
> Objet: transparent huge pages support / disk passthrough corruption
>
> So it seems like the recently reported problems[1] with disk pass
> through using virtio-scsi(-single) are caused by a combination of Qemu
> since 2.7 not handling memory fragmentation (well) and our compiled-in
> default of disabling transparent huge pages on the kernel side.
>
> While I will investigate further and see whether this is not fixable on
> the Qemu side as well, I think it would be a good idea to revisit the
> decision to patch this default in[2].
>
> @Andreas, Alexandre: you both where proponents of disabling THP support
> back then, but the current kernel docs[3] say (emphasis mine):
>
> -----%<-----
> Transparent Hugepage Support can be entirely disabled (*mostly for
> debugging purposes*) or only enabled inside MADV_HUGEPAGE regions (to
> avoid the risk of consuming more memory resources) or enabled system
> wide. This can be achieved with one of:
>
> echo always >/sys/kernel/mm/transparent_hugepage/enabled
> echo madvise >/sys/kernel/mm/transparent_hugepage/enabled
> echo never >/sys/kernel/mm/transparent_hugepage/enabled
>
> It's also possible to limit defrag efforts in the VM to generate
> hugepages in case they're not immediately free to madvise regions or
> to never try to defrag memory and simply fallback to regular pages
> unless hugepages are immediately available. Clearly if we spend CPU
> time to defrag memory, we would expect to gain even more by the fact
> we use hugepages later instead of regular pages. This isn't always
> guaranteed, but it may be more likely in case the allocation is for a
> MADV_HUGEPAGE region.
>
> echo always >/sys/kernel/mm/transparent_hugepage/defrag
> echo madvise >/sys/kernel/mm/transparent_hugepage/defrag
> echo never >/sys/kernel/mm/transparent_hugepage/defrag
> ----->%-----
>
> so I think setting both enabled and defrag to "madvise" by default would
> be advisable - the admin can override it (permanently with a kernel boot
> parameter, or at run time with the sysfs interface) anyway if they
> really know it causes performance issues.
>
> if you have any hard benchmark data to back up staying at "never",
> please send it soon ;) preferable both with non-transparent hugepages
> setup and without, and with both "always" and "madvise" for enabled and
> defrag.
>
> running a setup that is intended for debugging purposes (see above) as
> default seems strange to me (and this was probably the reason why we
> needed to patch "never" as default in in the first place). while I am
> not yet convinced that this solves the passthrough data corruption issue
> entirely, it is very reliably reproducable with THP disabled, and not at
> all so far on my test setup with THP enabled - so I propose switching
> with the next kernel update, unless there are (serious) objections.
>
> 1: https://forum.proxmox.com/threads/proxmox-4-4-virtio_
> scsi-regression.31471/
> 2: http://pve.proxmox.com/pipermail/pve-devel/2015-September/017079.html
> 3. https://git.kernel.org/cgit/linux/kernel/git/stable/linux-
> stable.git/tree/Documentation/vm/transhuge.txt?h=linux-4.4.y#n95
>
>