[pve-devel] [PATCH kernel] Add MCE patch for Threadripper 3000 series compatibility

Thomas Lamprecht t.lamprecht at proxmox.com
Wed Jan 15 15:22:57 CET 2020


On 1/15/20 2:54 PM, Stefan Reiter wrote:
> A forum user reported that our kernel does not boot on Threadripper 3000
> series CPUs, unless 'mce=off' is provided on the kernel commandline. [0]
> 
> This is a known issue, which has been fixed in mainline kernels and
> backported to 5.4, 4.19 and 4.14 [1]. It is not, however, included in
> 5.3, nor in the Ubuntu builds. [2]
> 
> This patch is the original one posted for 5.5, which is the same as the
> one ported to 5.4. It also applies cleanly to 5.3, and should work the
> same, seeing as the backports to older versions do not have functional
> changes either.
> 
> [0] https://forum.proxmox.com/threads/bug-pve-wont-boot-properly.63432/
> [1] https://patchwork.kernel.org/project/linux-edac/list/?q=Allow+Reserved+types+to+be+overwritten+in+smca_banks
> [2] https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/eoan/log/?qt=grep&q=Allow+Reserved+types+to+be+overwritten+in+smca_banks
> 
> Signed-off-by: Stefan Reiter <s.reiter at proxmox.com>
> ---
> 
> Not sure if we usually include fixes like that, but I feel like this could avoid
> a lot of Forum threads once TR 3000 gets more commonplace :)

I'd like to include this! It'd be also great to post it to ubuntu-kernel
list, and/or maybe even lkml-stable list for backporting to next stable
release, they probably want this too.

> 
> 
>  ...w-Reserved-types-to-be-overwritten-i.patch | 88 +++++++++++++++++++
>  1 file changed, 88 insertions(+)
>  create mode 100644 patches/kernel/0006-x86-MCE-AMD-Allow-Reserved-types-to-be-overwritten-i.patch
> 
> diff --git a/patches/kernel/0006-x86-MCE-AMD-Allow-Reserved-types-to-be-overwritten-i.patch b/patches/kernel/0006-x86-MCE-AMD-Allow-Reserved-types-to-be-overwritten-i.patch
> new file mode 100644
> index 0000000..6f49ff6
> --- /dev/null
> +++ b/patches/kernel/0006-x86-MCE-AMD-Allow-Reserved-types-to-be-overwritten-i.patch
> @@ -0,0 +1,88 @@
> +From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001
> +From: Yazen Ghannam <yazen.ghannam at amd.com>
> +Date: Thu, 21 Nov 2019 08:15:08 -0600
> +Subject: [PATCH] x86/MCE/AMD: Allow Reserved types to be overwritten in
> + smca_banks[]
> +
> +Each logical CPU in Scalable MCA systems controls a unique set of MCA
> +banks in the system. These banks are not shared between CPUs. The bank
> +types and ordering will be the same across CPUs on currently available
> +systems.
> +
> +However, some CPUs may see a bank as Reserved/Read-as-Zero (RAZ) while
> +other CPUs do not. In this case, the bank seen as Reserved on one CPU is
> +assumed to be the same type as the bank seen as a known type on another
> +CPU.
> +
> +In general, this occurs when the hardware represented by the MCA bank
> +is disabled, e.g. disabled memory controllers on certain models, etc.
> +The MCA bank is disabled in the hardware, so there is no possibility of
> +getting an MCA/MCE from it even if it is assumed to have a known type.
> +
> +For example:
> +
> +Full system:
> +	Bank  |  Type seen on CPU0  |  Type seen on CPU1
> +	------------------------------------------------
> +	 0    |         LS          |          LS
> +	 1    |         UMC         |          UMC
> +	 2    |         CS          |          CS
> +
> +System with hardware disabled:
> +	Bank  |  Type seen on CPU0  |  Type seen on CPU1
> +	------------------------------------------------
> +	 0    |         LS          |          LS
> +	 1    |         UMC         |          RAZ
> +	 2    |         CS          |          CS
> +
> +For this reason, there is a single, global struct smca_banks[] that is
> +initialized at boot time. This array is initialized on each CPU as it
> +comes online. However, the array will not be updated if an entry already
> +exists.
> +
> +This works as expected when the first CPU (usually CPU0) has all
> +possible MCA banks enabled. But if the first CPU has a subset, then it
> +will save a "Reserved" type in smca_banks[]. Successive CPUs will then
> +not be able to update smca_banks[] even if they encounter a known bank
> +type.
> +
> +This may result in unexpected behavior. Depending on the system
> +configuration, a user may observe issues enumerating the MCA
> +thresholding sysfs interface. The issues may be as trivial as sysfs
> +entries not being available, or as severe as system hangs.
> +
> +For example:
> +
> +	Bank  |  Type seen on CPU0  |  Type seen on CPU1
> +	------------------------------------------------
> +	 0    |         LS          |          LS
> +	 1    |         RAZ         |          UMC
> +	 2    |         CS          |          CS
> +
> +Extend the smca_banks[] entry check to return if the entry is a
> +non-reserved type. Otherwise, continue so that CPUs that encounter a
> +known bank type can update smca_banks[].
> +
> +Fixes: 68627a697c19 ("x86/mce/AMD, EDAC/mce_amd: Enumerate Reserved SMCA bank type")
> +Signed-off-by: Yazen Ghannam <yazen.ghannam at amd.com>
> +Signed-off-by: Borislav Petkov <bp at suse.de>
> +---
> + arch/x86/kernel/cpu/mce/amd.c | 2 +-
> + 1 file changed, 1 insertion(+), 1 deletion(-)
> +
> +diff --git a/arch/x86/kernel/cpu/mce/amd.c b/arch/x86/kernel/cpu/mce/amd.c
> +index 6ea7fdc82f3c..08e09c8c269f 100644
> +--- a/arch/x86/kernel/cpu/mce/amd.c
> ++++ b/arch/x86/kernel/cpu/mce/amd.c
> +@@ -266,7 +266,7 @@ static void smca_configure(unsigned int bank, unsigned int cpu)
> + 	smca_set_misc_banks_map(bank, cpu);
> + 
> + 	/* Return early if this bank was already initialized. */
> +-	if (smca_banks[bank].hwid)
> ++	if (smca_banks[bank].hwid && smca_banks[bank].hwid->hwid_mcatype != 0)
> + 		return;
> + 
> + 	if (rdmsr_safe_on_cpu(cpu, MSR_AMD64_SMCA_MCx_IPID(bank), &low, &high)) {
> +-- 
> +2.20.1
> +
> 





More information about the pve-devel mailing list