mrb's blog

Linux Kernel Patch to Support ECC Memory on AMD Ryzen 5000 APUs

Keywords: hardware sysadmin

I made a Linux kernel patch to support ECC memory on AMD Ryzen 5000 APUs (codename Cezanne.) It applies cleanly and works on kernel versions 5.13 through the latest (5.16-rc2).

Step-by-step patching in Debian 12 (bookworm)

The instructions below for Debian 12 (bookworm) are minimally intrusive. They install only one modified kernel module (amd64_edac.ko) while keeping the kernel image and other modules unmodified.

First, install dependencies to compile the kernel module:

$ apt install build-essential bc kmod cpio bison flex gnupg wget linux-headers-amd64 libncurses-dev libelf-dev libssl-dev rsync dwarves

Get the kernel source and extract it:

$ apt install linux-source
$ cd /usr/src/
$ tar xf linux-source-5.15.tar.xz
$ cd linux-source-5.15

Obtain the configuration of the running kernel, disable signing of the modules (or else compilation would fail as we do not possess Debian’s signing key):

$ cp /usr/src/linux-headers-$(uname -r)/.config .
$ sed -r -i -e 's,^CONFIG_SYSTEM_TRUSTED_KEYS=.+,CONFIG_SYSTEM_TRUSTED_KEYS="",g' .config

Download my patch ecc-amd-cezanne.patch and apply it:

$ curl https://blog.zorinaq.com/assets/ecc-amd-cezanne.patch | patch -Np0

Compile:

$ make -j $(nproc) bindeb-pkg

Strip the .BTF section from the module amd64_edac.ko, or else attempting to load it will fail with the error failed to validate module [amd64_edac] BTF: -22 in dmesg. This seems to be a subtle bug due to a corner case in the BTF validation framework:

$ objcopy --remove-section .BTF debian/linux-image/lib/modules/*/kernel/drivers/edac/amd64_edac.ko

Install the module amd64_edac.ko in the updates directory which is given higher priority by depmod (so our patched module will load instead of the original module):

$ DEST="/lib/modules/$(uname -r)/updates"
$ mkdir -p "$DEST"
$ cp debian/linux-image/lib/modules/*/kernel/drivers/edac/amd64_edac.ko "$DEST"
$ depmod -a

Try loading the module. If it works, edac-util -v should show ECC statistics:

$ modprobe amd64_edac
$ edac-util -v
mc0: 0 Uncorrected Errors with no DIMM info
mc0: 0 Corrected Errors with no DIMM info
mc0: csrow0: 0 Uncorrected Errors
mc0: csrow0: mc#0csrow#0channel#0: 0 Corrected Errors
mc0: csrow0: mc#0csrow#0channel#1: 0 Corrected Errors
mc0: csrow1: 0 Uncorrected Errors
mc0: csrow1: mc#0csrow#1channel#0: 0 Corrected Errors
mc0: csrow1: mc#0csrow#1channel#1: 0 Corrected Errors
mc0: csrow2: 0 Uncorrected Errors
mc0: csrow2: mc#0csrow#2channel#0: 0 Corrected Errors
mc0: csrow2: mc#0csrow#2channel#1: 0 Corrected Errors
mc0: csrow3: 0 Uncorrected Errors
mc0: csrow3: mc#0csrow#3channel#0: 0 Corrected Errors
mc0: csrow3: mc#0csrow#3channel#1: 0 Corrected Errors
edac-util: No errors to report.

Origin story

I built a small Linux network file server for my home network, based on an ASRock X570M PRO4 motherboard, an AMD Ryzen 7 PRO 5750G APU, and 128 GB of Kingston DDR4-2666 ECC unbuffered memory (KSM26ED8/32ME).

Officially, AMD validates ECC memory on Ryzen PRO APUs only, while non-PRO APUs may or may not support ECC depending on the motherboard. I really wanted ECC on an officially supported hardware stack. However PRO APUs are distributed to OEMs only, not to retail customers. So I hunted down a PRO APU which I had to buy from Germany through an eBay reseller.

After all this trouble, to my surprise, ECC did not work out of the box. I could not get ECC statistics:

$ edac-util 
edac-util: Error: No memory controller data found.

This error happens because the kernel module amd64_edac.ko fails to load:

$ modprobe amd64_edac
modprobe: ERROR: could not insert 'amd64_edac': No such device

My kernel is very recent, 5.14.0-4-amd64 from Debian 12 (testing/bookworm), but so are Ryzen 5000 APUs.

So let’s try a bleeding edge kernel? I must admit it had been about 10 years since I last built a Linux kernel, but with the help of Daniel Wayne Armstrong’s excellent concise guide, I dived in and built 5.16-rc2, released by Torvalds 3 days ago,

Alas, no luck. Same “No such device” error.

My BIOS is up-to-date. I reached out to ASRock’s customer support, and a helpful chap even tried booting up the same processor as me, 5750G, on that motherboard, with some ECC memory, and he sends me a screenshot proving ECC works at least on Windows:

ECC works on Windows

I do not have Windows, but if only to confirm ECC can work on my hardware, I figured I can probably run that wmic memphysical get memoryerrorcorrection command from the Windows installation ISO, without installing the OS. When the installer starts, select “Repair Windows”, open a command prompt, and, indeed, the wmic command prints 6, meaning Multi-bit ECC is working.

So, what is Linux doing? I locate the module’s source, drivers/edac/amd64_edac.c, add a few printk() debug messages…

Down the rabbit hole, I discover reserve_mc_sibling_devs() fails here:

pvt->F0 = pci_get_related_function(pvt->F3->vendor, pci_id1, pvt->F3);
if (!pvt->F0) {
  edac_dbg(1, "F0 not found, device 0x%x\n", pci_id1);
  return -ENODEV;
}

I convert the edac_dbg() to a printk() statement, which prints:

F0 not found, device 0x1650

0x1650 is a PCI device ID not present on my system. It is very helpful for debugging that I have another AMD machine (EPYC 7232P) with ECC memory, so I can compare the output of lspci and understand which PCI devices the code was looking for. As it turns out, they are always on bus 0 device 0x18:

On EPYC 7232P:

$ lspci -s 0:18 -n
00:18.0 0600: 1022:1490 ← function 0
00:18.1 0600: 1022:1491
00:18.2 0600: 1022:1492
00:18.3 0600: 1022:1493
00:18.4 0600: 1022:1494
00:18.5 0600: 1022:1495
00:18.6 0600: 1022:1496 ← function 6
00:18.7 0600: 1022:1497

On Ryzen 7 PRO 5750G:

$ lspci -s 0:18 -n
00:18.0 0600: 1022:166a ← function 0
00:18.1 0600: 1022:166b
00:18.2 0600: 1022:166c
00:18.3 0600: 1022:166d
00:18.4 0600: 1022:166e
00:18.5 0600: 1022:166f
00:18.6 0600: 1022:1670 ← function 6
00:18.7 0600: 1022:1671

On EPYC 7232P, the code looks for 0x1490 and finds it, but on Ryzen 7 PRO 5750G the code looks for 0x1650 and does not find it.

It is my understanding the above 8 device functions represent the northbridge / memory controller, and reserve_mc_sibling_devs() is looking for functions 0 then 6 which, on my machine, have device ID 0x166a and 0x1670.

So in drivers/edac/amd64_edac.c, function per_family_init() I first change the code to handle Ryzen 5000 APUs (family 0x19, model 0x50), then initialize data structures that contain the proper device IDs (0x166a and 0x1670):

--- drivers/edac/amd64_edac.h.orig	2021-11-23 20:53:17.777353032 -0800
+++ drivers/edac/amd64_edac.h	2021-11-23 20:55:43.346625956 -0800
@@ -126,6 +126,8 @@
 #define PCI_DEVICE_ID_AMD_17H_M70H_DF_F6 0x1446
 #define PCI_DEVICE_ID_AMD_19H_DF_F0	0x1650
 #define PCI_DEVICE_ID_AMD_19H_DF_F6	0x1656
+#define PCI_DEVICE_ID_AMD_19H_M50H_DF_F0 0x166a
+#define PCI_DEVICE_ID_AMD_19H_M50H_DF_F6 0x1670
 
 /*
  * Function 1 - Address Map
@@ -298,6 +300,7 @@
 	F17_M60H_CPUS,
 	F17_M70H_CPUS,
 	F19_CPUS,
+	F19_M50H_CPUS,
 	NUM_FAMILIES,
 };
 
--- drivers/edac/amd64_edac.c.orig	2021-09-30 01:11:08.000000000 -0700
+++ drivers/edac/amd64_edac.c	2021-11-23 21:10:39.766923976 -0800
@@ -2351,6 +2351,16 @@
 			.dbam_to_cs		= f17_addr_mask_to_cs_size,
 		}
 	},
+	[F19_M50H_CPUS] = {
+		.ctl_name = "F19h_M50h",
+		.f0_id = PCI_DEVICE_ID_AMD_19H_M50H_DF_F0,
+		.f6_id = PCI_DEVICE_ID_AMD_19H_M50H_DF_F6,
+		.max_mcs = 2,
+		.ops = {
+			.early_channel_count	= f17_early_channel_count,
+			.dbam_to_cs		= f17_addr_mask_to_cs_size,
+		}
+	},
 };
 
 /*
@@ -3403,6 +3413,12 @@
 			fam_type->ctl_name = "F19h_M20h";
 			break;
 		}
+		if (pvt->model == 0x50) {
+			fam_type = &family_types[F19_M50H_CPUS];
+			pvt->ops = &family_types[F19_M50H_CPUS].ops;
+			fam_type->ctl_name = "F19h_M50h";
+			break;
+		}
 		fam_type	= &family_types[F19_CPUS];
 		pvt->ops	= &family_types[F19_CPUS].ops;
 		family_types[F19_CPUS].ctl_name = "F19h";

I recompile the kernel module, and lo and behold, it loads and everything works!

$ insmod amd64_edac.ko
$ dmesg
[...]
EDAC amd64: MCT channel count: 2
EDAC MC0: Giving out device to module amd64_edac controller F19h_M50h: DEV 0000:00:18.3 (INTERRUPT)
EDAC amd64: F19h_M50h detected (node 0).
EDAC MC: UMC0 chip selects:
EDAC amd64: MC: 0: 16384MB 1: 16384MB
EDAC amd64: MC: 2: 16384MB 3: 16384MB
EDAC MC: UMC1 chip selects:
EDAC amd64: MC: 0: 16384MB 1: 16384MB
EDAC amd64: MC: 2: 16384MB 3: 16384MB
EDAC amd64: using x8 syndromes.
EDAC PCI1: Giving out device to module amd64_edac controller EDAC PCI controller: DEV 0000:00:18.0 (POLLED)
AMD64 EDAC driver v3.5.0
$ edac-util 
edac-util: No errors to report.

“No errors to report” is what you want to see and indicates no ECC errors have occurred so far.

This patch can easily be applied to kernel versions before 5.13, but you will find that you also need another patch 2ade8fc65076095460e3ea1ca65a8f619d7d9a3a or else amd64_edac.ko fails to load due to amd_cache_northbridges() returning an error.

Comments

alteriks@gmail.com wrote: Hi Marc,
I've used your patch against linux-5.15.7. Finally edac_util reports that everything is fine instead of ' Error: No memory controller data found.' on my Ryzen 5650G.
Thanks for your work!
Could you submit a pull request, so it will be available in mainline kernel?
11 Dec 2021 15:14 UTC

mrb wrote: Glad this helped! Yes, I plan on submitting a patch to kernel in the coming days. 12 Dec 2021 16:44 UTC

simitu@seznam.cz wrote: Thanks for patch and precise analysis. Except it is working, I like to know how. Ryzen 5650G and ASRock X470D4U here, edac now working! Please submit this to kernel dev. Best regards. 17 Dec 2021 10:47 UTC

andrewl wrote: Thanks for the patch!
I think it would be great if this could get upstreamed by you or an EDAC maintainer [yazen.ghannam a.t. amd.com].

I was able to compile the patch on Linux 5.15.8. To note you can compile just the module itself (which is much faster):

make modules_prepare
make M=drivers/edac/

and the ko file is located at:
./drivers/edac/amd64_edac.ko
19 Dec 2021 02:01 UTC

mrb wrote: For the record, I have just submitted my patch to the kernel devs: https://lore.kernel.org/linux-edac/20211219223127.71554-1-m@zorinaq.com/T/#u 19 Dec 2021 22:45 UTC

mrb wrote: The patch was accepted and committed into the ras.git repository: https://git.kernel.org/pub/scm/linux/kernel/git/ras/ras.git/commit/?h=edac-for-next&id=0b8bf9cb142da59a14622bba168ebcd6d0a54499 So it should find its way into Torvalds's repository in the near future. Yay!

By the way, thanks andrewl for the tip to compile only the necessary module. I suspected there was a way to do it, but was too lazy to do the research :)
25 Dec 2021 05:26 UTC

mrb wrote: My patch will be included in Linux 5.17-rc1. Here is the commit in Torvald's repository:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=0b8bf9cb142da59a14622bba168ebcd6d0a54499
10 Jan 2022 23:44 UTC

mscdex wrote: With the 5.14.21 kernel from the Ubuntu mainline kernel PPA it seems both amd_nb.c and amd64_edac.* are already patched (seems they got backported or similar). However, trying to load the kernel module still results in "no such device". dmesg has no additional messages after attempting to load the module.

I'm testing this with a Ryzen 7 5750GE Pro APU (on Ubuntu 21.10). I would've assumed the PCI device IDs would be identical to the 5750G, but maybe they aren't? How would I go about finding what the correct IDs should be in case they aren't the same?
25 Feb 2022 20:53 UTC

mscdex wrote: Forget about my previous comment. I had the wrong kernel branch checked out somehow. After checking out the appropriate branch, the module loads and `edac-util -v` is working! Thanks for the making the patch and instructions public! 25 Feb 2022 21:25 UTC

giorgio wrote: Thank you a lot for the patch! I really appreciate your work :-) 27 Mar 2022 11:21 UTC