mrb's blog

Linux Kernel Patch to Support ECC Memory on AMD Ryzen 5000 APUs

Keywords: hardware sysadmin

I made a Linux kernel patch to support ECC memory on AMD Ryzen 5000 APUs (codename Cezanne.) It applies cleanly and works on kernel versions 5.13 through the latest (5.16-rc2).

Step-by-step patching in Debian 12 (bookworm)

The instructions below for Debian 12 (bookworm) are minimally intrusive. They install only one modified kernel module (amd64_edac.ko) while keeping the kernel image and other modules unmodified.

First, install dependencies to compile the kernel module:

$ apt install build-essential bc kmod cpio bison flex gnupg wget linux-headers-amd64 libncurses-dev libelf-dev libssl-dev

Get the kernel source and extract it:

$ apt install linux-source
$ cd /usr/src/
$ tar xf linux-source-5.15.tar.xz
$ cd linux-source-5.15

Obtain the configuration of the running kernel, disable signing of the modules (or else compilation would fail as we do not possess Debian’s signing key):

$ cp /usr/src/linux-headers-$(uname -r)/.config .
$ sed -r -i -e 's,^CONFIG_SYSTEM_TRUSTED_KEYS=.+,CONFIG_SYSTEM_TRUSTED_KEYS="",g' .config

Download my patch ecc-amd-cezanne.patch and apply it:

$ curl https://blog.zorinaq.com/assets/ecc-amd-cezanne.patch | patch -Np0

Compile:

$ make -j $(nproc) bindeb-pkg

Strip the .BTF section from the module amd64_edac.ko, or else attempting to load it will fail with the error failed to validate module [amd64_edac] BTF: -22 in dmesg. This seems to be a subtle bug due to a corner case in the BTF validation framework:

$ objcopy --remove-section .BTF debian/linux-image/lib/modules/*/kernel/drivers/edac/amd64_edac.ko

Install the module amd64_edac.ko in the updates directory which is given higher priority by depmod (so our patched module will load instead of the original module):

$ DEST="/lib/modules/$(uname -r)/updates"
$ mkdir -p "$DEST"
$ cp debian/linux-image/lib/modules/*/kernel/drivers/edac/amd64_edac.ko "$DEST"
$ depmod -a

Try loading the module. If it works, edac-util -v should show ECC statistics:

$ modprobe amd64_edac
$ edac-util -v
mc0: 0 Uncorrected Errors with no DIMM info
mc0: 0 Corrected Errors with no DIMM info
mc0: csrow0: 0 Uncorrected Errors
mc0: csrow0: mc#0csrow#0channel#0: 0 Corrected Errors
mc0: csrow0: mc#0csrow#0channel#1: 0 Corrected Errors
mc0: csrow1: 0 Uncorrected Errors
mc0: csrow1: mc#0csrow#1channel#0: 0 Corrected Errors
mc0: csrow1: mc#0csrow#1channel#1: 0 Corrected Errors
mc0: csrow2: 0 Uncorrected Errors
mc0: csrow2: mc#0csrow#2channel#0: 0 Corrected Errors
mc0: csrow2: mc#0csrow#2channel#1: 0 Corrected Errors
mc0: csrow3: 0 Uncorrected Errors
mc0: csrow3: mc#0csrow#3channel#0: 0 Corrected Errors
mc0: csrow3: mc#0csrow#3channel#1: 0 Corrected Errors
edac-util: No errors to report.

Origin story

I built a small Linux network file server for my home network, based on an ASRock X570M PRO4 motherboard, an AMD Ryzen 7 PRO 5750G APU, and 128 GB of Kingston DDR4-2666 ECC registered memory (KSM26ED8/32ME).

Officially, AMD validates ECC memory on Ryzen PRO APUs only, while non-PRO APUs may or may not support ECC depending on the motherboard. I really wanted ECC on an officially supported hardware stack. However PRO APUs are distributed to OEMs only, not to retail customers. So I hunted down a PRO APU which I had to buy from Germany through an eBay reseller.

After all this trouble, to my surprise, ECC did not work out of the box. I could not get ECC statistics:

$ edac-util 
edac-util: Error: No memory controller data found.

This error happens because the kernel module amd64_edac.ko fails to load:

$ modprobe amd64_edac
modprobe: ERROR: could not insert 'amd64_edac': No such device

My kernel is very recent, 5.14.0-4-amd64 from Debian 12 (testing/bookworm), but so are Ryzen 5000 APUs.

So let’s try a bleeding edge kernel? I must admit it had been about 10 years since I last built a Linux kernel, but with the help of Daniel Wayne Armstrong’s excellent concise guide, I dived in and built 5.16-rc2, released by Torvalds 3 days ago,

Alas, no luck. Same “No such device” error.

My BIOS is up-to-date. I reached out to ASRock’s customer support, and a helpful chap even tried booting up the same processor as me, 5750G, on that motherboard, with some ECC memory, and he sends me a screenshot proving ECC works at least on Windows:

ECC works on Windows

I do not have Windows, but if only to confirm ECC can work on my hardware, I figured I can probably run that wmic memphysical get memoryerrorcorrection command from the Windows installation ISO, without installing the OS. When the installer starts, select “Repair Windows”, open a command prompt, and, indeed, the wmic command prints 6, meaning Multi-bit ECC is working.

So, what is Linux doing? I locate the module’s source, drivers/edac/amd64_edac.c, add a few printk() debug messages…

Down the rabbit hole, I discover reserve_mc_sibling_devs() fails here:

pvt->F0 = pci_get_related_function(pvt->F3->vendor, pci_id1, pvt->F3);
if (!pvt->F0) {
  edac_dbg(1, "F0 not found, device 0x%x\n", pci_id1);
  return -ENODEV;
}

I convert the edac_dbg() to a printk() statement, which prints:

F0 not found, device 0x1650

0x1650 is a PCI device ID not present on my system. It is very helpful for debugging that I have another AMD machine (EPYC 7232P) with ECC memory, so I can compare the output of lspci and understand which PCI devices the code was looking for. As it turns out, they are always on bus 0 device 0x18:

On EPYC 7232P:

$ lspci -s 0:18 -n
00:18.0 0600: 1022:1490 ← function 0
00:18.1 0600: 1022:1491
00:18.2 0600: 1022:1492
00:18.3 0600: 1022:1493
00:18.4 0600: 1022:1494
00:18.5 0600: 1022:1495
00:18.6 0600: 1022:1496 ← function 6
00:18.7 0600: 1022:1497

On Ryzen 7 PRO 5750G:

$ lspci -s 0:18 -n
00:18.0 0600: 1022:166a ← function 0
00:18.1 0600: 1022:166b
00:18.2 0600: 1022:166c
00:18.3 0600: 1022:166d
00:18.4 0600: 1022:166e
00:18.5 0600: 1022:166f
00:18.6 0600: 1022:1670 ← function 6
00:18.7 0600: 1022:1671

On EPYC 7232P, the code looks for 0x1490 and finds it, but on Ryzen 7 PRO 5750G the code looks for 0x1650 and does not find it.

It is my understanding the above 8 device functions represent the northbridge / memory controller, and reserve_mc_sibling_devs() is looking for functions 0 then 6 which, on my machine, have device ID 0x166a and 0x1670.

So in drivers/edac/amd64_edac.c, function per_family_init() I first change the code to handle Ryzen 5000 APUs (family 0x19, model 0x50), then initialize data structures that contain the proper device IDs (0x166a and 0x1670):

--- drivers/edac/amd64_edac.h.orig	2021-11-23 20:53:17.777353032 -0800
+++ drivers/edac/amd64_edac.h	2021-11-23 20:55:43.346625956 -0800
@@ -126,6 +126,8 @@
 #define PCI_DEVICE_ID_AMD_17H_M70H_DF_F6 0x1446
 #define PCI_DEVICE_ID_AMD_19H_DF_F0	0x1650
 #define PCI_DEVICE_ID_AMD_19H_DF_F6	0x1656
+#define PCI_DEVICE_ID_AMD_19H_M50H_DF_F0 0x166a
+#define PCI_DEVICE_ID_AMD_19H_M50H_DF_F6 0x1670
 
 /*
  * Function 1 - Address Map
@@ -298,6 +300,7 @@
 	F17_M60H_CPUS,
 	F17_M70H_CPUS,
 	F19_CPUS,
+	F19_M50H_CPUS,
 	NUM_FAMILIES,
 };
 
--- drivers/edac/amd64_edac.c.orig	2021-09-30 01:11:08.000000000 -0700
+++ drivers/edac/amd64_edac.c	2021-11-23 21:10:39.766923976 -0800
@@ -2351,6 +2351,16 @@
 			.dbam_to_cs		= f17_addr_mask_to_cs_size,
 		}
 	},
+	[F19_M50H_CPUS] = {
+		.ctl_name = "F19h_M50h",
+		.f0_id = PCI_DEVICE_ID_AMD_19H_M50H_DF_F0,
+		.f6_id = PCI_DEVICE_ID_AMD_19H_M50H_DF_F6,
+		.max_mcs = 2,
+		.ops = {
+			.early_channel_count	= f17_early_channel_count,
+			.dbam_to_cs		= f17_addr_mask_to_cs_size,
+		}
+	},
 };
 
 /*
@@ -3403,6 +3413,12 @@
 			fam_type->ctl_name = "F19h_M20h";
 			break;
 		}
+		if (pvt->model == 0x50) {
+			fam_type = &family_types[F19_M50H_CPUS];
+			pvt->ops = &family_types[F19_M50H_CPUS].ops;
+			fam_type->ctl_name = "F19h_M50h";
+			break;
+		}
 		fam_type	= &family_types[F19_CPUS];
 		pvt->ops	= &family_types[F19_CPUS].ops;
 		family_types[F19_CPUS].ctl_name = "F19h";

I recompile the kernel module, and lo and behold, it loads and everything works!

$ insmod amd64_edac.ko
$ dmesg
[...]
EDAC amd64: MCT channel count: 2
EDAC MC0: Giving out device to module amd64_edac controller F19h_M50h: DEV 0000:00:18.3 (INTERRUPT)
EDAC amd64: F19h_M50h detected (node 0).
EDAC MC: UMC0 chip selects:
EDAC amd64: MC: 0: 16384MB 1: 16384MB
EDAC amd64: MC: 2: 16384MB 3: 16384MB
EDAC MC: UMC1 chip selects:
EDAC amd64: MC: 0: 16384MB 1: 16384MB
EDAC amd64: MC: 2: 16384MB 3: 16384MB
EDAC amd64: using x8 syndromes.
EDAC PCI1: Giving out device to module amd64_edac controller EDAC PCI controller: DEV 0000:00:18.0 (POLLED)
AMD64 EDAC driver v3.5.0
$ edac-util 
edac-util: No errors to report.

“No errors to report” is what you want to see and indicates no ECC errors have occurred so far.

This patch can easily be applied to kernel versions before 5.13, but you will find that you also need another patch 2ade8fc65076095460e3ea1ca65a8f619d7d9a3a or else amd64_edac.ko fails to load due to amd_cache_northbridges() returning an error.

Comments