I made a Linux kernel patch to support ECC memory on AMD Ryzen 5000 APUs (codename Cezanne.) It applies cleanly and works on kernel versions 5.13 through the latest (5.16-rc2).
Step-by-step patching in Debian 12 (bookworm)
The instructions below for Debian 12 (bookworm) are minimally intrusive. They
install only one modified kernel module (amd64_edac.ko
) while keeping
the kernel image and other modules unmodified.
First, install dependencies to compile the kernel module:
$ apt install build-essential bc kmod cpio bison flex gnupg wget linux-headers-amd64 libncurses-dev libelf-dev libssl-dev rsync dwarves
Get the kernel source and extract it:
$ apt install linux-source
$ cd /usr/src/
$ tar xf linux-source-5.15.tar.xz
$ cd linux-source-5.15
Obtain the configuration of the running kernel, disable signing of the modules (or else compilation would fail as we do not possess Debian’s signing key):
$ cp /usr/src/linux-headers-$(uname -r)/.config .
$ sed -r -i -e 's,^CONFIG_SYSTEM_TRUSTED_KEYS=.+,CONFIG_SYSTEM_TRUSTED_KEYS="",g' .config
Download my patch ecc-amd-cezanne.patch and apply it:
$ curl https://blog.zorinaq.com/assets/ecc-amd-cezanne.patch | patch -Np0
Compile:
$ make -j $(nproc) bindeb-pkg
Strip the .BTF
section from the module amd64_edac.ko
, or else attempting to load it
will fail with the error failed to validate module [amd64_edac] BTF: -22
in dmesg.
This seems to be a subtle bug due to a corner case in the BTF validation
framework:
$ objcopy --remove-section .BTF debian/linux-image/lib/modules/*/kernel/drivers/edac/amd64_edac.ko
Install the module amd64_edac.ko
in the updates
directory which is given
higher priority by depmod (so our patched module will load instead of the original module):
$ DEST="/lib/modules/$(uname -r)/updates"
$ mkdir -p "$DEST"
$ cp debian/linux-image/lib/modules/*/kernel/drivers/edac/amd64_edac.ko "$DEST"
$ depmod -a
Try loading the module. If it works, edac-util -v
should show ECC statistics:
$ modprobe amd64_edac
$ edac-util -v
mc0: 0 Uncorrected Errors with no DIMM info
mc0: 0 Corrected Errors with no DIMM info
mc0: csrow0: 0 Uncorrected Errors
mc0: csrow0: mc#0csrow#0channel#0: 0 Corrected Errors
mc0: csrow0: mc#0csrow#0channel#1: 0 Corrected Errors
mc0: csrow1: 0 Uncorrected Errors
mc0: csrow1: mc#0csrow#1channel#0: 0 Corrected Errors
mc0: csrow1: mc#0csrow#1channel#1: 0 Corrected Errors
mc0: csrow2: 0 Uncorrected Errors
mc0: csrow2: mc#0csrow#2channel#0: 0 Corrected Errors
mc0: csrow2: mc#0csrow#2channel#1: 0 Corrected Errors
mc0: csrow3: 0 Uncorrected Errors
mc0: csrow3: mc#0csrow#3channel#0: 0 Corrected Errors
mc0: csrow3: mc#0csrow#3channel#1: 0 Corrected Errors
edac-util: No errors to report.
Origin story
I built a small Linux network file server for my home network, based on an ASRock X570M PRO4 motherboard, an AMD Ryzen 7 PRO 5750G APU, and 128 GB of Kingston DDR4-2666 ECC unbuffered memory (KSM26ED8/32ME).
Officially, AMD validates ECC memory on Ryzen PRO APUs only, while non-PRO APUs may or may not support ECC depending on the motherboard. I really wanted ECC on an officially supported hardware stack. However PRO APUs are distributed to OEMs only, not to retail customers. So I hunted down a PRO APU which I had to buy from Germany through an eBay reseller.
After all this trouble, to my surprise, ECC did not work out of the box. I could not get ECC statistics:
$ edac-util
edac-util: Error: No memory controller data found.
This error happens because the kernel module amd64_edac.ko
fails to load:
$ modprobe amd64_edac
modprobe: ERROR: could not insert 'amd64_edac': No such device
My kernel is very recent, 5.14.0-4-amd64 from Debian 12 (testing/bookworm), but so are Ryzen 5000 APUs.
So let’s try a bleeding edge kernel? I must admit it had been about 10 years since I last built a Linux kernel, but with the help of Daniel Wayne Armstrong’s excellent concise guide, I dived in and built 5.16-rc2, released by Torvalds 3 days ago,
Alas, no luck. Same “No such device” error.
My BIOS is up-to-date. I reached out to ASRock’s customer support, and a helpful chap even tried booting up the same processor as me, 5750G, on that motherboard, with some ECC memory, and he sends me a screenshot proving ECC works at least on Windows:
I do not have Windows, but if only to confirm ECC can work on my hardware,
I figured I can probably run that wmic memphysical get memoryerrorcorrection
command from the Windows installation ISO, without installing the OS. When the
installer starts, select “Repair Windows”, open a command prompt, and, indeed,
the wmic
command prints 6, meaning Multi-bit ECC is working.
So, what is Linux doing? I locate the module’s source, drivers/edac/amd64_edac.c
,
add a few printk() debug messages…
Down the rabbit hole, I discover reserve_mc_sibling_devs()
fails here:
I convert the edac_dbg() to a printk() statement, which prints:
F0 not found, device 0x1650
0x1650 is a PCI device ID not present on my system. It is very helpful
for debugging that I have another AMD machine (EPYC 7232P) with ECC memory, so I
can compare the output of lspci
and understand which PCI devices the code was
looking for. As it turns out, they are always on bus 0 device 0x18:
On EPYC 7232P:
$ lspci -s 0:18 -n
00:18.0 0600: 1022:1490 ← function 0
00:18.1 0600: 1022:1491
00:18.2 0600: 1022:1492
00:18.3 0600: 1022:1493
00:18.4 0600: 1022:1494
00:18.5 0600: 1022:1495
00:18.6 0600: 1022:1496 ← function 6
00:18.7 0600: 1022:1497
On Ryzen 7 PRO 5750G:
$ lspci -s 0:18 -n
00:18.0 0600: 1022:166a ← function 0
00:18.1 0600: 1022:166b
00:18.2 0600: 1022:166c
00:18.3 0600: 1022:166d
00:18.4 0600: 1022:166e
00:18.5 0600: 1022:166f
00:18.6 0600: 1022:1670 ← function 6
00:18.7 0600: 1022:1671
On EPYC 7232P, the code looks for 0x1490 and finds it, but on Ryzen 7 PRO 5750G the code looks for 0x1650 and does not find it.
It is my understanding the above 8 device functions represent the northbridge /
memory controller, and reserve_mc_sibling_devs()
is looking for functions 0
then 6 which, on my machine, have device ID 0x166a and 0x1670.
So in drivers/edac/amd64_edac.c
, function per_family_init()
I first change
the code to handle Ryzen 5000 APUs (family 0x19, model 0x50), then initialize
data structures that contain the proper device IDs (0x166a and 0x1670):
I recompile the kernel module, and lo and behold, it loads and everything works!
$ insmod amd64_edac.ko
$ dmesg
[...]
EDAC amd64: MCT channel count: 2
EDAC MC0: Giving out device to module amd64_edac controller F19h_M50h: DEV 0000:00:18.3 (INTERRUPT)
EDAC amd64: F19h_M50h detected (node 0).
EDAC MC: UMC0 chip selects:
EDAC amd64: MC: 0: 16384MB 1: 16384MB
EDAC amd64: MC: 2: 16384MB 3: 16384MB
EDAC MC: UMC1 chip selects:
EDAC amd64: MC: 0: 16384MB 1: 16384MB
EDAC amd64: MC: 2: 16384MB 3: 16384MB
EDAC amd64: using x8 syndromes.
EDAC PCI1: Giving out device to module amd64_edac controller EDAC PCI controller: DEV 0000:00:18.0 (POLLED)
AMD64 EDAC driver v3.5.0
$ edac-util
edac-util: No errors to report.
“No errors to report” is what you want to see and indicates no ECC errors have occurred so far.
This patch can easily be applied to kernel versions before 5.13, but you will find that you
also need another patch 2ade8fc65076095460e3ea1ca65a8f619d7d9a3a
or else amd64_edac.ko
fails to load due to amd_cache_northbridges()
returning an error.