mrb's blog

Shipping companies underestimate dim weight by 0.7%

2023-06-12T00:00:00-07:00

Shipping companies like FedEx or UPS sometimes determine the shipping rate by calculating the dimensional weight, or dim weight, of the package. For example, FedEx instructs their American audience to calculate the size in cubic inches and then to divide by 139. Others may use a divisor of 166 or 194. The result is the dim weight in pounds.

139, 166 and 194 are not obvious at first sight—the imperial system rarely makes things obvious!—but the intent is to estimate, respectively, 1/5th, 1/6th, and 1/7th the density of water:

5 (lb) * 453.59237 (g/lb) / (2.54 (cm/in) ^ 3) = 138.39952

6 (lb) * 453.59237 (g/lb) / (2.54 (cm/in) ^ 3) = 166.07943

7 (lb) * 453.59237 (g/lb) / (2.54 (cm/in) ^ 3) = 193.75933

In plain words: 5 lb of water has a volume of 138.39952 cubic inches, and so on. And a package having a volume of 138.39952 cubic inches is assumed to have a dim weight of 1 lb, or 1/5th of water.

The funny thing is: all shipping companies round these figures correctly except 138.39952 which is rounded up to 139. My guess is they did the same math as above except they approximated the pound as 454 g instead of exactly 453.59237 g:

5 (lb) * 454 (g/lb) / (2.54 (cm/in) ^ 3) = 138.52390 …which rounds to 139.

Therefore all the shipping companies who use a divisor of 139 instead of 138, such as FedEx, actually underestimate the intended dim weight by 0.7%, and correspondingly undercharge customers. Who wants to tell them?

Linux Kernel Patch to Support ECC Memory on AMD Ryzen 5000 APUs

2021-11-24T00:00:00-08:00

I made a Linux kernel patch to support ECC memory on AMD Ryzen 5000 APUs (codename Cezanne.) It applies cleanly and works on kernel versions 5.13 through the latest (5.16-rc2).

Step-by-step patching in Debian 12 (bookworm)
Origin story

Step-by-step patching in Debian 12 (bookworm)

The instructions below for Debian 12 (bookworm) are minimally intrusive. They install only one modified kernel module (amd64_edac.ko) while keeping the kernel image and other modules unmodified.

First, install dependencies to compile the kernel module:

$ apt install build-essential bc kmod cpio bison flex gnupg wget linux-headers-amd64 libncurses-dev libelf-dev libssl-dev rsync dwarves

Get the kernel source and extract it:

$ apt install linux-source
$ cd /usr/src/
$ tar xf linux-source-5.15.tar.xz
$ cd linux-source-5.15

Obtain the configuration of the running kernel, disable signing of the modules (or else compilation would fail as we do not possess Debian’s signing key):

$ cp /usr/src/linux-headers-$(uname -r)/.config .
$ sed -r -i -e 's,^CONFIG_SYSTEM_TRUSTED_KEYS=.+,CONFIG_SYSTEM_TRUSTED_KEYS="",g' .config

Download my patch ecc-amd-cezanne.patch and apply it:

$ curl https://blog.zorinaq.com/assets/ecc-amd-cezanne.patch | patch -Np0

Compile:

$ make -j $(nproc) bindeb-pkg

Strip the .BTF section from the module amd64_edac.ko, or else attempting to load it will fail with the error failed to validate module [amd64_edac] BTF: -22 in dmesg. This seems to be a subtle bug due to a corner case in the BTF validation framework:

$ objcopy --remove-section .BTF debian/linux-image/lib/modules/*/kernel/drivers/edac/amd64_edac.ko

Install the module amd64_edac.ko in the updates directory which is given higher priority by depmod (so our patched module will load instead of the original module):

$ DEST="/lib/modules/$(uname -r)/updates"
$ mkdir -p "$DEST"
$ cp debian/linux-image/lib/modules/*/kernel/drivers/edac/amd64_edac.ko "$DEST"
$ depmod -a

Try loading the module. If it works, edac-util -v should show ECC statistics:

$ modprobe amd64_edac
$ edac-util -v
mc0: 0 Uncorrected Errors with no DIMM info
mc0: 0 Corrected Errors with no DIMM info
mc0: csrow0: 0 Uncorrected Errors
mc0: csrow0: mc#0csrow#0channel#0: 0 Corrected Errors
mc0: csrow0: mc#0csrow#0channel#1: 0 Corrected Errors
mc0: csrow1: 0 Uncorrected Errors
mc0: csrow1: mc#0csrow#1channel#0: 0 Corrected Errors
mc0: csrow1: mc#0csrow#1channel#1: 0 Corrected Errors
mc0: csrow2: 0 Uncorrected Errors
mc0: csrow2: mc#0csrow#2channel#0: 0 Corrected Errors
mc0: csrow2: mc#0csrow#2channel#1: 0 Corrected Errors
mc0: csrow3: 0 Uncorrected Errors
mc0: csrow3: mc#0csrow#3channel#0: 0 Corrected Errors
mc0: csrow3: mc#0csrow#3channel#1: 0 Corrected Errors
edac-util: No errors to report.

Origin story

I built a small Linux network file server for my home network, based on an ASRock X570M PRO4 motherboard, an AMD Ryzen 7 PRO 5750G APU, and 128 GB of Kingston DDR4-2666 ECC unbuffered memory (KSM26ED8/32ME).

Officially, AMD validates ECC memory on Ryzen PRO APUs only, while non-PRO APUs may or may not support ECC depending on the motherboard. I really wanted ECC on an officially supported hardware stack. However PRO APUs are distributed to OEMs only, not to retail customers. So I hunted down a PRO APU which I had to buy from Germany through an eBay reseller.

After all this trouble, to my surprise, ECC did not work out of the box. I could not get ECC statistics:

$ edac-util 
edac-util: Error: No memory controller data found.

This error happens because the kernel module amd64_edac.ko fails to load:

$ modprobe amd64_edac
modprobe: ERROR: could not insert 'amd64_edac': No such device

My kernel is very recent, 5.14.0-4-amd64 from Debian 12 (testing/bookworm), but so are Ryzen 5000 APUs.

So let’s try a bleeding edge kernel? I must admit it had been about 10 years since I last built a Linux kernel, but with the help of Daniel Wayne Armstrong’s excellent concise guide, I dived in and built 5.16-rc2, released by Torvalds 3 days ago,

Alas, no luck. Same “No such device” error.

My BIOS is up-to-date. I reached out to ASRock’s customer support, and a helpful chap even tried booting up the same processor as me, 5750G, on that motherboard, with some ECC memory, and he sends me a screenshot proving ECC works at least on Windows:

I do not have Windows, but if only to confirm ECC can work on my hardware, I figured I can probably run that wmic memphysical get memoryerrorcorrection command from the Windows installation ISO, without installing the OS. When the installer starts, select “Repair Windows”, open a command prompt, and, indeed, the wmic command prints 6, meaning Multi-bit ECC is working.

So, what is Linux doing? I locate the module’s source, drivers/edac/amd64_edac.c, add a few printk() debug messages…

Down the rabbit hole, I discover reserve_mc_sibling_devs() fails here:

pvt->F0 = pci_get_related_function(pvt->F3->vendor, pci_id1, pvt->F3);
if (!pvt->F0) {
  edac_dbg(1, "F0 not found, device 0x%x\n", pci_id1);
  return -ENODEV;
}

I convert the edac_dbg() to a printk() statement, which prints:

F0 not found, device 0x1650

0x1650 is a PCI device ID not present on my system. It is very helpful for debugging that I have another AMD machine (EPYC 7232P) with ECC memory, so I can compare the output of lspci and understand which PCI devices the code was looking for. As it turns out, they are always on bus 0 device 0x18:

On EPYC 7232P:

$ lspci -s 0:18 -n
18.0 0600: 1022:1490 ← function 0
18.1 0600: 1022:1491
18.2 0600: 1022:1492
18.3 0600: 1022:1493
18.4 0600: 1022:1494
18.5 0600: 1022:1495
18.6 0600: 1022:1496 ← function 6
18.7 0600: 1022:1497

On Ryzen 7 PRO 5750G:

$ lspci -s 0:18 -n
18.0 0600: 1022:166a ← function 0
18.1 0600: 1022:166b
18.2 0600: 1022:166c
18.3 0600: 1022:166d
18.4 0600: 1022:166e
18.5 0600: 1022:166f
18.6 0600: 1022:1670 ← function 6
18.7 0600: 1022:1671

On EPYC 7232P, the code looks for 0x1490 and finds it, but on Ryzen 7 PRO 5750G the code looks for 0x1650 and does not find it.

It is my understanding the above 8 device functions represent the northbridge / memory controller, and reserve_mc_sibling_devs() is looking for functions 0 then 6 which, on my machine, have device ID 0x166a and 0x1670.

So in drivers/edac/amd64_edac.c, function per_family_init() I first change the code to handle Ryzen 5000 APUs (family 0x19, model 0x50), then initialize data structures that contain the proper device IDs (0x166a and 0x1670):

--- drivers/edac/amd64_edac.h.orig	2021-11-23 20:53:17.777353032 -0800
+++ drivers/edac/amd64_edac.h	2021-11-23 20:55:43.346625956 -0800
@@ -126,6 +126,8 @@
 #define PCI_DEVICE_ID_AMD_17H_M70H_DF_F6 0x1446
 #define PCI_DEVICE_ID_AMD_19H_DF_F0	0x1650
 #define PCI_DEVICE_ID_AMD_19H_DF_F6	0x1656
+#define PCI_DEVICE_ID_AMD_19H_M50H_DF_F0 0x166a
+#define PCI_DEVICE_ID_AMD_19H_M50H_DF_F6 0x1670
 
 /*
  * Function 1 - Address Map
@@ -298,6 +300,7 @@
 	F17_M60H_CPUS,
 	F17_M70H_CPUS,
 	F19_CPUS,
+	F19_M50H_CPUS,
 	NUM_FAMILIES,
 };
 
--- drivers/edac/amd64_edac.c.orig	2021-09-30 01:11:08.000000000 -0700
+++ drivers/edac/amd64_edac.c	2021-11-23 21:10:39.766923976 -0800
@@ -2351,6 +2351,16 @@
 			.dbam_to_cs		= f17_addr_mask_to_cs_size,
 		}
 	},
+	[F19_M50H_CPUS] = {
+		.ctl_name = "F19h_M50h",
+		.f0_id = PCI_DEVICE_ID_AMD_19H_M50H_DF_F0,
+		.f6_id = PCI_DEVICE_ID_AMD_19H_M50H_DF_F6,
+		.max_mcs = 2,
+		.ops = {
+			.early_channel_count	= f17_early_channel_count,
+			.dbam_to_cs		= f17_addr_mask_to_cs_size,
+		}
+	},
 };
 
 /*
@@ -3403,6 +3413,12 @@
 			fam_type->ctl_name = "F19h_M20h";
 			break;
 		}
+		if (pvt->model == 0x50) {
+			fam_type = &family_types[F19_M50H_CPUS];
+			pvt->ops = &family_types[F19_M50H_CPUS].ops;
+			fam_type->ctl_name = "F19h_M50h";
+			break;
+		}
 		fam_type	= &family_types[F19_CPUS];
 		pvt->ops	= &family_types[F19_CPUS].ops;
 		family_types[F19_CPUS].ctl_name = "F19h";

I recompile the kernel module, and lo and behold, it loads and everything works!

$ insmod amd64_edac.ko
$ dmesg
[...]
EDAC amd64: MCT channel count: 2
EDAC MC0: Giving out device to module amd64_edac controller F19h_M50h: DEV 0000:00:18.3 (INTERRUPT)
EDAC amd64: F19h_M50h detected (node 0).
EDAC MC: UMC0 chip selects:
EDAC amd64: MC: 0: 16384MB 1: 16384MB
EDAC amd64: MC: 2: 16384MB 3: 16384MB
EDAC MC: UMC1 chip selects:
EDAC amd64: MC: 0: 16384MB 1: 16384MB
EDAC amd64: MC: 2: 16384MB 3: 16384MB
EDAC amd64: using x8 syndromes.
EDAC PCI1: Giving out device to module amd64_edac controller EDAC PCI controller: DEV 0000:00:18.0 (POLLED)
AMD64 EDAC driver v3.5.0
$ edac-util 
edac-util: No errors to report.

“No errors to report” is what you want to see and indicates no ECC errors have occurred so far.

This patch can easily be applied to kernel versions before 5.13, but you will find that you also need another patch 2ade8fc65076095460e3ea1ca65a8f619d7d9a3a or else amd64_edac.ko fails to load due to amd_cache_northbridges() returning an error.

Booting unmodified Windows 10 over USB

2020-05-29T00:00:00-07:00

When I buy a laptop, the first thing I do is netboot into a PXE Linux environment and make a full raw disk image backup of the pre-installed Windows OS. I pipe dd if=/dev/sda or dd if=/dev/nvme0n1 into lz4 (much faster than gzip) and write the image on a fileserver. No matter how large the disk is, the image usually shrinks down to 20-30 GB because, well, there are usually only 20-30 GB of files on the drive. The empty NTFS blocks compress very well.

I then wipe Windows and install a Linux distribution.

Years later when I donate or sell the laptop, I restore the raw disk image. Not only does this restore the Windows OEM image, but because all sectors are overwritten, it securely wipes the disk. Very good for security.

The image backup I made turned out very useful today. I needed to update the firmware for the trackpoint of my ThinkPad X1 Carbon laptop. However the firmware update is only possible using a Windows 10 utility.

Booting from USB

Here is how I got Windows running on my laptop, without reimaging or swapping or repartitioning the internal drive.

Write the image to a USB drive

I wrote the factory Windows 10 disk image to an external USB drive (unlz4 <image.lz4 >/dev/sda). Being a full disk image, it includes 4 partitions: EFI System partition, Microsoft Reserved partition, Basic data partition, Recovery partition.

Set BootDriverFlags to 0x14

If you try to boot from the external USB drive as is, Windows 10 will error out during boot (INACCESSIBLE_BOOT_DEVICE).

The key to make it work is to edit the obscure registry value BootDriverFlags to change it from 0x0 to 0x14. The value is located under HKLM\SYSTEM\HardwareConfig\{...uuid...} and can be edited by mounting the C: partition in Linux, and using the chntpw utility:

$ chntpw -e /path/to/mounted/Windows/System32/config/SYSTEM

List the {...uuid...} subkey under HardwareConfig:

> ls HardwareConfig
Node has 1 subkeys and 2 values
  key name
  <{ca7bc4cc-350d-11b2-a85c-95cecf0de0fa}>
  size     type              value name             [value if type DWORD]
    78  1 REG_SZ             <LastConfig>
     4  4 REG_DWORD          <LastId>                   0 [0x0]

Edit BootDriverFlags with ed to set it to 0x14:

> ed HardwareConfig\{ca7bc4cc-350d-11b2-a85c-95cecf0de0fa}\BootDriverFlags
EDIT: <HardwareConfig\{ca7bc4cc-350d-11b2-a85c-95cecf0de0fa}\BootDriverFlags> of type REG_DWORD (4) with length 4 [0x4]
DWORD: Old value 0 [0x0], enter new value (prepend 0x if hex, empty to keep old value)
-> 0x14
DWORD: New value 20 [0x14], 

Finally save the changes with q:

> q

BootDriverFlags is so obscure that it is barely documented. Some online resources say the value is under HKLM\System\CurrentControlSet\Control or HKLM\SYSTEM\HardwareConfig\ but they are wrong. The value is under HKLM\SYSTEM\HardwareConfig\{...uuid...}.

That is it. Changing a single registry key is all that is needed to make an OEM Windows 10 image boot from a USB drive.

Evidence of the effectiveness of stay-at-home orders in the US

2020-05-15T00:00:00-07:00

A semi-popular Twitter thread by @boriquagato argues that stay-at-home orders are ineffective at reducing mortality in the US: “there is just no evidence that this incredibly expensive and harmful policy has any effect at all.”

This is, of course, preposterous. @boriquagato attempted to rank US states by cumulative deaths per capita, and observed how the ranking changed over ~4 weeks. I will explain the flaws in their application of this methodology. But beforehand, I will present a superior methodology: examining the trend of daily deaths per capita.

Trend of daily deaths
Trend of daily cases
Flaws in @boriquagato’s methodology
Summary
Responses from @boriquagato

Trend of daily deaths

When analyzing daily deaths, be aware there is always a lag between the moment a stay-at-home order goes into effect and the moment a reduction of daily deaths is observed, for many reasons:

delay between infection and death
stay-at-home orders may be initially advisory or poorly enforced, and eventually made mandatory or more strictly enforced
infected persons staying at home may infect other household members over time
etc

The lag is approximately 2-4 weeks. So I charted daily deaths per million inhabitants for each US state, marked the date of stay-at-home orders with a dashed line, and overlaid a shaded area 2-4 weeks later:

In 33 states—plus DC—that implemented stay-at-home orders, we expect to see and do see daily deaths either peaking or reaching a plateau at some point in the shaded area: Alaska (8 deaths,) California, Colorado, Connecticut, District of Columbia, Florida, Georgia, Hawaii (17 deaths,) Idaho, Kansas, Kentucky, Louisiana, Maine, Maryland, Massachusetts, Michigan, Missouri, Montana (16 deaths,) Nevada, New Jersey, New York, North Carolina, Ohio, Oklahoma, Oregon, Pennsylvania, South Carolina, Tennessee, Texas, Vermont (53 deaths,) Virginia, Washington, West Virginia, and Wisconsin. Some states, as noted in parentheses, have recorded very few deaths so the curve exhibits random noise in the form of peaks and valleys.

In 3 states that implemented stay-at-home orders, we see the curve is neither peaking nor reaching a plateau around the expected time. However the sources of these anomalies are easily identified, and once corrected the curve does peak or reach a plateau when expected:

Arizona: the spike in early May is caused by a delay in reporting deaths from weeks prior Other than this artifact, Arizona has reached the plateau when expected.
Indiana: the spike around late April is due to a reporting artifact in my data source for COVID-19 deaths. The state’s official COVID-19 dashboard shows deaths peaking on 22 April, as expected (between 08 April and 22 April):
Mississippi: the spike around early May is, again, caused by a reporting artifact. The state’s official COVID-19 dashboard shows deaths peaking on or about 27 April, as expected (between 17 April and 01 May):

In 4 states that implemented stay-at-home orders, we see anomalies with no obvious explanation:

Alabama: daily deaths appear to continue to rise past the shaded area, however there is not enough data to tell if the trend is significant or if it will reach a plateau.
Delaware reached a plateau around 23 April, which is 31 days after the stay-at-home order, beyond the expected delay of 2-4 weeks. The state borders Maryland who implemented a stay-at-home 7 days after Delaware did. Delaware is a geographically small state, so one possible but unverified explanation could be people commuting across the state line who continued to import new cases into Delaware for longer than expected.
Illinois: daily death growth slows down in the shaded area, but not significantly enough to plateau. It is unclear why. Most deaths occur in the Chicago metropolitan area. Lack of enforcement in the city?
New Mexico: a cursory search did not reveal any obvious reason why deaths are still climbing. Perhaps the state’s “grossly substandard nursing care” is to blame. 45% of COVID-19 deaths in the state occurred in long-term care facilities.

In 3 states that implemented stay-at-home orders, daily deaths continue to increase past the shaded area. However these states are the top 3 states in the nation with the highest shares of long-term care facility deaths according to Kaiser Family Foundation (Minnesota: 81%, Rhode Island: 77%, New Hampshire: 77%) well above the nationwide average of one-third. So daily death growth in these states does not reflect a failure of the stay-at-home policy, but a failure of keeping their long-term care facilities safe:

Minnesota
New Hampshire
Rhode Island

In 7 states that never issued stay-at-home orders, deaths generally follow an upward trend, with no signs of stopping. However these states are so small and have so few deaths that their curves are very noisy (except the largest state, Iowa):

Arkansas (98 deaths)
Iowa (318 deaths)
Nebraska (118 deaths)
North Dakota (40 deaths)
South Dakota (43 deaths)
Utah (76 deaths)
Wyoming (7 deaths: the curve is pure random noise)

Trend of daily cases

We can produce the same charts with cases instead of deaths. I expect the lag between stay-at-home orders and daily cases peaking or reaching a plateau to be 2-15 days, represented by the shaded area:

For many states, we do indeed observe daily cases peaking or reaching a plateau 2-15 days after the stay-at-home order.

However daily cases sometimes continue to increase beyond the shaded area. A likely explanation is that states are increasing the daily test rate over time, which improves the case ascertainment rate. For this reason, charts of daily deaths in the previous section are a more reliale indicator of the effectiveness of stay-at-home orders.

Flaws in @boriquagato’s methodology

@boriquagato ranks US states by cumulative deaths per capita and observes how the ranking changes over approximately 4 weeks, between 11 April and 08 May. Blue represents states that implemented stay-at-home orders, and red those that did not:

Firstly, the scatterplot is bogus. I recreated it using census.gov 2019 population data and the New York Times COVID-19 data repository and I ended up with a significantly different scatterplot. Also, trying to place a line of best fit on this scatterplot is pointless as it simply approximates y=x when the time span is as short as 4 weeks. Among the states without stay-at-home orders, according to @boriquagato’s version 4 of the 7 states saw their rank worsen, but in reality it happened to 6 of the 7 states. This is evidence stay-at-home orders are effective at reducing mortality:

Secondly, @boriquagato selected a premature time period that failed to fully capture the effectiveness of stay-at-home orders. For example look at the District of Columbia:

In the case of DC the stay-at-home order was clearly effective because daily death growth peaked then started declining. However the time period 11 April (red dashes) through 08 May (blue dashes) is too early to capture the decline in full; it should have been shifted by minimum ~9 days. So I recreated @boriquagato’s scatterplot over ~~20 April–15 May~~ [Update 2020-06-20: I updated the chart to cover the longer time period 20 April–20 June]:

On average stay-at-home states gained 1.2 ranks, while states without the policy lost 7.7 ranks. In fact, every red states moved up above the diagonal. Can we evaluate the statistical significance of this? I assumed the null hypothesis that these states did not lose ranks, and computed a p-value of 0.0047! This is statistically significant evidence that the lack of a stay-at-home policy is strongly correlated with increased deaths.

However a scatterplot of state rankings is a crude tool to validate the effectiveness of stay-at-home orders. A more refined visualization is to produce the same scatterplot but with the axes reflecting deaths/million instead of rankings:

Again, no surprises here: the states witout stay-at-home orders deviate upward from the linear regression curve (dashed line.) Deaths in states without stay-at-home orders universally increased faster than the national average.

This chart is a different representation of the same data as the scatterplot above, showing deaths/capita evolving through time:

Summary

The scatterplot created by @boriquagato is incorrect, probably because of errors in their (undocumented) data sources or data input. A corrected scatterplot shows that states not implementing such orders fared noticeably worse than states that did. This is evidence that stay-at-home orders are effective.

Nonetheless, a scatterplot of states ranked by cumulative deaths per capita remains an inferior tool to validate the effectiveness of stay-at-home orders, because of the huge 100-fold difference in death rates, even before states started implementing stay-at-home orders.

A superior methodology is to look at the trend of daily deaths per capita. And indeed, among the 43 states—plus DC—that implemented stay-at-home orders, 36 states and DC (84%) demonstrate a decrease in daily death growth correlated with the timing of the order, while 4 states exhibit unexplained anomalous trends, and in 3 states unexpected daily death growth can be attributed to external factors (deaths in long-term care facilities.) Conversely, the 7 states that did not implement stay-at-home orders exhibit daily death growth that persists to this present day. This is further evidence that stay-at-home orders are effective.

This analysis could be improved if states reported deaths by date of death, like Indiana and Mississippi. Most states do not provide such data, which creates anomalies in daily death curves, as states sometimes report unreported deaths from weeks prior.

Another improvement would be possible if states provided separate statistics for deaths in long-term care facilities. Deaths in such facilities account for a significant number of total deaths, and are largely uncorrelated with whether or not a state implemented a stay-at-home order.

Finally, the findings in this post are largely supported by third-party research showing that social distancing does reduce the daily growth rate of the epidemic:

Responses from @boriquagato

@boriquagato made false counterclaims, did not understand my data, largely ignored my replies, and abruptly withdrew from the discussion by blocking me on Twitter:

He finally provided the data source for his scatterplot, and as I suspected his data is bogus. He used data from 2020-04-17 instead of 2020-04-11. He never replied.
He claimed 3 of 7 states not implementing stay-at-home orders improved their rank, but only 1 of 7 did. And again his data is bogus.
He wrongly claimed states not implementing lockdowns changed by 1 position in the scatterplot, but in reality the mean rank move is -5 and statistically significant.
He thought my analysis of daily deaths per capita was based on raw daily death figures, not realizing I compute a 7-day average to smooth out the spikes (the “date-of-report vs date-of-death” issue.)
He mistakenly thought daily deaths were supposed to decline in the entirety of the shaded areas. I re-explained they either peak or reach a plateau, as expected. I marked the exact location of peaks and plateaus. He never replied.

The Case Fatality Ratio of the Novel Coronavirus (2019-nCoV)

2020-02-06T00:00:00-08:00

Estimating the case fatality ratio (CFR) of the 2019 novel coronavirus is crucial to inform policy makers and help them make the best decisions.

As of 7 February 2020 in mainland China there are 31 161 confirmed cases, 637 deaths, 1 540 recoveries. So some claim the case fatality ratio is 637/31161 = 2%. But this is as naïve as claiming the recovery ratio is 1540/31161 = 5% thus that the fatality is 95%.

28 984 cases (93%) are not resolved. They could either die or recover. So in reality the CFR could be between 2% and 95%.

Naïve CFR
Resolved CFR
Lack of awareness
Simulating naïve vs. resolved CFR
CFR of 2019-nCoV
Updates & Validation

Naïve CFR

After the SARS epidemic, a paper was published in the American Journal of Epidemiology by Ghani et al., Methods for Estimating the Case Fatality Ratio for a Novel, Emerging Infectious Disease, where they demonstrated that this common method to estimate the CFR was severely flawed. The authors call it a “naïve estimate:”

naïve CFR = deaths / cases

They noted this method is “clearly easier to describe to policy makers and the public” however it exhibits “considerable bias.” In the case of SARS in Hong Kong in 2003, between 2 April and 21 May it “falsely suggested a rise in the case fatality ratio” by 5x, from 2% to 11%. See the “naïve CFR” curve labelled “simple estimate 1” in their figure 3a:

In reality the true observed CFR has been about 13% during this period of time. The naïve CFR was severely inaccurate due to “simply an artifact,” a lag: “the final outcome for patients [death or recovery] lagged behind their identification by approximately 3 weeks.”

We observed this false rise in the naïve CFR with other outbreaks. For example Ebola (source: Case fatality rate for Ebola virus disease in west Africa):

Resolved CFR

Ghani et al. recommend two other methods to better estimate the CFR. One eliminates the lag between identification and death or recovery by taking into account only cases whose outcome is known, or resolved. Ghani et al. refer to it as the “simple estimate 2” (e₂) but for clarity I suggest calling it the resolved CFR:

resolved CFR = deaths / (deaths + recoveries)

Another method is based on the Kaplan-Meier survival procedure, which I will not describe here.

The authors conclude that both methods, the resolved CFR and Kaplan-Meier, “adequately estimated the case fatality ratio during the SARS epidemic.” As their figure 3a shows the resolved CFR tracks the true observed CFR much more closely than the naïve CFR:

Of course these methods have caveats, but it is very clear from their data that the naïve CFR is the least accurate estimate of all methods, severely underestimates the true CFR, and only converges toward it near the end of the outbreak.

15 years after Ghani et al.’s paper, it was cited by at least 59 others. For example Estimating Absolute and Relative Case Fatality Ratios from Infectious Disease Surveillance Data supports it and concludes that “the naïve estimator is virtually always biased, often severely so.”

Since the start of the 2019-nCoV outbreak, another expert has specifically come forward to emphasize the resolved CFR is the right method: see this article in Issues in Science and Technology by medical doctor Maimuna Majumder, PhD.

Epidemiological methods overwhelmingly support the resolved CFR as a better estimate than the naïve CFR, period.

[Edit: 6 months later, in August 2020, the WHO published a scientific brief that explains the issues with the naïve CFR method, and suggests the resolved CFR as a simple solution.]

Lack of awareness

Unfortunately, 15 years after Ghani et al.’s paper, the public, journalists, and even many in the medical field have learned nothing from epidemiologists. They continue to use the naïve CFR. They appear unaware it often underestimates the true CFR and is expected to rise over time (“false rise”.)

For example the scientific director of the World Health Organization’s SARS investigation (of all people!) appeared unaware. A 2003 New York Times article wrote: “the death rate has also steadily risen, leaving health officials worried. Lacking a precise explanation for the rise, health officials have generated a number of theories. In outbreaks of other new infections, the death rate has usually fallen with time. ‘‘It’s worrying, and we hope it is not an indication of a continuing trend,’’ said Dr. Klaus Stöhr, scientific director of the W.H.O.’s SARS investigation.”

The WHO was subsequently criticized in A comparison study of realtime fatality rates: severe acute respiratory syndrome in Hong Kong, Singapore, Taiwan, Toronto and Beijing, China (PDF version,) where the authors write: “While the outbreak was on going and there were patients still in hospitals over the course of the epidemic, the WHO estimate assumed implicitly that all remaining SARS in-patients would eventually recover. It therefore led to an underestimation of the true case fatality rate. For example, in the midst of the SARS outbreak at April 15th, 2003, the fatality rate in Hong Kong was 4.5% according to the WHO estimate, but it hit a record high of 17.0% at the end of the epidemic.”

Have the WHO learned anything since 2003? No! Another WHO representative recently still used the naïve CFR method which calculates the 2% figure: “WHO representative to the Philippines Dr. Rabindra Abeyasinghe noted that the 2019-nCoV’s death rate fell to about 2 percent.”

Simulating naïve vs. resolved CFR

To demonstrate how inaccurate the naïve CFR is compared to the resolved CFR, I wrote a Python script that simulates an outbreak infecting a population of 100k individuals over 200 days following a logistic growth curve which increases gradually at first, more rapidly in the middle growth period, and slowly at the end, eventually leveling off. The disease has a 50% probability of causing death 21 days after infection.

The results are obvious: the naïve CFR underestimates the true CFR by 5x (it starts at 9%) and only converges toward the true CFR (50%) near the end of the epidemic. By comparison the resolved CFR is a perfect estimate at any point:

In a second simulation, I changed the parameter time_to_heal to 28 days in order to simulate recovery taking longer than death: deaths still happen 21 days after the infection, but recoveries happen 28 days after. The only difference this creates is that at the beginning and middle of the epidemic the resolved CFR slightly overestimates the true CFR by a factor of about 1.26x (63%/50%):

Three things are very clear:

The naïve CFR is always a severe underestimate in the beginning and middle phase of an outbreak.
The naïve CFR is bound to increase over time.
The resolved CFR is always much more accurate than the naïve CFR. At worse the resolved CFR is off by 1.26x while the naïve CFR is off by 5x.

Want to play with my simulation? Here is the source code:

#!/usr/bin/python3

import math

population = 100e3
days = 200
death_prob = 0.50
time_to_death = 21
time_to_heal = 21
hist = []
deaths = recovs = naive_cfr = resolved_cfr = 0
print('day,cases,deaths,recoveries,naive_cfr,resolved_cfr')
for d in range(0, days):
    if d == 0:
        cases = 0
    else:
        cases = round(population / (1 + math.e**(-0.08*(d - days/2))))
    hist.insert(0, cases)
    if len(hist) >= time_to_death + 2:
        deaths += round((hist[time_to_death] - hist[time_to_death + 1]) * \
                death_prob)
    if len(hist) >= time_to_heal + 2:
        recovs += round((hist[time_to_heal] - hist[time_to_heal + 1]) * \
                (1 - death_prob))
    if d > max(time_to_death, time_to_heal):
        if cases:
            naive_cfr = 100 * deaths / cases
        if deaths + recovs:
            resolved_cfr = 100 * deaths / (deaths + recovs)
    print('{day},{cases},{deaths},{recovs},{naive_cfr},{resolved_cfr}'.\
            format(day=d, cases=cases, deaths=deaths, recovs=recovs,
            naive_cfr=naive_cfr, resolved_cfr=resolved_cfr))

CFR of 2019-nCoV

We know the naïve CFR (2%) is inaccurate and will increase, as per previous outbreaks (SARS, Ebola) and as per simulations.

We know the resolved CFR is much more accurate, and as of 7 February 2020 it stands at deaths/(deaths+recoveries) = 637/(637+1540) = 29%, which is rather alarming.

But we should interpret any CFR estimate with caution:

We are very early in the outbreak, with very few resolved cases (a few thousands) so the resolved CFR is fluctuating a lot from day to day
China’s official statistics may be underestimating the number of deaths (eg. patients who are suspected of but not yet confirmed having 2019-nCoV and who die are not counted toward deaths caused by the virus)
There may be many undetected/underreported mild cases that heal on their own and thus are not counted in the number of recoveries (these cases could be detected when/if serosurveys are performed)

Nonetheless we can take a stab at guessing a best case scenario for a low fatality ratio. Let’s assume there are 50x more cases than reported, all mild, so 50x more recoveries. And let’s assume there are only 2x more deaths than reported.

With these parameters the CFR would be 637*2/(637*2+1540*50) = 1.6%, which is still concerning because if there are 50x more cases than reported, then the virus is spreading far faster than we think. So a pandemic would be unavoidable, and 1.6% would be 16x deadlier than the seasonal flu (0.1%.) The flu kills half a million worldwide every year, so 2019-nCoV would kill a few millions.

You can play with the parameters, but either way 2019-nCoV is not looking good at all.

My parameters of 50x more cases than reported and 2x more deaths than reported are equivalent to assuming the same number of deaths but 25x more cases than reported, as of 7 February 2020. This assumption is roughly consistent with other estimates:

Nowcasting and forecasting the potential domestic and international spread of the 2019-nCoV outbreak originating in Wuhan, China: a modelling study estimates 75 815 cases as of 25 January 2020, which is 38x more than the 1 975 cases officially reported by China on that date.
The Imperial College London MRC Centre for Global Infectious Disease Analysis estimates in Report 4: Severity of 2019-novel coronavirus as of 3 February 2020 that there are 19x or 26x more infections than reported, according to 2 scenarios.
The Rate of Underascertainment of Novel Coronavirus (2019‐nCoV) Infection: Estimation Using Japanese Passengers Data on Evacuation Flights estimates 11x more infections than reported (“the ascertainment rate of infection was estimated at 9.2%.”)

Updates & Validation

As of 16 February 2020 the resolved CFR based on China’s official statistics stands at 1863/(1863+10844) = 15%. And when accounting for underreported mild cases, I still believe my guess of 1.6%, made on 7 February 2020, is reasonable.

On 10 February 2020, the Imperial College London MRC Centre for Global Infectious Disease Analysis published Report 4: Severity of 2019-novel coronavirus that estimates the CFR based on China’s statistics at 18%, or 0.8-0.9% when accounting for underreported mild cases. Both figures approximately match my estimates (respectively 15% and 1.6%.)

On 14 February 2020, Real-Time Estimation of the Risk of Death from Novel Coronavirus (COVID-19) Infection: Inference Using Exported Cases estimated the IFR at 0.5-0.8%. The Infection Fatality Ratio is the same as a CFR estimate that corrects for underreported cases. It approximately matches my estimate (1.6%.)

On 19 February 2020, the Institute for Disease Modeling estimated the IFR at 0.94%. It approximately matches my estimate (1.6%.)

The 3 reports above are referenced by the WHO in Coronavirus disease situation report - 30 (references 10, 11 and 12) and in situation report 31.

As of 21 February 2020, a researcher from the Institute of Social and Preventive Medicine, University of Bern estimated the CFR using a methodology to correct for possible underreporting of mild cases, and found 1.6%. It exactly matches my estimate for this scenario (1.6%.)

As of 30 March 2020, a peer-reviewed paper in The Lancet, Estimates of the severity of coronavirus disease 2019: a model-based analysis, estimated the IFR at 0.66%. It approximately matches my estimate (1.6%) within a two-fold factor.

Reverse Engineering and Cloning the Costco Digital Membership Card

2019-10-09T00:00:00-07:00

In this post I will reverse engineer the Costco Android app to find out how the digital membership card works, and how to generate the time-based token encoded in its QR code. I reimplemented it in plain HTML/JS. As a result the card can be cloned to any number of devices, bypassing the app’s security mechanism that attempted to tie the card to only one device.

2022-10-11 Three years later, this hack still works, only with minor changes. The class that fetches ANDROID_ID is now MembershipCardUtilImpl. I used apktool to decode the Costco Android app APK file version 6.7.0; I applied this new patch (just make sure to edit ANDROID_ID to a unique random value of your choosing); I rebuilt the APK and installed it. The app still generates the same QR code as my card generator. I find it easier to simply hardcode the ANDROID_ID, as this avoid the need to then log it after installing the app.

Why?
Costco goes digital
Reverse engineering the app
Custom card generator
Final thoughts
Disclosure timeline

Why?

I reverse engineered the app mainly out of curiosity; but also for efficiency: my reimplementation of the card loads instantly compared to their app, also I can factory reset or replace my phone and the card will continue to work whereas the Costco app will force you to “transfer” it to a new device (and it allows only one transfer, afterwards customer service must be contacted.)

In the end this project was trivial and serves as a nice introduction to Android reverse engineering.

But first, a story:

In 2004 shortly after I moved to the US, I was driving around my neighborhood and saw a big-box store that I decided to visit. It was a Costco and I did not know it was a membership-only store. I managed to get in through sheer luck. I did not notice shoppers were required to flash their membership cards at the greeter at the entrance. In fact I barely noticed her, as she was standing discretely at the side of the oversized entrance. She did not stop me.

So I got in, shopped for a bit, and when it was my turn to pay at the checkout counter the clerk asked for my card.

«My what?»

«Sir, you don’t have a membership card?»

Fast forward 15 years later, I love Costco. I take my family there to shop for items in bulk that I probably do not need in bulk, but it is just so convenient. However I am always looking to carry fewer things in my wallet, and Costco has been annoying in that regard, as one must carry and present this stinking plastic magnetic stripe membership card.

Costco goes digital

Two months ago they finally introduced a digital membership card on their Costco smartphone app:

So I install their Android app (version: 4.2.1, APK SHA256: 29b10840e299b31213b89788554da0038901ef878e2a076e45b221a65fc4f222.) My first impression is that it is very heavy given how little it does: 68 MB and it is mostly a WebView shim of their online store. It is slow to load and takes ~6 seconds just to get to the digital membership card view which is showing:

Membership type icon (standard, executive…)
First and last name
Membership number
Membership start and expiration
Photo
QR code

Let’s try to take a screenshot… it does not work. The app does not allow it as it sets FLAG_SECURE on this window.

Reverse engineering the app

I grab the APK file, run apktool decode to unpack and disassemble it, and recursively grep for setFlags to look for a call passing the value FLAG_SECURE (0x2000).

The app is not obfuscated so it is easy to find the relevant setFlags call that needs to be modified. It is in the file MembershipCardActivity.smali, method onCreate(). Just zero out the flags to disable FLAG_SECURE:

diff -Nur com.costco.app.android_4.2.1.orig/smali/com/costco/app/android/digitalmembership/MembershipCardActivity.smali com.costco.app.android_4.2.1/smali/com/costco/app/android/digitalmembership/MembershipCardActivity.smali
--- com.costco.app.android_4.2.1.orig/smali/com/costco/app/android/digitalmembership/MembershipCardActivity.smali       2019-10-07 21:06:54.186134497 -0700
+++ com.costco.app.android_4.2.1/smali/com/costco/app/android/digitalmembership/MembershipCardActivity.smali    2019-10-06 15:57:21.298978426 -0700
@@ -258,7 +258,7 @@
 
     move-result-object p1
 
-    const/16 v0, 0x2000
+    const/16 v0, 0x0000
 
     invoke-virtual {p1, v0, v0}, Landroid/view/Window;->setFlags(II)V

I rebuild the APK with apktool build, sign it, zipalign it. I uninstall Costco’s official APK, install my patched APK, and launch the app.

It asks me if I want to “transfer” my digital membership card to this new device, hmm… My rebuild caused that? (I will find out later the technical reason why this happened.) Anyway, I allow it, open the digital card, and taking a screenshot is now possible:

(All my data in this screenshot has been altered for privacy. My real Costco photo is a lot worse ☺)

The QR code contains a numeric value where MMM… is the membership number:

96000MMMMMMMMMMMMM000114362136

So is cloning the digital membership card as easy as taking a screenshot? Nope because the last 8 digits of the QR code value appear to change every once in a while.

Let’s investigate how the app builds the QR code.

I fire up the jadx decompiler as it produces Java code more readable than smali code. All the relevant code seems to be under com.costco.app.android.digitalmembership; in particular MembershipCardFragment.setMembershipQRCode():

private void setMembershipQRCode() {
    if (this.card != null && getContext() != null) {
        StringBuilder stringBuilder = new StringBuilder();
        stringBuilder.append(APPLICATION_IDENTIFIER);
        stringBuilder.append(SUB_TYPE);
        int length = this.card.getMemberCardNumber().length();
        if (length <= 13) {
            while (length < 13) {
                stringBuilder.append("0");
                length++;
            }
        }
        if (!StringUtils.isNullOrEmpty(this.card.getMemberCardNumber())) {
            stringBuilder.append(this.card.getMemberCardNumber());
            stringBuilder.append(MembershipCardUtils.generateDynamicToken(getContext()));
            this.qrCodeImage.setImageBitmap(MembershipCardUtils.encodeAsBitmap(
              getContext(), stringBuilder.toString()));
        }
    }
}

And MembershipCardUtils.generateDynamicToken() generates the last digits of the QR code:

public static String generateDynamicToken(Context context) {
    MessageDigest instance;
    String deviceId = getDeviceId(context);
    long currentTimeMillis = System.currentTimeMillis() / 300000;
    StringBuilder stringBuilder = new StringBuilder();
    stringBuilder.append(deviceId);
    stringBuilder.append(SALT);
    stringBuilder.append(currentTimeMillis);
    deviceId = stringBuilder.toString();
    try {
	instance = MessageDigest.getInstance("SHA-256");
    } catch (NoSuchAlgorithmException e) {
	e.printStackTrace();
	instance = null;
    }
    deviceId = Integer.toString(Integer.parseInt(bytesToHex(instance.digest(
        deviceId.getBytes(StandardCharsets.UTF_8))).substring(0, 6), 16));
    StringBuilder stringBuilder2 = new StringBuilder();
    stringBuilder2.append(RESERVED_FOR_FUTURE);
    stringBuilder2.append(DIGITAL_TOKEN_VERSION);
    while (deviceId.length() < TOKEN_LENGTH) {
	StringBuilder stringBuilder3 = new StringBuilder();
	stringBuilder3.append(ApiErrorCode.UNKNOWN);
	stringBuilder3.append(deviceId);
	deviceId = stringBuilder3.toString();
    }
    stringBuilder2.append(deviceId);
    return stringBuilder2.toString();
}

A device identifier is obtained from getDeviceId() which simply returns the Android platform identifier ANDROID_ID.

SALT is the literal string "SCOTTJOHN" (if you are John Scott and wrote this code, let’s have a beer one day!)

From the code above it is obvious how the QR code numeric value is generated. There are 4 constant fields and 2 variable fields:

96000MMMMMMMMMMMMM0001TTTTTTTT
96 ........................... APPLICATION_IDENTIFIER
  000 ........................ SUB_TYPE
     MMMMMMMMMMMMM ........... membership number padded to 13 digits
                  00 ......... RESERVED_FOR_FUTURE
                    01 ....... DIGITAL_TOKEN_VERSION
                      TTTTTTTT dynamic token padedd to 8 digits

The dynamic token is generated by computing the SHA-256 hash of the concatenation of 3 strings:

ANDROID_ID value
literal string "SCOTTJOHN"
decimal integer representation of System.currentTimeMillis() / 300000

The hexadecimal hash is truncated to 6 hex digits, converted to decimal, and padded to 8 digits. The dynamic token is in essence a time-based 24-bit token with a granularity of 300 seconds or 5 minutes.

Using ANDROID_ID to calculate the token serves as a primitive security mechanism to tie the card to only one device.

Now this explains why the app asked me earlier if I wanted to “transfer” the card to my recompiled APK. The Android platform, since Android 8.0, generates ANDROID_ID values unique to each combination of app-signing key, user, and device. The patched APK is signed with my key, different from the key of the official APK, so the two APKs see different values. When Costco’s server-side infrastructure notices an ANDROID_ID reported by the app that is different from what it expects, it prompts to transfer the card to “the new device.”

In order to generate the token myself, I need to know ANDROID_ID. So I patch the method getDeviceId() to log the value (stored in register p0) using System.out.println():

diff -Nur com.costco.app.android_4.2.1.orig/smali/com/costco/app/android/digitalmembership/MembershipCardUtils.smali com.costco.app.android_4.2.1/smali/com/costco/app/android/digitalmembership/MembershipCardUtils.smali
--- com.costco.app.android_4.2.1.orig/smali/com/costco/app/android/digitalmembership/MembershipCardUtils.smali  2019-10-07 21:06:54.190134496 -0700
+++ com.costco.app.android_4.2.1/smali/com/costco/app/android/digitalmembership/MembershipCardUtils.smali       2019-10-07 20:42:26.474302400 -0700
@@ -485,7 +485,7 @@
 .end method
 
 .method public static getDeviceId(Landroid/content/Context;)Ljava/lang/String;
-    .locals 1
+    .locals 3
     .annotation build Landroid/annotation/SuppressLint;
         value = {
             "HardwareIds"
@@ -503,6 +503,11 @@
 
     move-result-object p0
 
+    sget-object v1, Ljava/lang/System;->out:Ljava/io/PrintStream;
+    const-string v2, "android_id="
+    invoke-virtual {v1, v2}, Ljava/io/PrintStream;->print(Ljava/lang/String;)V
+    invoke-virtual {v1, p0}, Ljava/io/PrintStream;->println(Ljava/lang/String;)V
+
     return-object p0
 .end method

I build and push the APK. Now when launching the app adb logcat reveals ANDROID_ID:

10-07 21:16:41.414 26112 26112 I System.out: android_id=3d2axxxxxxxxxxxx

Custom card generator

Armed with my ANDROID_ID value, and a description of the time-based token algorithm, I was able to reimplement a digital membership card generator in plain HTML/JS:

Costco digital membership card generator

To use the generator:

Download the Costco app APK, either from your phone using adb pull, or from an APK archive site
Decode the APK with apktool, apply my two patches, and rebuild it
Install the app on a phone, launch it, take a screenshot of the digital membership card
Run adb logcat and write down ANDROID_ID
Save my card generator and host it somewhere (1 HTML, 2 JS files, and 1 template screenshot)
Replace my template screenshot with your screenshot (crop it to remove the Android status bar)
Open the HTML file, type your ANDROID_ID and membership when prompted (they will be saved in the URL fragment,) and a real-looking card with a valid QR code is displayed
Tip: scroll down by a few pixels to hide the Chrome address bar; or use “Add to Home screen” and the page will load in fullscreen mode

An implementation detail to be aware of: my QR code will always decode to the exact same numerical value as the app’s QR code, but may look different. This is because QR codes can use 8 mask patterns for data encoding. A QR implementation is supposed to pick the best mask pattern based a certain logic, such as minimizing the number of consecutive same-color pixels. My QR library and the app’s library simply have slightly differing logic in that regard. But for the sole purpose of visually confirming whether the QR codes encode the same data, I added a feature so that clicking the QR code cycles through the 8 possible mask patterns.

Final thoughts

The proper way to implement a digital membership card tied to a single device would have been to leverage Android’s hardware-backed keystore with key attestation, and to put the asymmetric cryptographic signature of a timestamp in the QR code.

So what happened 15 years ago when I did not have a membership card? The clerk called a security guard to escort me out. What an embarassing moment.

Nowadays I present my phone to the clerk and when he scans the QR code he does not even realize he is scanning my card generator.

Disclosure timeline

I chose full disclosure given the nature of the vulnerability. I am also reaching out to Costco to advise them on remediation.

2019-10-09 I send an email to the Costco app developer contact listed on the Play Store (webservice@contactcostco.com).

2019-10-10 I receive an automated reply “we will no longer be responding to requests via email.”

2019-10-10 Finding no security contact information whatsoever, I tweet @Costco. I also tentatively send a message to Director of IT Security Andrew Tuck through LinkedIn and to a handful of guesses of what might be his corporate email address. I call the corporate headquarters and leave a voicemail to Tuck.

Reviewing Morgan Stanley's Bitcoin research reports

2018-02-06T00:00:00-08:00

I reviewed the following two research reports published by analysts at Morgan Stanley on the subject of Bitcoin electricity usage:

Bitcoin ASIC production substantiates electricity use; points to coming jump, by James E Faucette et al., January 3, 2018
Bitcoin demand > EV demand?, by Nicholas J Ashworth et al., January 10, 2018

The first report estimates current electricity consumption at 2500 MW, agreeing with my own estimate of 1620/2100/3136 MW (lower bound/best guess/upper bound) as of January 11, 2018:

However I spotted a few errors.

1. Math error (multiplying instead of dividing)
2. PUE of mining farms as low as 1.03-1.33
3. Inconsistent PUE math
4. Hashrate method makes optimistic and pessimistic assumptions
5. Only a fraction of the Ordos farm mines bitcoins
6. Antminer S9 dominates the market
7. Electricity costs
8. Bitmain’s direct sales model: ONE global price
9. Transaction fees not accounted for
Footnotes

1. Math error (multiplying instead of dividing)

The analysts attempt to forecast future consumption, 12 months from now (ca. January 2019,) and claim it may be “more than 13 500/hour [sic] megawatts.”

Based on TSMC production orders for 15-20k 300mm wafer-starts of Bitcoin ASICs per month, they estimate “up to 5-7.5M new rigs” could be added. They claim to calculate electricity consumption based on 6.5M, but their numbers line up only with the upper bound 7.5M:

7.5M × 1300 (watts) × 1.4 (efficiency improvement) = 13 650 MW

The multiplication by 1.4 is meant to account for new rigs bringing a “40% efficiency improvement” and this is their error: they multiply instead of dividing.¹ A given volume of wafers/chips more energy-efficient consume less, not more, per mm² of die area. When correcting this error we arrive at an estimate of 6950 MW, about half their published number (13 500 MW.)

2. PUE of mining farms as low as 1.03-1.33

The Morgan Stanley analysts assume “60% direct electricity usage (i.e. 40% of total electricity consumption is used for non-hashing operations like cooling, network equipment, etc.)” In data center lingo this is called a PUE of 100/60 = 1.67.

However no study supports such terrible PUE values for the mining industry.² In reality, most mining farms aggressively optimize their PUE:

Gigawatt Mining builds air-cooled mining farms having a PUE of 1.03-1.05.³
Bitfury data centers are highly energy-efficient; for example their 40 MW Norway data center has a PUE of 1.05,⁴ and their CEO emphasized their Iceland data center does not have a high PUE.²
The well-known Bitmain Ordos mine reportedly has a PUE of either 1.11 or 1.33 (depending on which journalist’s numbers are trusted.)

Google optimized their data center PUEs as low as 1.06⁵ and electricity is not even one of their main costs. So it completely makes sense to find miners, for whom electricity is one of their main costs, to be in the same range.

3. Inconsistent PUE math

According to their future consumption estimate and PUE estimate, the resulting global consumption should be 13500 × 1.67 × 90% utilization = 20 250 MW. But they calculate “nearly 16 000 MW.”

20250 ≠ 16000. The math is inconsistent.

Correcting their math and parameters gives 6950 × 1.11 (or 1.33) × 90% = 6950 (or 8300) MW.

In summary, Morgan Stanley’s first report forecasts the consumption ca. January 2019 will be 13 500-16 000 MW (120-140 TWh/yr annualized) however fixing multiple errors actually forecasts 6950-8300 MW (60-75 TWh/yr annualized).

4. Hashrate method makes optimistic and pessimistic assumptions

The report claims “the hash-rate methodology uses a fairly optimistic set of efficiency assumptions.” This is not true. Well perhaps they refered to other people’s hashrate methodology. But mine, as explained in the introduction, makes optimistic and pessimistic assumptions (miners using either the least or the most efficient ASICs.)

5. Only a fraction of the Ordos farm mines bitcoins

The report continues by attempting to extrapolate the global electricity consumption from the Ordos mine:

They fail to account for the fact that only 7/8th of the farm mines bitcoins. The other 1/8th mines litecoins.
The media publishes slightly different power consumption numbers, implying either 29.2 or 35 MW for the Bitcoin rigs (depending on journalists.)
They build their calculations on a grossly rounded estimate of its hashrate (“4% of ~6M TH/s”), but it can be calculated more exactly as we know there are 21k Bitcoin rigs (~263k TH/s.)

When correcting these errors, the mine’s power consumption scaled to a global hashrate of 15.2M TH/s would imply a global power consumption of either 1690 or 2020 MW (14.8 or 17.7 TWh/yr) depending on journalists. This is significantly less than the analysts’ 2700 MW (23 TWh/yr.)

6. Antminer S9 dominates the market

The report states “the most efficient mining rigs used by Bitmain in its facilities [Antminer S9/T9] are not yet widely available” and imply that if they are not available that the average rig must be another less efficient model.

The analysts conflate market availabily with market share.

Bitmain claimed in mid-2017 they had a 70% market share. Everything points to the fact it is even higher today. The Antminer S9/T9 has been the only Bitcoin mining rig sold by Bitmain for the last 20 months. Batches of tens of thousands sell out in minutes at shop.bitmain.com. Bitmain is buying ~20k 16nm wafers a month and arguably makes up most of the ~10k a month that the Morgan Stanley analysts claim since 3Q17.

~10k wafers = ~270k S9 = ~3.6 EH/s manufactured per month.

That is more than the 1-3 EH/s added monthly to the global hashrate over 3Q17/4Q17 (it takes months to go from wafer production to mining.) Bitmain rigs make up virtually all the hashrate being deployed to this day.

7. Electricity costs

As to Morgan Stanley’s second report, it merely quotes the first report’s flawed prediction of 120-140 TWh/yr ca. January 2019. But other than that it is generally of better quality than the first. My criticism concerns relatively minor points.

In it, the analysts calculate the cost of mining one bitcoin by assuming electriciy costs between 6¢ and 8¢ per kWh. Their source are EIA numbers grossly rounded for entire geographical regions.

Miners do not pay average prices. They choose the less expensive electrical utilities of these regions.

For example where the analysts quote 7.46¢ for Washington State (see their exhibit 5,) a mining farm located in this state, Giga Watt, pays in fact 2.8¢. It is my opinion that the industry average is probably around 5¢.

8. Bitmain’s direct sales model: ONE global price

Another assumption they make when calculating the cost of mining a bitcoin is to assume that outside China an Antminer S9 costs $7000. In reality only individual retail sales reach such high prices on third party sites such as eBay. Large-scale miners representative of the average mining farm, even outside China, all pay the same price: Bitmain’s direct sales price which was $2320 for the batches sold around the time the report was written.

9. Transaction fees not accounted for

Finally, they imply the cost of mining one bitcoin is a “breakeven point” but it is not exactly true. For example, at the time of the report, transaction fees collected by miners averaged more than 600 BTC daily and boosted their global daily revenue by 1.33× (1800 to 2400 BTC,) hence the true breakeven point was 1.33× lower.

Correcting these errors, with an electricity cost of $0.05/kWh, with the same sale price globally, and with the (unusually) high-fee period of December/January, the true breakeven point was $2300, significantly below the analysts’ number ($3000 to $7000.)

Footnotes

I would argue that it is incorrect to multiply as well as to divide. Newer chips consume less energy per silicon gate not per mm² of die area. The power consumption per mm² is roughly the same between two different generations of chips, because gates consume less but more can be packed per mm². For example a Radeon R9 390 and a Radeon RX Vega 64 (about the same die area: 438 vs 486 mm²) are manufactured at two very different process nodes (28nm vs 14nm,) yet they have the same ~300 W TDP. Nonetheless I went on with the division to follow the analysts’ argument. ↩
A quote from Bitfury CEO Valery Vavilov is often misinterpreted when he offhandedly said in an interview “Many data centers around the world have 30 to 40 percent of electricity costs going to cooling,” which corresponds to a PUE of 1.43 to 1.67. Obviously he was refering to traditional data centers outside the mining industry. In fact he emphasized “This [high PUE] is not an issue in our Iceland data center.” ↩ ↩²
Source: personal discussion with CEO Dave Carlson. ↩
Bitfury’s 40 MW Mo i Rana data center in Norway has a PUE of 1.05 or lower ↩
Source: Efficiency: How we do it ↩

Running a Bitcoin full node on $5 a month

2017-10-25T00:00:00-07:00

In the context of the Bitcoin scaling debate, small blockers believe that if the block size limit is doubled, running a fully validating Bitcoin node will become expensive and only well-funded corporations will be able to run nodes, hence increasing centralization.

However facts and data disprove this narrative.

Here is a fact: despite blocks currently averaging 1 MB it is possible to run a full node on a $5/month VPS. I know because I run one. I demonstrate below it could also (likely) handle 4 MB blocks.

The point I am making is not that people should run full nodes on VPS. The point is that the cost of a VPS serves as a good indicator of the amortized cost of resources consumed by a full node, and this cost is tiny. It is irrelevant if users run the workload on a Raspberry Pi on a home connection, or on a VPS.

Technical setup
Handling larger blocks

Technical setup

In my experience 512 MB RAM is sufficient, even during IBD (initial block download.) According to the requirements (bare minimum, with custom settings) we could do with 256 MB, but I choose 512 MB to have some headroom.

The blockchain is currently approximately 145 GB, so that is the minimum storage we need. We could run in pruned mode to reduce requirements to 5 GB, however the goal here is to have a full node archiving the entire blockchain. The blockchain will likely grow to 200-220 GB over the next year, assuming 1.0-1.5 MB blocks.

I observed the network bandwidth consumed by 30 connected peers to be in the range 100-300 GB/month (average of 300-900 kbit/s) and 95% of it is outbound.

bitcoind is at <1% CPU usage most of the time, so CPU performance is a non-issue. Extremely rare pathological cases like the 1 MB transaction in block 364292 would cause a spike of CPU usage lasting 20-60 seconds.

Here is a handful of VPS plans adequate for running a full node:

$3.5/mo: LetBox 512 MB RAM, 200 GB HDD, 2 TB bandwidth
$4/mo: Hostens 512 MB RAM, 512 GB HDD, 4 TB bandwidth
$5/mo: HostHatch 512 MB RAM, 250 GB HDD, 1 TB bandwidth
$5/mo: ServerHub 512 MB RAM, 500 GB HDD, 1 TB bandwidth
6€/mo: Scaleway 2 GB RAM, 200 GB SSD, 200 Mbps unmetered (3€ for VPS including 50 GB + 3€ for an extra 150 GB)

I chose ServerHub at $5/month, because it offered the most disk space. [Update 2018-10-02: a new offering by Hostens is a lot more attractive with four times the bandwidth (4 vs. 1 TB per month) and a tad more storage.]

These tips are a great baseline for reducing memory usage. However one advice is notably missing: run Bitcoin Core 0.15.0 or later because its memory usage is more predictible. Also, running a 32-bit version of Bitcoin Core may help—I have not measured by how much, but that is what I chose to run based on experience.

When first setting up the node, I configured bitcoin.conf to optimize the initial block download. The dbcache parameter (UTXO cache) should be as large as what the machine can support:

dbcache=100
maxmempool=10
maxconnections=10
disablewallet=1

The peak system memory usage I recorded with these settings during IBD was 403 MB out of 512 MB, excluding Linux buffers/cache. IBD completed in 45 hours.¹ In the initial stages it was bottlenecked by the overall network throughput of my peers at ~30 Mbit/s, while in the later stages the bottleneck shifted to CPU and disk I/O.

Once IBD completed I reconfigured bitcoin.conf to give more space to the mempool,² allow up to 30 peers, and limit upload to 16000 MB/day (0.5 TB/month):

dbcache=50
maxmempool=50
maxconnections=30
disablewallet=1
maxuploadtarget=16000

Complete specs of this ServerHub VPS:

32-bit Debian 6.0.10 (old but fits the job)
32-bit Bitcoin Core 0.15.0.1
2 CPU cores Xeon E5-2630 v3 @ 2.40GHz
512 MB RAM
500 GB simfs (OpenVZ instance)

That is it. We have a $5/month full node, running well, perfectly healthy, contributing to decentralizing the network.

Handling larger blocks

2 MB or 4 MB blocks should not significantly increase RAM usage. The UTXO cache can remain limited (dbcache=50). It is a widespread misconception that the cache must be big enough to hold the entire UTXO set. It is fine for a non-mining full node to have many UTXO cache misses. The mempool can also remain limited (maxmempool=50) even if blocks are 2× or 4× larger. An overflowing mempool causes, at worst, occasional unnecessary network retransmissions of transactions.

2 MB or 4 MB blocks should proportionally increase network bandwidth, storage, and CPU usage. However this VPS has resources to spare: it uses only 10-30% of its monthly bandwidth, 31% of its storage space, and the CPU is <1% usage most of the time. Quadratic hashing is solved by segwit, and solvable in non-segwit transactions by limiting them to 1 MB.

If 4 MB blocks started being mined today, the monthly bandwidth would still work out (I would probably double maxuploadtarget to allow bitcoind to use the maximum amount of bandwidth possible, and reduce maxconnections from 30 to 25 if needed), and the 500 GB would fill up in 1.7 years.

So this $5/month VPS would have enough RAM/network/storage/CPU resources to function as a full node with 4 MB blocks, today and for the near future. Plus, $5/month will buy more resources 1.7 years from now, so it may be possible to continue operating a node on such a shoestring budget even further.

How does $5/month fit the narrative of small blockers who claim running a full node will become too expensive? It does not. $5/month is affordable. $5/month is within reach of enough individuals to protect decentralization.

Fun comparison: the average transaction fee right now is over $3, therefore running a full node costs less than the fees for sending 2 Bitcoin transactions a month.

For comparison, as of block 482000 (August 2017) IBD can be completed in 3 hours on a high-end machine (24 cores, 64 GB RAM), see also chart on page 16 of Upcoming in Bitcoin Core 0.15. ↩
In theory IBD would have run just fine with dbcache=50 and maxmempool=50 because unused mempool memory is automatically shared with the UTXO cache, since version 0.14.0. I did things the way I did out of habit. ↩

The case for increasing Bitcoin's block weight limit

2017-10-25T00:00:00-07:00

March 2017 was a significant moment for Bitcoin: the average block size bumped into the 1MB limit, stunting the growth of the transaction rate ever since. Many arguments are made about how to increase capacity (changing the limit, developing off-chain solutions, etc.) However very rarely are discussions centered around facts and data such as estimating the impact of the stunted growth so far and the urgency of the situation. Also troubling is the widespread misconception that the activation of segwit on August 24th has given Bitcoin short-term relief.

In reality, I estimate Bitcoin missed the opportunity to process over $20 billion of transactions. Low segwit adoption keeps the network seriously congested to this day: over 90% of blocks are over 90% full. And given the current rate of increase of segwit adoption, the network will likely remain congested for the foreseeable future.

This post is not about segwit2x, a grassroots change initially proposed by Sergio Demian Lerner on bitcoin-dev to double the block weight, and which is supported by signatories of the New York Agreement. Although I support it myself, segwit2x is a complex topic that deserves its own blog post. Rather, the goal of this post is only to demonstrate the impact of congestion and the urgent need to increase the block weight limit in one way or another (not necessarily segwit2x.)

Average block size
Mempool
Segwit
Market share
Altcoins
Pragmatism vs. idealism
Decentralization
Conclusion

Average block size

Let’s start with a chart showing the average block size over the last 3 years:

Up until March 2017 the transaction rate growth was predictible. Had there not been a 1MB limit, it seems the market demand for transaction capacity would have been for approximately 1.2MB blocks today. By failing to capture an extra 0 to 0.2MB of transactions, that is 0 to 20% extra transactions through the period March to October (daily transaction volume of $300-1000 million) Bitcoin failed to capture $20 billion worth of transactions over the last 7 months.

Mempool

I have to touch the subject of the mempool because the human—not technical—dynamics that affect its functioning are often misunderstood.

It is a mistake to think that “if the mempool is not growing, then Bitcoin is not congested.” When congestion begins, the mempool starts growing, some transactions in it are delayed 1 day, 2 days, 4 days… eventually some users lose patience and avoid sending transactions (Bitcoin fails to capture them), hence reducing pressure on the mempool, which starts shrinking until it is back to normal or even empty. And the cycle repeats immediately: users send more transactions, the mempool grows, they lose patience, and avoid Bitcoin, etc.

Under perpetual congestion conditions the mempool will never grow indefinitely.

Segwit

I hear you say “but segwit allows up to 4MB blocks, so the problem is solved, right?”

No! Segwit replaced the size limit of 1MB with a weight limit of 4 million weight units (“4M”). Therefore the size of blocks in bytes is no longer relevant to estimate the level of congestion. Instead we need to look at the weight of blocks in weight units. For example average blocks of 1MB/4M would indicate congestion, while larger blocks of smaller weight such as 1.5MB/3M would indicate no congestion.

The weight of blocks is a crucial metric, yet I am not aware of any Bitcoin statistics site that charts it. Not even segwit.party (even though weight data is present in their JSON file.) It is almost as if segwit proponents who believe segwit will solve scalability wanted to hide how congested the network really is…

So I charted the data for the last 1000 blocks as I am typing this (blocks 490698 through 491697, roughly October 19th-25th):

913 out of the last 1000 blocks have a weight greater than 3.6M (90% of 4M), in other words over 90% of blocks are over 90% full. This is a clear sign that congestion is still seriously impacting Bitcoin. And the direct cause of this congestion is low segwit adoption which sits at only 8% two months after activation:

(Opinions abound as to why it peaked at 17% and fell to 8%. Some claim a group was faking segwit adoption by spamming the network.)

An argument I sometimes hear is that “there is no severe congestion, because if there was, segwit adoption would increase more sharply.” This is flawed reasoning because other factors—independent of congestion—slow down segwit adoption: wallet software and services are not upgraded to use segwit by default, users are unaware of how to use segwit or why they need it, etc.

As a result, the average block size was only able to grow by a paltry few percents from 1MB to 1.02-1.04MB in two months, significantly undersupplying the market demand (which seems to be for approximately for 1.2MB blocks, see above.)

So Bitcoin blocks have been full since March. What else happened in March?

Bitcoin lost market share, from 85% down to 55%. In my humble opinion, the coincidence of the drop starting exactly in March is a sign that the congested network was the direct cause of Bitcoin’s loss.

This could be explained by one theory: users being fed up with the congested network and high transaction fees, and moving to altcoins with less congestion and lower fees.

Altcoins

Interestingly, data points seem to support this theory. The transaction rate of Ethereum, Litecoin, and Dash (and maybe others) also started increasing in or around March 2017:

If anything, this is more evidence potentially validating the theory that some users are abandoning Bitcoin for altcoins due to congestion.

Another explanation about why Ethereum’s transaction rate went up so sharply is because of the ICO boom. However ICOs are hardly responsible for all of it, and they do not explain the rise in Litecoin and Dash which are much less often used to fund ICOs.

Notice that Ethereum already processes 450k transactions daily, which is 70% more than the 260k processed by Bitcoin… This might be a slightly unfair comparison because unlike Ethereum, some Bitcoin transactions are batched. To be more correct we should compare the number of Ethereum transactions to the number of Bitcoin transaction outputs (and for transactions with 2 or more outputs, subtract 1 because one of them is likely the change output.) However batching is not that common so the comparison between the number of Ethereum and Bitcoin transactions is mostly valid.

Pragmatism vs. idealism

How do we resolve Bitcoin’s congestion? Instead of increasing the block weight limit, I think we all agree more ideal solutions would be the Lightning Network, Mimblewimble, transaction batching, etc.

However these solutions are either not ready or not practically usable at scale today.

In essence this is the root cause of the scaling debate: the community is divided between big blockers who want a pragmatic quick fix, and small blockers who are more idealistic.

Decentralization

Will increasing the block weight limit hamper decentralization? I do not believe so.

Firstly, the amortized cost of running a full node is only $5 per month, an amount small enough that running one is within reach of enough individuals to protect decentralization.

Fun comparison: the average transaction fee right now is over $3, therefore running a full node costs less than the fees for sending 2 Bitcoin transactions a month.

Secondly, history shows that as blocks have grown over the last 2 years from 0.5 MB to 1.0 MB the number of full nodes has in fact increased, helping decentralization. That is because bigger blocks = more Bitcoin adoption = more individuals interested to and able to run full nodes. Generally speaking we can expect the number of full nodes to increase as blocks become bigger (up to a certain point.)

Conclusion

It is easy to get distracted by all the good news Bitcoin has been receiving lately: all-time-high price around $6000, a lot of venture capital injected into Bitcoin companies, etc. However this rapid success could disappear just as quickly if Bitcoin fails to increase its capacity soon.

Segwit is too little too late: some users not adopting it are causing congestion for all users. We needed solutions to scale in March 2017. We needed segwit activated a year ago so adoption would be at 50%+ today. We needed all this before Bitcoin already failed to capture $20 billion worth of transactions, lost 30% of market share, and lost hundreds of thousands of transactions daily to altcoins…

Bitcoin needs a pragmatic fix, a one-time reasonable increase of the block weight in order to relieve congestion right away and give us some breathing room. We can do this while keeping the costs to run a full node low ($5/month) and the network decentralized.

Attacks on Merkle Tree Proof

2017-08-11T00:00:00-07:00

The Zcoin cryptocurrency is migrating its proof-of-work from Lyra2 to Merkle Tree Proof which is a new algorithm published in a paper last year by the same authors who previously designed Equihash and Argon2. The Zcoin folks invited me to participate in their miner development challenge; however what really caught my interest was their audit challenge in which they offer bounties to reward the discovery of vulnerabilities in either the MTP algorithm described by the paper, or the MTP code implemented by Zcoin.

Challenge accepted. I found 4 attacks. The 4th one is the most serious one.

Edit: The Zcoin project gave me a 9166 USD bounty (6666 USD for attacks 1 and 4, plus 2500 USD for attacks 2 and 3,) thanks! Subsequently, the MTP authors revised their paper to defend against my attacks, see MTP 1.2.

Prelude
Description of MTP-Argon2
Previously known attacks
New attacks
Minor errors in MTP paper
Final words

Prelude

Proof-of-work algorithms based on Merkle trees are a relatively new construction. I find them quite interesting. To my knowledge, before MTP, Fabien Coelho is the only person who designed one in his 2008 paper An (Almost) Constant-Effort Solution-Verification Proof-of-Work Protocol Based on Merkle Trees.

Description of MTP-Argon2

First, let me give a succinct description of MTP-Argon2 which is the concrete instance of MTP with parameters making it suitable for use as a cryptocurrency proof-of-work.

The prover takes the challenge (eg. current block header) and derives from it 2 GB of Argon2d memory blocks (time_cost = 1, lanes = 4). Remember that Argon2 computes each block X[i] from 2 input blocks X[i - 1] and X[ϕ(i)]. Each 1-kB Argon2 block is seen as a leaf, and the prover computes the merkle hash tree of the leaves, ending up with the root Φ. The prover picks a random nonce N, computes a hash Y_0 = H(Φ ‖ N), then Y_0 modulo the number of Argon2 blocks determines the index i of some block. The prover computes another hash, Y_1 = H(Y_0 ‖ X[i]), then Y_1 determines the index of another block, and so on. This is repeated L = 70 times to randomly selects 70 blocks. If the final hash Y_L is under a certain target, then the proof-of-work is solved. The proof is:

merkle root Φ
nonce N
2 × L = 140 blocks: two Argon2 input blocks for each of the L selected blocks
3 × L = 210 merkle openings: openings of 2 × L input blocks and of L selected blocks (note: the paper makes a crucial mistake on this point, see next section)

The verifier recomputes all the hashes Y_0 through Y_L to verify that Y_L is under the target. For each of the L steps, the verifier also uses the 2 input blocks and Argon2’s compression function to recompute the selected block, and he verifies their merkle openings.

Previously known attacks

The first crucial vulnerability in the MTP paper is that the proof contains the openings of only the input blocks. But it also needs the openings of the selected blocks, or else the prover can choose a constant value for all blocks (eg. all zeroes) and can compute the Y_i hashes by recomputing the selected blocks on-the-fly from the 2 constant input blocks without having to prove the selected blocks are part of the tree. I discovered this but then read Dinur and Nadler’s review of MTP and saw they already reported this vulnerability (“4.1 Simple Attacks, second attack.”)

Dinur and Nadler also document a 2nd and 3rd vulnerability: “4.1 Simple Attacks, first attack” (previous proof-of-work solutions can be reused) and the main attack described in section 5 (time-memory tradeoff exploiting specially crafted inconsistent blocks.) The MTP authors suggested both attacks can be fixed by including the challenge in the Argon2 compression function.

Zcoin fixes the 1st vulnerability by including the openings of the selected blocks in the proof, and fixes the 2nd and 3rd one by writing—via this memcopy() call—the first 16 bytes of the double-SHA256 of the block header (76 bytes, excluding nonce), at offset 128 of the XOR output in the Argon2 compression function.

New attacks

Despite the above fixes suggested by the MTP authors and implemented by Zcoin, I discovered new attacks. They affect the MTP algorithm, except attack 3 which is just the result of a bug in the custom MTP implementation in Zcoin.

Attack 1: Argon2 segment sharing

MTP-Argon2 suggests parameters designed to force provers to use 2048 MiB of RAM. However a flaw allows a malicious prover to reduce his RAM usage by 18.75% down to 1664 MiB by sharing Argon2 segments amongst lanes.

Argon2 is parametrized with 4 lanes of 512 MiB. Each lane is divided in 4 segments of 128 MiB. The first two X[i] blocks of a lane are normally initialized with respectively H’(H_0 ‖ 0 ‖ i) and H’(H_0 ‖ 1 ‖ i).

However an oddity of MTP is that verifiers never check that these first two blocks are initialized this way. So a prover can share the same first segment across all lanes, allowing him to store 3 fewer segments in RAM, hence saving 3 × 128 = 384 MiB of RAM. This brings his RAM usage from 2048 MiB down to 1664 MiB, and makes MTP less memory-hard than it is supposed to.

(Note that even though the first segment of each lane are identical, the second and subsequent segments will be different from other lanes because the reference set R of blocks will be different for different lanes.)

One pragmatic fix for MTP is to simply allow, even encourage, the use of this memory-saving technique, giving the same advantage to malicious and honest provers. Argon2 could be parametrized to use 2581120 KiB (2520.625 MiB) of RAM, so that the actual RAM usage with the memory-saving technique would be close to the 2048 MiB originally intended.

However a cleaner fix is to use Argon2 with 1 lane instead of 4. This would also fix the more serious attack 4.

One might think that an alternative fix would be to require provers to include the first 2 blocks of each lane (and their openings) in the proof, and to require verifiers to ensure that these blocks are unique. However fundamentally MTP allows some inconsistent blocks, so the prover could supply 8 unique blocks, and break the consistency of Argon2 by sharing the same remaining blocks amongst lanes.

Attack 2: Location in merkle tree not verified

Note: the challenge judges argue that they do not see this as an issue in the paper, but the paper was merely silent on this (critical) detail.

MTP as described in its paper is flawed. Algorithm 2 (verifier’s algorithm), step 2, verifies the opening but not the location of the Argon2 blocks in the tree. Their location is not part of the proof. Their location can only be computed at the next step, step 3, when Y_j is calculated. As a result, the prover can pretend the blocks selected through each of the L = 70 steps are always X[3] (MTP does not allow to select X[1] or X[2]) and he can effectively run algorithm 1 (prover’s algorithm) by storing only 3 kB. I will explain later how to further reduce memory usage down to 1 kB.

More specifically an attack against MTP would work as follow: in the prover’s algorithm, step 1, the prover computes and stores X[1] through X[3] in memory, and stops here. The prover assumes the rest of the blocks X[4] through X[T] are all zeros. In step 5, the prover does not bother computing i_j but assumes i_j is always 3. The rest of the algorithm is run normally. Whenever a nonce N results in a satisfying Y_L, the prover’s proof will be the merkle root, the nonce N, and the set Z of L = 70 elements where each element is the same:

(input block X[2],
input block X[1],
opening of X[3],
opening of X[2],
opening of X[1])

The verifier’s algorithm blindly accepts this proof because all the openings are valid and the algorithm does not verify whether the location (Y_j mod T) equals 3 or not. Since the location is not verified or even known by the verifier’s algorithm, it seems the paper implies a canonical merkle tree is computed: the left and right hashes are first ordered (since it is not known which one is which) before hashing them to compute the parent.

On a side note, the Zcoin implementation of MTP gets around the problem of not knowing the location not by computing a canonical merkle tree, but by storing both the left and right hashes in the proof. But it too fails to verify the location, so it does not prevent the attack.

In order to fix the verifier’s algorithm, step 2 should be removed, and step 3 should be changed to:

Compute from Z for 1 ≤ j ≤ L:
  X[i_j] = F(X[i_j - 1], X[ϕ(i_j)])
  Verify opening of X[i_j] and its location in tree: (Y_(j-1) mod T)
  Verify opening of X[i_j - 1] and its location in tree: (Y_(j-1) mod T) - 1
  Verify opening of X[ϕ(i_j)] and its location in tree (*)
  Y_j = G(Y_j - 1, X[i_j])

(*) The location of X[ϕ(i_j)] is specified by the first two 32-bit words of
X[i_j - 1] as documented in the Argon2 specification (J_1 and J_2 words.)

Since an opening consists of 21 hashes and no additional information, some of these hashes will be left hashes, others right hashes. So verifying the the location of a block is accomplished by determining which one is left and which one is right, and by hashing them in the proper order to compute the parent.

In the Zcoin implementation, it stores 21 × 3 hashes per opening—21 (tree depths) × 3 elements (parent, left, right)—so verifying the location of the block would consist of verifying the hash is at the expected place (either left or right) instead of freely allowing it to be at either the left or right as it does.

The memory usage of this attack against MTP can be reduced from 3 kB to 1 kB. Because the verifier’s algorithm allows any value for X[1] and X[2], the prover can actually assume they are all zeros and not even store them in memory.

Attack 3: 1/3rd of openings not verified (Zcoin implementation flaw)

When Zcoin patched one of the Dinur and Nadler attacks, by verifying 210 instead of 140 openings, they forgot to update one loop:

for (i = 0; i < L * 2; ++i) {

i < L * 2 should be replaced with i < L * 3 or else the verifier will only verify the opening of 2/3rds of the blocks in the merkle tree. Failing to verify the remaining 1/3rd means a malicious prover would have to spend the expected computing resources (2 GB) only once to compute 1 potential PoW solution. Then he would simply grind one block in the remaining 1/3rd by spending minimal resources, completely defeating the memory-hardness of MTP. For example he would calculate Y_j = H(Y_(j-1) ‖ X[i_j]) for L-1 blocks, then would grind the last block, potentially on many machines in parallel, by sending the Y_(L-1) value to the machines, each grinding unique random last blocks, and each needing to allocate only 1 kB to hold this block.

Attack 4: Time-memory trade-off with 1/16th the memory, 2.88× the time

I found a time-memory trade-off attack against MTP-Argon2 that requires approximately 1/16th the memory, increases the time complexity by only ~2.88×, while requiring negligible precomputation work. The overall cost for a cheating prover, measured in ASIC area-time complexity, is hence reduced by a factor ~5.5×.

The prover computes the first 128 MB Argon2 segment of lane 0 honestly. This segment can be reused by lanes 1-3 because verifiers don’t verify the first 2 blocks of the first segments are initialized as per the Argon2 spec, see attack 1.

For the 2nd segment of lane 0, the prover chooses an arbitrary (aka non-consistent) 1st block that references any of the other 3 lanes (2nd 32-bit word, J_2 ≠ current_lane). He then computes the 2nd block honestly. If J_2 in the 2nd block references the current lane (probability = 0.25), the prover tries again by changing some bytes in the 1st block. If J_2 references one of the other 3 lanes (probability = 0.75), the prover progresses further and honestly computes the 3rd block. If J_2 in the 3rd block references the current lane, the provers goes back to the beginning, changes some bytes in the 1st block, so that both the 2nd and 3rd blocks reference any of the other 3 lanes. The prover continues in the same fashion for N blocks: he brute forces the 1st block until he honestly computes N subsequent blocks that all reference any of the other 3 lanes. The probability that an arbitrary 1st block fulfills this condition is 0.75^N. I suggest N = 49 for an efficient attack. On average the prover will have to try 1/(0.75^N) = ~1,324,336 candidate values for the 1st block.

So far the 2nd segment consists of:

block 1: brute forced
blocks 2-50: honestly computed

The prover stores these 50 blocks (memory cost: 50 kB.) The prover makes block 51 non-consistent by setting it to the same content as block 1. Then blocks 52-100, if computed, would generate the same data as blocks 2-50. There is therefore no need to generate or store the remainder of the brute forced segment: it consists of blocks 1-50 repeated over and over.

Repeat these steps to generate the 2nd segments of lanes 1-3, then the 3rd, then the 4th segments of all lanes (12 brute forced segments in total.)

The crux of the attack that is important to understand is that the annoying part of Argon2, from an attacker’s viewpoint, is that if J_2 references the current lane, the reference set R of blocks is constantly changing from block to block. However by forcing J_2 to reference other lanes, the sets R of the other 3 lanes remain constant for the whole segment, so a given block value X[i] will generate the same X[i+1] regardless of the index i.

In total, on average, the prover will have to try 12 × 1,324,336 = ~15.9 million candidate values for the brute forced blocks. This precomputation must be done once for every new challenge (eg. newly mined Zcoin block), and should take just a few seconds of GPU time as it is an embarrassingly parallel task.

The total memory used by the prover is 128 MB (first segments) + 12 * 50 kB (brute forced segments) = ~128.6 MB, or approximately 1/16th the amount used by honest provers.

In the 1st segment, all blocks are consistent. In the other 3 segments, 49 out of 50 blocks are consistent. So on average 197 out of 200 blocks are consistent, which means the proof-of-work’s random selection of L = 70 blocks has probability (197/200)^70 = ~0.347 of being valid. Therefore this attack only increases the time complexity by a factor ~2.88.

Overall, this time-memory trade-off attack is quite attractive: it requires little precomputation, reduces memory use to ~1/16th, and increases runtime by only ~2.88×, which translates to an ASIC area-time advantage of ~5.5× for a cheating prover over an honest prover. The most efficient N value for this attack remains to be determined but is likely around 30-50.

There is no easy and cheap way to detect the attack. The only viable fix that I see is to use Argon2 with 1 lane instead of 4. This forces the reference set R of blocks to change from block to block, so a given block X[i] will generate a block X[i+1] that varies depending on the exact value of the index i.

Minor errors in MTP paper

The authors say about their previous proof-of-work Equihash that it is not truly progress-free. It “is quite promising, but the reference implementation reported is quite slow, as it takes about 30 seconds to get a proof that certifies the memory allocation of 500 MB.” However Equihash performance was dramatically improved in October-November 2016 thanks to the Zcash miner challenge which spurred competition on Equihash solvers. SILENTARMY, written by yours truly, takes only ~10 milliseconds to get a proof on a modern GPU. More recent solvers such as OptiminerZcash or Claymore’s Zcash Miner take only ~2 milliseconds.

Final words

The proof size of MTP is quite large at 213920 bytes (2 GB of Argon2 blocks, L = 70, 16-byte merkle hashes.) By comparison Equihash proofs are much smaller at 1344 bytes (n = 200, k = 9.) Given that, and the fact Equihash is not as problematic as the authors once thought, is actually progress-free, and has been more reviewed than MTP, I personally think Equihash is a superior proof-of-work. More research into MTP and smaller proofs may change my opinion :-)