mrb's blogComputer and network security, reverse engineering, security vulnerability research and exploitation, high-performance computing, GPGPU, assembly optimization, free and open source software, Unix kernels internals, decentralized cryptocurrencies (Bitcoin), entrepreneurship, angel investing, etc.
2023-09-28T18:16:13-07:00http://blog.zorinaq.comShipping companies underestimate dim weight by 0.7%http://blog.zorinaq.com/dim-weight/2023-06-12T00:00:00-07:00<p>Shipping companies like FedEx or UPS sometimes determine the shipping
rate by calculating the <em>dimensional weight</em>, or dim weight, of the package.
For example, FedEx instructs their American audience to calculate the size in
cubic inches and then to <a href="https://www.fedex.com/en-us/shipping/packaging/what-is-dimensional-weight.html">divide by 139</a>.
Others may use a divisor of 166 or 194. The result is the dim weight in
pounds.</p>
<p>139, 166 and 194 are not obvious at first sight—the imperial system rarely
makes things obvious!—but the intent is to estimate, respectively, 1/5th,
1/6th, and 1/7th the density of water:</p>
<p>5 (lb) * 453.59237 (g/lb) / (2.54 (cm/in) ^ 3) = 138.39952</p>
<p>6 (lb) * 453.59237 (g/lb) / (2.54 (cm/in) ^ 3) = 166.07943</p>
<p>7 (lb) * 453.59237 (g/lb) / (2.54 (cm/in) ^ 3) = 193.75933</p>
<p>In plain words: 5 lb of water has a volume of 138.39952 cubic inches, and so
on. And a package having a volume of 138.39952 cubic inches is assumed
to have a dim weight of 1 lb, or 1/5th of water.</p>
<p>The funny thing is: all shipping companies round these figures correctly
<em>except</em> 138.39952 which is rounded up to 139. My guess is they did the same
math as above except they approximated the pound as 454 g instead of exactly
453.59237 g:</p>
<p>5 (lb) * 454 (g/lb) / (2.54 (cm/in) ^ 3) = 138.52390 …which rounds to 139.</p>
<p>Therefore all the shipping companies who use a divisor of 139 instead of
138, such as FedEx, actually <em>underestimate</em> the intended dim weight by 0.7%,
and correspondingly undercharge customers. Who wants to tell them?</p>
Linux Kernel Patch to Support ECC Memory on AMD Ryzen 5000 APUshttp://blog.zorinaq.com/ecc-on-amd-cezanne/2021-11-24T00:00:00-08:00<p>I made a Linux kernel patch to support ECC memory on AMD Ryzen 5000 APUs (codename Cezanne.) It
applies cleanly and works on kernel versions 5.13 through the latest
(5.16-rc2).</p>
<ul id="markdown-toc">
<li><a href="#step-by-step-patching-in-debian-12-bookworm" id="markdown-toc-step-by-step-patching-in-debian-12-bookworm">Step-by-step patching in Debian 12 (bookworm)</a></li>
<li><a href="#origin-story" id="markdown-toc-origin-story">Origin story</a></li>
</ul>
<h2 id="step-by-step-patching-in-debian-12-bookworm">Step-by-step patching in Debian 12 (bookworm)</h2>
<p>The instructions below for Debian 12 (bookworm) are minimally intrusive. They
install only one modified kernel module (<code class="language-plaintext highlighter-rouge">amd64_edac.ko</code>) while keeping
the kernel image and other modules unmodified.</p>
<p>First, install dependencies to compile the kernel module:</p>
<div class="language-plaintext as-terminal highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ apt install build-essential bc kmod cpio bison flex gnupg wget linux-headers-amd64 libncurses-dev libelf-dev libssl-dev rsync dwarves
</code></pre></div></div>
<p>Get the kernel source and extract it:</p>
<div class="language-plaintext as-terminal highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ apt install linux-source
$ cd /usr/src/
$ tar xf linux-source-5.15.tar.xz
$ cd linux-source-5.15
</code></pre></div></div>
<p>Obtain the configuration of the running kernel, disable signing of the modules
(or else compilation would fail as we do not possess Debian’s signing key):</p>
<div class="language-plaintext as-terminal highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cp /usr/src/linux-headers-$(uname -r)/.config .
$ sed -r -i -e 's,^CONFIG_SYSTEM_TRUSTED_KEYS=.+,CONFIG_SYSTEM_TRUSTED_KEYS="",g' .config
</code></pre></div></div>
<p>Download my patch <a href="../assets/ecc-amd-cezanne.patch">ecc-amd-cezanne.patch</a> and apply it:</p>
<div class="language-plaintext as-terminal highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ curl https://blog.zorinaq.com/assets/ecc-amd-cezanne.patch | patch -Np0
</code></pre></div></div>
<p>Compile:</p>
<div class="language-plaintext as-terminal highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ make -j $(nproc) bindeb-pkg
</code></pre></div></div>
<p>Strip the <code class="language-plaintext highlighter-rouge">.BTF</code> section from the module <code class="language-plaintext highlighter-rouge">amd64_edac.ko</code>, or else attempting to load it
will fail with the error <code class="language-plaintext highlighter-rouge">failed to validate module [amd64_edac] BTF: -22</code> in dmesg.
This seems to be a <a href="https://lkml.org/lkml/2021/11/26/623">subtle bug due to a corner case in the BTF validation
framework</a>:</p>
<div class="language-plaintext as-terminal highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ objcopy --remove-section .BTF debian/linux-image/lib/modules/*/kernel/drivers/edac/amd64_edac.ko
</code></pre></div></div>
<p>Install the module <code class="language-plaintext highlighter-rouge">amd64_edac.ko</code> in the <code class="language-plaintext highlighter-rouge">updates</code> directory which is given
higher priority by depmod (so our patched module will load instead of the original module):</p>
<div class="language-plaintext as-terminal highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ DEST="/lib/modules/$(uname -r)/updates"
$ mkdir -p "$DEST"
$ cp debian/linux-image/lib/modules/*/kernel/drivers/edac/amd64_edac.ko "$DEST"
$ depmod -a
</code></pre></div></div>
<p>Try loading the module. If it works, <code class="language-plaintext highlighter-rouge">edac-util -v</code> should show ECC statistics:</p>
<div class="language-plaintext as-terminal highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ modprobe amd64_edac
$ edac-util -v
mc0: 0 Uncorrected Errors with no DIMM info
mc0: 0 Corrected Errors with no DIMM info
mc0: csrow0: 0 Uncorrected Errors
mc0: csrow0: mc#0csrow#0channel#0: 0 Corrected Errors
mc0: csrow0: mc#0csrow#0channel#1: 0 Corrected Errors
mc0: csrow1: 0 Uncorrected Errors
mc0: csrow1: mc#0csrow#1channel#0: 0 Corrected Errors
mc0: csrow1: mc#0csrow#1channel#1: 0 Corrected Errors
mc0: csrow2: 0 Uncorrected Errors
mc0: csrow2: mc#0csrow#2channel#0: 0 Corrected Errors
mc0: csrow2: mc#0csrow#2channel#1: 0 Corrected Errors
mc0: csrow3: 0 Uncorrected Errors
mc0: csrow3: mc#0csrow#3channel#0: 0 Corrected Errors
mc0: csrow3: mc#0csrow#3channel#1: 0 Corrected Errors
edac-util: No errors to report.
</code></pre></div></div>
<h2 id="origin-story">Origin story</h2>
<p>I built a small Linux network file server for my home network, based on
an ASRock X570M PRO4 motherboard, an AMD Ryzen 7 PRO 5750G APU, and
128 GB of Kingston DDR4-2666 ECC unbuffered memory (KSM26ED8/32ME).</p>
<p>Officially, AMD validates ECC memory on Ryzen PRO APUs only, while non-PRO APUs
may or may not support ECC depending on the motherboard. I really wanted ECC on
an officially supported hardware stack. However PRO APUs are distributed to
OEMs only, not to retail customers. So I hunted down a PRO APU which I had to
buy from Germany through an eBay reseller.</p>
<p>After all this trouble, to my surprise, ECC did not work out of the box. I
could not get ECC statistics:</p>
<div class="language-plaintext as-terminal highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ edac-util
edac-util: Error: No memory controller data found.
</code></pre></div></div>
<p>This error happens because the kernel module <code class="language-plaintext highlighter-rouge">amd64_edac.ko</code> fails to load:</p>
<div class="language-plaintext as-terminal highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ modprobe amd64_edac
modprobe: ERROR: could not insert 'amd64_edac': No such device
</code></pre></div></div>
<p>My kernel is very recent, 5.14.0-4-amd64 from Debian 12 (testing/bookworm), but
so are Ryzen 5000 APUs.</p>
<p>So let’s try a bleeding edge kernel? I must admit it had been about 10 years
since I last built a Linux kernel, but with the help of <a href="https://www.dwarmstrong.org/kernel/">Daniel Wayne
Armstrong’s excellent concise guide</a>,
I dived in and built 5.16-rc2, released by Torvalds 3 days ago,</p>
<p>Alas, no luck. Same “No such device” error.</p>
<p>My BIOS is up-to-date. I reached out to ASRock’s customer support, and a
helpful chap even tried booting up the same processor as me, 5750G, on that
motherboard, with some ECC memory, and he sends me a screenshot proving ECC
works at least on Windows:</p>
<p class="no-margins"><img src="../assets/ecc_asrock_x570m_pro4.jpg" alt="ECC works on Windows" title="ECC works on Windows" class="pure-img" /></p>
<p>I do not have Windows, but if only to confirm ECC can work on <em>my</em> hardware,
I figured I can probably run that <code class="language-plaintext highlighter-rouge">wmic memphysical get memoryerrorcorrection</code>
command from the Windows installation ISO, without installing the OS. When the
installer starts, select “Repair Windows”, open a command prompt, and, indeed,
the <code class="language-plaintext highlighter-rouge">wmic</code> command prints 6, meaning Multi-bit ECC is working.</p>
<p>So, what is Linux doing? I locate the module’s source, <code class="language-plaintext highlighter-rouge">drivers/edac/amd64_edac.c</code>,
add a few printk() debug messages…</p>
<p>Down the rabbit hole, I discover <code class="language-plaintext highlighter-rouge">reserve_mc_sibling_devs()</code> fails here:</p>
<figure class="highlight"><pre><code class="language-c" data-lang="c"><span class="n">pvt</span><span class="o">-></span><span class="n">F0</span> <span class="o">=</span> <span class="n">pci_get_related_function</span><span class="p">(</span><span class="n">pvt</span><span class="o">-></span><span class="n">F3</span><span class="o">-></span><span class="n">vendor</span><span class="p">,</span> <span class="n">pci_id1</span><span class="p">,</span> <span class="n">pvt</span><span class="o">-></span><span class="n">F3</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">pvt</span><span class="o">-></span><span class="n">F0</span><span class="p">)</span> <span class="p">{</span>
<span class="n">edac_dbg</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="s">"F0 not found, device 0x%x</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">pci_id1</span><span class="p">);</span>
<span class="k">return</span> <span class="o">-</span><span class="n">ENODEV</span><span class="p">;</span>
<span class="p">}</span></code></pre></figure>
<p>I convert the edac_dbg() to a printk() statement, which prints:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>F0 not found, device 0x1650
</code></pre></div></div>
<p>0x1650 is a PCI device ID not present on my system. It is very helpful
for debugging that I have another AMD machine (EPYC 7232P) with ECC memory, so I
can compare the output of <code class="language-plaintext highlighter-rouge">lspci</code> and understand which PCI devices the code was
looking for. As it turns out, they are always on bus 0 device 0x18:</p>
<p>On EPYC 7232P:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ lspci -s 0:18 -n
00:18.0 0600: 1022:1490 ← function 0
00:18.1 0600: 1022:1491
00:18.2 0600: 1022:1492
00:18.3 0600: 1022:1493
00:18.4 0600: 1022:1494
00:18.5 0600: 1022:1495
00:18.6 0600: 1022:1496 ← function 6
00:18.7 0600: 1022:1497
</code></pre></div></div>
<p>On Ryzen 7 PRO 5750G:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ lspci -s 0:18 -n
00:18.0 0600: 1022:166a ← function 0
00:18.1 0600: 1022:166b
00:18.2 0600: 1022:166c
00:18.3 0600: 1022:166d
00:18.4 0600: 1022:166e
00:18.5 0600: 1022:166f
00:18.6 0600: 1022:1670 ← function 6
00:18.7 0600: 1022:1671
</code></pre></div></div>
<p>On EPYC 7232P, the code looks for 0x1490 and finds it, but on Ryzen 7 PRO 5750G
the code looks for 0x1650 and does not find it.</p>
<p>It is my understanding the above 8 device functions represent the northbridge /
memory controller, and <code class="language-plaintext highlighter-rouge">reserve_mc_sibling_devs()</code> is looking for functions 0
then 6 which, on my machine, have device ID 0x166a and 0x1670.</p>
<p>So in <code class="language-plaintext highlighter-rouge">drivers/edac/amd64_edac.c</code>, function <code class="language-plaintext highlighter-rouge">per_family_init()</code> I first change
the code to handle Ryzen 5000 APUs (family 0x19, model 0x50), then initialize
data structures that contain the proper device IDs (0x166a and 0x1670):</p>
<figure class="highlight"><pre><code class="language-patch" data-lang="patch"><span class="gd">--- drivers/edac/amd64_edac.h.orig 2021-11-23 20:53:17.777353032 -0800
</span><span class="gi">+++ drivers/edac/amd64_edac.h 2021-11-23 20:55:43.346625956 -0800
</span><span class="p">@@ -126,6 +126,8 @@</span>
#define PCI_DEVICE_ID_AMD_17H_M70H_DF_F6 0x1446
#define PCI_DEVICE_ID_AMD_19H_DF_F0 0x1650
#define PCI_DEVICE_ID_AMD_19H_DF_F6 0x1656
<span class="gi">+#define PCI_DEVICE_ID_AMD_19H_M50H_DF_F0 0x166a
+#define PCI_DEVICE_ID_AMD_19H_M50H_DF_F6 0x1670
</span>
/*
* Function 1 - Address Map
<span class="p">@@ -298,6 +300,7 @@</span>
F17_M60H_CPUS,
F17_M70H_CPUS,
F19_CPUS,
<span class="gi">+ F19_M50H_CPUS,
</span> NUM_FAMILIES,
};
<span class="gd">--- drivers/edac/amd64_edac.c.orig 2021-09-30 01:11:08.000000000 -0700
</span><span class="gi">+++ drivers/edac/amd64_edac.c 2021-11-23 21:10:39.766923976 -0800
</span><span class="p">@@ -2351,6 +2351,16 @@</span>
.dbam_to_cs = f17_addr_mask_to_cs_size,
}
},
<span class="gi">+ [F19_M50H_CPUS] = {
+ .ctl_name = "F19h_M50h",
+ .f0_id = PCI_DEVICE_ID_AMD_19H_M50H_DF_F0,
+ .f6_id = PCI_DEVICE_ID_AMD_19H_M50H_DF_F6,
+ .max_mcs = 2,
+ .ops = {
+ .early_channel_count = f17_early_channel_count,
+ .dbam_to_cs = f17_addr_mask_to_cs_size,
+ }
+ },
</span> };
/*
<span class="p">@@ -3403,6 +3413,12 @@</span>
fam_type->ctl_name = "F19h_M20h";
break;
}
<span class="gi">+ if (pvt->model == 0x50) {
+ fam_type = &family_types[F19_M50H_CPUS];
+ pvt->ops = &family_types[F19_M50H_CPUS].ops;
+ fam_type->ctl_name = "F19h_M50h";
+ break;
+ }
</span> fam_type = &family_types[F19_CPUS];
pvt->ops = &family_types[F19_CPUS].ops;
family_types[F19_CPUS].ctl_name = "F19h";</code></pre></figure>
<p>I recompile the kernel module, and lo and behold, it loads and everything works!</p>
<div class="language-plaintext as-terminal highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ insmod amd64_edac.ko
$ dmesg
[...]
EDAC amd64: MCT channel count: 2
EDAC MC0: Giving out device to module amd64_edac controller F19h_M50h: DEV 0000:00:18.3 (INTERRUPT)
EDAC amd64: F19h_M50h detected (node 0).
EDAC MC: UMC0 chip selects:
EDAC amd64: MC: 0: 16384MB 1: 16384MB
EDAC amd64: MC: 2: 16384MB 3: 16384MB
EDAC MC: UMC1 chip selects:
EDAC amd64: MC: 0: 16384MB 1: 16384MB
EDAC amd64: MC: 2: 16384MB 3: 16384MB
EDAC amd64: using x8 syndromes.
EDAC PCI1: Giving out device to module amd64_edac controller EDAC PCI controller: DEV 0000:00:18.0 (POLLED)
AMD64 EDAC driver v3.5.0
$ edac-util
edac-util: No errors to report.
</code></pre></div></div>
<p>“No errors to report” is what you want to see and indicates no ECC errors have occurred so far.</p>
<p>This patch can easily be applied to kernel versions before 5.13, but you will find that you
also need another patch <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/arch/x86/kernel/amd_nb.c?id=2ade8fc65076095460e3ea1ca65a8f619d7d9a3a">2ade8fc65076095460e3ea1ca65a8f619d7d9a3a</a>
or else <code class="language-plaintext highlighter-rouge">amd64_edac.ko</code> fails to load due to <code class="language-plaintext highlighter-rouge">amd_cache_northbridges()</code> returning an error.</p>
Booting unmodified Windows 10 over USBhttp://blog.zorinaq.com/boot-win10-over-usb/2020-05-29T00:00:00-07:00<p>When I buy a laptop, the first thing I do is netboot into a PXE Linux
environment and make a full raw disk image backup of the pre-installed
Windows OS. I pipe <code class="language-plaintext highlighter-rouge">dd if=/dev/sda</code> or <code class="language-plaintext highlighter-rouge">dd if=/dev/nvme0n1</code> into <code class="language-plaintext highlighter-rouge">lz4</code>
(much faster than <code class="language-plaintext highlighter-rouge">gzip</code>) and write the image on a fileserver. No matter
how large the disk is, the image usually shrinks down to 20-30 GB because,
well, there are usually only 20-30 GB of files on the drive. The empty NTFS
blocks compress very well.</p>
<p>I then wipe Windows and install a Linux distribution.</p>
<p>Years later when I donate or sell the laptop, I restore the raw disk image.
Not only does this restore the Windows OEM image, but because all sectors are
overwritten, it <em>securely wipes</em> the disk. Very good for security.</p>
<p>The image backup I made turned out very useful today. I needed to
update the firmware for the trackpoint of my ThinkPad X1 Carbon
laptop. However the firmware update is <em>only</em> possible using a <a href="https://pcsupport.lenovo.com/us/en/products/laptops-and-netbooks/thinkpad-x-series-laptops/thinkpad-x1-carbon-type-20hr-20hq/downloads/ds122148">Windows 10
utility</a>.</p>
<h2 id="booting-from-usb">Booting from USB</h2>
<p>Here is how I got Windows running on my laptop, without reimaging or
swapping or repartitioning the internal drive.</p>
<h3 id="write-the-image-to-a-usb-drive">Write the image to a USB drive</h3>
<p>I wrote the factory Windows 10 disk image to an external USB drive
(<code class="language-plaintext highlighter-rouge">unlz4 <image.lz4 >/dev/sda</code>). Being a full disk image, it includes 4
partitions: EFI System partition, Microsoft Reserved partition, Basic data
partition, Recovery partition.</p>
<h3 id="set-bootdriverflags-to-0x14">Set BootDriverFlags to 0x14</h3>
<p>If you try to boot from the external USB drive as is, Windows 10
will error out during boot (INACCESSIBLE_BOOT_DEVICE).</p>
<p>The key to make it work is to edit the obscure registry value
BootDriverFlags to change it from 0x0 to 0x14. The value is located under
<code class="language-plaintext highlighter-rouge">HKLM\SYSTEM\HardwareConfig\{...uuid...}</code> and can be edited by
mounting the C: partition in Linux, and using the <code class="language-plaintext highlighter-rouge">chntpw</code> utility:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ chntpw -e /path/to/mounted/Windows/System32/config/SYSTEM
</code></pre></div></div>
<p>List the <code class="language-plaintext highlighter-rouge">{...uuid...}</code> subkey under <code class="language-plaintext highlighter-rouge">HardwareConfig</code>:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>> ls HardwareConfig
Node has 1 subkeys and 2 values
key name
<{ca7bc4cc-350d-11b2-a85c-95cecf0de0fa}>
size type value name [value if type DWORD]
78 1 REG_SZ <LastConfig>
4 4 REG_DWORD <LastId> 0 [0x0]
</code></pre></div></div>
<p>Edit BootDriverFlags with <code class="language-plaintext highlighter-rouge">ed</code> to set it to 0x14:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>> ed HardwareConfig\{ca7bc4cc-350d-11b2-a85c-95cecf0de0fa}\BootDriverFlags
EDIT: <HardwareConfig\{ca7bc4cc-350d-11b2-a85c-95cecf0de0fa}\BootDriverFlags> of type REG_DWORD (4) with length 4 [0x4]
DWORD: Old value 0 [0x0], enter new value (prepend 0x if hex, empty to keep old value)
-> 0x14
DWORD: New value 20 [0x14],
</code></pre></div></div>
<p>Finally save the changes with <code class="language-plaintext highlighter-rouge">q</code>:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>> q
</code></pre></div></div>
<p>BootDriverFlags is so obscure that it is barely documented. Some online
resources say the value is under <code class="language-plaintext highlighter-rouge">HKLM\System\CurrentControlSet\Control</code>
or <code class="language-plaintext highlighter-rouge">HKLM\SYSTEM\HardwareConfig\</code> but they are wrong. The value is under
<code class="language-plaintext highlighter-rouge">HKLM\SYSTEM\HardwareConfig\{...uuid...}</code>.</p>
<p>That is it. Changing a single registry key is all that is needed to make
an OEM Windows 10 image boot from a USB drive.</p>
Evidence of the effectiveness of stay-at-home orders in the UShttp://blog.zorinaq.com/stay-at-home-orders-in-the-us/2020-05-15T00:00:00-07:00<p>A semi-popular <a href="https://twitter.com/boriquagato/status/1258780623570792453">Twitter thread by @boriquagato</a> argues that stay-at-home
orders are ineffective at reducing mortality in the US: “<em>there is just no
evidence that this incredibly expensive and harmful policy has any effect at
all.</em>”</p>
<p>This is, of course, preposterous. @boriquagato attempted to rank US states by
<em>cumulative</em> deaths per capita, and observed how the ranking changed over
~4 weeks. I will explain the flaws in their application of this methodology.
But beforehand, I will present a superior methodology: examining the trend of
<em>daily</em> deaths per capita.</p>
<ul id="markdown-toc">
<li><a href="#trend-of-daily-deaths" id="markdown-toc-trend-of-daily-deaths">Trend of daily deaths</a></li>
<li><a href="#trend-of-daily-cases" id="markdown-toc-trend-of-daily-cases">Trend of daily cases</a></li>
<li><a href="#flaws-in-boriquagatos-methodology" id="markdown-toc-flaws-in-boriquagatos-methodology">Flaws in @boriquagato’s methodology</a></li>
<li><a href="#summary" id="markdown-toc-summary">Summary</a></li>
<li><a href="#responses-from-boriquagato" id="markdown-toc-responses-from-boriquagato">Responses from @boriquagato</a></li>
</ul>
<h2 id="trend-of-daily-deaths">Trend of daily deaths</h2>
<p>When analyzing daily deaths, be aware there is always a lag between
the moment a stay-at-home order goes into effect and the moment a reduction of
daily deaths is observed, for many reasons:</p>
<ul>
<li>delay between infection and death</li>
<li>stay-at-home orders may be initially advisory or poorly enforced, and eventually made mandatory or more strictly enforced</li>
<li>infected persons staying at home may infect other household members over time</li>
<li>etc</li>
</ul>
<p>The lag is approximately 2-4 weeks. So I charted daily deaths per million inhabitants
for each US state, marked the date of stay-at-home orders with a dashed line, and
overlaid a shaded area 2-4 weeks later:</p>
<p class="no-margins"><a href="../assets/covid19_us_deaths_per_capita.png"><img src="../assets/covid19_us_deaths_per_capita.png" alt="Daily COVID-19 deaths per million, for each US state" title="Daily COVID-19 deaths per million, for each US state" class="pure-img" /></a></p>
<p>In <strong>33 states</strong>—plus DC—that implemented stay-at-home orders, we expect to see and
do see daily deaths either peaking or reaching a plateau at some point in the
shaded area:
Alaska (8 deaths,)
California,
Colorado,
Connecticut,
District of Columbia,
Florida,
Georgia,
Hawaii (17 deaths,)
Idaho,
Kansas,
Kentucky,
Louisiana,
Maine,
Maryland,
Massachusetts,
Michigan,
Missouri,
Montana (16 deaths,)
Nevada,
New Jersey,
New York,
North Carolina,
Ohio,
Oklahoma,
Oregon,
Pennsylvania,
South Carolina,
Tennessee,
Texas,
Vermont (53 deaths,)
Virginia,
Washington,
West Virginia, and
Wisconsin.
Some states, as noted in parentheses, have recorded very few deaths
so the curve exhibits random noise in the form of peaks and valleys.</p>
<p>In <strong>3 states</strong> that implemented stay-at-home orders, we see the
curve is neither peaking nor reaching a plateau around the expected time.
However the sources of these anomalies are easily identified, and once
corrected the curve does peak or reach a plateau when expected:</p>
<ol>
<li>Arizona: the spike in early May is caused by a <a href="https://www.azcentral.com/story/news/local/arizona-health/2020/05/08/arizona-covid-19-daily-deaths-have-risen-due-lags-and-overlooked-deaths/3099135001/">delay in reporting deaths from weeks prior</a> Other than this artifact, Arizona has reached the plateau when expected.</li>
<li>Indiana: the spike around late April is due to a reporting artifact in my <a href="https://github.com/nytimes/covid-19-data/blob/master/us-states.csv">data source for COVID-19 deaths</a>. The <a href="https://www.coronavirus.in.gov/2393.htm">state’s official COVID-19 dashboard</a> shows deaths peaking on 22 April, as expected (between 08 April and 22 April):
<img src="../assets/covid19_indiana_deaths.png" alt="Indiana deaths by day" title="Indiana deaths by day" class="pure-img" /></li>
<li>Mississippi: the spike around early May is, again, caused by a reporting artifact. The <a href="https://msdh.ms.gov/msdhsite/_static/14,0,420.html#map">state’s official COVID-19 dashboard</a> shows deaths peaking on or about 27 April, as expected (between 17 April and 01 May):
<img src="../assets/covid19_mississippi_deaths.png" alt="Mississippi deaths by day" title="Mississippi deaths by day" class="pure-img" /></li>
</ol>
<p>In <strong>4 states</strong> that implemented stay-at-home orders, we see anomalies with no
obvious explanation:</p>
<ol>
<li>Alabama: daily deaths appear to continue to rise past the shaded area, however there is not enough data to tell if the trend is significant or if it will reach a plateau.</li>
<li>Delaware reached a plateau around 23 April, which is 31 days after the stay-at-home order, beyond the expected delay of 2-4 weeks. The state borders Maryland who implemented a stay-at-home 7 days after Delaware did. Delaware is a geographically small state, so one possible but unverified explanation could be people commuting across the state line who continued to import new cases into Delaware for longer than expected.</li>
<li>Illinois: daily death growth slows down in the shaded area, but not significantly enough to plateau. It is unclear why. Most deaths occur in the Chicago metropolitan area. Lack of enforcement in the city?</li>
<li>New Mexico: a cursory search did not reveal any obvious reason why deaths are still climbing. Perhaps the state’s “<a href="https://www.santafenewmexican.com/news/coronavirus/nursing-home-for-covid-19-patients-run-by-firm-with-history-of-violations-lawsuits/article_d7091238-8423-11ea-bd9b-9f69a2f80402.html"><em>grossly substandard nursing care</em></a>” is to blame. <a href="https://www.santafenewmexican.com/news/coronavirus/state-confirms-first-santa-fe-death-from-covid-19/article_258b12ae-8fd1-11ea-afd3-d7bb9ee7818d.html">45% of COVID-19 deaths in the state occurred in long-term care facilities</a>.</li>
</ol>
<p>In <strong>3 states</strong> that implemented stay-at-home orders, daily deaths continue to
increase past the shaded area. However these states are the top 3 states in the nation
with the highest shares of long-term care facility deaths according to
<a href="https://www.kff.org/health-costs/issue-brief/state-data-and-policy-actions-to-address-coronavirus/#stateleveldata">Kaiser Family Foundation</a> (Minnesota: 81%, Rhode Island: 77%, New Hampshire: 77%) well
above the nationwide average of <a href="https://www.nytimes.com/interactive/2020/05/09/us/coronavirus-cases-nursing-homes-us.html">one-third</a>.
So daily death growth in these states does not reflect a failure of the stay-at-home
policy, but a failure of keeping their long-term care facilities safe:</p>
<ol>
<li>Minnesota</li>
<li>New Hampshire</li>
<li>Rhode Island</li>
</ol>
<p>In <strong>7 states</strong> that never issued stay-at-home orders, deaths generally follow an
upward trend, with no signs of stopping. However these states are so small and
have so few deaths that their curves are very noisy (except the largest state,
Iowa):</p>
<ol>
<li>Arkansas (98 deaths)</li>
<li>Iowa (318 deaths)</li>
<li>Nebraska (118 deaths)</li>
<li>North Dakota (40 deaths)</li>
<li>South Dakota (43 deaths)</li>
<li>Utah (76 deaths)</li>
<li>Wyoming (7 deaths: the curve is pure random noise)</li>
</ol>
<h2 id="trend-of-daily-cases">Trend of daily cases</h2>
<p>We can produce the same charts with cases instead of deaths. I expect the lag
between stay-at-home orders and daily cases peaking or reaching a plateau
to be 2-15 days, represented by the shaded area:</p>
<p class="no-margins"><a href="../assets/covid19_us_cases_per_capita.png"><img src="../assets/covid19_us_cases_per_capita.png" alt="Daily COVID-19 cases per million, for each US state" title="Daily COVID-19 cases per million, for each US state" class="pure-img" /></a></p>
<p>For many states, we do indeed observe daily cases peaking or reaching a plateau
2-15 days after the stay-at-home order.</p>
<p>However daily cases sometimes continue to increase beyond the shaded area. A
likely explanation is that states are increasing the daily test rate over time,
which improves the case ascertainment rate. For this reason, charts of daily
deaths in the previous section are a more reliale indicator of the effectiveness
of stay-at-home orders.</p>
<h2 id="flaws-in-boriquagatos-methodology">Flaws in @boriquagato’s methodology</h2>
<p><a href="https://twitter.com/boriquagato/status/1258780623570792453">@boriquagato</a> ranks US states by cumulative deaths per capita and
observes how the ranking changes over approximately 4 weeks, between 11 April
and 08 May. Blue represents states that implemented stay-at-home orders, and
red those that did not:</p>
<p class="no-margins"><a href="../assets/covid19_scatterplot0_boriquagato.jpg"><img src="../assets/covid19_scatterplot0_boriquagato.jpg" alt="@boriquagato's states ranking" title="@boriquagato's states ranking" class="pure-img" /></a></p>
<p>Firstly, the scatterplot is bogus. I recreated it using
<a href="https://www2.census.gov/programs-surveys/popest/datasets/2010-2019/state/detail/SCPRC-EST2019-18+POP-RES.csv">census.gov 2019 population data</a> and
the <a href="https://github.com/nytimes/covid-19-data/blob/061015bd97a7db39bf9e8abaebff8543f69eedf6/us-states.csv">New York Times COVID-19 data repository</a>
and I ended
up with a significantly different scatterplot. Also, trying to place a line of best fit
on this scatterplot is pointless as it simply approximates <code class="language-plaintext highlighter-rouge">y=x</code> when the time span is
as short as 4 weeks. Among the states without stay-at-home orders, according to @boriquagato’s
version 4 of the 7 states saw their rank worsen, but in reality it happened to
6 of the 7 states. This is evidence stay-at-home orders are effective at
reducing mortality:</p>
<p class="no-margins"><a href="../assets/covid19_scatterplot1.png"><img src="../assets/covid19_scatterplot1.png" alt="Corrected ranking" title="Corrected ranking" class="pure-img" /></a></p>
<p>Secondly, @boriquagato selected a premature time period that failed to fully capture the
effectiveness of stay-at-home orders. For example look at the District of Columbia:</p>
<p class="no-margins"><img src="../assets/covid19_us_deaths_per_capita_snippet.png" alt="Daily COVID-19 deaths per million for D.C." title="Daily COVID-19 deaths per million for D.C." class="pure-img" /></p>
<p>In the case of DC the stay-at-home order was clearly effective because daily death growth peaked then
started declining. However the time period 11 April (red dashes) through 08 May
(blue dashes) is too early to capture the decline in full; it should have been shifted
by minimum ~9 days. So I recreated @boriquagato’s scatterplot over <del>20 April–15 May</del>
<strong>[Update 2020-06-20:</strong> I updated the chart to cover the longer time period 20 April–20 June<strong>]</strong>:</p>
<p class="no-margins"><a href="../assets/covid19_scatterplot2.png"><img src="../assets/covid19_scatterplot2.png" alt="Better ranking" title="Better ranking" class="pure-img" /></a></p>
<p>On average stay-at-home states gained 1.2 ranks, while states without the
policy lost 7.7 ranks. In fact, every red states moved up above the diagonal.
Can we evaluate the statistical significance of this? I assumed the null
hypothesis that these states did not lose ranks, and computed a p-value of
0.0047! This is statistically significant evidence that the lack of a
stay-at-home policy is strongly correlated with increased deaths.</p>
<p>However a scatterplot of state rankings is a crude tool to validate the
effectiveness of stay-at-home orders. A more refined visualization is to
produce the same scatterplot but with the axes reflecting deaths/million
instead of rankings:</p>
<p class="no-margins"><a href="../assets/covid19_scatterplot3_dpm.png"><img src="../assets/covid19_scatterplot3_dpm.png" alt="Plotting deaths per million" title="Plotting deaths per million" class="pure-img" /></a></p>
<p>Again, no surprises here: the states witout stay-at-home orders deviate upward from the
linear regression curve (dashed line.) Deaths in states without stay-at-home orders
universally increased faster than the national average.</p>
<p>This chart is a different representation of the same data as the scatterplot above,
showing deaths/capita evolving through time:</p>
<p class="no-margins"><a href="../assets/covid19_deathlog.png"><img src="../assets/covid19_deathlog.png" alt="Cumulative COVID-19 deaths per million, for each US state" title="Cumulative COVID-19 deaths per million, for each US state" class="pure-img" /></a></p>
<h2 id="summary">Summary</h2>
<p>The scatterplot created by @boriquagato is incorrect, probably because of
errors in their (undocumented) data sources or data input. A corrected scatterplot shows
that states not implementing such orders fared noticeably worse than states
that did. This is evidence that stay-at-home orders are effective.</p>
<p>Nonetheless, a scatterplot of states ranked by <em>cumulative</em> deaths per capita remains
an inferior tool to validate the effectiveness of stay-at-home orders, because
of the huge 100-fold difference in death rates, even before states started implementing
stay-at-home orders.</p>
<p>A superior methodology is to look at the trend of <em>daily</em> deaths per capita.
And indeed, among the 43 states—plus DC—that implemented stay-at-home orders,
36 states and DC (84%) demonstrate a decrease in daily death growth correlated with
the timing of the order, while 4 states exhibit unexplained anomalous trends, and in
3 states unexpected daily death growth can be attributed to external factors
(deaths in long-term care facilities.) Conversely, the 7 states that did not
implement stay-at-home orders exhibit daily death growth that persists to this
present day. This is further evidence that stay-at-home orders are effective.</p>
<p>This analysis could be improved if states reported deaths by date of death,
like Indiana and Mississippi. Most states do not provide such data, which
creates anomalies in daily death curves, as states sometimes report unreported
deaths from weeks prior.</p>
<p>Another improvement would be possible if states provided separate statistics
for deaths in long-term care facilities. Deaths in such facilities account for
a significant number of total deaths, and are largely uncorrelated with whether or
not a state implemented a stay-at-home order.</p>
<p>Finally, the findings in this post are largely supported by third-party research showing
that social distancing does reduce the daily growth rate of the epidemic:</p>
<ul>
<li><a href="https://www.nature.com/articles/s41586-020-2404-8">The effect of large-scale anti-contagion policies on the COVID-19 pandemic</a></li>
<li><a href="https://www.nature.com/articles/s41586-020-2405-7">Estimating the effects of non-pharmaceutical interventions on COVID-19 in Europe</a></li>
<li><a href="https://science.sciencemag.org/content/early/2020/05/14/science.abb9789">Inferring change points in the spread of COVID-19 reveals the effectiveness of interventions</a></li>
<li><a href="https://www.medrxiv.org/content/10.1101/2020.05.07.20092353v1">Social Distancing is Effective at Mitigating COVID-19 Transmission in the United States</a></li>
<li><a href="https://arxiv.org/abs/2004.06098">The effect of stay-at-home orders on COVID-19 cases and fatalities in the United States</a></li>
<li><a href="https://www.nber.org/papers/w27091">When Do Shelter-in-Place Orders Fight COVID-19 Best? Policy Heterogeneity Across States and Adoption Time</a></li>
<li><a href="https://www.nber.org/papers/w26906">Human Mobility Restrictions and the Spread of the Novel Coronavirus (2019-nCoV) in China</a></li>
<li><a href="http://ftp.iza.org/dp13262.pdf">Were Urban Cowboys Enough to Control COVID-19? Local Shelter-In-Place Orders and Coronavirus Case Growth</a></li>
<li><a href="https://www.medrxiv.org/content/10.1101/2020.05.22.20110460v1">Effects of Government Mandated Social Distancing Measures on Cumulative Incidence of COVID-19 in the United States and its Most Populated Cities</a></li>
</ul>
<h2 id="responses-from-boriquagato">Responses from @boriquagato</h2>
<p>@boriquagato made false counterclaims, did not understand my data, largely ignored
my replies, and abruptly withdrew from the discussion by blocking me on Twitter:</p>
<ol>
<li>He finally provided the data source for his scatterplot, and as I
suspected <a href="https://twitter.com/zorinaq/status/1261490101739065345">his data is bogus</a>.
He used data from 2020-04-17 instead of 2020-04-11. He never replied.</li>
<li>He claimed 3 of 7 states not implementing stay-at-home orders improved their rank,
but only 1 of 7 did. And again
<a href="https://twitter.com/zorinaq/status/1261755723777626113">his data is bogus</a>.</li>
<li>He wrongly claimed states not implementing lockdowns changed by 1 position in the scatterplot,
but in reality the
<a href="https://twitter.com/zorinaq/status/1261755443958837248">mean rank move is -5 and statistically significant</a>.</li>
<li>He thought my analysis of daily deaths per capita was based on raw daily death figures,
not realizing I <a href="https://twitter.com/zorinaq/status/1261775136262254597">compute a 7-day average</a>
to smooth out the spikes (the “date-of-report vs date-of-death” issue.)</li>
<li>He mistakenly thought daily deaths were supposed to decline in the entirety of the shaded areas.
I <a href="https://twitter.com/zorinaq/status/1261508447410941954">re-explained</a> they either peak or reach
a plateau, as expected. I marked the exact location of peaks and plateaus. He never replied.</li>
</ol>
The Case Fatality Ratio of the Novel Coronavirus (2019-nCoV)http://blog.zorinaq.com/case-fatality-ratio-ncov/2020-02-06T00:00:00-08:00<p>Estimating the case fatality ratio (CFR) of the 2019 novel coronavirus
is crucial to inform policy makers and help them make the best decisions.</p>
<p><a href="http://archive.is/raDd5">As of 7 February 2020</a> in mainland China there are 31 161 confirmed
cases, 637 deaths, 1 540 recoveries. So some claim the case fatality ratio
is 637/31161 = 2%. But this is as naïve as claiming the recovery ratio is
1540/31161 = 5% thus that the fatality is 95%.</p>
<p>28 984 cases (93%) are not resolved. They could either die or
recover. So in reality the CFR could be between 2% and 95%.</p>
<ul id="markdown-toc">
<li><a href="#naïve-cfr" id="markdown-toc-naïve-cfr">Naïve CFR</a></li>
<li><a href="#resolved-cfr" id="markdown-toc-resolved-cfr">Resolved CFR</a></li>
<li><a href="#lack-of-awareness" id="markdown-toc-lack-of-awareness">Lack of awareness</a></li>
<li><a href="#simulating-naïve-vs-resolved-cfr" id="markdown-toc-simulating-naïve-vs-resolved-cfr">Simulating naïve vs. resolved CFR</a></li>
<li><a href="#cfr-of-2019-ncov" id="markdown-toc-cfr-of-2019-ncov">CFR of 2019-nCoV</a></li>
<li><a href="#updates--validation" id="markdown-toc-updates--validation">Updates & Validation</a></li>
</ul>
<h2 id="naïve-cfr">Naïve CFR</h2>
<p>After the SARS epidemic, a paper was published in the American Journal of
Epidemiology by Ghani et al., <em><a href="https://academic.oup.com/aje/article/162/5/479/82647">Methods for Estimating the Case Fatality Ratio
for a Novel, Emerging Infectious Disease</a></em>, where they demonstrated that
this common method to estimate the CFR was severely flawed. The authors
call it a “<em>naïve estimate</em>:”</p>
<p><em>naïve CFR = deaths / cases</em></p>
<p>They noted this method is “<em>clearly easier to describe to policy makers and the
public</em>” however it exhibits “<em>considerable bias</em>.” In the case of SARS in Hong
Kong in 2003, between 2 April and 21 May it “<em>falsely suggested a rise in the
case fatality ratio</em>” by 5x, from 2% to 11%. See the “naïve CFR” curve labelled
“<em>simple estimate 1</em>” in their <a href="https://academic.oup.com/aje/article/162/5/479/82647">figure 3a</a>:</p>
<p class="no-margins"><img src="../assets/ghani_fig3a.svg" alt="Ghani et al. figure 3a" title="Ghani et al. figure 3a" class="pure-img" /></p>
<p>In reality the true observed CFR has been about 13% during this period of time.
The naïve CFR was severely inaccurate due to “<em>simply an artifact</em>,” a
lag: “<em>the final outcome for patients [death or recovery] lagged behind their
identification by approximately 3 weeks</em>.”</p>
<p>We observed this false rise in the naïve CFR with other outbreaks. For example
Ebola (source: <a href="https://www.thelancet.com/journals/lancet/article/PIIS0140-6736(14)61706-2/fulltext">Case fatality rate for Ebola virus disease in west
Africa</a>):</p>
<p class="no-margins"><a href="https://www.thelancet.com/journals/lancet/article/PIIS0140-6736(14)61706-2/fulltext"><img src="../assets/ebola_cfr.jpg" alt="Naïve case fatality ratio of Ebola" title="Naïve case fatality ratio of Ebola" class="pure-img" /></a></p>
<h2 id="resolved-cfr">Resolved CFR</h2>
<p>Ghani et al. recommend two other methods to better estimate the CFR. One
eliminates the lag between identification and death or recovery by taking into
account only cases whose outcome is known, or <em>resolved</em>. Ghani et al. refer to
it as the “<em>simple estimate 2</em>” (e₂) but for clarity I suggest calling it the
<em>resolved CFR</em>:</p>
<p><em>resolved CFR = deaths / (deaths + recoveries)</em></p>
<p>Another method is based on the Kaplan-Meier survival procedure, which I will
not describe here.</p>
<p>The authors conclude that both methods, the resolved CFR
and Kaplan-Meier, “<em>adequately estimated the case fatality ratio during the
SARS epidemic</em>.” As their <a href="https://academic.oup.com/aje/article/162/5/479/82647">figure 3a</a> shows the resolved CFR tracks
the true observed CFR much more closely than the naïve CFR:</p>
<p class="no-margins"><img src="../assets/ghani_fig3a.svg" alt="Ghani et al. figure 3a" title="Ghani et al. figure 3a" class="pure-img" /></p>
<p>Of course these methods have caveats, but it is very clear
from their data that the naïve CFR is the least accurate
estimate of all methods, severely underestimates the true CFR, and only
converges toward it near the end of the outbreak.</p>
<p>15 years after Ghani et al.’s paper, it was cited by at least 59 others.
For example <em><a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4540071/">Estimating Absolute and Relative Case Fatality Ratios from
Infectious Disease Surveillance Data</a></em> supports it and concludes that
“<em>the naïve estimator is virtually always biased, often severely so</em>.”</p>
<p>Since the start of the 2019-nCoV outbreak, another expert has specifically
come forward to emphasize the resolved CFR is the right method: see this
<a href="https://issues.org/clarity-please-on-the-coronavirus-statistics/">article in Issues in Science and Technology</a> by medical doctor Maimuna
Majumder, PhD.</p>
<p>Epidemiological methods overwhelmingly support the resolved CFR as
a better estimate than the naïve CFR, period.</p>
<p><strong>[Edit:</strong> 6 months later, in August 2020, the WHO published a <a href="https://apps.who.int/iris/handle/10665/333642">scientific brief</a>
that explains the issues with the naïve CFR method, and suggests the
resolved CFR as a simple solution.<strong>]</strong></p>
<h2 id="lack-of-awareness">Lack of awareness</h2>
<p>Unfortunately, 15 years after Ghani et al.’s paper, the public,
journalists, and even many in the medical field have learned nothing from
epidemiologists. They continue to use the naïve CFR. They appear unaware it
often underestimates the true CFR and is expected to rise over time (“false
rise”.)</p>
<p>For example the scientific director of the World Health Organization’s SARS
investigation (of all people!) appeared unaware. A 2003 <a href="https://www.nytimes.com/2003/04/22/world/death-rate-from-virus-more-than-doubles-varying-sharply-by-country.html">New York Times
article</a> wrote: “<em>the death rate has also steadily risen, leaving health
officials worried. Lacking a precise explanation for the rise, health
officials have generated a number of theories. In outbreaks of other new
infections, the death rate has usually fallen with time. ‘‘It’s worrying, and
we hope it is not an indication of a continuing trend,’’ said Dr. Klaus Stöhr,
scientific director of the W.H.O.’s SARS investigation.”</em></p>
<p>The WHO was subsequently criticized in <a href="https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/j.1467-985X.2004.00345.x">A comparison study of realtime fatality
rates: severe acute respiratory syndrome in Hong Kong, Singapore, Taiwan,
Toronto and Beijing, China</a> (<a href="http://chao.stat.nthu.edu.tw/wordpress/paper/2005_JRSSA_168_P233.pdf">PDF version</a>,) where the
authors write: “<em>While the outbreak was on going and there were patients still
in hospitals over the course of the epidemic, the WHO estimate assumed
implicitly that all remaining SARS in-patients would eventually recover. It
therefore led to an underestimation of the true case fatality rate. For
example, in the midst of the SARS outbreak at April 15th, 2003, the fatality
rate in Hong Kong was 4.5% according to the WHO estimate, but it hit a record
high of 17.0% at the end of the epidemic.</em>”</p>
<p>Have the WHO learned anything since 2003? No! Another <a href="https://apps.who.int/iris/handle/10665/333642">WHO
representative</a> recently still used the naïve CFR method which calculates
the 2% figure:
“<em>WHO representative to the Philippines Dr. Rabindra Abeyasinghe noted
that the 2019-nCoV’s death rate fell to about 2 percent</em>.”</p>
<h2 id="simulating-naïve-vs-resolved-cfr">Simulating naïve vs. resolved CFR</h2>
<p>To demonstrate how inaccurate the naïve CFR is compared to the resolved CFR, I
wrote a Python script that simulates an outbreak infecting a population of 100k
individuals over 200 days following a logistic growth curve which increases
gradually at first, more rapidly in the middle growth period, and slowly at the
end, eventually leveling off. The disease has a 50% probability of causing
death 21 days after infection.</p>
<p>The results are obvious: the naïve CFR underestimates the true CFR by 5x
(it starts at 9%) and only converges toward the true CFR (50%) near the end of
the epidemic. By comparison the resolved CFR is a <em>perfect</em> estimate at any
point:</p>
<p class="no-margins"><a href="../assets/outbreak1.svg"><img src="../assets/outbreak1.svg" alt="Simulated outbreak" title="Simulated outbreak" class="pure-img" /></a></p>
<p>In a second simulation, I changed the parameter <code class="language-plaintext highlighter-rouge">time_to_heal</code> to 28 days
in order to simulate recovery taking longer than death: deaths still happen
21 days after the infection, but recoveries happen 28 days after. The only
difference this creates is that at the beginning and middle of the epidemic
the resolved CFR slightly overestimates the true CFR by a factor of about 1.26x
(63%/50%):</p>
<p class="no-margins"><a href="../assets/outbreak2.svg"><img src="../assets/outbreak2.svg" alt="Simulated outbreak" title="Simulated outbreak" class="pure-img" /></a></p>
<p>Three things are very clear:</p>
<ol>
<li>The naïve CFR is always a severe underestimate in the beginning and middle phase of an outbreak.</li>
<li>The naïve CFR is bound to increase over time.</li>
<li>The resolved CFR is always much more accurate than the naïve CFR. At worse the resolved CFR is off by 1.26x while the naïve CFR is off by 5x.</li>
</ol>
<p>Want to play with my simulation? Here is the source code:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="c1">#!/usr/bin/python3
</span>
<span class="kn">import</span> <span class="nn">math</span>
<span class="n">population</span> <span class="o">=</span> <span class="mf">100e3</span>
<span class="n">days</span> <span class="o">=</span> <span class="mi">200</span>
<span class="n">death_prob</span> <span class="o">=</span> <span class="mf">0.50</span>
<span class="n">time_to_death</span> <span class="o">=</span> <span class="mi">21</span>
<span class="n">time_to_heal</span> <span class="o">=</span> <span class="mi">21</span>
<span class="n">hist</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">deaths</span> <span class="o">=</span> <span class="n">recovs</span> <span class="o">=</span> <span class="n">naive_cfr</span> <span class="o">=</span> <span class="n">resolved_cfr</span> <span class="o">=</span> <span class="mi">0</span>
<span class="k">print</span><span class="p">(</span><span class="s">'day,cases,deaths,recoveries,naive_cfr,resolved_cfr'</span><span class="p">)</span>
<span class="k">for</span> <span class="n">d</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">days</span><span class="p">):</span>
<span class="k">if</span> <span class="n">d</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>
<span class="n">cases</span> <span class="o">=</span> <span class="mi">0</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">cases</span> <span class="o">=</span> <span class="nb">round</span><span class="p">(</span><span class="n">population</span> <span class="o">/</span> <span class="p">(</span><span class="mi">1</span> <span class="o">+</span> <span class="n">math</span><span class="p">.</span><span class="n">e</span><span class="o">**</span><span class="p">(</span><span class="o">-</span><span class="mf">0.08</span><span class="o">*</span><span class="p">(</span><span class="n">d</span> <span class="o">-</span> <span class="n">days</span><span class="o">/</span><span class="mi">2</span><span class="p">))))</span>
<span class="n">hist</span><span class="p">.</span><span class="n">insert</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">cases</span><span class="p">)</span>
<span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">hist</span><span class="p">)</span> <span class="o">>=</span> <span class="n">time_to_death</span> <span class="o">+</span> <span class="mi">2</span><span class="p">:</span>
<span class="n">deaths</span> <span class="o">+=</span> <span class="nb">round</span><span class="p">((</span><span class="n">hist</span><span class="p">[</span><span class="n">time_to_death</span><span class="p">]</span> <span class="o">-</span> <span class="n">hist</span><span class="p">[</span><span class="n">time_to_death</span> <span class="o">+</span> <span class="mi">1</span><span class="p">])</span> <span class="o">*</span> \
<span class="n">death_prob</span><span class="p">)</span>
<span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">hist</span><span class="p">)</span> <span class="o">>=</span> <span class="n">time_to_heal</span> <span class="o">+</span> <span class="mi">2</span><span class="p">:</span>
<span class="n">recovs</span> <span class="o">+=</span> <span class="nb">round</span><span class="p">((</span><span class="n">hist</span><span class="p">[</span><span class="n">time_to_heal</span><span class="p">]</span> <span class="o">-</span> <span class="n">hist</span><span class="p">[</span><span class="n">time_to_heal</span> <span class="o">+</span> <span class="mi">1</span><span class="p">])</span> <span class="o">*</span> \
<span class="p">(</span><span class="mi">1</span> <span class="o">-</span> <span class="n">death_prob</span><span class="p">))</span>
<span class="k">if</span> <span class="n">d</span> <span class="o">></span> <span class="nb">max</span><span class="p">(</span><span class="n">time_to_death</span><span class="p">,</span> <span class="n">time_to_heal</span><span class="p">):</span>
<span class="k">if</span> <span class="n">cases</span><span class="p">:</span>
<span class="n">naive_cfr</span> <span class="o">=</span> <span class="mi">100</span> <span class="o">*</span> <span class="n">deaths</span> <span class="o">/</span> <span class="n">cases</span>
<span class="k">if</span> <span class="n">deaths</span> <span class="o">+</span> <span class="n">recovs</span><span class="p">:</span>
<span class="n">resolved_cfr</span> <span class="o">=</span> <span class="mi">100</span> <span class="o">*</span> <span class="n">deaths</span> <span class="o">/</span> <span class="p">(</span><span class="n">deaths</span> <span class="o">+</span> <span class="n">recovs</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">'{day},{cases},{deaths},{recovs},{naive_cfr},{resolved_cfr}'</span><span class="p">.</span>\
<span class="nb">format</span><span class="p">(</span><span class="n">day</span><span class="o">=</span><span class="n">d</span><span class="p">,</span> <span class="n">cases</span><span class="o">=</span><span class="n">cases</span><span class="p">,</span> <span class="n">deaths</span><span class="o">=</span><span class="n">deaths</span><span class="p">,</span> <span class="n">recovs</span><span class="o">=</span><span class="n">recovs</span><span class="p">,</span>
<span class="n">naive_cfr</span><span class="o">=</span><span class="n">naive_cfr</span><span class="p">,</span> <span class="n">resolved_cfr</span><span class="o">=</span><span class="n">resolved_cfr</span><span class="p">))</span></code></pre></figure>
<h2 id="cfr-of-2019-ncov">CFR of 2019-nCoV</h2>
<p>We know the naïve CFR (2%) is inaccurate and will increase, as per previous
outbreaks (SARS, Ebola) and as per simulations.</p>
<p>We know the resolved CFR is much more accurate, and as of <a href="http://archive.is/raDd5">7 February
2020</a> it stands at deaths/(deaths+recoveries) = 637/(637+1540) = 29%,
which is rather alarming.</p>
<p>But we should interpret any CFR estimate with caution:</p>
<ul>
<li>We are very early in the outbreak, with very few resolved cases (a few thousands) so the resolved CFR is fluctuating a lot from day to day</li>
<li>China’s official statistics may be underestimating the number of deaths (eg. patients who are suspected of but not yet confirmed having 2019-nCoV and who die are not counted toward deaths caused by the virus)</li>
<li>There may be many undetected/underreported mild cases that heal on their own and thus are not counted in the number of recoveries (these cases could be detected when/if serosurveys are performed)</li>
</ul>
<p>Nonetheless we can take a stab at guessing a best case scenario for a low
fatality ratio. Let’s assume there are 50x more cases than reported, all mild,
so 50x more recoveries. And let’s assume there are only 2x more deaths than
reported.</p>
<p>With these parameters the CFR would be 637*2/(637*2+1540*50)
= 1.6%, which is still concerning because if there are 50x more cases than
reported, then the virus is spreading far faster than we think. So a pandemic
would be unavoidable, and 1.6% would be 16x deadlier than the seasonal flu
(0.1%.) The flu kills half a million worldwide every year, so 2019-nCoV would
kill a few millions.</p>
<p>You can play with the parameters, but either way 2019-nCoV is not looking good
at all.</p>
<p>My parameters of 50x more cases than reported and 2x more deaths than reported
are equivalent to assuming the same number of deaths but <strong>25x</strong> more cases than reported,
as of 7 February 2020. This assumption is roughly consistent with other
estimates:</p>
<ul>
<li><a href="https://www.thelancet.com/journals/lancet/article/PIIS0140-6736(20)30260-9/fulltext">Nowcasting and forecasting the potential domestic and international spread of the 2019-nCoV outbreak originating in Wuhan, China: a modelling study</a>
estimates 75 815 cases as of 25 January 2020, which is <strong>38x</strong> more than the 1 975
cases officially reported by China on that date.</li>
<li>The Imperial College London MRC Centre for Global Infectious Disease Analysis estimates
in <a href="https://www.imperial.ac.uk/media/imperial-college/medicine/mrc-gida/2020-02-10-COVID19-Report-4.pdf">Report 4: Severity of 2019-novel coronavirus</a> as of 3 February 2020
that there are <strong>19x</strong> or <strong>26x</strong> more infections than reported, according to
2 scenarios.</li>
<li><a href="https://www.mdpi.com/2077-0383/9/2/419">The Rate of Underascertainment of Novel Coronavirus (2019‐nCoV) Infection:
Estimation Using Japanese Passengers Data on Evacuation Flights</a> estimates
<strong>11x</strong> more infections than reported (“<em>the ascertainment rate of infection was estimated at 9.2%</em>.”)</li>
</ul>
<h2 id="updates--validation">Updates & Validation</h2>
<p><strong>As of <a href="http://archive.is/DCkmo">16 February 2020</a> the resolved CFR based on China’s official
statistics stands at 1863/(1863+10844) = 15%. And when accounting for
underreported mild cases, I still believe my guess of 1.6%, made on 7 February
2020, is reasonable.</strong></p>
<p>On 10 February 2020,
the Imperial College London MRC Centre for Global Infectious Disease Analysis
<a href="https://www.imperial.ac.uk/media/imperial-college/medicine/mrc-gida/2020-02-10-COVID19-Report-4.pdf">published Report 4: Severity of 2019-novel coronavirus</a>
that estimates the CFR based on China’s statistics at <strong>18%</strong>, or <strong>0.8-0.9%</strong>
when accounting for underreported mild cases. Both figures approximately match my
estimates (respectively 15% and 1.6%.)</p>
<p>On 14 February 2020, <a href="https://www.mdpi.com/2077-0383/9/2/523">Real-Time Estimation of the Risk of Death from Novel
Coronavirus (COVID-19) Infection: Inference Using Exported Cases</a>
estimated the IFR at <strong>0.5-0.8%</strong>. The Infection Fatality Ratio is the same
as a CFR estimate that corrects for underreported cases. It approximately
matches my estimate (1.6%.)</p>
<p>On 19 February 2020, the Institute for Disease Modeling <a href="https://institutefordiseasemodeling.github.io/nCoV-public/analyses/first_adjusted_mortality_estimates_and_risk_assessment/2019-nCoV-preliminary_age_and_time_adjusted_mortality_rates_and_pandemic_risk_assessment.html">estimated the
IFR</a> at <strong>0.94%</strong>. It approximately matches my estimate (1.6%.)</p>
<p>The 3 reports above are referenced by the WHO in <a href="https://www.who.int/docs/default-source/coronaviruse/situation-reports/20200219-sitrep-30-covid-19.pdf?sfvrsn=3346b04f_2">Coronavirus disease situation
report - 30</a> (references 10, 11 and 12) and in <a href="https://www.who.int/docs/default-source/coronaviruse/situation-reports/20200220-sitrep-31-covid-19.pdf">situation report 31</a>.</p>
<p>As of 21 February 2020, a researcher from the Institute of Social and
Preventive Medicine, University of Bern <a href="https://github.com/calthaus/ncov-cfr/blob/d30f02e1c20e06103aad6a04eb492dd156466d98/README.md">estimated the CFR</a> using a
methodology to correct for possible underreporting of mild cases, and found
<strong>1.6%</strong>. It exactly matches my estimate for this scenario (1.6%.)</p>
<p>As of 30 March 2020, a peer-reviewed paper in The Lancet, <a href="https://www.thelancet.com/journals/laninf/article/PIIS1473-3099(20)30243-7/fulltext">Estimates of the
severity of coronavirus disease 2019: a model-based analysis</a>,
estimated the IFR at <strong>0.66%</strong>. It approximately matches my estimate (1.6%)
within a two-fold factor.</p>
Reverse Engineering and Cloning the Costco Digital Membership Cardhttp://blog.zorinaq.com/reverse-engineering-costco-digital-membership-card/2019-10-09T00:00:00-07:00<p>In this post I will reverse engineer the Costco Android app to find out how the
digital membership card works, and how to generate the time-based token encoded
in its QR code. I reimplemented it in plain HTML/JS. As a result the card can
be cloned to any number of devices, bypassing the app’s security mechanism that
attempted to tie the card to only one device.</p>
<p><strong>2022-10-11</strong> Three years later, this hack still works, only with minor changes.
The class that fetches ANDROID_ID is now MembershipCardUtilImpl. I used apktool
to decode the Costco Android app APK file version 6.7.0; I applied
<a href="../assets/costco-patch-6.7.0.txt">this new patch</a> (just make sure to edit ANDROID_ID
to a unique random value of your choosing); I rebuilt the APK and installed it.
The app still generates the same QR code as my <a href="../assets/costco_cardgen/">card generator</a>. I find
it easier to simply hardcode the ANDROID_ID, as this avoid the need to then log it
after installing the app.</p>
<ul id="markdown-toc">
<li><a href="#why" id="markdown-toc-why">Why?</a></li>
<li><a href="#costco-goes-digital" id="markdown-toc-costco-goes-digital">Costco goes digital</a></li>
<li><a href="#reverse-engineering-the-app" id="markdown-toc-reverse-engineering-the-app">Reverse engineering the app</a></li>
<li><a href="#custom-card-generator" id="markdown-toc-custom-card-generator">Custom card generator</a></li>
<li><a href="#final-thoughts" id="markdown-toc-final-thoughts">Final thoughts</a></li>
<li><a href="#disclosure-timeline" id="markdown-toc-disclosure-timeline">Disclosure timeline</a></li>
</ul>
<h2 id="why">Why?</h2>
<p>I reverse engineered the app mainly out of curiosity; but also for efficiency:
my reimplementation of the card loads instantly compared to their app, also I
can factory reset or replace my phone and the card will continue to work
whereas the Costco app will force you to “transfer” it to a new device (and it
allows only one transfer, afterwards customer service must be contacted.)</p>
<p>In the end this project was trivial and serves as a nice introduction to
Android reverse engineering.</p>
<p>But first, a story:</p>
<p>In 2004 shortly after I moved to the US, I was driving around my
neighborhood and saw a big-box store that I decided to visit. It was a Costco
and I did not know it was a membership-only store. I managed to get in through
sheer luck. I did not notice shoppers were required to flash their membership
cards at the greeter at the entrance. In fact I barely
noticed her, as she was standing discretely at the side of the oversized
entrance. She did not stop me.</p>
<p>So I got in, shopped for a bit, and when it was my turn to pay at the
checkout counter the clerk asked for my card.</p>
<p>«<em>My what?</em>»</p>
<p>«<em>Sir, you don’t have a membership card?</em>»</p>
<p>Fast forward 15 years later, I love Costco. I take my family there to shop
for items in bulk that I probably do not <em>need</em> in bulk,
but it is just so convenient. However I am always looking to carry fewer things
in my wallet, and Costco has been annoying in that regard, as one must carry
and present this stinking plastic magnetic stripe membership card.</p>
<h2 id="costco-goes-digital">Costco goes digital</h2>
<p>Two months ago they finally introduced a <em>digital membership card</em> on
their Costco smartphone app:</p>
<p class="no-margins"><img src="../assets/costco-dmc.jpg" alt="Costco digital membership card" title="Costco digital membership card" class="pure-img" /></p>
<p>So I install their Android app (version: 4.2.1,
APK SHA256: 29b10840e299b31213b89788554da0038901ef878e2a076e45b221a65fc4f222.)
My first impression is that it is very heavy given how little it does:
68 MB and it is mostly a WebView shim of their online store.
It is slow to load and takes ~6 seconds just to get to the
digital membership card view which is showing:</p>
<ul>
<li>Membership type icon (standard, executive…)</li>
<li>First and last name</li>
<li>Membership number</li>
<li>Membership start and expiration</li>
<li>Photo</li>
<li>QR code</li>
</ul>
<p>Let’s try to take a screenshot… it does not work. The app does not allow it as
it sets <a href="https://developer.android.com/reference/android/view/WindowManager.LayoutParams#FLAG_SECURE">FLAG_SECURE</a> on this window.</p>
<h2 id="reverse-engineering-the-app">Reverse engineering the app</h2>
<p>I grab the APK file, run <a href="https://ibotpeaches.github.io/Apktool/"><code class="language-plaintext highlighter-rouge">apktool decode</code></a> to unpack and disassemble
it, and recursively grep for <code class="language-plaintext highlighter-rouge">setFlags</code> to look for a call passing the value
FLAG_SECURE (0x2000).</p>
<p>The app is not obfuscated so it is easy to find the relevant
<code class="language-plaintext highlighter-rouge">setFlags</code> call that needs to be modified. It is in the file <code class="language-plaintext highlighter-rouge">MembershipCardActivity.smali</code>,
method <code class="language-plaintext highlighter-rouge">onCreate()</code>. Just zero out the flags to disable FLAG_SECURE:</p>
<figure class="highlight"><pre><code class="language-diff" data-lang="diff"><span class="gh">diff -Nur com.costco.app.android_4.2.1.orig/smali/com/costco/app/android/digitalmembership/MembershipCardActivity.smali com.costco.app.android_4.2.1/smali/com/costco/app/android/digitalmembership/MembershipCardActivity.smali
</span><span class="gd">--- com.costco.app.android_4.2.1.orig/smali/com/costco/app/android/digitalmembership/MembershipCardActivity.smali 2019-10-07 21:06:54.186134497 -0700
</span><span class="gi">+++ com.costco.app.android_4.2.1/smali/com/costco/app/android/digitalmembership/MembershipCardActivity.smali 2019-10-06 15:57:21.298978426 -0700
</span><span class="p">@@ -258,7 +258,7 @@</span>
move-result-object p1
<span class="gd">- const/16 v0, 0x2000
</span><span class="gi">+ const/16 v0, 0x0000
</span>
invoke-virtual {p1, v0, v0}, Landroid/view/Window;->setFlags(II)V
</code></pre></figure>
<p>I rebuild the APK with <code class="language-plaintext highlighter-rouge">apktool build</code>, sign it, zipalign it. I uninstall Costco’s
official APK, install my patched APK, and launch the app.</p>
<p>It asks me if I want to “transfer” my digital membership card to this new
device, hmm… My rebuild caused that? (I will find out later
the technical reason why this happened.) Anyway, I allow it, open
the digital card, and taking a screenshot is now possible:</p>
<p class="no-margins"><img src="../assets/costco-screenshot.png" alt="Screenshot of Costco digital membership card" title="Screenshot of Costco digital membership card" class="pure-img" /></p>
<p>(All my data in this screenshot has been altered for privacy. My real Costco photo is a lot worse ☺)</p>
<p>The QR code contains a numeric value where MMM… is the membership number:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>96000MMMMMMMMMMMMM000114362136
</code></pre></div></div>
<p>So is cloning the digital membership card as easy as taking a screenshot? Nope
because the last 8 digits of the QR code value appear to change every once in a
while.</p>
<p>Let’s investigate how the app builds the QR code.</p>
<p>I fire up the <a href="https://github.com/skylot/jadx">jadx</a> decompiler as it produces Java code more readable
than smali code. All the relevant code
seems to be under <code class="language-plaintext highlighter-rouge">com.costco.app.android.digitalmembership</code>;
in particular <code class="language-plaintext highlighter-rouge">MembershipCardFragment.setMembershipQRCode()</code>:</p>
<figure class="highlight"><pre><code class="language-java" data-lang="java"><span class="kd">private</span> <span class="kt">void</span> <span class="nf">setMembershipQRCode</span><span class="o">()</span> <span class="o">{</span>
<span class="k">if</span> <span class="o">(</span><span class="k">this</span><span class="o">.</span><span class="na">card</span> <span class="o">!=</span> <span class="kc">null</span> <span class="o">&&</span> <span class="n">getContext</span><span class="o">()</span> <span class="o">!=</span> <span class="kc">null</span><span class="o">)</span> <span class="o">{</span>
<span class="nc">StringBuilder</span> <span class="n">stringBuilder</span> <span class="o">=</span> <span class="k">new</span> <span class="nc">StringBuilder</span><span class="o">();</span>
<span class="n">stringBuilder</span><span class="o">.</span><span class="na">append</span><span class="o">(</span><span class="no">APPLICATION_IDENTIFIER</span><span class="o">);</span>
<span class="n">stringBuilder</span><span class="o">.</span><span class="na">append</span><span class="o">(</span><span class="no">SUB_TYPE</span><span class="o">);</span>
<span class="kt">int</span> <span class="n">length</span> <span class="o">=</span> <span class="k">this</span><span class="o">.</span><span class="na">card</span><span class="o">.</span><span class="na">getMemberCardNumber</span><span class="o">().</span><span class="na">length</span><span class="o">();</span>
<span class="k">if</span> <span class="o">(</span><span class="n">length</span> <span class="o"><=</span> <span class="mi">13</span><span class="o">)</span> <span class="o">{</span>
<span class="k">while</span> <span class="o">(</span><span class="n">length</span> <span class="o"><</span> <span class="mi">13</span><span class="o">)</span> <span class="o">{</span>
<span class="n">stringBuilder</span><span class="o">.</span><span class="na">append</span><span class="o">(</span><span class="s">"0"</span><span class="o">);</span>
<span class="n">length</span><span class="o">++;</span>
<span class="o">}</span>
<span class="o">}</span>
<span class="k">if</span> <span class="o">(!</span><span class="nc">StringUtils</span><span class="o">.</span><span class="na">isNullOrEmpty</span><span class="o">(</span><span class="k">this</span><span class="o">.</span><span class="na">card</span><span class="o">.</span><span class="na">getMemberCardNumber</span><span class="o">()))</span> <span class="o">{</span>
<span class="n">stringBuilder</span><span class="o">.</span><span class="na">append</span><span class="o">(</span><span class="k">this</span><span class="o">.</span><span class="na">card</span><span class="o">.</span><span class="na">getMemberCardNumber</span><span class="o">());</span>
<span class="n">stringBuilder</span><span class="o">.</span><span class="na">append</span><span class="o">(</span><span class="nc">MembershipCardUtils</span><span class="o">.</span><span class="na">generateDynamicToken</span><span class="o">(</span><span class="n">getContext</span><span class="o">()));</span>
<span class="k">this</span><span class="o">.</span><span class="na">qrCodeImage</span><span class="o">.</span><span class="na">setImageBitmap</span><span class="o">(</span><span class="nc">MembershipCardUtils</span><span class="o">.</span><span class="na">encodeAsBitmap</span><span class="o">(</span>
<span class="n">getContext</span><span class="o">(),</span> <span class="n">stringBuilder</span><span class="o">.</span><span class="na">toString</span><span class="o">()));</span>
<span class="o">}</span>
<span class="o">}</span>
<span class="o">}</span></code></pre></figure>
<p>And <code class="language-plaintext highlighter-rouge">MembershipCardUtils.generateDynamicToken()</code> generates the last digits
of the QR code:</p>
<figure class="highlight"><pre><code class="language-java" data-lang="java"><span class="kd">public</span> <span class="kd">static</span> <span class="nc">String</span> <span class="nf">generateDynamicToken</span><span class="o">(</span><span class="nc">Context</span> <span class="n">context</span><span class="o">)</span> <span class="o">{</span>
<span class="nc">MessageDigest</span> <span class="n">instance</span><span class="o">;</span>
<span class="nc">String</span> <span class="n">deviceId</span> <span class="o">=</span> <span class="n">getDeviceId</span><span class="o">(</span><span class="n">context</span><span class="o">);</span>
<span class="kt">long</span> <span class="n">currentTimeMillis</span> <span class="o">=</span> <span class="nc">System</span><span class="o">.</span><span class="na">currentTimeMillis</span><span class="o">()</span> <span class="o">/</span> <span class="mi">300000</span><span class="o">;</span>
<span class="nc">StringBuilder</span> <span class="n">stringBuilder</span> <span class="o">=</span> <span class="k">new</span> <span class="nc">StringBuilder</span><span class="o">();</span>
<span class="n">stringBuilder</span><span class="o">.</span><span class="na">append</span><span class="o">(</span><span class="n">deviceId</span><span class="o">);</span>
<span class="n">stringBuilder</span><span class="o">.</span><span class="na">append</span><span class="o">(</span><span class="no">SALT</span><span class="o">);</span>
<span class="n">stringBuilder</span><span class="o">.</span><span class="na">append</span><span class="o">(</span><span class="n">currentTimeMillis</span><span class="o">);</span>
<span class="n">deviceId</span> <span class="o">=</span> <span class="n">stringBuilder</span><span class="o">.</span><span class="na">toString</span><span class="o">();</span>
<span class="k">try</span> <span class="o">{</span>
<span class="n">instance</span> <span class="o">=</span> <span class="nc">MessageDigest</span><span class="o">.</span><span class="na">getInstance</span><span class="o">(</span><span class="s">"SHA-256"</span><span class="o">);</span>
<span class="o">}</span> <span class="k">catch</span> <span class="o">(</span><span class="nc">NoSuchAlgorithmException</span> <span class="n">e</span><span class="o">)</span> <span class="o">{</span>
<span class="n">e</span><span class="o">.</span><span class="na">printStackTrace</span><span class="o">();</span>
<span class="n">instance</span> <span class="o">=</span> <span class="kc">null</span><span class="o">;</span>
<span class="o">}</span>
<span class="n">deviceId</span> <span class="o">=</span> <span class="nc">Integer</span><span class="o">.</span><span class="na">toString</span><span class="o">(</span><span class="nc">Integer</span><span class="o">.</span><span class="na">parseInt</span><span class="o">(</span><span class="n">bytesToHex</span><span class="o">(</span><span class="n">instance</span><span class="o">.</span><span class="na">digest</span><span class="o">(</span>
<span class="n">deviceId</span><span class="o">.</span><span class="na">getBytes</span><span class="o">(</span><span class="nc">StandardCharsets</span><span class="o">.</span><span class="na">UTF_8</span><span class="o">))).</span><span class="na">substring</span><span class="o">(</span><span class="mi">0</span><span class="o">,</span> <span class="mi">6</span><span class="o">),</span> <span class="mi">16</span><span class="o">));</span>
<span class="nc">StringBuilder</span> <span class="n">stringBuilder2</span> <span class="o">=</span> <span class="k">new</span> <span class="nc">StringBuilder</span><span class="o">();</span>
<span class="n">stringBuilder2</span><span class="o">.</span><span class="na">append</span><span class="o">(</span><span class="no">RESERVED_FOR_FUTURE</span><span class="o">);</span>
<span class="n">stringBuilder2</span><span class="o">.</span><span class="na">append</span><span class="o">(</span><span class="no">DIGITAL_TOKEN_VERSION</span><span class="o">);</span>
<span class="k">while</span> <span class="o">(</span><span class="n">deviceId</span><span class="o">.</span><span class="na">length</span><span class="o">()</span> <span class="o"><</span> <span class="no">TOKEN_LENGTH</span><span class="o">)</span> <span class="o">{</span>
<span class="nc">StringBuilder</span> <span class="n">stringBuilder3</span> <span class="o">=</span> <span class="k">new</span> <span class="nc">StringBuilder</span><span class="o">();</span>
<span class="n">stringBuilder3</span><span class="o">.</span><span class="na">append</span><span class="o">(</span><span class="nc">ApiErrorCode</span><span class="o">.</span><span class="na">UNKNOWN</span><span class="o">);</span>
<span class="n">stringBuilder3</span><span class="o">.</span><span class="na">append</span><span class="o">(</span><span class="n">deviceId</span><span class="o">);</span>
<span class="n">deviceId</span> <span class="o">=</span> <span class="n">stringBuilder3</span><span class="o">.</span><span class="na">toString</span><span class="o">();</span>
<span class="o">}</span>
<span class="n">stringBuilder2</span><span class="o">.</span><span class="na">append</span><span class="o">(</span><span class="n">deviceId</span><span class="o">);</span>
<span class="k">return</span> <span class="n">stringBuilder2</span><span class="o">.</span><span class="na">toString</span><span class="o">();</span>
<span class="o">}</span></code></pre></figure>
<p>A device identifier is obtained from <code class="language-plaintext highlighter-rouge">getDeviceId()</code> which simply returns
the Android platform identifier <a href="https://developer.android.com/reference/android/provider/Settings.Secure#ANDROID_ID">ANDROID_ID</a>.</p>
<p><code class="language-plaintext highlighter-rouge">SALT</code> is the literal string <code class="language-plaintext highlighter-rouge">"SCOTTJOHN"</code> (if you are John Scott and wrote this code, let’s have a beer one day!)</p>
<p>From the code above it is obvious how the QR code numeric value is generated.
There are 4 constant fields and 2 variable fields:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>96000MMMMMMMMMMMMM0001TTTTTTTT
96 ........................... APPLICATION_IDENTIFIER
000 ........................ SUB_TYPE
MMMMMMMMMMMMM ........... membership number padded to 13 digits
00 ......... RESERVED_FOR_FUTURE
01 ....... DIGITAL_TOKEN_VERSION
TTTTTTTT dynamic token padedd to 8 digits
</code></pre></div></div>
<p>The dynamic token is generated by computing the SHA-256 hash of the concatenation of 3 strings:</p>
<ol>
<li>ANDROID_ID value</li>
<li>literal string <code class="language-plaintext highlighter-rouge">"SCOTTJOHN"</code></li>
<li>decimal integer representation of <code class="language-plaintext highlighter-rouge">System.currentTimeMillis() / 300000</code></li>
</ol>
<p>The hexadecimal hash is truncated to 6 hex digits, converted to decimal, and padded to 8 digits.
The dynamic token is in essence a <strong>time-based 24-bit token with a granularity of 300 seconds or 5 minutes</strong>.</p>
<p>Using ANDROID_ID to calculate the token serves as a primitive security
mechanism to tie the card to only one device.</p>
<p>Now this explains why the app asked me earlier if I wanted to “transfer” the
card to my recompiled APK. The Android platform, since Android 8.0, generates
ANDROID_ID values unique to each combination of app-signing key, user, and
device. The patched APK is signed with my key, different from the key of the
official APK, so the two APKs see different values. When Costco’s server-side
infrastructure notices an ANDROID_ID reported by the app that is different from
what it expects, it prompts to transfer the card to “the new device.”</p>
<p>In order to generate the token myself, I need to know ANDROID_ID. So I
patch the method <code class="language-plaintext highlighter-rouge">getDeviceId()</code> to log the value (stored in register p0)
using <code class="language-plaintext highlighter-rouge">System.out.println()</code>:</p>
<figure class="highlight"><pre><code class="language-diff" data-lang="diff"><span class="gh">diff -Nur com.costco.app.android_4.2.1.orig/smali/com/costco/app/android/digitalmembership/MembershipCardUtils.smali com.costco.app.android_4.2.1/smali/com/costco/app/android/digitalmembership/MembershipCardUtils.smali
</span><span class="gd">--- com.costco.app.android_4.2.1.orig/smali/com/costco/app/android/digitalmembership/MembershipCardUtils.smali 2019-10-07 21:06:54.190134496 -0700
</span><span class="gi">+++ com.costco.app.android_4.2.1/smali/com/costco/app/android/digitalmembership/MembershipCardUtils.smali 2019-10-07 20:42:26.474302400 -0700
</span><span class="p">@@ -485,7 +485,7 @@</span>
.end method
.method public static getDeviceId(Landroid/content/Context;)Ljava/lang/String;
<span class="gd">- .locals 1
</span><span class="gi">+ .locals 3
</span> .annotation build Landroid/annotation/SuppressLint;
value = {
"HardwareIds"
<span class="p">@@ -503,6 +503,11 @@</span>
move-result-object p0
<span class="gi">+ sget-object v1, Ljava/lang/System;->out:Ljava/io/PrintStream;
+ const-string v2, "android_id="
+ invoke-virtual {v1, v2}, Ljava/io/PrintStream;->print(Ljava/lang/String;)V
+ invoke-virtual {v1, p0}, Ljava/io/PrintStream;->println(Ljava/lang/String;)V
+
</span> return-object p0
.end method
</code></pre></figure>
<p>I build and push the APK. Now when launching the app <code class="language-plaintext highlighter-rouge">adb logcat</code> reveals ANDROID_ID:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>10-07 21:16:41.414 26112 26112 I System.out: android_id=3d2axxxxxxxxxxxx
</code></pre></div></div>
<h2 id="custom-card-generator">Custom card generator</h2>
<p>Armed with my ANDROID_ID value, and a description of the time-based token
algorithm, I was able to reimplement a digital membership card generator in
plain HTML/JS:</p>
<p><a href="../assets/costco_cardgen/">Costco digital membership card generator</a></p>
<p>To use the generator:</p>
<ol>
<li>Download the Costco app APK, either from your phone using <code class="language-plaintext highlighter-rouge">adb pull</code>, or from an <a href="https://www.apkmirror.com/apk/costco-wholesale/costco-wholesale/costco-wholesale-4-2-1-release/costco-wholesale-4-2-1-android-apk-download/">APK archive site</a></li>
<li>Decode the APK with apktool, apply my two patches, and rebuild it</li>
<li>Install the app on a phone, launch it, take a screenshot of the digital membership card</li>
<li>Run <code class="language-plaintext highlighter-rouge">adb logcat</code> and write down ANDROID_ID</li>
<li>Save my <a href="../assets/costco_cardgen/">card generator</a> and host it somewhere (1 HTML, 2 JS files, and 1 template screenshot)</li>
<li>Replace my template screenshot with your screenshot (crop it to remove the Android status bar)</li>
<li>Open the HTML file, type your ANDROID_ID and membership when prompted (they will be saved in the URL fragment,) and a real-looking card with a valid QR code is displayed</li>
<li>Tip: scroll down by a few pixels to hide the Chrome address bar; or use “Add to Home screen” and the page will load in fullscreen mode</li>
</ol>
<p>An implementation detail to be aware of: my QR code will always decode to the
exact same numerical value as the app’s QR code, but may look different. This
is because QR codes can use 8 mask patterns for data encoding. A QR
implementation is supposed to pick the best mask pattern based a certain logic,
such as minimizing the number of consecutive same-color pixels. My QR library
and the app’s library simply have slightly differing logic in that regard. But
for the sole purpose of visually confirming whether the QR codes encode the
same data, I added a feature so that clicking the QR code cycles through the 8
possible mask patterns.</p>
<h2 id="final-thoughts">Final thoughts</h2>
<p>The proper way to implement a digital membership card tied to a single
device would have been to leverage Android’s <a href="https://source.android.com/security/keystore">hardware-backed keystore</a>
with <a href="https://source.android.com/security/keystore/attestation">key attestation</a>, and to put the asymmetric cryptographic
signature of a timestamp in the QR code.</p>
<p>So what happened 15 years ago when I did not have a membership card? The clerk
called a security guard to escort me out. What an embarassing moment.</p>
<p>Nowadays I present my phone to the clerk and when he scans the QR code he does
not even realize he is scanning my card generator.</p>
<h2 id="disclosure-timeline">Disclosure timeline</h2>
<p>I chose full disclosure given the nature of the vulnerability. I am also
reaching out to Costco to advise them on remediation.</p>
<p><strong>2019-10-09</strong> I send an email to the Costco app developer contact
listed on the Play Store (webservice@contactcostco.com).</p>
<p><strong>2019-10-10</strong> I receive an automated reply “we will no longer be responding to
requests via email.”</p>
<p><strong>2019-10-10</strong> Finding no security contact information whatsoever, I tweet
@Costco. I also tentatively send a message to Director of IT Security Andrew
Tuck through LinkedIn and to a handful of guesses of what might be his
corporate email address. I call the corporate headquarters and leave
a voicemail to Tuck.</p>
Reviewing Morgan Stanley's Bitcoin research reportshttp://blog.zorinaq.com/morgan-stanley-bitcoin-research-reports/2018-02-06T00:00:00-08:00<p>I reviewed the following two research reports published by analysts at
Morgan Stanley on the subject of Bitcoin electricity usage:</p>
<ol>
<li><a href="https://ny.matrix.ms.com/eqr/article/webapp/b974be48-effb-11e7-8cdb-6e28b48ebbd3">Bitcoin ASIC production substantiates electricity use; points to coming jump</a>,
by James E Faucette et al., January 3, 2018</li>
<li><a href="https://ny.matrix.ms.com/eqr/article/webapp/11d8a6c8-f081-11e7-8cdb-6e28b48ebbd3">Bitcoin demand > EV demand?</a>,
by Nicholas J Ashworth et al., January 10, 2018</li>
</ol>
<p>The first report estimates current electricity consumption at 2500 MW, agreeing
with my own estimate of <a href="../bitcoin-electricity-consumption/#summary">1620/2100/3136 MW</a> (lower bound/best guess/upper
bound) as of January 11, 2018:</p>
<p class="no-margins"><img src="../assets/bitcoin_elec_consu_ms_2018.png" alt="Chart of Bitcoin electricity consumption" title="Bitcoin electricity consumption" class="pure-img" /></p>
<p>However I spotted a few errors.</p>
<ul id="markdown-toc">
<li><a href="#1-math-error-multiplying-instead-of-dividing" id="markdown-toc-1-math-error-multiplying-instead-of-dividing">1. Math error (multiplying instead of dividing)</a></li>
<li><a href="#2-pue-of-mining-farms-as-low-as-103-133" id="markdown-toc-2-pue-of-mining-farms-as-low-as-103-133">2. PUE of mining farms as low as 1.03-1.33</a></li>
<li><a href="#3-inconsistent-pue-math" id="markdown-toc-3-inconsistent-pue-math">3. Inconsistent PUE math</a></li>
<li><a href="#4-hashrate-method-makes-optimistic-and-pessimistic-assumptions" id="markdown-toc-4-hashrate-method-makes-optimistic-and-pessimistic-assumptions">4. Hashrate method makes optimistic and pessimistic assumptions</a></li>
<li><a href="#5-only-a-fraction-of-the-ordos-farm-mines-bitcoins" id="markdown-toc-5-only-a-fraction-of-the-ordos-farm-mines-bitcoins">5. Only a fraction of the Ordos farm mines bitcoins</a></li>
<li><a href="#6-antminer-s9-dominates-the-market" id="markdown-toc-6-antminer-s9-dominates-the-market">6. Antminer S9 dominates the market</a></li>
<li><a href="#7-electricity-costs" id="markdown-toc-7-electricity-costs">7. Electricity costs</a></li>
<li><a href="#8-bitmains-direct-sales-model-one-global-price" id="markdown-toc-8-bitmains-direct-sales-model-one-global-price">8. Bitmain’s direct sales model: ONE global price</a></li>
<li><a href="#9-transaction-fees-not-accounted-for" id="markdown-toc-9-transaction-fees-not-accounted-for">9. Transaction fees not accounted for</a></li>
<li><a href="#footnotes" id="markdown-toc-footnotes">Footnotes</a></li>
</ul>
<h2 id="1-math-error-multiplying-instead-of-dividing">1. Math error (multiplying instead of dividing)</h2>
<p>The analysts attempt to forecast future consumption, 12 months from now (ca.
January 2019,) and claim it may be “<em>more than 13 500/hour [sic]
megawatts</em>.”</p>
<p>Based on TSMC production orders for 15-20k 300mm wafer-starts of Bitcoin ASICs
per month, they estimate “<em>up to 5-7.5M new rigs</em>” could be added. They claim
to calculate electricity consumption based on 6.5M, but their numbers line
up only with the upper bound 7.5M:</p>
<p>7.5M × 1300 (watts) × 1.4 (efficiency improvement) = 13 650 MW</p>
<p>The multiplication by 1.4 is meant to account for new rigs bringing a “<em>40%
efficiency improvement</em>” and this is their error: they multiply instead
of dividing.<sup id="fnref:eff" role="doc-noteref"><a href="#fn:eff" class="footnote" rel="footnote">1</a></sup> A given volume of wafers/chips more energy-efficient
consume less, not more, per mm² of die area. When correcting this error we
arrive at an estimate of 6950 MW, about half their published number
(13 500 MW.)</p>
<h2 id="2-pue-of-mining-farms-as-low-as-103-133">2. PUE of mining farms as low as 1.03-1.33</h2>
<p>The Morgan Stanley analysts assume “<em>60% direct electricity usage (i.e. 40% of
total electricity consumption is used for non-hashing operations like cooling,
network equipment, etc.)</em>” In data center lingo this is called a <a href="https://en.wikipedia.org/wiki/Power_usage_effectiveness">PUE</a>
of 100/60 = 1.67.</p>
<p>However no study supports such terrible PUE values for the <em>mining
industry</em>.<sup id="fnref:sourceVavilov" role="doc-noteref"><a href="#fn:sourceVavilov" class="footnote" rel="footnote">2</a></sup> In reality, most mining farms aggressively optimize
their PUE:</p>
<ul>
<li>Gigawatt Mining builds <a href="https://medium.com/gigawatt">air-cooled mining farms</a> having a PUE of 1.03-1.05.<sup id="fnref:sourceGW" role="doc-noteref"><a href="#fn:sourceGW" class="footnote" rel="footnote">3</a></sup></li>
<li>Bitfury data centers are highly energy-efficient; for example
their 40 MW Norway data center has a PUE of 1.05,<sup id="fnref:moirana" role="doc-noteref"><a href="#fn:moirana" class="footnote" rel="footnote">4</a></sup> and their CEO
emphasized their Iceland data center does not have a high PUE.<sup id="fnref:sourceVavilov:1" role="doc-noteref"><a href="#fn:sourceVavilov" class="footnote" rel="footnote">2</a></sup></li>
<li>The well-known Bitmain Ordos mine reportedly has a PUE <a href="https://plus.google.com/+MarcBevand/posts/Fg8i66ATzhU">of either 1.11 or
1.33</a> (depending on which journalist’s numbers are trusted.)</li>
</ul>
<p>Google optimized their data center PUEs as low as 1.06<sup id="fnref:googlePUE" role="doc-noteref"><a href="#fn:googlePUE" class="footnote" rel="footnote">5</a></sup> and
electricity is not even one of their main costs. So it completely makes sense
to find miners, for whom electricity <em>is</em> one of their main costs, to be in the
same range.</p>
<h2 id="3-inconsistent-pue-math">3. Inconsistent PUE math</h2>
<p>According to their future consumption estimate and PUE estimate,
the resulting global consumption should be 13500 × 1.67 × 90% utilization
= 20 250 MW. But they calculate “<em>nearly 16 000 MW</em>.”</p>
<p>20250 ≠ 16000. The math is inconsistent.</p>
<p>Correcting their math and parameters gives 6950 × 1.11 (or 1.33) × 90% = 6950 (or 8300) MW.</p>
<p>In summary, Morgan Stanley’s first report forecasts the consumption ca. January
2019 will be <strong>13 500-16 000 MW (120-140 TWh/yr annualized)</strong> however
fixing multiple errors actually forecasts <strong>6950-8300 MW (60-75 TWh/yr
annualized).</strong></p>
<h2 id="4-hashrate-method-makes-optimistic-and-pessimistic-assumptions">4. Hashrate method makes optimistic and pessimistic assumptions</h2>
<p>The report claims “<em>the hash-rate methodology uses a fairly optimistic set of
efficiency assumptions</em>.” This is not true. Well perhaps they refered to
other people’s hashrate methodology. But mine, as explained <a href="../bitcoin-electricity-consumption/#summary">in the
introduction</a>, makes optimistic <strong>and pessimistic</strong> assumptions
(miners using either the least or the most efficient ASICs.)</p>
<h2 id="5-only-a-fraction-of-the-ordos-farm-mines-bitcoins">5. Only a fraction of the Ordos farm mines bitcoins</h2>
<p>The report continues by attempting to extrapolate the global electricity
consumption from the Ordos mine:</p>
<ul>
<li>They fail to account for the fact that only 7/8th of the farm mines bitcoins.
The other 1/8th mines litecoins.</li>
<li>The media publishes slightly different power consumption numbers,
implying either <a href="https://plus.google.com/+MarcBevand/posts/Fg8i66ATzhU">29.2 or 35 MW for the Bitcoin rigs</a> (depending on
journalists.)</li>
<li>They build their calculations on a grossly rounded estimate of its hashrate
(“<em>4% of ~6M TH/s</em>”), but it can be calculated more exactly as we know
there are 21k Bitcoin rigs (~263k TH/s.)</li>
</ul>
<p>When correcting these errors, the mine’s power consumption scaled to a global
hashrate of 15.2M TH/s would imply a global power consumption of either <strong>1690
or 2020 MW</strong> (14.8 or 17.7 TWh/yr) depending on journalists. This is significantly
less than the analysts’ <strong>2700 MW</strong> (23 TWh/yr.)</p>
<h2 id="6-antminer-s9-dominates-the-market">6. Antminer S9 dominates the market</h2>
<p>The report states “<em>the most efficient mining rigs used by Bitmain in its
facilities [Antminer S9/T9] are not yet widely available</em>” and imply that if
they are not available that the average rig must be another less efficient
model.</p>
<p>The analysts conflate <em>market availabily</em> with <em>market share</em>.</p>
<p>Bitmain claimed in mid-2017 they had a <a href="https://qz.com/1053799/chinas-bitmain-dominates-bitcoin-mining-now-it-wants-to-cash-in-on-artificial-intelligence/">70% market share</a>.
Everything points to the fact it is even higher today.
The Antminer S9/T9 has been the only Bitcoin mining rig sold by Bitmain for the last 20 months.
Batches of tens of thousands sell out in minutes at shop.bitmain.com.
Bitmain is <a href="https://twitter.com/jwangARK/status/954429531678543872">buying ~20k 16nm wafers a month</a> and arguably makes up
most of the ~10k a month that the Morgan Stanley analysts claim since 3Q17.</p>
<p>~10k wafers = ~270k S9 = <strong>~3.6 EH/s</strong> manufactured per month.</p>
<p>That is more than the <strong>1-3 EH/s</strong> added monthly to the global hashrate over
3Q17/4Q17 (it takes months to go from wafer production to mining.) Bitmain rigs
make up <em>virtually all</em> the hashrate being deployed to this day.</p>
<h2 id="7-electricity-costs">7. Electricity costs</h2>
<p>As to Morgan Stanley’s second report, it merely quotes the first report’s
flawed prediction of 120-140 TWh/yr ca. January 2019. But other than that it
is generally of better quality than the first. My criticism concerns
relatively minor points.</p>
<p>In it, the analysts calculate the cost of mining one bitcoin by assuming
electriciy costs between 6¢ and 8¢ per kWh. Their source are EIA numbers
grossly rounded for entire geographical regions.</p>
<p>Miners do not pay average prices. They choose the less expensive electrical
utilities of these regions.</p>
<p>For example where the analysts quote <strong>7.46¢</strong> for Washington State (see
their exhibit 5,) a mining farm located in this state, <a href="https://giga-watt.com/">Giga Watt</a>,
pays in fact <strong>2.8¢</strong>. It is my opinion that the industry average is probably
around 5¢.</p>
<h2 id="8-bitmains-direct-sales-model-one-global-price">8. Bitmain’s direct sales model: ONE global price</h2>
<p>Another assumption they make when calculating the cost of mining a bitcoin is
to assume that outside China an Antminer S9 costs $7000. In reality only
individual retail sales reach such high prices on third party sites such as
eBay. Large-scale miners representative of the average mining farm, even
outside China, all pay the same price: Bitmain’s direct sales price which
was $2320 for the batches sold around the time the report was written.</p>
<h2 id="9-transaction-fees-not-accounted-for">9. Transaction fees not accounted for</h2>
<p>Finally, they imply the cost of mining one bitcoin is a “<em>breakeven point</em>”
but it is not exactly true. For example, at the time of the report, transaction
fees collected by miners averaged more than <a href="https://blockchain.info/charts/transaction-fees?daysAverageString=7">600 BTC daily</a> and boosted
their global daily revenue by 1.33× (1800 to 2400 BTC,) hence the true
breakeven point was 1.33× lower.</p>
<p>Correcting these errors, with an electricity cost of $0.05/kWh, with the same
sale price globally, and with the (unusually) high-fee period of
December/January, the true breakeven point was $2300, significantly below the
analysts’ number ($3000 to $7000.)</p>
<h2 id="footnotes">Footnotes</h2>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:eff" role="doc-endnote">
<p>I would argue that it is incorrect to multiply as well as to divide.
Newer chips consume less energy per <em>silicon gate</em> not per <em>mm² of die
area</em>. The power consumption per mm² is roughly the same between two
different generations of chips, because gates consume less but more can
be packed per mm². For example a Radeon R9 390 and a
Radeon RX Vega 64 (about the same die area: 438 vs 486 mm²) are manufactured
at two very different process nodes (28nm vs 14nm,) yet they have the same
~300 W TDP. Nonetheless I went on with the division to follow the
analysts’ argument. <a href="#fnref:eff" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:sourceVavilov" role="doc-endnote">
<p>A quote from Bitfury CEO Valery Vavilov is often
misinterpreted when he offhandedly <a href="https://spectrum.ieee.org/computing/networks/why-the-biggest-bitcoin-mines-are-in-china">said in an interview</a> “<em>Many data centers around the world have 30 to 40 percent of
electricity costs going to cooling</em>,” which corresponds to a PUE of 1.43 to
1.67. Obviously he was refering to traditional data centers outside the
mining industry. In fact he emphasized “<em>This [high PUE] is not an
issue in our Iceland data center</em>.” <a href="#fnref:sourceVavilov" class="reversefootnote" role="doc-backlink">↩</a> <a href="#fnref:sourceVavilov:1" class="reversefootnote" role="doc-backlink">↩<sup>2</sup></a></p>
</li>
<li id="fn:sourceGW" role="doc-endnote">
<p>Source: personal discussion with CEO Dave Carlson. <a href="#fnref:sourceGW" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:moirana" role="doc-endnote">
<p>Bitfury’s 40 MW Mo i Rana data center in Norway has a
<a href="http://bitfury.com/content/4-press/03_20_18_bitfury_norway_datacenter_release.pdf">PUE of 1.05 or lower</a> <a href="#fnref:moirana" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:googlePUE" role="doc-endnote">
<p>Source: <a href="https://www.google.com/about/datacenters/efficiency/internal/">Efficiency: How we do it</a> <a href="#fnref:googlePUE" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>
Running a Bitcoin full node on $5 a monthhttp://blog.zorinaq.com/full-node-on-5-dollars/2017-10-25T00:00:00-07:00<p>In the context of the Bitcoin scaling debate, small blockers believe that if
the block size limit is doubled, running a fully validating Bitcoin node will
become expensive and only well-funded corporations will be able to run nodes,
hence increasing centralization.</p>
<p>However <em>facts and data</em> disprove this narrative.</p>
<p>Here is a fact: despite blocks currently averaging 1 MB <strong>it is possible to run
a full node on a $5/month VPS</strong>. I know because I run one. I demonstrate
below it could also (likely) handle 4 MB blocks.</p>
<p>The point I am making is not that people should run full nodes on VPS. The
point is that the cost of a VPS serves as a good indicator of the <em>amortized
cost</em> of resources consumed by a full node, and this cost is tiny. It is
irrelevant if users run the workload on a Raspberry Pi on a home connection, or
on a VPS.</p>
<ul id="markdown-toc">
<li><a href="#technical-setup" id="markdown-toc-technical-setup">Technical setup</a></li>
<li><a href="#handling-larger-blocks" id="markdown-toc-handling-larger-blocks">Handling larger blocks</a></li>
</ul>
<h2 id="technical-setup">Technical setup</h2>
<p>In my experience <strong>512 MB</strong> RAM is sufficient, even during IBD (initial
block download.) According to the requirements <a href="https://bitcoin.org/en/bitcoin-core/features/requirements">(bare minimum, with custom
settings)</a> we could do with 256 MB, but I choose 512 MB to have some
headroom.</p>
<p>The blockchain is currently approximately 145 GB, so that is the minimum
storage we need. We could run in pruned mode to reduce requirements to 5 GB,
however the goal here is to have a full node archiving the entire blockchain.
The blockchain will likely grow to <strong>200-220 GB</strong> over the next year,
assuming 1.0-1.5 MB blocks.</p>
<p>I observed the network bandwidth consumed by 30 connected peers to be in
the range <strong>100-300 GB/month</strong> (average of 300-900 kbit/s) and 95% of it is
outbound.</p>
<p><code class="language-plaintext highlighter-rouge">bitcoind</code> is at <1% CPU usage most of the time, so CPU performance is a
non-issue. Extremely rare pathological cases like the 1 MB transaction in block
<a href="https://blockchain.info/block-index/912565">364292</a> would cause a spike of CPU usage lasting 20-60 seconds.</p>
<p>Here is a handful of VPS plans adequate for running a full node:</p>
<ul>
<li>$3.5/mo: <a href="http://letbox.com/page/storage">LetBox</a> 512 MB RAM, 200 GB HDD, 2 TB bandwidth</li>
<li>$4/mo: <a href="https://www.hostens.com/vps-hosting/#hosting__plan__group-tab-storage">Hostens</a> 512 MB RAM, 512 GB HDD, 4 TB bandwidth</li>
<li>$5/mo: <a href="https://hosthatch.com/storage-vps">HostHatch</a> 512 MB RAM, 250 GB HDD, 1 TB bandwidth</li>
<li>$5/mo: <a href="https://www.serverhub.com/vps/ssd-cached">ServerHub</a> 512 MB RAM, 500 GB HDD, 1 TB bandwidth</li>
<li>6€/mo: <a href="https://www.scaleway.com/pricing/">Scaleway</a> 2 GB RAM, 200 GB <strong>SSD</strong>, 200 Mbps unmetered
(3€ for VPS including 50 GB + 3€ for an extra 150 GB)</li>
</ul>
<p>I chose ServerHub at $5/month, because it offered the most disk space.
<strong>[Update 2018-10-02:</strong> a new offering by Hostens is a lot more attractive
with four times the bandwidth (4 vs. 1 TB per month) and a tad more
storage.<strong>]</strong></p>
<p><a href="https://gist.github.com/laanwj/efe29c7661ce9b6620a7">These tips</a> are a great baseline for reducing memory usage. However one
advice is notably missing: run Bitcoin Core 0.15.0 or later because its <a href="https://bitcoin.org/en/release/v0.15.0">memory
usage is more predictible</a>. Also, running a 32-bit version of Bitcoin
Core may help—I have not measured by how much, but that is what I chose to run
based on experience.</p>
<p>When first setting up the node, I configured <code class="language-plaintext highlighter-rouge">bitcoin.conf</code> to optimize the
initial block download. The <code class="language-plaintext highlighter-rouge">dbcache</code> parameter (UTXO cache) should be as
large as what the machine can support:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>dbcache=100
maxmempool=10
maxconnections=10
disablewallet=1
</code></pre></div></div>
<p>The peak system memory usage I recorded with these settings during IBD was 403
MB out of 512 MB, excluding Linux buffers/cache. IBD completed in 45
hours.<sup id="fnref:refB" role="doc-noteref"><a href="#fn:refB" class="footnote" rel="footnote">1</a></sup> In the initial stages it was bottlenecked by the overall network
throughput of my peers at ~30 Mbit/s, while in the later stages the bottleneck
shifted to CPU and disk I/O.</p>
<p>Once IBD completed I reconfigured <code class="language-plaintext highlighter-rouge">bitcoin.conf</code> to give more space to the
mempool,<sup id="fnref:refA" role="doc-noteref"><a href="#fn:refA" class="footnote" rel="footnote">2</a></sup> allow up to 30 peers, and limit upload to 16000 MB/day (0.5 TB/month):</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>dbcache=50
maxmempool=50
maxconnections=30
disablewallet=1
maxuploadtarget=16000
</code></pre></div></div>
<p>Complete specs of this ServerHub VPS:</p>
<ul>
<li>32-bit Debian 6.0.10 (old but fits the job)</li>
<li>32-bit Bitcoin Core 0.15.0.1</li>
<li>2 CPU cores Xeon E5-2630 v3 @ 2.40GHz</li>
<li>512 MB RAM</li>
<li>500 GB simfs (OpenVZ instance)</li>
</ul>
<p>That is it. We have a $5/month full node, running well, perfectly healthy,
contributing to decentralizing the network.</p>
<h2 id="handling-larger-blocks">Handling larger blocks</h2>
<p>2 MB or 4 MB blocks should <em>not</em> significantly increase RAM usage. The UTXO cache
can remain limited (dbcache=50). It is a <a href="https://zander.github.io/posts/UTXO-Growth/">widespread misconception</a> that
the cache must be big enough to hold the entire UTXO set. It is fine for a
non-mining full node to have many UTXO cache misses. The mempool can also
remain limited (maxmempool=50) even if blocks are 2× or 4× larger. An
overflowing mempool causes, at worst, occasional unnecessary network
retransmissions of transactions.</p>
<p>2 MB or 4 MB blocks should <em>proportionally</em> increase network bandwidth,
storage, and CPU usage. However this VPS has resources to spare: it uses only
10-30% of its monthly bandwidth, 31% of its storage space, and the CPU is <1%
usage most of the time. Quadratic hashing is solved by segwit, and solvable in
non-segwit transactions by limiting them to 1 MB.</p>
<p>If 4 MB blocks started being mined today, the monthly bandwidth would still
work out (I would probably double <code class="language-plaintext highlighter-rouge">maxuploadtarget</code> to allow bitcoind to use
the maximum amount of bandwidth possible, and reduce <code class="language-plaintext highlighter-rouge">maxconnections</code> from 30
to 25 if needed), and the 500 GB would fill up in 1.7 years.</p>
<p>So this $5/month VPS would have enough RAM/network/storage/CPU resources to
function as a full node with 4 MB blocks, today and for the near future. Plus,
$5/month will buy more resources 1.7 years from now, so it may be possible to
continue operating a node on such a shoestring budget even further.</p>
<p>How does $5/month fit the narrative of small blockers who claim running a full
node will become too expensive? It does not. $5/month is affordable.
$5/month is within reach of enough individuals to protect decentralization.</p>
<p>Fun comparison: the average transaction fee right now is <a href="https://bitinfocharts.com/comparison/bitcoin-transactionfees.html#1y">over $3</a>,
therefore <strong>running a full node costs less than the fees for sending 2
Bitcoin transactions a month</strong>.</p>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:refB" role="doc-endnote">
<p>For comparison, as of block 482000 (August 2017) IBD can be completed
in <a href="https://www.reddit.com/r/Bitcoin/comments/6x17t1/bitcoin_core_0150_performance_is_impressive/dmcpt4h/">3 hours</a> on a high-end machine (24 cores, 64 GB RAM), see also
chart on page 16 of <a href="https://people.xiph.org/~greg/gmaxwell-sf-015-2017.pdf">Upcoming in Bitcoin Core 0.15</a>. <a href="#fnref:refB" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:refA" role="doc-endnote">
<p>In theory IBD would have run just fine with dbcache=50 and
maxmempool=50 because unused mempool memory is automatically shared with
the UTXO cache, since version 0.14.0. I did things the way I did out of habit. <a href="#fnref:refA" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>
The case for increasing Bitcoin's block weight limithttp://blog.zorinaq.com/block-increase-needed/2017-10-25T00:00:00-07:00<p>March 2017 was a significant moment for Bitcoin: the average block size bumped
into the 1MB limit, stunting the growth of the transaction rate ever since.
Many arguments are made about how to increase capacity (changing the limit,
developing off-chain solutions, etc.) However very rarely are discussions
centered around <em>facts and data</em> such as estimating the impact of the stunted
growth so far and the urgency of the situation. Also troubling is the
widespread misconception that the <a href="https://bitcoinmagazine.com/articles/segregated-witness-activates-bitcoin-what-expect/">activation of segwit</a> on August 24th
has given Bitcoin short-term relief.</p>
<p>In reality, I estimate <strong>Bitcoin missed the opportunity to process over $20
billion of transactions</strong>. Low segwit adoption keeps the network seriously
congested to this day: <strong>over 90% of blocks are over 90% full</strong>. And given the
current rate of increase of segwit adoption, <strong>the network will likely remain
congested for the foreseeable future</strong>.</p>
<p>This post is <em>not</em> about segwit2x, a grassroots change initially
<a href="https://lists.linuxfoundation.org/pipermail/bitcoin-dev/2017-March/013921.html">proposed</a> by Sergio Demian Lerner on bitcoin-dev to double the block
weight, and which is supported by signatories of the <a href="https://medium.com/@DCGco/bitcoin-scaling-agreement-at-consensus-2017-133521fe9a77">New York Agreement</a>.
Although I support it myself, segwit2x is a complex topic that deserves its own
blog post. Rather, the goal of this post is only to demonstrate the impact of
congestion and the urgent need to increase the block weight limit in one way or
another (not necessarily segwit2x.)</p>
<ul id="markdown-toc">
<li><a href="#average-block-size" id="markdown-toc-average-block-size">Average block size</a></li>
<li><a href="#mempool" id="markdown-toc-mempool">Mempool</a></li>
<li><a href="#segwit" id="markdown-toc-segwit">Segwit</a></li>
<li><a href="#market-share" id="markdown-toc-market-share">Market share</a></li>
<li><a href="#altcoins" id="markdown-toc-altcoins">Altcoins</a></li>
<li><a href="#pragmatism-vs-idealism" id="markdown-toc-pragmatism-vs-idealism">Pragmatism vs. idealism</a></li>
<li><a href="#decentralization" id="markdown-toc-decentralization">Decentralization</a></li>
<li><a href="#conclusion" id="markdown-toc-conclusion">Conclusion</a></li>
</ul>
<h2 id="average-block-size">Average block size</h2>
<p>Let’s start with a chart showing the average block size over the last 3 years:</p>
<p class="no-margins"><img src="../assets/stunted_growth.png" alt="Stunted growth" title="stunted growth" class="pure-img" /></p>
<p>Up until March 2017 the transaction rate growth was predictible. Had there not
been a 1MB limit, it seems the market demand for transaction capacity would
have been for approximately 1.2MB blocks today. By failing to capture an extra
0 to 0.2MB of transactions, that is 0 to 20% extra transactions through the
period March to October (daily transaction volume of <a href="https://blockchain.info/charts/estimated-transaction-volume-usd?timespan=1year&daysAverageString=7">$300-1000 million</a>)
<strong>Bitcoin failed to capture $20 billion worth of transactions over the last 7
months</strong>.</p>
<h2 id="mempool">Mempool</h2>
<p>I have to touch the subject of the mempool because the human—not
technical—dynamics that affect its functioning are often misunderstood.</p>
<p>It is a mistake to think that “<em>if the mempool is not growing, then Bitcoin is
not congested</em>.” When congestion begins, the mempool starts growing, some
transactions in it are delayed 1 day, 2 days, 4 days… eventually some users
lose patience and avoid sending transactions (<strong>Bitcoin fails to capture
them</strong>), hence reducing pressure on the mempool, which starts shrinking until
it is back to normal or even empty. And the cycle repeats immediately: users
send more transactions, the mempool grows, they lose patience, and avoid
Bitcoin, etc.</p>
<p><strong>Under perpetual congestion conditions the mempool will never grow
indefinitely</strong>.</p>
<h2 id="segwit">Segwit</h2>
<p>I hear you say “<em>but segwit allows up to 4MB blocks, so the problem is solved,
right?</em>”</p>
<p>No! Segwit replaced the <em>size</em> limit of 1MB with a <em>weight</em> limit of 4 million
weight units (“4M”). Therefore
the size of blocks in <em>bytes</em> is no longer relevant to estimate the level of
congestion. Instead we need to look at the weight of blocks in <em>weight
units</em>. For example average blocks of 1MB/4M would indicate congestion,
while larger blocks of smaller weight such as 1.5MB/3M would indicate no congestion.</p>
<p>The weight of blocks is a crucial metric, yet I am not aware of <em>any</em> Bitcoin
statistics site that charts it. Not even <a href="http://segwit.party/charts/">segwit.party</a> (even though
weight data is present in their <a href="http://enyo.gk2.sk/data.json">JSON file</a>.) It is almost as if segwit
proponents who believe segwit will solve scalability wanted to hide how
congested the network really is…</p>
<p>So I charted the data for the last 1000 blocks as I am typing this (blocks 490698
through 491697, roughly October 19th-25th):</p>
<p class="no-margins"><img src="../assets/blocks_full.png" alt="Over 90% of blocks are over 90% full" title="Over 90% of blocks are over 90% full" class="pure-img" /></p>
<p>913 out of the last 1000 blocks have a weight greater than 3.6M (90% of 4M), in
other words <strong>over 90% of blocks are over 90% full</strong>. This is a clear sign that
congestion is still seriously impacting Bitcoin. And the direct cause of this
congestion is low segwit adoption which sits at only <a href="http://segwit.party/charts/">8%</a> two months
after activation:</p>
<p class="no-margins"><a href="http://segwit.party/charts/"><img src="../assets/segwit_adoption.png" alt="Segwit adoption rate" title="Segwit adoption rate" class="pure-img" /></a></p>
<p>(Opinions abound as to why it peaked at 17% and fell to 8%. Some claim
a group was faking segwit adoption by spamming the network.)</p>
<p>An argument I sometimes hear is that “<em>there is no severe congestion,
because if there was, segwit adoption would increase more sharply</em>.” This is
flawed reasoning because other factors—independent of congestion—slow down
segwit adoption: wallet software and services are not upgraded to use segwit by
default, users are unaware of how to use segwit or why they need it, etc.</p>
<p>As a result, <strong>the average block size was only able to grow by a paltry few
percents from 1MB to 1.02-1.04MB in two months</strong>, significantly
undersupplying the market demand (which seems to be for approximately for 1.2MB
blocks, see above.)</p>
<h2 id="market-share">Market share</h2>
<p>So Bitcoin blocks have been full since March. What else happened in March?</p>
<p class="no-margins"><img src="../assets/bitcoin_dominance.png" alt="Bitcoin market dominance" title="Bitcoin market dominance" class="pure-img" /></p>
<p>Bitcoin lost market share, from 85% down to 55%. In my humble opinion, the
coincidence of the drop starting <em>exactly</em> in March is a sign that the congested
network was the direct cause of Bitcoin’s loss.</p>
<p>This could be explained by one theory: users being fed up with the congested
network and high transaction fees, and moving to altcoins with less congestion
and lower fees.</p>
<h2 id="altcoins">Altcoins</h2>
<p>Interestingly, data points seem to support this theory. The transaction rate of
Ethereum, Litecoin, and Dash (and maybe others) also started increasing in or
around March 2017:</p>
<p class="no-margins"><a href="https://bitinfocharts.com/comparison/ethereum-transactions.html#1y"><img src="../assets/ethereum_txrate.png" alt="Ethereum transaction rate" title="Ethereum transaction rate" class="pure-img" /></a></p>
<p class="no-margins"><a href="https://bitinfocharts.com/comparison/litecoin-transactions.html#1y"><img src="../assets/litecoin_txrate.png" alt="Litecoin transaction rate" title="Litecoin transaction rate" class="pure-img" /></a></p>
<p class="no-margins"><a href="https://bitinfocharts.com/comparison/dash-transactions.html#1y"><img src="../assets/dash_txrate.png" alt="Dash transaction rate" title="Dash transaction rate" class="pure-img" /></a></p>
<p>If anything, this is more evidence <em>potentially</em> validating the theory that
some users are abandoning Bitcoin for altcoins due to congestion.</p>
<p>Another explanation about why Ethereum’s transaction rate went up so sharply
is because of the ICO boom. However ICOs are hardly responsible for all of it,
and they do not explain the rise in Litecoin and Dash which are much less often
used to fund ICOs.</p>
<p>Notice that <strong>Ethereum already processes 450k transactions daily, which is 70%
more than the 260k processed by Bitcoin</strong>… This might be a slightly unfair
comparison because unlike Ethereum, some Bitcoin transactions are batched. To
be more correct we should compare the number of Ethereum transactions to the
number of Bitcoin transaction <em>outputs</em> (and for transactions with 2 or more
outputs, subtract 1 because one of them is likely the change output.) However
batching is not that common so the comparison between the number of Ethereum
and Bitcoin transactions is mostly valid.</p>
<h2 id="pragmatism-vs-idealism">Pragmatism vs. idealism</h2>
<p>How do we resolve Bitcoin’s congestion? Instead of increasing the block weight
limit, I think we all agree more ideal solutions would be the <a href="https://lightning.network/">Lightning
Network</a>, <a href="https://bitcoinmagazine.com/articles/mimblewimble-how-a-stripped-down-version-of-bitcoin-could-improve-privacy-fungibility-and-scalability-all-at-once-1471038001/">Mimblewimble</a>, transaction batching, etc.</p>
<p>However these solutions are either not ready or not practically usable <em>at scale</em>
today.</p>
<p>In essence this is the root cause of the scaling debate: the community is
divided between big blockers who want a pragmatic quick fix, and small blockers
who are more idealistic.</p>
<h2 id="decentralization">Decentralization</h2>
<p>Will increasing the block weight limit hamper decentralization? I do not
believe so.</p>
<p>Firstly, the <em>amortized cost</em> of running a full node is only <a href="../full-node-on-5-dollars/">$5 per
month</a>, an amount small enough that running one is within reach of enough
individuals to protect decentralization.</p>
<p>Fun comparison: the average transaction fee right now is <a href="https://bitinfocharts.com/comparison/bitcoin-transactionfees.html#1y">over $3</a>,
therefore <strong>running a full node costs less than the fees for sending 2
Bitcoin transactions a month</strong>.</p>
<p>Secondly, history shows that as blocks have grown over the last 2 years from
0.5 MB to 1.0 MB the number of full nodes has in fact increased, helping
decentralization. That is because bigger blocks = more Bitcoin adoption = more
individuals interested to and able to run full nodes. Generally speaking
we can expect the number of full nodes to increase as blocks become bigger
(up to a certain point.)</p>
<p class="no-margins"><img src="../assets/larger_blk_size_more_nodes.png" alt="Increased node count" title="Increased node count" class="pure-img" /></p>
<h2 id="conclusion">Conclusion</h2>
<p>It is easy to get distracted by all the good news Bitcoin has been receiving
lately: all-time-high price around $6000, a lot of venture capital injected
into Bitcoin companies, etc. However this rapid success could disappear just as
quickly if Bitcoin fails to increase its capacity soon.</p>
<p>Segwit is too little too late: some users not adopting it are causing
congestion for all users. We needed solutions to scale in March 2017. We needed
segwit activated a year ago so adoption would be at 50%+ today. We needed all
this before Bitcoin already failed to capture $20 billion worth of
transactions, lost 30% of market share, and lost hundreds of thousands of
transactions daily to altcoins…</p>
<p><strong>Bitcoin needs a pragmatic fix, a one-time reasonable increase of the block
weight in order to relieve congestion right away and give us some breathing
room</strong>. We can do this while keeping the costs to run a full node low
($5/month) and the network decentralized.</p>
Attacks on Merkle Tree Proofhttp://blog.zorinaq.com/attacks-on-mtp/2017-08-11T00:00:00-07:00<p>The Zcoin cryptocurrency is migrating its proof-of-work from Lyra2 to <a href="https://arxiv.org/abs/1606.03588v1">Merkle
Tree Proof</a> which is a new algorithm published in a paper last year by
the same authors who previously designed <a href="https://www.cryptolux.org/index.php/Equihash">Equihash</a> and <a href="https://www.cryptolux.org/images/0/0d/Argon2.pdf">Argon2</a>. The
Zcoin folks invited me to participate in their <a href="https://zcoin.io/mtp-open-source-miner-bounty-challenge/">miner development
challenge</a>; however what really caught my interest was their <a href="https://zcoin.io/mtp-audit-and-implementation-bounty/">audit
challenge</a> in which they offer bounties to reward the discovery of
vulnerabilities in either the MTP algorithm described by the paper, or the MTP
code implemented by Zcoin.</p>
<p>Challenge accepted. I found 4 attacks. The 4th one is the most serious one.</p>
<p><strong>Edit:</strong>
The Zcoin project gave me a <a href="https://twitter.com/zcoinofficial/status/931015823367421953">9166 USD bounty</a> (6666 USD for attacks 1 and
4, plus 2500 USD for attacks 2 and 3,) thanks!
Subsequently, the MTP authors revised their paper to defend against my attacks,
see <a href="https://arxiv.org/abs/1606.03588v2">MTP 1.2</a>.</p>
<ul id="markdown-toc">
<li><a href="#prelude" id="markdown-toc-prelude">Prelude</a></li>
<li><a href="#desc" id="markdown-toc-desc">Description of MTP-Argon2</a></li>
<li><a href="#prev" id="markdown-toc-prev">Previously known attacks</a></li>
<li><a href="#new-attacks" id="markdown-toc-new-attacks">New attacks</a> <ul>
<li><a href="#segsharing" id="markdown-toc-segsharing">Attack 1: Argon2 segment sharing</a></li>
<li><a href="#loc" id="markdown-toc-loc">Attack 2: Location in merkle tree not verified</a></li>
<li><a href="#openings" id="markdown-toc-openings">Attack 3: 1/3rd of openings not verified (Zcoin implementation flaw)</a></li>
<li><a href="#tmto" id="markdown-toc-tmto">Attack 4: Time-memory trade-off with 1/16th the memory, 2.88× the time</a></li>
</ul>
</li>
<li><a href="#minor-errors-in-mtp-paper" id="markdown-toc-minor-errors-in-mtp-paper">Minor errors in MTP paper</a></li>
<li><a href="#final-words" id="markdown-toc-final-words">Final words</a></li>
</ul>
<h2 id="prelude">Prelude</h2>
<p>Proof-of-work algorithms based on Merkle trees are a relatively new
construction. I find them quite interesting. To my knowledge, before MTP,
Fabien Coelho is the only person who designed one in his 2008 paper
<a href="https://eprint.iacr.org/2007/433.pdf">An (Almost) Constant-Effort Solution-Verification Proof-of-Work Protocol
Based on Merkle Trees</a>.</p>
<h2 id="desc">Description of MTP-Argon2</h2>
<p>First, let me give a succinct description of <em>MTP-Argon2</em> which is the concrete
instance of MTP with parameters making it suitable for use as a cryptocurrency
proof-of-work.</p>
<!--
According to the Unicode standard:
ϕ is U+03D5 ("straight" greek lowercase phi, used as a technical symbol)
φ is U+03C6 ("loopy" greek lowercase phi, meant for greek text)
This web page uses U+03D5.
Note that the Merkle paper authors, in their latex source, use the macro
\phi (not \varphi). Latex maps these macros to the following Adobe Type 1 font
character names in the Latin Modern or Computer Modern fonts:
\phi maps to /phi (a straight glyph) copying from PDF maps to U+03C6
\varphi maps to /phi1 (a loopy glyph) copying from PDF maps to U+03D5
This seems to be in line with PDF spec (appendix D.4) and Adobe Glyph List:
phi should be the straight glyph should map to U+03C6
phi1 should be the loopy glyph should map to U+03D5
However the Unicode standard says:
U+03D5 should be straight, used as a technical symbol, "maps to phi1 symbol entities"
U+03C6 should be loopy, meant for greek text
Unicode 10.0 standard page 308: starting with Unicode Standard v3.0 (1999), the
glyphs for 3C6 and 3D5 were swapped compared to earlier versions. This is the
origin of all these inconsistencies.
Ultimately, the correct desired result should be that the latex \phi and
\varphi macros respectively map to U+03D5 and U+03C6 when copied from the
generated PDF. To do that, either
(1) the Adobe Glyph List needs to swap its codepoints for phi and phi1:
phi can remain the straight glyph but should map to U+03D5
phi1 can remain the loopy glyph but should map to U+03C6
(and the comment in the Unicode standard "maps to phi1 symbol entities"
should apply to U+03C6 not U+03D5),
or (2) the Latin Modern and Computer Modern fonts need to map:
\phi to /phi1
\varphi to /phi
The solution (1) seems to be more appropriate. In fact the LCDF Typetools
package patches the Adobe Glyph List do to exactly that.
-->
<p>The <em>prover</em> takes the <em>challenge</em> (eg. current block header) and derives from
it 2 GB of Argon2d memory blocks (time_cost = 1, lanes = 4). Remember
that Argon2 computes each block X[i] from 2 <em>input</em> blocks X[i - 1] and
X[ϕ(i)]. Each 1-kB Argon2 block is seen as a leaf, and the prover computes the
merkle hash tree of the leaves, ending up with the root Φ. The prover
picks a random nonce N, computes a hash Y_0 = H(Φ ‖ N), then Y_0 modulo
the number of Argon2 blocks determines the index i of some block. The prover
computes another hash, Y_1 = H(Y_0 ‖ X[i]), then Y_1 determines the index of
another block, and so on. This is repeated L = 70 times to randomly selects 70
blocks. If the final hash Y_L is under a certain target, then the
proof-of-work is solved. The proof is:</p>
<ul>
<li>merkle root Φ</li>
<li>nonce N</li>
<li>2 × L = 140 blocks: two Argon2 <em>input</em> blocks for each of the L <em>selected</em> blocks</li>
<li>3 × L = 210 merkle openings: openings of 2 × L <em>input</em> blocks and of L <em>selected</em> blocks
(note: the paper makes a crucial mistake on this point, see next section)</li>
</ul>
<p>The <em>verifier</em> recomputes all the hashes Y_0 through Y_L to verify that
Y_L is under the target. For each of the L steps, the verifier also uses the
2 <em>input</em> blocks and Argon2’s compression function to recompute the <em>selected</em>
block, and he verifies their merkle openings.</p>
<h2 id="prev">Previously known attacks</h2>
<p>The first crucial vulnerability in the MTP paper is that the proof contains the
openings of only the <em>input</em> blocks. But it also needs the openings of the
<em>selected</em> blocks, or else the prover can choose a constant value for all
blocks (eg. all zeroes) and can compute the Y_i hashes by recomputing the
<em>selected</em> blocks on-the-fly from the 2 constant input blocks without having to
prove the <em>selected</em> blocks are part of the tree. I discovered this but then
read <a href="https://eprint.iacr.org/2017/497.pdf">Dinur and Nadler’s review of MTP</a> and saw they already reported
this vulnerability (“<em>4.1 Simple Attacks, second attack</em>.”)</p>
<p>Dinur and Nadler also document a 2nd and 3rd vulnerability: “<em>4.1 Simple
Attacks, first attack</em>” (previous proof-of-work solutions can be reused) and
the main attack described in section 5 (time-memory tradeoff exploiting
specially crafted inconsistent blocks.) The MTP authors suggested both attacks
can be fixed by including the <em>challenge</em> in the Argon2 compression function.</p>
<p>Zcoin fixes the 1st vulnerability by including the openings of the selected
blocks in the proof, and fixes the 2nd and 3rd one by writing—via this
<a href="https://github.com/zcoinofficial/zcoin/blob/2fb7dda5d39eb97e229f8da1a22aa45c1b54c72b/src/mtp.cpp#L38">memcopy()</a> call—the first 16 bytes of the double-SHA256 of the block
header (76 bytes, excluding nonce), at offset 128 of the XOR output in the
Argon2 compression function.</p>
<h2 id="new-attacks">New attacks</h2>
<p>Despite the above fixes suggested by the MTP authors and implemented by Zcoin,
I discovered new attacks. They affect the MTP algorithm, except attack 3 which
is just the result of a bug in the custom MTP implementation in Zcoin.</p>
<h3 id="segsharing">Attack 1: Argon2 segment sharing</h3>
<p>MTP-Argon2 suggests parameters designed to force provers to use 2048 MiB of
RAM. However a flaw allows a malicious prover to reduce his RAM usage by 18.75%
down to 1664 MiB by sharing Argon2 segments amongst lanes.</p>
<p>Argon2 is parametrized with 4 lanes of 512 MiB. Each lane is divided in 4
segments of 128 MiB. The first two X[i] blocks of a lane are normally
initialized with respectively H’(H_0 ‖ 0 ‖ i) and H’(H_0 ‖ 1 ‖ i).</p>
<p>However an oddity of MTP is that verifiers never check that these first two
blocks are initialized this way. So a prover can share the same first segment
across all lanes, allowing him to store 3 fewer segments in RAM, hence saving 3
× 128 = 384 MiB of RAM. This brings his RAM usage from 2048 MiB down to
1664 MiB, and makes MTP less memory-hard than it is supposed to.</p>
<p>(Note that even though the first segment of each lane are identical, the second
and subsequent segments will be different from other lanes because the
reference set R of blocks will be different for different lanes.)</p>
<p>One pragmatic fix for MTP is to simply allow, even encourage, the use of this
memory-saving technique, giving the same advantage to malicious and honest
provers. Argon2 could be parametrized to use 2581120 KiB (2520.625 MiB) of RAM,
so that the actual RAM usage with the memory-saving technique would be close to
the 2048 MiB originally intended.</p>
<p>However a cleaner fix is to use Argon2 with 1 lane instead of 4. This would also
fix the more serious <a href="#tmto">attack 4</a>.</p>
<p>One might think that an alternative fix would be to require provers to include
the first 2 blocks of each lane (and their openings) in the proof, and to
require verifiers to ensure that these blocks are unique. However fundamentally
MTP allows some inconsistent blocks, so the prover could supply 8 unique
blocks, and break the consistency of Argon2 by sharing the same remaining
blocks amongst lanes.</p>
<h3 id="loc">Attack 2: Location in merkle tree not verified</h3>
<p><strong>Note:</strong> the challenge judges <a href="https://github.com/zcoinofficial/zcoin/wiki/MTP-Audit-and-Implementation-Bounty-Submissions">argue</a> that they do not see this
as an issue in the paper, but the paper was merely silent on this (critical)
detail.</p>
<p>MTP as described in its paper is flawed. Algorithm 2 (verifier’s algorithm),
step 2, verifies the opening but not the <em>location</em> of the Argon2 blocks in the
tree. Their location is not part of the proof. Their location can only be
computed at the next step, step 3, when Y_j is calculated. As a result, the
prover can pretend the blocks selected through each of the L = 70 steps are
always X[3]
(MTP does not allow to select X[1] or X[2]) and he can effectively
run algorithm 1 (prover’s algorithm) by storing only 3 kB. I will explain later
how to further reduce memory usage down to 1 kB.</p>
<p>More specifically an attack against MTP would work as follow: in the prover’s
algorithm, step 1, the prover computes and stores X[1] through X[3] in memory,
and stops here. The prover assumes the rest of the blocks X[4] through X[T] are
all zeros. In step 5, the prover does not bother computing i_j but assumes i_j
is always 3. The rest of the algorithm is run normally. Whenever a nonce N
results in a satisfying Y_L, the prover’s proof will be the merkle root, the
nonce N, and the set Z of L = 70 elements where each element is the same:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(input block X[2],
input block X[1],
opening of X[3],
opening of X[2],
opening of X[1])
</code></pre></div></div>
<p>The verifier’s algorithm blindly accepts this proof because all the openings
are valid and the algorithm does not verify whether the location (Y_j mod T)
equals 3 or not. Since the location is not verified or even known by the
verifier’s algorithm, it seems the paper implies a <em>canonical</em> merkle tree is
computed: the <em>left</em> and <em>right</em> hashes are first ordered (since it is not
known which one is which) before hashing them to compute the parent.</p>
<p>On a side note, the Zcoin implementation of MTP gets around the problem of not
knowing the location not by computing a <em>canonical</em> merkle tree, but by storing
both the left and right hashes in the proof. But it too fails to verify the
location, so it does not prevent the attack.</p>
<p>In order to fix the verifier’s algorithm, step 2 should be removed, and step 3
should be changed to:</p>
<!-- "monospace" font on Android (/system/fonts/DroidSansMono.ttf) lacks a glyph for
U+03D5 ("straight" greek lowercase phi), so we have a special CSS class to
set the <code> font-family to "monospace, sans-serif" so that glyphs will first
be searched for in the monospace font, then fall back to sans-serif. -->
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Compute from Z for 1 ≤ j ≤ L:
X[i_j] = F(X[i_j - 1], X[ϕ(i_j)])
Verify opening of X[i_j] and its location in tree: (Y_(j-1) mod T)
Verify opening of X[i_j - 1] and its location in tree: (Y_(j-1) mod T) - 1
Verify opening of X[ϕ(i_j)] and its location in tree (*)
Y_j = G(Y_j - 1, X[i_j])
(*) The location of X[ϕ(i_j)] is specified by the first two 32-bit words of
X[i_j - 1] as documented in the Argon2 specification (J_1 and J_2 words.)
</code></pre></div></div>
<p>Since an opening consists of 21 hashes and no additional information, some of
these hashes will be <em>left</em> hashes, others <em>right</em> hashes. So verifying the
the location of a block is accomplished by determining which one is left and
which one is right, and by hashing them in the proper order to compute the parent.</p>
<p>In the Zcoin implementation, it stores 21 × 3 hashes per opening—21 (tree
depths) × 3 elements (parent, left, right)—so verifying the location of the
block would consist of verifying the hash is at the expected place (either left
or right) instead of freely allowing it to be at either the left or right <a href="https://github.com/zcoinofficial/zcoin/blob/dfb9a0b57ebcc706ae0278b5698792a02d1b66f6/src/libmerkletree/merkletree.cpp#L79">as
it does</a>.</p>
<p>The memory usage of this attack against MTP can be reduced from 3 kB to
1 kB. Because the verifier’s algorithm allows any value for X[1] and
X[2], the prover can actually assume they are all zeros and not even store them
in memory.</p>
<h3 id="openings">Attack 3: 1/3rd of openings not verified (Zcoin implementation flaw)</h3>
<p>When Zcoin patched one of the Dinur and Nadler attacks, by verifying 210
instead of 140 openings, they forgot to update one <a href="https://github.com/zcoinofficial/zcoin/blob/dfb9a0b57ebcc706ae0278b5698792a02d1b66f6/src/mtp.cpp#L609">loop</a>:</p>
<figure class="highlight"><pre><code class="language-c" data-lang="c"><span class="k">for</span> <span class="p">(</span><span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o"><</span> <span class="n">L</span> <span class="o">*</span> <span class="mi">2</span><span class="p">;</span> <span class="o">++</span><span class="n">i</span><span class="p">)</span> <span class="p">{</span></code></pre></figure>
<p><code class="language-plaintext highlighter-rouge">i < L * 2</code> should be replaced with <code class="language-plaintext highlighter-rouge">i < L * 3</code> or else the verifier will
only verify the opening of 2/3rds of the blocks in the merkle tree. Failing
to verify the remaining 1/3rd means a malicious prover would have to spend
the expected computing resources (2 GB) only once to compute 1 potential PoW
solution. Then he would simply grind one block in the remaining 1/3rd by
spending minimal resources, completely defeating the memory-hardness of
MTP. For example he would calculate Y_j = H(Y_(j-1) ‖ X[i_j]) for L-1
blocks, then would grind the last block, potentially on many machines in
parallel, by sending the Y_(L-1) value to the machines, each grinding
unique random last blocks, and each needing to allocate only 1 kB to hold
this block.</p>
<h3 id="tmto">Attack 4: Time-memory trade-off with 1/16th the memory, 2.88× the time</h3>
<p>I found a time-memory trade-off attack against MTP-Argon2 that requires
approximately 1/16th the memory, increases the time complexity by only ~2.88×,
while requiring negligible precomputation work. The overall cost for a cheating
prover, measured in ASIC area-time complexity, is hence reduced by a factor ~5.5×.</p>
<p>The prover computes the first 128 MB Argon2 segment of lane 0 honestly.
This segment can be reused by lanes 1-3 because verifiers don’t verify the first
2 blocks of the first segments are initialized as per the Argon2 spec, see
<a href="#segsharing">attack 1</a>.</p>
<p>For the 2nd segment of lane 0, the prover chooses an arbitrary (aka
non-consistent) 1st block that references any of the <strong>other</strong> 3 lanes (2nd
32-bit word, J_2 ≠ current_lane). He then computes the 2nd block honestly. If
J_2 in the 2nd block references the current lane (probability = 0.25), the
prover tries again by changing some bytes in the 1st block. If J_2 references
one of the other 3 lanes (probability = 0.75), the prover progresses
further and honestly computes the 3rd block. If J_2 in the 3rd block
references the current lane, the provers goes back to the beginning, changes
some bytes in the 1st block, so that both the 2nd and 3rd blocks reference any
of the other 3 lanes. The prover continues in the same fashion for N blocks:
he brute forces the 1st block until he honestly computes N subsequent blocks
that all reference any of the other 3 lanes. The probability that an arbitrary
1st block fulfills this condition is 0.75^N. I suggest N = 49 for an efficient
attack. On average the prover will have to try 1/(0.75^N) = ~1,324,336
candidate values for the 1st block.</p>
<p>So far the 2nd segment consists of:</p>
<ul>
<li>block 1: brute forced</li>
<li>blocks 2-50: honestly computed</li>
</ul>
<p>The prover stores these 50 blocks (memory cost: 50 kB.) The prover makes
block 51 non-consistent by setting it to the same content as block 1. Then
blocks 52-100, if computed, would generate the same data as blocks 2-50. There
is therefore no need to generate or store the remainder of the brute forced
segment: it consists of blocks 1-50 repeated over and over.</p>
<p>Repeat these steps to generate the 2nd segments of lanes 1-3, then the 3rd, then
the 4th segments of all lanes (12 brute forced segments in total.)</p>
<p>The crux of the attack that is important to understand is that the annoying
part of Argon2, from an attacker’s viewpoint, is that if J_2 references the
current lane, the reference set R of blocks is constantly changing from block
to block. However by forcing J_2 to reference other lanes, the sets R of the other 3
lanes remain constant for the whole segment, so a given block value X[i] will
generate the same X[i+1] regardless of the index i.</p>
<p>In total, on average, the prover will have to try 12 × 1,324,336 = ~15.9
million candidate values for the brute forced blocks. This precomputation must
be done once for every new challenge (eg. newly mined Zcoin block), and should
take just a few seconds of GPU time as it is an embarrassingly parallel task.</p>
<p>The total memory used by the prover is 128 MB (first segments) + 12 *
50 kB (brute forced segments) = ~128.6 MB, or approximately 1/16th
the amount used by honest provers.</p>
<p>In the 1st segment, all blocks are consistent. In the other 3 segments, 49 out
of 50 blocks are consistent. So on average 197 out of 200 blocks are
consistent, which means the proof-of-work’s random selection of L = 70 blocks
has probability (197/200)^70 = ~0.347 of being valid. Therefore this attack
only increases the time complexity by a factor ~2.88.</p>
<p>Overall, this time-memory trade-off attack is quite attractive: it requires
little precomputation, reduces memory use to ~1/16th, and increases runtime by
only ~2.88×, which translates to an ASIC area-time advantage of ~5.5× for a
cheating prover over an honest prover. The most efficient N value for this
attack remains to be determined but is likely around 30-50.</p>
<p>There is no easy and cheap way to detect the attack. The only viable fix that I
see is to use Argon2 with 1 lane instead of 4. This forces the reference set R
of blocks to change from block to block, so a given block X[i] will generate a
block X[i+1] that varies depending on the exact value of the index i.</p>
<h2 id="minor-errors-in-mtp-paper">Minor errors in MTP paper</h2>
<p>The authors say about their previous proof-of-work Equihash that it is not
truly <em>progress-free</em>. It “<em>is quite promising, but the reference
implementation reported is quite slow, as it takes about 30 seconds to get a
proof that certifies the memory allocation of 500 MB.</em>” However Equihash
performance was dramatically improved in October-November 2016 thanks to the
Zcash miner challenge which spurred competition on Equihash solvers.
<a href="https://github.com/mbevand/silentarmy/">SILENTARMY</a>, written by yours truly, takes only ~10 milliseconds to get a
proof on a modern GPU. More recent solvers such as OptiminerZcash or Claymore’s
Zcash Miner take only ~2 milliseconds.</p>
<h2 id="final-words">Final words</h2>
<p>The proof size of MTP is quite large at <strong>213920 bytes</strong> (2 GB of Argon2
blocks, L = 70, 16-byte merkle hashes.) By comparison Equihash proofs are much
smaller at <strong>1344 bytes</strong> (n = 200, k = 9.) Given that, and the fact Equihash
is not as problematic as the authors once thought, is actually progress-free,
and has been more reviewed than MTP, I personally think Equihash is a superior
proof-of-work. More research into MTP and smaller proofs may change my
opinion :-)</p>