Attacks on Merkle Tree Proof

The Zcoin cryptocurrency is migrating its proof-of-work from Lyra2 to Merkle Tree Proof which is a new algorithm published in a paper last year by the same authors who previously designed Equihash and Argon2. The Zcoin folks invited me to participate in their miner development challenge; however what really caught my interest was their audit challenge in which they offer bounties to reward the discovery of vulnerabilities in either the MTP algorithm described by the paper, or the MTP code implemented by Zcoin.

Challenge accepted. I found 4 attacks. The 4th one is the most serious one.

Edit: The Zcoin project gave me a 9166 USD bounty (6666 USD for attacks 1 and 4, plus 2500 USD for attacks 2 and 3,) thanks! Subsequently, the MTP authors revised their paper to defend against my attacks, see MTP 1.2.

Prelude
Description of MTP-Argon2
Previously known attacks
New attacks
Minor errors in MTP paper
Final words

Prelude

Proof-of-work algorithms based on Merkle trees are a relatively new construction. I find them quite interesting. To my knowledge, before MTP, Fabien Coelho is the only person who designed one in his 2008 paper An (Almost) Constant-Effort Solution-Verification Proof-of-Work Protocol Based on Merkle Trees.

Description of MTP-Argon2

First, let me give a succinct description of MTP-Argon2 which is the concrete instance of MTP with parameters making it suitable for use as a cryptocurrency proof-of-work.

The prover takes the challenge (eg. current block header) and derives from it 2 GB of Argon2d memory blocks (time_cost = 1, lanes = 4). Remember that Argon2 computes each block X[i] from 2 input blocks X[i - 1] and X[ϕ(i)]. Each 1-kB Argon2 block is seen as a leaf, and the prover computes the merkle hash tree of the leaves, ending up with the root Φ. The prover picks a random nonce N, computes a hash Y_0 = H(Φ ‖ N), then Y_0 modulo the number of Argon2 blocks determines the index i of some block. The prover computes another hash, Y_1 = H(Y_0 ‖ X[i]), then Y_1 determines the index of another block, and so on. This is repeated L = 70 times to randomly selects 70 blocks. If the final hash Y_L is under a certain target, then the proof-of-work is solved. The proof is:

merkle root Φ
nonce N
2 × L = 140 blocks: two Argon2 input blocks for each of the L selected blocks
3 × L = 210 merkle openings: openings of 2 × L input blocks and of L selected blocks (note: the paper makes a crucial mistake on this point, see next section)

The verifier recomputes all the hashes Y_0 through Y_L to verify that Y_L is under the target. For each of the L steps, the verifier also uses the 2 input blocks and Argon2’s compression function to recompute the selected block, and he verifies their merkle openings.

Previously known attacks

The first crucial vulnerability in the MTP paper is that the proof contains the openings of only the input blocks. But it also needs the openings of the selected blocks, or else the prover can choose a constant value for all blocks (eg. all zeroes) and can compute the Y_i hashes by recomputing the selected blocks on-the-fly from the 2 constant input blocks without having to prove the selected blocks are part of the tree. I discovered this but then read Dinur and Nadler’s review of MTP and saw they already reported this vulnerability (“4.1 Simple Attacks, second attack.”)

Dinur and Nadler also document a 2nd and 3rd vulnerability: “4.1 Simple Attacks, first attack” (previous proof-of-work solutions can be reused) and the main attack described in section 5 (time-memory tradeoff exploiting specially crafted inconsistent blocks.) The MTP authors suggested both attacks can be fixed by including the challenge in the Argon2 compression function.

Zcoin fixes the 1st vulnerability by including the openings of the selected blocks in the proof, and fixes the 2nd and 3rd one by writing—via this memcopy() call—the first 16 bytes of the double-SHA256 of the block header (76 bytes, excluding nonce), at offset 128 of the XOR output in the Argon2 compression function.

New attacks

Despite the above fixes suggested by the MTP authors and implemented by Zcoin, I discovered new attacks. They affect the MTP algorithm, except attack 3 which is just the result of a bug in the custom MTP implementation in Zcoin.

Attack 1: Argon2 segment sharing

MTP-Argon2 suggests parameters designed to force provers to use 2048 MiB of RAM. However a flaw allows a malicious prover to reduce his RAM usage by 18.75% down to 1664 MiB by sharing Argon2 segments amongst lanes.

Argon2 is parametrized with 4 lanes of 512 MiB. Each lane is divided in 4 segments of 128 MiB. The first two X[i] blocks of a lane are normally initialized with respectively H’(H_0 ‖ 0 ‖ i) and H’(H_0 ‖ 1 ‖ i).

However an oddity of MTP is that verifiers never check that these first two blocks are initialized this way. So a prover can share the same first segment across all lanes, allowing him to store 3 fewer segments in RAM, hence saving 3 × 128 = 384 MiB of RAM. This brings his RAM usage from 2048 MiB down to 1664 MiB, and makes MTP less memory-hard than it is supposed to.

(Note that even though the first segment of each lane are identical, the second and subsequent segments will be different from other lanes because the reference set R of blocks will be different for different lanes.)

One pragmatic fix for MTP is to simply allow, even encourage, the use of this memory-saving technique, giving the same advantage to malicious and honest provers. Argon2 could be parametrized to use 2581120 KiB (2520.625 MiB) of RAM, so that the actual RAM usage with the memory-saving technique would be close to the 2048 MiB originally intended.

However a cleaner fix is to use Argon2 with 1 lane instead of 4. This would also fix the more serious attack 4.

One might think that an alternative fix would be to require provers to include the first 2 blocks of each lane (and their openings) in the proof, and to require verifiers to ensure that these blocks are unique. However fundamentally MTP allows some inconsistent blocks, so the prover could supply 8 unique blocks, and break the consistency of Argon2 by sharing the same remaining blocks amongst lanes.

Attack 2: Location in merkle tree not verified

Note: the challenge judges argue that they do not see this as an issue in the paper, but the paper was merely silent on this (critical) detail.

MTP as described in its paper is flawed. Algorithm 2 (verifier’s algorithm), step 2, verifies the opening but not the location of the Argon2 blocks in the tree. Their location is not part of the proof. Their location can only be computed at the next step, step 3, when Y_j is calculated. As a result, the prover can pretend the blocks selected through each of the L = 70 steps are always X[3] (MTP does not allow to select X[1] or X[2]) and he can effectively run algorithm 1 (prover’s algorithm) by storing only 3 kB. I will explain later how to further reduce memory usage down to 1 kB.

More specifically an attack against MTP would work as follow: in the prover’s algorithm, step 1, the prover computes and stores X[1] through X[3] in memory, and stops here. The prover assumes the rest of the blocks X[4] through X[T] are all zeros. In step 5, the prover does not bother computing i_j but assumes i_j is always 3. The rest of the algorithm is run normally. Whenever a nonce N results in a satisfying Y_L, the prover’s proof will be the merkle root, the nonce N, and the set Z of L = 70 elements where each element is the same:

(input block X[2],
input block X[1],
opening of X[3],
opening of X[2],
opening of X[1])

The verifier’s algorithm blindly accepts this proof because all the openings are valid and the algorithm does not verify whether the location (Y_j mod T) equals 3 or not. Since the location is not verified or even known by the verifier’s algorithm, it seems the paper implies a canonical merkle tree is computed: the left and right hashes are first ordered (since it is not known which one is which) before hashing them to compute the parent.

On a side note, the Zcoin implementation of MTP gets around the problem of not knowing the location not by computing a canonical merkle tree, but by storing both the left and right hashes in the proof. But it too fails to verify the location, so it does not prevent the attack.

In order to fix the verifier’s algorithm, step 2 should be removed, and step 3 should be changed to:

Compute from Z for 1 ≤ j ≤ L:
  X[i_j] = F(X[i_j - 1], X[ϕ(i_j)])
  Verify opening of X[i_j] and its location in tree: (Y_(j-1) mod T)
  Verify opening of X[i_j - 1] and its location in tree: (Y_(j-1) mod T) - 1
  Verify opening of X[ϕ(i_j)] and its location in tree (*)
  Y_j = G(Y_j - 1, X[i_j])

(*) The location of X[ϕ(i_j)] is specified by the first two 32-bit words of
X[i_j - 1] as documented in the Argon2 specification (J_1 and J_2 words.)

Since an opening consists of 21 hashes and no additional information, some of these hashes will be left hashes, others right hashes. So verifying the the location of a block is accomplished by determining which one is left and which one is right, and by hashing them in the proper order to compute the parent.

In the Zcoin implementation, it stores 21 × 3 hashes per opening—21 (tree depths) × 3 elements (parent, left, right)—so verifying the location of the block would consist of verifying the hash is at the expected place (either left or right) instead of freely allowing it to be at either the left or right as it does.

The memory usage of this attack against MTP can be reduced from 3 kB to 1 kB. Because the verifier’s algorithm allows any value for X[1] and X[2], the prover can actually assume they are all zeros and not even store them in memory.

Attack 3: 1/3rd of openings not verified (Zcoin implementation flaw)

When Zcoin patched one of the Dinur and Nadler attacks, by verifying 210 instead of 140 openings, they forgot to update one loop:

for (i = 0; i < L * 2; ++i) {

i < L * 2 should be replaced with i < L * 3 or else the verifier will only verify the opening of 2/3rds of the blocks in the merkle tree. Failing to verify the remaining 1/3rd means a malicious prover would have to spend the expected computing resources (2 GB) only once to compute 1 potential PoW solution. Then he would simply grind one block in the remaining 1/3rd by spending minimal resources, completely defeating the memory-hardness of MTP. For example he would calculate Y_j = H(Y_(j-1) ‖ X[i_j]) for L-1 blocks, then would grind the last block, potentially on many machines in parallel, by sending the Y_(L-1) value to the machines, each grinding unique random last blocks, and each needing to allocate only 1 kB to hold this block.

Attack 4: Time-memory trade-off with 1/16th the memory, 2.88× the time

I found a time-memory trade-off attack against MTP-Argon2 that requires approximately 1/16th the memory, increases the time complexity by only ~2.88×, while requiring negligible precomputation work. The overall cost for a cheating prover, measured in ASIC area-time complexity, is hence reduced by a factor ~5.5×.

The prover computes the first 128 MB Argon2 segment of lane 0 honestly. This segment can be reused by lanes 1-3 because verifiers don’t verify the first 2 blocks of the first segments are initialized as per the Argon2 spec, see attack 1.

For the 2nd segment of lane 0, the prover chooses an arbitrary (aka non-consistent) 1st block that references any of the other 3 lanes (2nd 32-bit word, J_2 ≠ current_lane). He then computes the 2nd block honestly. If J_2 in the 2nd block references the current lane (probability = 0.25), the prover tries again by changing some bytes in the 1st block. If J_2 references one of the other 3 lanes (probability = 0.75), the prover progresses further and honestly computes the 3rd block. If J_2 in the 3rd block references the current lane, the provers goes back to the beginning, changes some bytes in the 1st block, so that both the 2nd and 3rd blocks reference any of the other 3 lanes. The prover continues in the same fashion for N blocks: he brute forces the 1st block until he honestly computes N subsequent blocks that all reference any of the other 3 lanes. The probability that an arbitrary 1st block fulfills this condition is 0.75^N. I suggest N = 49 for an efficient attack. On average the prover will have to try 1/(0.75^N) = ~1,324,336 candidate values for the 1st block.

So far the 2nd segment consists of:

block 1: brute forced
blocks 2-50: honestly computed

The prover stores these 50 blocks (memory cost: 50 kB.) The prover makes block 51 non-consistent by setting it to the same content as block 1. Then blocks 52-100, if computed, would generate the same data as blocks 2-50. There is therefore no need to generate or store the remainder of the brute forced segment: it consists of blocks 1-50 repeated over and over.

Repeat these steps to generate the 2nd segments of lanes 1-3, then the 3rd, then the 4th segments of all lanes (12 brute forced segments in total.)

The crux of the attack that is important to understand is that the annoying part of Argon2, from an attacker’s viewpoint, is that if J_2 references the current lane, the reference set R of blocks is constantly changing from block to block. However by forcing J_2 to reference other lanes, the sets R of the other 3 lanes remain constant for the whole segment, so a given block value X[i] will generate the same X[i+1] regardless of the index i.

In total, on average, the prover will have to try 12 × 1,324,336 = ~15.9 million candidate values for the brute forced blocks. This precomputation must be done once for every new challenge (eg. newly mined Zcoin block), and should take just a few seconds of GPU time as it is an embarrassingly parallel task.

The total memory used by the prover is 128 MB (first segments) + 12 * 50 kB (brute forced segments) = ~128.6 MB, or approximately 1/16th the amount used by honest provers.

In the 1st segment, all blocks are consistent. In the other 3 segments, 49 out of 50 blocks are consistent. So on average 197 out of 200 blocks are consistent, which means the proof-of-work’s random selection of L = 70 blocks has probability (197/200)^70 = ~0.347 of being valid. Therefore this attack only increases the time complexity by a factor ~2.88.

Overall, this time-memory trade-off attack is quite attractive: it requires little precomputation, reduces memory use to ~1/16th, and increases runtime by only ~2.88×, which translates to an ASIC area-time advantage of ~5.5× for a cheating prover over an honest prover. The most efficient N value for this attack remains to be determined but is likely around 30-50.

There is no easy and cheap way to detect the attack. The only viable fix that I see is to use Argon2 with 1 lane instead of 4. This forces the reference set R of blocks to change from block to block, so a given block X[i] will generate a block X[i+1] that varies depending on the exact value of the index i.

Minor errors in MTP paper

The authors say about their previous proof-of-work Equihash that it is not truly progress-free. It “is quite promising, but the reference implementation reported is quite slow, as it takes about 30 seconds to get a proof that certifies the memory allocation of 500 MB.” However Equihash performance was dramatically improved in October-November 2016 thanks to the Zcash miner challenge which spurred competition on Equihash solvers. SILENTARMY, written by yours truly, takes only ~10 milliseconds to get a proof on a modern GPU. More recent solvers such as OptiminerZcash or Claymore’s Zcash Miner take only ~2 milliseconds.

Final words

The proof size of MTP is quite large at 213920 bytes (2 GB of Argon2 blocks, L = 70, 16-byte merkle hashes.) By comparison Equihash proofs are much smaller at 1344 bytes (n = 200, k = 9.) Given that, and the fact Equihash is not as problematic as the authors once thought, is actually progress-free, and has been more reviewed than MTP, I personally think Equihash is a superior proof-of-work. More research into MTP and smaller proofs may change my opinion :-)