mrb's blog

MD5 Chosen-Prefix Collisions on ATI GPUs: "bday" Finally Released

Keywords: amd bruteforcing cryptography gpu md5

More than a year ago, in July 2009, I presented a talk entitled MD5 Chosen-Prefix Collisions on GPUs at Black Hat USA. I made the slides and whitepaper available on my homepage shortly after the conference, but the source code was not available at the time. I wanted to fine-tuned 1 or 2 final details and write slightly better documentation before releasing the code. Well, believe me or not but it took me more than a year to do this. I am glad to finally release "bday" (source code): an implementation of the improved birthday search described in On Collisions for MD5 (Marc Stevens' Master's Thesis), using a hand-optimized MD5 compression function written in AMD CAL IL (Compute Abstract Layer Intermediate Language) targeting ATI Radeon HD 4000 series and higher GPUs.

This code provides a full implementation of the birthday search, which is the most computing intensive step in creating a chosen-prefix MD5 collision. I was originally hoping to integrate my code with hashclash (which implements other steps necessary to create a chosen-prefix collision) to make it more user-friendly, but I doubt I will take the time to do so.

Performance

Most of the run time is spent executing the MD5 compression function, with some overhead introduced by the Pollard-Rho algorithm. On an HD 4850 with 800 ALUs at 625 MHz, this code executes 725 million iterations of the full (64 steps) MD5 compression function per second. MD5 steps in rounds 1, 2 and 4 require 9 IL instructions each. Steps in round 3 require 8 instructions. So approximately 81.2% of the ALU resources are dedicated to executing the MD5 steps since 1 IL instruction most often maps to 1 native hardware instruction:

725e6*(9+9+8+9)*16/(800*625e6) = 81.2%

Features

  • automatically scales to multiple GPUs
  • launches 192 threads per SIMD, assumes 10 SIMDs, total of 1920 threads per GPU
  • ratio of 12 threads per VLIW
  • each thread runs 4 MD5 compression functions in parallel by using the x, y, z, and w IL components
  • each thread uses 76 GPRs, so 89% of the 16384 available hardware GPRs per SIMD are used (76*192/16384 = .89)
  • 2 shifting strategies implemented for the rotate operation in each step (see gen-kernel.pl:execute_shift_strategy):
    • "A" using ishl+ushr+ior
    • "B" using ushr+umad -> default, faster
  • constant buffer for sine values to use fewer literal registers

More Info

See the USAGE file in the source distribution for how to run an MD5 birthday search. As a side note, "bday" is the first open source optimized MD5 implementation for ATI GPUs.

Comments

Marc Stevens wrote: If you are able to provide a patch versus the current SVN source of HashClash, I can include it and give the ATI code credits to you.
As a side note, HashClash includes GPU support for NVIDIA graphic cards (actually already since Nov 2008).
Moreover, you can speed up your code even more by about 20%, since you can precompute the first 13 (out of 64) steps of MD5.
27 Nov 2010 15:34 UTC

mrb wrote: Hey Marc, glad to see your comment here! I am aware of this possible optimization (I described it in the README). I believe you can precompute the first 14, not 13, steps, as there are only 2 unknown words (x and y) at the end of the message. 27 Nov 2010 22:39 UTC

Yuhong Bao wrote: VBA digital signatures seems to use MD5:
http://msdn.microsoft.com/en-us/library/dd908782(v=office.12).aspx
29 Aug 2013 06:27 UTC