MD5 Chosen-Prefix Collisions on ATI GPUs: "bday" Finally Released

More than a year ago, in July 2009, I presented a talk entitled MD5 Chosen-Prefix Collisions on GPUs at Black Hat USA. I made the slides and whitepaper available on my homepage shortly after the conference, but the source code was not available at the time. I wanted to fine-tuned 1 or 2 final details and write slightly better documentation before releasing the code. Well, believe me or not but it took me more than a year to do this. I am glad to finally release "bday" (source code): an implementation of the improved birthday search described in On Collisions for MD5 (Marc Stevens' Master's Thesis), using a hand-optimized MD5 compression function written in AMD CAL IL (Compute Abstract Layer Intermediate Language) targeting ATI Radeon HD 4000 series and higher GPUs.

This code provides a full implementation of the birthday search, which is the most computing intensive step in creating a chosen-prefix MD5 collision. I was originally hoping to integrate my code with hashclash (which implements other steps necessary to create a chosen-prefix collision) to make it more user-friendly, but I doubt I will take the time to do so.

Performance

Most of the run time is spent executing the MD5 compression function, with some overhead introduced by the Pollard-Rho algorithm. On an HD 4850 with 800 ALUs at 625 MHz, this code executes 725 million iterations of the full (64 steps) MD5 compression function per second. MD5 steps in rounds 1, 2 and 4 require 9 IL instructions each. Steps in round 3 require 8 instructions. So approximately 81.2% of the ALU resources are dedicated to executing the MD5 steps since 1 IL instruction most often maps to 1 native hardware instruction:

725e6*(9+9+8+9)*16/(800*625e6) = 81.2%

Features

automatically scales to multiple GPUs
launches 192 threads per SIMD, assumes 10 SIMDs, total of 1920 threads per GPU
ratio of 12 threads per VLIW
each thread runs 4 MD5 compression functions in parallel by using the x, y, z, and w IL components
each thread uses 76 GPRs, so 89% of the 16384 available hardware GPRs per SIMD are used (76*192/16384 = .89)
2 shifting strategies implemented for the rotate operation in each step (see gen-kernel.pl:execute_shift_strategy):
- "A" using ishl+ushr+ior
- "B" using ushr+umad -> default, faster
constant buffer for sine values to use fewer literal registers

More Info

See the USAGE file in the source distribution for how to run an MD5 birthday search. As a side note, "bday" is the first open source optimized MD5 implementation for ATI GPUs.