Debunking Intel's Attempt to Debunk the GPU Performance Myth

In a technical paper published on June 23, 2010, at the International Symposium on Computer Architecture (ISCA) in Saint-Malo, France, Debunking the 100X GPU vs. CPU myth, Intel —who lacks a high-performing GPU due to their Larrabee GPU project having been delayed multiple times— attempts to debunk the claim that "GPUs deliver substantial speedups (between 10x and 1000x) over multi-core CPUs".

Intel's research paper is seriously flawed.

Firstly, Intel compares the Nvidia GTX280 (240 ALUs at 1296 MHz), an obsolete 2-year old GPU released on June 17, 2008, against a recent Intel Core i7-960 (4 cores at 3.2GHz). Not only this GPU from Nvidia is two generations behind, but they plainly ignore AMD's GPUs which are superior to Nvidia chips in the GPGPU arena. Nonetheless, Intel admits a 2.5x average performance advantage of this GPU over their CPU.

Secondly, the study completely ignores hardware costs. In the HPC market, no one cares about absolute performance. What only matters is performance per dollar (and these days people start to care about performance per Watt too).

Thirdly, the authors misrepresent the GTX280 theoretical single precision FLOPS performance. The theoretical Core i7-960 performance is:

2 (mul+add operation) * 4 (operation per 128-bit XMM register) * 4 (cores) * 3.2 (GHz) = 102.4 SP GFLOPS

Similarly, the theoretical performance of the GTX280 is:

2 (mul+add operation) * 240 (ALUs) * 1.296 (GHz) = 622.08 SP GFLOPS

With fused multiply-add and multiply operations, the GTX280 performance is:

3 (mul-add + mul operation) * 240 (ALUs) * 1.296 (GHz) = 933.12 SP GFLOPS

However the authors compare the CPU's 102.4 GFLOPS to a "311.1 GFLOPS" number for the GPU. In other words, they compare mul+add on the CPU to mul (or add) on the GPU, which is an apples vs. oranges comparison. (Note that I am not talking about the highly architecture-specific fused multiply-add and multiply operation mode of the GTX280 which allows it to reach an even higher theoretical 933.12 GFLOPS.)

Let's fix these errors in Intel's research paper, shall we? The AMD HD 5870 GPU (1600 ALUs at 850 MHz) provides 4.37x more theoretical SP performance than the GTX280:

2 (mul+add operation) * 1600 (ALUs) * 0.85 (GHz) = 2720 SP GFLOPS

Also the cost of a HD 5870 is about 390 USD, whereas the Core i7-960 costs 570 USD.

Assuming the "2.5x performance advantage of the GTX280" claim can be trusted, because the HD 5870 is 4.37x faster than a GTX280, it should have a 10.93x advantage over the CPU. Also, the HD 5870 is 68.4% the price of the CPU, therefore its perf/$ ratio is 16.0x higher than the Core i7-960.

In conclusion, it can be deducted from this Intel paper that a modern GPU such as the HD 5870 can provide a performance/dollar ratio at least 16 times higher than the Core i7-960, on average, on the workloads selected by Intel itself in this paper. And this conclusion is reached despite the authors most likely having favored workloads that could be better optimized on their CPU (for example with the help of the large L2 cache), and despite the authors having most likely spent more effort optimizing the CPU implementations than the GPU implementations.

The moral of the story is: always adopt a questioning attitude even when reading what appears to be quality academic research papers. I admire the skills of the authors to have closed the performance gap to only 2.5x when comparing against a GTX280, but their conclusion is flawed.

[Update: Nvidia replied to Intel in this blog post and raised the same concerns as I just did: comparison made against an outdated GTX280, other types workloads exhibiting an even bigger gap between GPU and CPU, etc.]