A Look at the 4640-GPU Nebulae Supercomputer

The TOP500 list for June 2010 has just been published. A second supercomputer built on GPUs appeared in the top 10, Nebulae, the first one being Tianhe-1. Designed by the supercomputer manufacturer Dawning Information Industry and installed at the National Supercomputing Center in Shenzhen (NSCS), Nebulae features Intel Xeon X5650 2.66 GHz 6-core "Westmere" processors and Nvidia Tesla C2050 448-ALU 1150 MHz "Fermi" graphics processing units. TOP500 describes it as having a total of "120640 cores", a vague figure that I will explain later. These are the only pieces of information available right now. There are stories and anecdotes that sometimes leak to the public and indicate that multiple large companies around the world are experimenting with GPU-based supercomputers that are not in the TOP500 list, but the fact the 2 fastest public ones, Nebulae and Tianhe-1, are designed and operated by chinese companies and universities shows that China is becoming a leader in the domain of GPU supercomputing.

[Update: Do not miss the pictures of Nebulae I posted in a followup write up.]

Nebulae

The information page for Nebulae gives no information about the interesting part, how many C2050 GPUs Nebulae might have. The only technical detail available is the X5650 perfomance, listed as 10.64 GFLOPS (double precision), which by the way is misleading and slightly inaccurate because it is the approximate performance of 1 core only out of the 6: 4 double precision floating point operations per clock (with a combined multiply and add 128-bit SSE instruction) * 2.666 GHz = 10.664 GFLOPS. As to the C2050, its double precision floating point performance is: 2 (one multiply and add instruction) / 2 (a double precision instruction can only be executed every other cycle) * 448 ALUs (or shaders) * 1150 MHz = 515.2 GFLOPS. The X5650 is a dual-socket processor, so it is logical to assume that the supercomputer is built on dual-socket nodes and has a certain number of C2050 GPUs per node. Based on all this information, it is almost certain that Nebulae is built on 4640 nodes, where each node has two X5650 processors and one C2050 GPU, for a total of 9280 processors and 4640 GPUs:

4640 nodes * (10.664 GFLOPS per core * 6 cores * 2 sockets + 515.2 GFLOPS per GPU) = 2984.30 double precision TFLOPS

This 2984.30 TFLOPS number matches exactly the Rpeak value published in the TOP500 list —note that the more correct number is 2984.45 TFLOPS because the unrounded X5650 performance is 10.6666... GFLOPS per core, but who cares about a discrepancy of 0.15 TFLOPS? :-) This also concords with the rumored "4700 nodes" according to this article from The Register. [2010-05-31 update: the EETimes confirms the figure of 4640 GPUs, which also supports the rest of my numbers.] Finally, this also explains the "120640 cores" figure which combines processors and GPUs. The NSCS defined one SIMD unit (in Nvidia's terminology: streaming multiprocessor) as one core. The C2050 GPU has 14 SIMD units, and the two X5650 processors provide 12 cores:

4640 nodes * (12 processor cores + 14 SIMD units) = 120640 cores

So there you go, all the numbers published by TOP500 are now explained, despite a lack of public detailed specs. From a micro-architectural point of view, it makes sense to count a SIMD unit as a core, because each of them contain 32 ALUs (shaders) executing the same instructions, just from different thread contexts. However from a computing point of view, a SIMD unit provides more theoretical computing power than a traditional processor core: 36.80 GFLOPS (C2050) compared to 10.664 GFLOPS (X5650).

Given that Nvidia was 2 or 3 quarters late in delivering Fermi, there is no doubt that the delay had a direct impact on this supercomputer. The NSCS probably had to wait months before taking delivery of their 4.6 thousand C2050 GPUs. I would wager that the fraction of Nvidia's initial C2050 production allocated to this supercomputer alone was pretty large, given that they were rumors that up until last month, Nvidia was only able to manufacture "thousands" of Fermi GPUs due to low yields. They are also in dire need of communicating good news about Fermi —which has been bashed by the press lately— and this supercomputer is a opportunity for them to do so. Expect soon a big press release about how Nebulae represents a success of Fermi in the GPGPU world.

Tianhe-1

By contrast, the previously fastest GPU-based supercomputer operated by the National SuperComputer Center in Tianjin (NSCC-TJ), Tianhe-1, is built on 3072 nodes, where each node has either two Xeon E5450 or two E5540 processors and some are equipped with an AMD HD 4870 X2 GPU, for a total of 6144 processors and 5120 GPUs (2560 dual-GPU cards):

2560 compute nodes provide an average 10.507 GFLOPS per processor core (2048 of them are based on the E5540 —10.133 GFLOPS— and 512 of them based on the E5450 —12 GFLOPS—), plus 368 GFLOPS with a downclocked HD 4870 X2:
2560 compute nodes * (10.507 GFLOPS per core * 4 cores * 2 sockets + 368 GFLOPS per GPU) = 1157.26 TFLOPS
512 operation nodes based on the E5450 provide 12 GFLOPS per core, and have no GPU:
512 operation nodes * (12.0 GFLOPS per core * 4 cores * 2 sockets) = 49.15 TFLOPS
Total 1157.26 + 49.15 = 1206.41 double precision TFLOPS (again the TOP500 list contains inaccurate rounding errors and lists Rpeak = 1206.19 TFLOPS)

At the nominal clock of 750 MHz, an HD 4870 X2 provides 2400 single precision GFLOPS, because a single VLIW unit (in AMD's terminology: thread processor, contains 5 ALUs) can execute 5 single precision instructions per clock. But such a VLIW unit can only execute 1 double precision instruction per clock, so an HD 4870 X2 provides only 480 double precision GFLOPS, and downclocking it from 750 to 575 MHz brings this down to 368 GFLOPS. When I first tried to break down Tianhe-1's FLOPS numbers, I could not come up with anything that made sense, unless the GPUs were downclocked to a nice round 575 MHz number... which I then confirmed after finding this TOP500 article.

As for the number of cores, for some reason NSCC-TJ is not consistent with how Rpeak is calculated and does not include the number of cores of the 512 operation nodes. Also, similarly to Nebulae, one SIMD unit (in AMD's terminology: SIMD core or engine) is counted as one core. The HD 4870 X2 has 20 of them, and the two Xeon processors provide 8 cores, therefore:

2560 compute nodes * (8 processor cores + 20 SIMD units) = 71680 cores

The NSCC-TJ should be consistent IMHO and should include the operation nodes's 4096 cores (512 nodes * 8 cores) in the total, which would be 75776 cores.

Single Precision and Integer Workloads

The TOP500 list focuses on double precision floating point LINPACK benchmarks only. But how would these 2 supercomputers fare on single precision and integer workloads? AMD's architecture is very strong on these workloads, whereas Nvidia's is better on the former. Despite Nebulae being roughly twice faster than Tianhe-1 in theoretical and practical double precision performance, despite running a generation of Nvidia GPU ahead of AMD (Tianhe-1 is running R700 GPUs instead of the more recent R800 GPUs), despite having twice as many GPU cards (4640 vs. 2560), Nebulae would still be unable to surpass Tianhe-1. In fact the theoretical single precision computing performance provided by the GPUs from these 2 supercomputers is, surprisingly, the same, or about 4750 TFLOPS:

Nebulae (Nvidia C2050): 4640 nodes * 1030.40 GFLOPS = 4781.05 single precision TFLOPS (not accounting CPUs)
Tianhe-1 (AMD HD 4870 X2): 2560 nodes * 1840.00 GFLOPS = 4710.40 single precision TFLOPS (not accounting CPUs)

(HD 4870 X2's theoretical 2400 GFLOPS scaled down to 1840 GFLOPS to account for the downclocking from 750 MHz to 575 MHz).

Theoretical Nebulae with HD 5970 GPUs

Given that AMD GPUs are more powerful per Watt and per unit of price, looking at how powerful Nebulae could have been with them is interesting... Had Nebulae been built on AMD HD 5970 GPUs (4640 single precision GFLOPS, 928 double precision GFLOPS each), with each single-GPU C2050 card replaced with a dual-GPU HD 5970, it would have been 1.6x faster in double precision and 3.8x faster in single precision, providing respectively a bewildering 4900 double precision TFLOPS and 22717 single precision TFLOPS of theoretical performance (including CPUs). It would be technically feasible to use these AMD GPUs as the power envelope of the HD 5970 is only slightly higher than the C2050 (294W vs. 247W). Note that Nebulae's "4640 nodes" coincide strangely with the 4640 single precision GFLOPS performance number of an HD 5970. Coincidence? :-) My guess as to why NSCS chose Nvidia instead of AMD is perhaps because of a large CUDA code base or set of CUDA applications that they already had and wanted to run on the supercomputer, or because they wanted ECC GDDR RAM (AMD GPUs do not support ECC), or perhaps Nvidia, in need of proving Fermi is a solid GPGPU choice despite its technical flaws, decided to practically give them those C2050 cards for free to effectively buy the number 2 spot on the TOP500 list... Sleazy but possible.