mrb's blog

A Look at the 4640-GPU Nebulae Supercomputer

Keywords: amd gpu hardware nvidia performance

The TOP500 list for June 2010 has just been published. A second supercomputer built on GPUs appeared in the top 10, Nebulae, the first one being Tianhe-1. Designed by the supercomputer manufacturer Dawning Information Industry and installed at the National Supercomputing Center in Shenzhen (NSCS), Nebulae features Intel Xeon X5650 2.66 GHz 6-core "Westmere" processors and Nvidia Tesla C2050 448-ALU 1150 MHz "Fermi" graphics processing units. TOP500 describes it as having a total of "120640 cores", a vague figure that I will explain later. These are the only pieces of information available right now. There are stories and anecdotes that sometimes leak to the public and indicate that multiple large companies around the world are experimenting with GPU-based supercomputers that are not in the TOP500 list, but the fact the 2 fastest public ones, Nebulae and Tianhe-1, are designed and operated by chinese companies and universities shows that China is becoming a leader in the domain of GPU supercomputing.

[Update: Do not miss the pictures of Nebulae I posted in a followup write up.]

Nebulae

The information page for Nebulae gives no information about the interesting part, how many C2050 GPUs Nebulae might have. The only technical detail available is the X5650 perfomance, listed as 10.64 GFLOPS (double precision), which by the way is misleading and slightly inaccurate because it is the approximate performance of 1 core only out of the 6: 4 double precision floating point operations per clock (with a combined multiply and add 128-bit SSE instruction) * 2.666 GHz = 10.664 GFLOPS. As to the C2050, its double precision floating point performance is: 2 (one multiply and add instruction) / 2 (a double precision instruction can only be executed every other cycle) * 448 ALUs (or shaders) * 1150 MHz = 515.2 GFLOPS. The X5650 is a dual-socket processor, so it is logical to assume that the supercomputer is built on dual-socket nodes and has a certain number of C2050 GPUs per node. Based on all this information, it is almost certain that Nebulae is built on 4640 nodes, where each node has two X5650 processors and one C2050 GPU, for a total of 9280 processors and 4640 GPUs:

4640 nodes * (10.664 GFLOPS per core * 6 cores * 2 sockets + 515.2 GFLOPS per GPU) = 2984.30 double precision TFLOPS

This 2984.30 TFLOPS number matches exactly the Rpeak value published in the TOP500 list —note that the more correct number is 2984.45 TFLOPS because the unrounded X5650 performance is 10.6666... GFLOPS per core, but who cares about a discrepancy of 0.15 TFLOPS? :-) This also concords with the rumored "4700 nodes" according to this article from The Register. [2010-05-31 update: the EETimes confirms the figure of 4640 GPUs, which also supports the rest of my numbers.] Finally, this also explains the "120640 cores" figure which combines processors and GPUs. The NSCS defined one SIMD unit (in Nvidia's terminology: streaming multiprocessor) as one core. The C2050 GPU has 14 SIMD units, and the two X5650 processors provide 12 cores:

4640 nodes * (12 processor cores + 14 SIMD units) = 120640 cores

So there you go, all the numbers published by TOP500 are now explained, despite a lack of public detailed specs. From a micro-architectural point of view, it makes sense to count a SIMD unit as a core, because each of them contain 32 ALUs (shaders) executing the same instructions, just from different thread contexts. However from a computing point of view, a SIMD unit provides more theoretical computing power than a traditional processor core: 36.80 GFLOPS (C2050) compared to 10.664 GFLOPS (X5650).

Given that Nvidia was 2 or 3 quarters late in delivering Fermi, there is no doubt that the delay had a direct impact on this supercomputer. The NSCS probably had to wait months before taking delivery of their 4.6 thousand C2050 GPUs. I would wager that the fraction of Nvidia's initial C2050 production allocated to this supercomputer alone was pretty large, given that they were rumors that up until last month, Nvidia was only able to manufacture "thousands" of Fermi GPUs due to low yields. They are also in dire need of communicating good news about Fermi —which has been bashed by the press lately— and this supercomputer is a opportunity for them to do so. Expect soon a big press release about how Nebulae represents a success of Fermi in the GPGPU world.

Tianhe-1

By contrast, the previously fastest GPU-based supercomputer operated by the National SuperComputer Center in Tianjin (NSCC-TJ), Tianhe-1, is built on 3072 nodes, where each node has either two Xeon E5450 or two E5540 processors and some are equipped with an AMD HD 4870 X2 GPU, for a total of 6144 processors and 5120 GPUs (2560 dual-GPU cards):

  • 2560 compute nodes provide an average 10.507 GFLOPS per processor core (2048 of them are based on the E5540 —10.133 GFLOPS— and 512 of them based on the E5450 —12 GFLOPS—), plus 368 GFLOPS with a downclocked HD 4870 X2:
    2560 compute nodes * (10.507 GFLOPS per core * 4 cores * 2 sockets + 368 GFLOPS per GPU) = 1157.26 TFLOPS
  • 512 operation nodes based on the E5450 provide 12 GFLOPS per core, and have no GPU:
    512 operation nodes * (12.0 GFLOPS per core * 4 cores * 2 sockets) = 49.15 TFLOPS
  • Total 1157.26 + 49.15 = 1206.41 double precision TFLOPS (again the TOP500 list contains inaccurate rounding errors and lists Rpeak = 1206.19 TFLOPS)

At the nominal clock of 750 MHz, an HD 4870 X2 provides 2400 single precision GFLOPS, because a single VLIW unit (in AMD's terminology: thread processor, contains 5 ALUs) can execute 5 single precision instructions per clock. But such a VLIW unit can only execute 1 double precision instruction per clock, so an HD 4870 X2 provides only 480 double precision GFLOPS, and downclocking it from 750 to 575 MHz brings this down to 368 GFLOPS. When I first tried to break down Tianhe-1's FLOPS numbers, I could not come up with anything that made sense, unless the GPUs were downclocked to a nice round 575 MHz number... which I then confirmed after finding this TOP500 article.

As for the number of cores, for some reason NSCC-TJ is not consistent with how Rpeak is calculated and does not include the number of cores of the 512 operation nodes. Also, similarly to Nebulae, one SIMD unit (in AMD's terminology: SIMD core or engine) is counted as one core. The HD 4870 X2 has 20 of them, and the two Xeon processors provide 8 cores, therefore:

2560 compute nodes * (8 processor cores + 20 SIMD units) = 71680 cores

The NSCC-TJ should be consistent IMHO and should include the operation nodes's 4096 cores (512 nodes * 8 cores) in the total, which would be 75776 cores.

Single Precision and Integer Workloads

The TOP500 list focuses on double precision floating point LINPACK benchmarks only. But how would these 2 supercomputers fare on single precision and integer workloads? AMD's architecture is very strong on these workloads, whereas Nvidia's is better on the former. Despite Nebulae being roughly twice faster than Tianhe-1 in theoretical and practical double precision performance, despite running a generation of Nvidia GPU ahead of AMD (Tianhe-1 is running R700 GPUs instead of the more recent R800 GPUs), despite having twice as many GPU cards (4640 vs. 2560), Nebulae would still be unable to surpass Tianhe-1. In fact the theoretical single precision computing performance provided by the GPUs from these 2 supercomputers is, surprisingly, the same, or about 4750 TFLOPS:

  • Nebulae (Nvidia C2050): 4640 nodes * 1030.40 GFLOPS = 4781.05 single precision TFLOPS (not accounting CPUs)
  • Tianhe-1 (AMD HD 4870 X2): 2560 nodes * 1840.00 GFLOPS = 4710.40 single precision TFLOPS (not accounting CPUs)

(HD 4870 X2's theoretical 2400 GFLOPS scaled down to 1840 GFLOPS to account for the downclocking from 750 MHz to 575 MHz).

Theoretical Nebulae with HD 5970 GPUs

Given that AMD GPUs are more powerful per Watt and per unit of price, looking at how powerful Nebulae could have been with them is interesting... Had Nebulae been built on AMD HD 5970 GPUs (4640 single precision GFLOPS, 928 double precision GFLOPS each), with each single-GPU C2050 card replaced with a dual-GPU HD 5970, it would have been 1.6x faster in double precision and 3.8x faster in single precision, providing respectively a bewildering 4900 double precision TFLOPS and 22717 single precision TFLOPS of theoretical performance (including CPUs). It would be technically feasible to use these AMD GPUs as the power envelope of the HD 5970 is only slightly higher than the C2050 (294W vs. 247W). Note that Nebulae's "4640 nodes" coincide strangely with the 4640 single precision GFLOPS performance number of an HD 5970. Coincidence? :-) My guess as to why NSCS chose Nvidia instead of AMD is perhaps because of a large CUDA code base or set of CUDA applications that they already had and wanted to run on the supercomputer, or because they wanted ECC GDDR RAM (AMD GPUs do not support ECC), or perhaps Nvidia, in need of proving Fermi is a solid GPGPU choice despite its technical flaws, decided to practically give them those C2050 cards for free to effectively buy the number 2 spot on the TOP500 list... Sleazy but possible.

Comments

Ceearem wrote: My guess why they used NVIDIA instead of Ati would be a combination of more onboard memory for the fermi cards (PCIe transfers are a really bad bottleneck for many GPU compute tasks) and the software ecosystem. ATI has a long way to go to provide a similar mature developing platform as CUDA constitutes for NVIDIA GPUs. And this is obviously the reason why more software exists (and is in development) for NVIDIA GPUs than for ATIs cards. Despite that does fermi offer some gpgpu specific features which might increase their useability compared to previouse GPU generations. 01 Jun 2010 19:34 UTC

mrb wrote: Good point. Compute tasks on the C2050 can access 3GB, but only 2GB on the HD 5970 Eyefinity's 2x2GB (split across 2 GPUs). That said if the tradeoff is 1 extra GB (Nvidia) versus 1.6x/3.8x higher double/single precision perf (AMD), is Nvidia decidedly the right choice for the very diverse set of GPGPU workloads that this research supercomputer is likely to execute? 01 Jun 2010 20:33 UTC

Ceearem wrote: True but with dual GPUs there are other tradeoffs. I.e. you should count the 5970 really as two GPUs with 1GB each. Thats how you see it as a GPGPU Program. So basically each instance of the programm has only access to 1GB of memory and the GPU has to share the bottleneck PCIe with a second. I do this differentiation because often you have to seperate a workload according to available memory. And communications costs can become a dominant factor. One example: 3D fft. Seperating a 3d FFT over multiple GPUs does often not decrease walltime since you have to communicate the full set of data after doing each dimension. So in that case if my dataset will fit into the 3GB of the Tesla card but would be needed to be spread over both parts of the 5970 the Tesla would probably outperform the Ati card by a wide margin. 01 Jun 2010 22:59 UTC

mrb wrote: Re: same PCIe link shared by the 2 GPUs of the HD 5970: I agree.

Note that when you say only 1GB is usable on the HD 5970, you are referring to the original HD 5970 edition. I was specifically referring to the HD 5970 Eyefinity which has 2x2GB total, so 2GB per GPU.
01 Jun 2010 23:21 UTC

RecessionCone wrote: I'm not sure it's possible to use the Tianhe system for any real computing, because it doesn't have ECC memory, which is pretty much required in big HPC. Otherwise, you don't know if you have a soft memory error during a large computation, and the entire simulation is potentially invalid.

AMD doesn't have ECC on any of their GPUs. The Tesla C2050 does.
04 Jun 2010 01:40 UTC

Carsten wrote: @mrb:
Those on the other hand do have a "slightly" higher TDP.

@RecessionCone
It's not desireable, but can't you do ECC in software? Costs performance, true, but at least you'd ensure your calculations aren't wasted at all.
04 Jun 2010 07:20 UTC

mrb wrote: @RecessionCone
Neither can you be certain when you have errors on Fermi because Nvidia implemented ECC SECDED which is unable to detect many multi-bit errors. ECC is certainly desirable, but at a certain scale, with or without ECC, errors are inevitable and applications have to deal with them.
04 Jun 2010 09:38 UTC

fermion wrote: the problem is that people who ported their hydro-codes in OpenCL to ATI cards and in CUDA to Nvidia cards, see 3x faster execution on Nvidia, denying the theoretical calculations of equal or higher FLOPs from ATI.
Memory organization inside the GPU seems to be better,
I don;t know about bandwidth to RAM .. for me it's not even worth checking since no matter what causes it, ATI is a bad idea in 2010. AMD also produces buggy drivers.

I'm thinking of building a 70GPU cluster with GTX480s.
25 Oct 2010 00:36 UTC

fermion wrote: one more thing of course is programming - much easier in CUDA, and all the great parallel libraries for free. I wish AMD offered anything remotely as useful for scientific applications. I think AMD is easily 2-3 years behind Nvidia in that department and the distance grows. Just my opinion.
10 years from now it might be the opposite. Until then...
25 Oct 2010 00:44 UTC