AMD Bulldozer and Oracle SPARC T3: Same Beefed-up SMT Microarchitectures

Here is an insightful technical post on the new SPARC T3 processor from Joerg Moellenkamp (16 cores and 128 threads on 1 socket). Oracle just announced 1-, 2-, and 4-socket systems built on this processor, which gives up to 512 threads per system. I remember Oracle/Sun planning months ago 8-socket T3 systems, so I presume such beasts will be announced later.

I find it interesting that both Oracle, with the T3 processor, and AMD, with the upcoming Bulldozer processor, adopted similar designs.

A Bulldozer module, as AMD calls it, consists of two integer units, and one floating point unit (see picture). AMD sometimes labels these units as "cores" but this nomenclature is confusing. Instead a whole Bulldozer module should be seen as a 1-core 2-thread piece of x86-64 technology. When one of the threads executes integer-only instructions, the second integer unit effectively doubles the performance compared to a classic 1-core 2-thread design like Intel's SMT technology (a Nehalem core only has one integer unit).

A SPARC T3 core, like the previous generation UltraSPARC T2/T2+, also has two integer units, and one floating point unit. However a T3 core exposes 8 threads to the OS.

So, effectively a Bulldozer module (2 integer units, 1 floating point unit, 2 threads) is microarchitecturally equivalent to a T3 core (2 integer units, 1 floating point unit, 8 threads). This is the interesting story of Bulldozer: AMD finally adopted SMT, but a beefed-up version of it where not 1, but 2, integer units are present in a "core" to counterbalance the increased number of threads exposed by it. Of course no one in the technology press picked this up and reported it this way, because AMD is using words carefully to market a Bulldozer module as "2 cores not supporting SMT" as it sounds better than "1 core supporting a better version of SMT".

Now, I think a smart move for AMD would be to expose an even higher number of threads per Bulldozer module, as it could be relatively inexpensive to implement in terms of die area (Oracle showed they could expose 8 threads without too much difficulty). For example if 4 threads were exposed, the ratio of threads per integer units would be the same as Intel Nehalem. Who knows, perhaps AMD will do it in future revisions of Bulldozer?

mrb Wednesday 22 September 2010 at 10:56 pm | | Default | No comments
Used tags: , , ,

Adobe's Primordial Flash Example is Incorrect

One of the most visited pages on adobe.com by developers for a quick reference to the official and peculiar syntax to embed a Flash application in a Web page, OBJECT and EMBED syntax | Flash, is wrong about the syntax.

Not wrong in a pedantic actually-it-should-be-done-this-way-for-better-compliance way, but wrong in the sense it will not load Flash applications under Firefox and probably other browsers.

<embed href="/support/flash/ts/documents/myFlashMovie.swf" ...

It should be src instead of href!

<embed src="/support/flash/ts/documents/myFlashMovie.swf" ...

One of my pet peeves is attention to detail, but I just cannot imagine how Adobe can screw so badly that they get this primordial document wrong.

For my inquisitive readers who want more details, I was testing with Firefox 3.0.19 under Linux, and Flash Player 10.1.82.76.

mrb Tuesday 21 September 2010 at 12:06 am | | Default | No comments
Used tags: ,

Intel's Sandy Bridge to Deliver 2% of AMD's Top Graphics Performance

Pardon the obnoxious title of this post, but here is something I want to share, and I feel it is necessary to disperse some absurd ideas going around.

Intel's next generation Sandy Bridge microarchitecture features an integrated graphics core: the CPU and GPU share the same die. The GPU will have up to 12 dual-issue execution units (EUs) [Anandtech]. In terms of maximum theoretical computing performance, an EU is equivalent to two stream cores (AMD) or two streaming processor (Nvidia), because contrary to AMD and Nvidia, it is dual-issue, therefore can execute 2 instructions per cycle. It is unclear what the frequency clock of the EUs will be, the same as the CPU cores (~3GHz), or closer to AMD's and Nvidia's clocks (~1GHz), or somewhere in-between. Let us assume 2GHz.

  • 2 instructions per clock (dual-issue EU)
  • times 12 (number of EUs)
  • times 2 billion (2GHz)
  • equal 48 billion instructions per second

This number of instructions/sec gives an idea of the level of performance of a GPU (graphics operations are translated to GPU instructions, and execution units occupy most of the area of a GPU die). For reference a low-end AMD Radeon HD 5450 can execute 52 billion instructions/sec. Very close. In fact, this exclusive preview of a Sandy Bridge Core i5-2400 3.1GHz confirms that the performance of the i5-2400 matches roughly the performance of the HD 5450, which shows the math is right.

Ready for the kicker?

The highest-end AMD Radeon HD 5970 can execute 2320 billion instructions/sec. In other words a ~3GHz Sandy Bridge processor's integrated GPU will deliver only ~2% of the graphics performance of the highest-end AMD Radeon video card.

Do not misinterpret me. 2% may sound bad, but it is good enough for entry-level graphics performance (or else AMD would not be selling the HD 5450). My point is that by their own design, Intel obviously do not intend to, and will not compete with top-of-the-line discrete GPUs. Nonetheless, 12 EUs seems really low; this number was probably carefully chosen so as to not unnecessarily waste die space and power.

That said, perhaps Intel were originally hoping their integrated GPU would be fast enough for some high-definition video transcoding, and after realizing it would not be the case, set out to design the Media Engine, aka Display Engine. This is a separate block on the die, neither part of the GPU, nor part of the CPU cores, but part of the "System Agent", and is made of fixed functions to implement video encoding and decoding as efficiently as possible.

In conclusion, I am looking forward to play with Sandy Bridge, but not for high-end GPGPU or gaming :-)

mrb Wednesday 15 September 2010 at 02:34 am | | Default | Three comments
Used tags: , , , ,

OpenSolaris: How to Build CPAN Perl Modules with GCC Instead of the Sun Compiler

When building Perl modules with CPAN, the system assumes that the same compiler arguments that were used to compile Perl (indicated in the output of "perl -V") should be used to compile modules. However on OpenSolaris, Perl was compiled with the Sun C compiler, whereas the OS distributes GCC by default. This translates to an annoying situation: out of the box, when attempting to build a CPAN module, GCC will fail when encountering arguments CPAN passed to it that it does not recognize (the most prevalent error is "unrecognized option `-KPIC'"). The right solution is of course to install the Sun C compiler ("pkg install ss-dev") but this is 200MB+ of packages with tons of dependencies. A quicker and hackish workaround is to write a cc(1) wrapper that translates or ignores the 4 arguments that GCC does not support (-KPIC -xO3 -xspace -xildoff). I wrote such a wrapper. Put it in a temporary PATH location (eg. /root/bin) and run CPAN like this:

$ env PATH="/root/bin:$PATH" /usr/perl5/5.8.4/bin/cpan Crypt::SSLeay

Here is the code:

#!/usr/bin/python
# cc(1) wrapper to build CPAN Perl modules with GCC on OpenSolaris. -mrb
import os, sys
path = '/usr/gnu/bin/cc'
args = []
i = 0
while i < len(sys.argv):
       if i == 0:
               args.append(path)
       elif sys.argv[i] == '-KPIC':
               args.append('-fPIC')
       elif sys.argv[i] == '-xO3':
               args.append('-O3')
       elif sys.argv[i] == '-xspace':
               pass
       elif sys.argv[i] == '-xildoff':
               pass
       else:
               args.append(sys.argv[i])
       i += 1
os.execv(path, args)

mrb Monday 13 September 2010 at 10:11 pm | | Default | Five comments
Used tags: , , ,

Many SSD Benchmark Reviews Contain Flaws

I find flaws in almost every benchmark reviews I read about solid state drives, an area I know well. Whether it is the tester degrading performance with poor hardware or OS settings, or forgetting to mention crucial details that can greatly impact results, or not using his benchmarking tools correctly, or even widely-used tools that are themselves poorly written(!), these errors make many benchmark results that are published simply incorrect, and sometimes deceptive.

I am currently doing in-depth research of SSDs providing good random write IOPS, with a focus on those based on SandForce controllers, and here are a few examples of flaws in SSD benchmark reviews from major tech sites:

  • Out of the hundreds of reviews of the popular Intel X18-M/X25-M drive series, not a single one mentions that random write IOPS highly depends on the span of the LBA space being tested. There is a difference of almost 10x between random write IOPS measured on the full LBA space as opposed to an 8GB fraction of it: respectively 350 IOPS and 3300 IOPS. Intel themselves do not make the information easily accessible. They published it for only 1 model (first generation, 50nm): you have to browse a page for SSD resellers and click on the link named "Intel X18-M/X25-M SATA Solid State Drive - Enterprise Server/Storage Applications product manual addendum" (direct link) to access a PDF that documents the difference between measuring random write IOPS on an 8GB span as opposed to 100% span! Also, Intel generally publish 100% span numbers for enterprise-class SSD models, but 8GB span numbers for consumer-class SSD models. This spec is the only source documenting the 2 performance numbers for the same SSD model (Intel publish no equivalent spec for the second generation 34nm X18-M/X25-M drive series.)
  • Two benchmarking tools, CrystalDiskMark and AS SSD, are popular despite a flaw that many reviewers noticed: they report sequential read/write throughput results consistently inferior to other benchmarking tools (especially for SF-1200-based SSDs.) For example Benchmark Reviews tested the OCZ Vertex 2 120GB and these tools report 210-215MB/s while all other tools report 270-280MB/s as expected. [Update: one explanation is that this could be due to CrystalDiskMark and AS SSD being set up to write and read random data, whereas other tools use constant data, inadvertently allowing the SF-1200 controller to aggressively optimize I/O with its transparent data deduplication and compression features. If this is the case, then I shift my objection to all reviewers, including Benchmark Reviews, who fail to mention how CrystalDiskMark is configured —random or constant data— and thereby provide results that readers cannot interpret.]
  • In this 14-page Legit Reviews article on the OCZ Vertex 2 100GB drive, the test system is preventing the drive from showing its full potential in sequential reads: there is a bottleneck at 230MB/s when this SSD is known to reach 270MB/s as confirmed by many other reviews: Anandtech, Techspot, etc. This indicates either a problem with the test system, or aging of the SSD which engaged the wear-leveling algorithms, degrading performance.
  • Benchmark Reviews admits here they made the mistake of publishing several SSD benchmarks with IOPS numbers measured with an I/O queue depth of 1, instead of 32 which is the maximum allowed by NCQ and provides the maximum random IOPS performance. They were effectively measuring latency instead of true IOPS performance.
  • Here is a terrible use of IOmeter from Benchmark Reviews who discovers about 1k IOPS (latency: 1ms) for a top-of-the-line Crucial RealSSD C300 SSD. Even the most basic random I/O tests with a queue depth of 1 on any SSD should yield at least 10k IOPS (latency: 100us). As a matter of fact, all the IOPS results for the 11 SSDs tested in this page are suspiciously low and reflect a poor configuration of IOmeter, of which the author gives no information whatsoever.
  • A similar mistake is made by Anandtech in this Crucial RealSSD C300 review where the random 4kB read IOPS performance is measured with a queue depth that is too small. For example the 256GB model reports 20k IOPS (79.5MB/s) on SATA 3Gbps, when in fact it is known to be capable of 50k IOPS at the maximum queue depth of 32. The reviewer is aware of the queue depth issue, but presents results for both a short and long queue only for random write tests (not random reads).
  • ServeTheHome measured a sequential read throughput of 502.6MB/s [sic] over a SATA 300MB/s link. It is simply impossible to surpass the throughput of the underlying SATA link —probably another bug in CrystalDiskMark.

Benchmarking is hard. Most people get it wrong. The above flaws demonstrate that even major online tech sites do not always provide quality results. Too many of them hire young writers passionate about technology, but otherwise with little in-depth knowledge of what they are testing, and barely able to run benchmarking tools and correctly interpret results that come out of them.

mrb Sunday 12 September 2010 at 11:37 pm | | Default | Seventeen comments
Used tags: , , , ,

Specs of the Best Android Phones

If I were to rank the best Android phones —I favor CPU speed, unit weight, and have 512MB RAM minimum as a requirement— my list would be:

  1. Samsung Galaxy S, 1GHz Hummingbird, 480x800 resolution, 512MB RAM, 122x64x10mm, 118g
  2. HTC Droid Incredible, 1GHz Snapdragon, 480x800, 512MB, 118x59x12mm, 130g
  3. HTC Nexus One, 1GHz Snapdragon, 480x800, 512MB, 119x60x12mm, 130g
  4. Motorola Droid 2, 1GHz OMAP3620, 480x854, 512MB, 116x61x14mm, 169g
  5. Motorola Droid X, 1GHz OMAP3630, 480x854, 512MB, 128x66x10mm, 155g
  6. HTC Evo, 1GHz Snapdragon, 480x800, 512MB, 122x66x13mm, 170g

The Galaxy S does have the fastest processor. Out of all the processors used by these units, Hummingbird and OMAP36xx are based on the same ARM Cortex A8 core, but the former has a better GPU (PowerVR SGX 540 vs. SGX 530) with double the pipelines and higher clock speeds. And both the Hummingbird and OMAP36xx are generally recognized as faster than the Snapdragon. The Galaxy S even outperforms the iPhone 4 hardware.

That said, feature-wise, I have to admit that the HDMI output of the Droid X and Evo is pretty slick.

mrb Thursday 09 September 2010 at 12:06 am | | Default | No comments
Used tags: , , ,

Panasonic R3 Power Jack Solder Joint Reflowing

Tonight, I had to reflow the solder joint of the power jack on my laptop. In more than 5.5 years of abusing^Husing this Panasonic R3 every single day as my main computer, this was only the second hardware failure I have experienced.

The internal Toshiba 2.5" HDD did fail after 4 years —based on the sounds it started emitting weeks before dying, I suspect mechanical failure and blame myself for roughly handling the machine :-)— and I replaced it with an SSD. But other than that I think these laptops sure do deserve their "Toughbook" brand name...

Back to my problem, when plugging in the power plug, I could notice that wiggling the cable caused the laptop to intermittently lose power. I verified the continuity of the cable with a voltmeter while the adaptor was plugged into the wall, and it looked okay. So I closely inspected the power jack and noticed the central pin could move a little bit; a sign that a solder joint had become weak. I opened the laptop thanks to the same guide I used to replace the HDD with an SSD last year. Fifteen minutes later, after removing 20+ screws, latches, and ribbon cables, I gained access to the power jack and visually confirmed the weak joint. Even when moving the pin to the best position possible, the electrical resistance between the pin and the cable coming out of the jack was still above 40 Ω. Very minor burn marks from undesired arcing could even be seen. Fortunately the whole back of the jack was protected with electrical tape, and the arcing did not seem to have cause any other damage. A quick intervention with my soldering iron to reflow the joint and add a bit more solder easily fixed the problem. I am now typing this blog entry on my rejuvenated R3!

mrb Wednesday 08 September 2010 at 12:38 am | | Default | No comments
Used tags: ,