Here is an insightful
technical
post on the new SPARC T3 processor from Joerg Moellenkamp (16 cores and 128 threads
on 1 socket). Oracle just announced 1-, 2-, and 4-socket systems built on this processor,
which gives up to 512 threads per system. I remember Oracle/Sun planning months ago 8-socket
T3 systems, so I presume such beasts will be announced later.
I find it interesting that both Oracle, with the T3 processor, and AMD, with the
upcoming Bulldozer
processor, adopted similar designs.
A Bulldozer module, as AMD calls it, consists of two integer units,
and one floating point unit (see picture). AMD sometimes labels these units as
"cores" but this nomenclature is confusing. Instead a whole Bulldozer module should be
seen as a 1-core 2-thread piece of x86-64 technology. When one of the threads executes
integer-only instructions, the second integer unit effectively doubles the
performance compared to a classic 1-core 2-thread design like Intel's SMT technology
(a Nehalem core only has one integer unit).
A SPARC T3 core, like the previous generation UltraSPARC T2/T2+, also has two
integer units, and one floating point unit. However a T3 core
exposes 8 threads to the OS.
So, effectively a Bulldozer module (2 integer units, 1 floating point unit, 2 threads)
is microarchitecturally equivalent to a T3 core (2 integer units, 1 floating point unit,
8 threads). This is the interesting story of Bulldozer: AMD finally adopted SMT,
but a beefed-up version of it where not 1, but 2, integer units are present in a "core"
to counterbalance the increased number of threads exposed by it.
Of course no one in the technology press picked this up and reported it this way,
because AMD is using words carefully to market a Bulldozer module as
"2 cores not supporting SMT" as it sounds better than "1 core supporting a better
version of SMT".
Now, I think a smart move for AMD would be to expose an even higher number of threads per
Bulldozer module, as it could be relatively inexpensive to implement in terms of die area (Oracle showed they could expose 8 threads without too much difficulty).
For example if 4 threads were exposed, the ratio of threads per integer units would be
the same as Intel Nehalem. Who knows, perhaps AMD will do it in future revisions of
Bulldozer?
One of the most visited pages on adobe.com by developers for a quick reference to the
official and peculiar syntax to embed a Flash application in a Web page,
OBJECT and EMBED syntax |
Flash, is wrong about the syntax.
Not wrong in a pedantic actually-it-should-be-done-this-way-for-better-compliance way,
but wrong in the sense it will not load Flash applications under Firefox
and probably other browsers.
<embed href="/support/flash/ts/documents/myFlashMovie.swf" ...
It should be src instead of href!
<embed src="/support/flash/ts/documents/myFlashMovie.swf" ...
One of my pet peeves is attention to detail, but I just cannot imagine how Adobe
can screw so badly that they get this primordial document wrong.
For my inquisitive readers who want more details, I was testing with Firefox 3.0.19
under Linux, and Flash Player 10.1.82.76.
Pardon the obnoxious title of this post, but here is something I want to share,
and I feel it is necessary to disperse some absurd ideas going around.
Intel's next generation
Sandy
Bridge microarchitecture
features an integrated graphics core: the CPU and GPU share the same die.
The GPU will have up to 12 dual-issue execution units (EUs)
[Anandtech].
In terms of maximum theoretical computing performance,
an EU is equivalent to two stream cores (AMD) or two
streaming processor (Nvidia), because contrary to AMD and Nvidia, it is
dual-issue, therefore can execute 2 instructions per cycle.
It is unclear what the frequency clock of the EUs will be,
the same as the CPU cores (~3GHz), or closer to AMD's and Nvidia's
clocks (~1GHz), or somewhere in-between.
Let us assume 2GHz.
- 2 instructions per clock (dual-issue EU)
- times 12 (number of EUs)
- times 2 billion (2GHz)
- equal 48 billion instructions per second
This number of instructions/sec gives an idea of the level of performance
of a GPU (graphics operations are translated to GPU instructions, and
execution units occupy most of the area of a GPU die).
For reference a low-end AMD Radeon HD 5450 can execute
52 billion instructions/sec. Very close. In fact, this
exclusive preview of a
Sandy Bridge Core i5-2400 3.1GHz confirms that the performance of the
i5-2400 matches roughly the performance of the HD 5450, which shows
the math is right.
Ready for the kicker?
The highest-end AMD Radeon HD 5970 can execute
2320 billion instructions/sec. In other words
a ~3GHz Sandy Bridge processor's integrated GPU will deliver only ~2% of the
graphics performance of the highest-end AMD Radeon video card.
Do not misinterpret me. 2% may sound bad, but it is good enough for
entry-level graphics performance (or else AMD would not be selling
the HD 5450). My point is that by their own design, Intel obviously do not intend
to, and will not compete with top-of-the-line discrete GPUs. Nonetheless,
12 EUs seems really low; this number was probably carefully chosen
so as to not unnecessarily waste die space and power.
That said, perhaps Intel were originally hoping their integrated GPU
would be fast enough for some high-definition video transcoding, and
after realizing it would not be the case, set out to design the Media Engine,
aka Display Engine. This is a separate block on the die, neither part of
the GPU, nor part of the CPU cores, but part of the "System Agent",
and is made of fixed functions to implement video encoding and decoding
as efficiently as possible.
In conclusion, I am looking forward to play with Sandy Bridge, but not for
high-end GPGPU or gaming 
When building Perl modules with CPAN, the system assumes that the same
compiler arguments that were used to compile Perl (indicated in the output
of "perl -V") should be used to compile
modules. However on OpenSolaris, Perl was compiled with the Sun C compiler,
whereas the OS distributes GCC by default. This translates to an annoying
situation: out of the box, when attempting to build a CPAN module,
GCC will fail when encountering arguments CPAN passed to it that it does
not recognize (the most prevalent error is "unrecognized option `-KPIC'").
The right solution is of course to install the Sun C compiler
("pkg install ss-dev") but this is 200MB+ of packages with tons of dependencies.
A quicker and hackish workaround is to write a cc(1) wrapper that translates or
ignores the 4 arguments that GCC does not support (-KPIC -xO3 -xspace -xildoff).
I wrote such a wrapper. Put it in a temporary PATH location (eg. /root/bin)
and run CPAN like this:
$ env PATH="/root/bin:$PATH" /usr/perl5/5.8.4/bin/cpan Crypt::SSLeay
Here is the code:
#!/usr/bin/python
# cc(1) wrapper to build CPAN Perl modules with GCC on OpenSolaris. -mrb
import os, sys
path = '/usr/gnu/bin/cc'
args = []
i = 0
while i < len(sys.argv):
if i == 0:
args.append(path)
elif sys.argv[i] == '-KPIC':
args.append('-fPIC')
elif sys.argv[i] == '-xO3':
args.append('-O3')
elif sys.argv[i] == '-xspace':
pass
elif sys.argv[i] == '-xildoff':
pass
else:
args.append(sys.argv[i])
i += 1
os.execv(path, args)
I find flaws in almost every benchmark
reviews I read about solid state drives, an area I know well.
Whether it is the tester
degrading performance with poor hardware or OS settings, or
forgetting to mention crucial details that can greatly impact results, or
not using his benchmarking tools correctly, or
even widely-used tools that are themselves poorly written(!),
these errors make many benchmark results that are published
simply incorrect, and sometimes deceptive.
I am currently doing in-depth research of SSDs providing good random
write IOPS, with a focus on those based on SandForce controllers,
and here are a few examples of flaws in SSD benchmark reviews from major tech sites:
-
Out of the hundreds of reviews of the popular Intel
X18-M/X25-M drive series, not a single one mentions that random write
IOPS highly depends on the span of the LBA space being tested.
There is a difference of almost 10x between random write IOPS measured on
the full LBA space as opposed to an 8GB fraction of it:
respectively 350 IOPS and 3300 IOPS.
Intel themselves do not make the information
easily accessible. They published it for only 1 model (first generation, 50nm): you have to browse a
page
for SSD resellers and click on the link named "Intel X18-M/X25-M SATA Solid State
Drive - Enterprise Server/Storage Applications product manual addendum"
(direct link)
to access a PDF that documents the difference between measuring random
write IOPS on an 8GB span as opposed to 100% span! Also, Intel generally publish
100% span numbers for enterprise-class SSD models, but 8GB span numbers for consumer-class SSD models.
This spec is the only source documenting the 2 performance numbers for the
same SSD model (Intel publish no equivalent spec for the second generation
34nm X18-M/X25-M drive series.)
-
Two benchmarking tools,
CrystalDiskMark
and
AS SSD,
are popular despite a flaw that many reviewers noticed: they
report sequential read/write throughput results consistently inferior
to other benchmarking tools (especially for SF-1200-based SSDs.) For example
Benchmark Reviews tested the OCZ Vertex 2 120GB and these tools report
210-215MB/s while all other tools report 270-280MB/s as expected.
[Update: one explanation is that this could be due to CrystalDiskMark and
AS SSD being set up to
write and read random data,
whereas other tools use constant data, inadvertently allowing the SF-1200 controller
to aggressively optimize I/O with its transparent data deduplication and compression
features. If this is the case, then I shift my objection to all reviewers, including
Benchmark Reviews, who fail to mention how CrystalDiskMark is configured
—random or constant data— and thereby provide results that readers
cannot interpret.]
-
In this
14-page Legit Reviews article
on the OCZ Vertex 2 100GB drive, the test system is preventing the drive from showing
its full potential in sequential reads: there is a bottleneck at 230MB/s when this
SSD is known to reach 270MB/s as confirmed by many other reviews:
Anandtech,
Techspot, etc.
This indicates either a problem with the test system, or aging of the SSD which engaged the
wear-leveling algorithms, degrading performance.
-
Benchmark
Reviews admits here they made the mistake of publishing several SSD benchmarks
with IOPS numbers measured with an I/O queue depth of 1, instead of 32
which is the maximum allowed by NCQ and provides the maximum random IOPS performance.
They were effectively measuring latency instead of true IOPS performance.
-
Here is a
terrible use of IOmeter
from Benchmark Reviews who discovers about 1k IOPS (latency: 1ms) for a
top-of-the-line Crucial RealSSD C300 SSD. Even the most basic random I/O tests
with a queue depth of 1 on any SSD should yield at least 10k IOPS (latency: 100us).
As a matter of fact, all the IOPS results for the 11 SSDs tested in this page
are suspiciously low and reflect a poor configuration of IOmeter, of which
the author gives no information whatsoever.
-
A similar mistake is made by
Anandtech in this Crucial RealSSD C300 review where the random 4kB read IOPS
performance is measured with a queue depth that is too small. For example
the 256GB model reports 20k IOPS (79.5MB/s) on SATA 3Gbps, when in fact it is
known to be capable of
50k IOPS
at the maximum queue depth of 32.
The reviewer is aware of the queue depth issue, but presents results for both a short
and long queue only for random write tests (not random reads).
-
ServeTheHome measured a sequential read throughput of
502.6MB/s [sic] over a SATA 300MB/s link. It is
simply impossible to surpass
the throughput of the underlying SATA link
—probably another bug in CrystalDiskMark.
Benchmarking is hard. Most people get it wrong. The above flaws demonstrate
that even major online tech sites do not always provide quality results. Too
many of them hire young writers passionate about technology, but otherwise with
little in-depth
knowledge of what they are testing, and barely able to run benchmarking tools
and correctly interpret results that come out of them.
If I were to rank the best Android phones —I favor CPU speed, unit weight, and
have 512MB RAM minimum as a requirement— my list would be:
- Samsung Galaxy S, 1GHz Hummingbird, 480x800 resolution, 512MB RAM, 122x64x10mm, 118g
- HTC Droid Incredible, 1GHz Snapdragon, 480x800, 512MB, 118x59x12mm, 130g
- HTC Nexus One, 1GHz Snapdragon, 480x800, 512MB, 119x60x12mm, 130g
- Motorola Droid 2, 1GHz OMAP3620, 480x854, 512MB, 116x61x14mm, 169g
- Motorola Droid X, 1GHz OMAP3630, 480x854, 512MB, 128x66x10mm, 155g
- HTC Evo, 1GHz Snapdragon, 480x800, 512MB, 122x66x13mm, 170g
The Galaxy S does have the fastest processor. Out of all the processors used by these
units, Hummingbird and OMAP36xx are based on the same ARM Cortex A8 core, but
the former has a better GPU (PowerVR SGX 540 vs. SGX 530) with
double the pipelines
and higher clock speeds. And both the Hummingbird and OMAP36xx are generally
recognized as faster than the Snapdragon. The Galaxy S even
outperforms the iPhone 4 hardware.
That said, feature-wise, I have to admit that the HDMI output of the Droid X and Evo is pretty slick.
Tonight, I had to reflow the solder joint of the power jack
on my laptop. In more than 5.5 years of abusing^Husing
this Panasonic R3 every single day as my main computer, this was
only the second hardware failure I have experienced.
The internal Toshiba 2.5" HDD did fail after 4 years —based
on the sounds it started emitting weeks before dying, I suspect
mechanical failure and blame myself for roughly handling
the machine
— and I replaced it with an SSD. But other than that I
think these laptops sure do deserve their "Toughbook" brand name...
Back to my problem, when plugging in the power plug, I could notice that
wiggling the cable caused the laptop to intermittently lose power. I verified
the continuity of the cable with a voltmeter while the adaptor was plugged into
the wall, and it looked okay. So I closely inspected the power jack and noticed
the central pin could move a little bit; a sign that a solder joint had become
weak. I opened the laptop thanks to the same guide I used to
replace the HDD with an SSD last year. Fifteen minutes later, after
removing 20+ screws, latches, and ribbon cables, I gained access to the
power jack
and visually confirmed the weak joint. Even when moving the pin to the best
position possible, the electrical resistance between the pin and the cable
coming out of the jack was still above 40 Ω.
Very minor burn marks from undesired
arcing could even be seen. Fortunately the whole back of the jack was protected
with electrical tape, and the arcing did not seem to have cause any other
damage. A quick intervention with my soldering iron to reflow the joint and
add a bit more solder easily fixed the problem. I am now typing this blog entry
on my rejuvenated R3!
|
|