mrb's blog

Many SSD Benchmark Reviews Contain Flaws

Keywords: bug performance sata ssd storage

I find flaws in almost every benchmark reviews I read about solid state drives, an area I know well. Whether it is the tester degrading performance with poor hardware or OS settings, or forgetting to mention crucial details that can greatly impact results, or not using his benchmarking tools correctly, or even widely-used tools that are themselves poorly written(!), these errors make many benchmark results that are published simply incorrect, and sometimes deceptive.

I am currently doing in-depth research of SSDs providing good random write IOPS, with a focus on those based on SandForce controllers, and here are a few examples of flaws in SSD benchmark reviews from major tech sites:

  • Out of the hundreds of reviews of the popular Intel X18-M/X25-M drive series, not a single one mentions that random write IOPS highly depends on the span of the LBA space being tested. There is a difference of almost 10x between random write IOPS measured on the full LBA space as opposed to an 8GB fraction of it: respectively 350 IOPS and 3300 IOPS. Intel themselves do not make the information easily accessible. They published it for only 1 model (first generation, 50nm): you have to browse a page for SSD resellers and click on the link named "Intel X18-M/X25-M SATA Solid State Drive - Enterprise Server/Storage Applications product manual addendum" (direct link) to access a PDF that documents the difference between measuring random write IOPS on an 8GB span as opposed to 100% span! Also, Intel generally publish 100% span numbers for enterprise-class SSD models, but 8GB span numbers for consumer-class SSD models. This spec is the only source documenting the 2 performance numbers for the same SSD model (Intel publish no equivalent spec for the second generation 34nm X18-M/X25-M drive series.)
  • Two benchmarking tools, CrystalDiskMark and AS SSD, are popular despite a flaw that many reviewers noticed: they report sequential read/write throughput results consistently inferior to other benchmarking tools (especially for SF-1200-based SSDs.) For example Benchmark Reviews tested the OCZ Vertex 2 120GB and these tools report 210-215MB/s while all other tools report 270-280MB/s as expected. [Update: one explanation is that this could be due to CrystalDiskMark and AS SSD being set up to write and read random data, whereas other tools use constant data, inadvertently allowing the SF-1200 controller to aggressively optimize I/O with its transparent data deduplication and compression features. If this is the case, then I shift my objection to all reviewers, including Benchmark Reviews, who fail to mention how CrystalDiskMark is configured —random or constant data— and thereby provide results that readers cannot interpret.]
  • In this 14-page Legit Reviews article on the OCZ Vertex 2 100GB drive, the test system is preventing the drive from showing its full potential in sequential reads: there is a bottleneck at 230MB/s when this SSD is known to reach 270MB/s as confirmed by many other reviews: Anandtech, Techspot, etc. This indicates either a problem with the test system, or aging of the SSD which engaged the wear-leveling algorithms, degrading performance.
  • Benchmark Reviews admits here they made the mistake of publishing several SSD benchmarks with IOPS numbers measured with an I/O queue depth of 1, instead of 32 which is the maximum allowed by NCQ and provides the maximum random IOPS performance. They were effectively measuring latency instead of true IOPS performance.
  • Here is a terrible use of IOmeter from Benchmark Reviews who discovers about 1k IOPS (latency: 1ms) for a top-of-the-line Crucial RealSSD C300 SSD. Even the most basic random I/O tests with a queue depth of 1 on any SSD should yield at least 10k IOPS (latency: 100us). As a matter of fact, all the IOPS results for the 11 SSDs tested in this page are suspiciously low and reflect a poor configuration of IOmeter, of which the author gives no information whatsoever.
  • A similar mistake is made by Anandtech in this Crucial RealSSD C300 review where the random 4kB read IOPS performance is measured with a queue depth that is too small. For example the 256GB model reports 20k IOPS (79.5MB/s) on SATA 3Gbps, when in fact it is known to be capable of 50k IOPS at the maximum queue depth of 32. The reviewer is aware of the queue depth issue, but presents results for both a short and long queue only for random write tests (not random reads).
  • ServeTheHome measured a sequential read throughput of 502.6MB/s [sic] over a SATA 300MB/s link. It is simply impossible to surpass the throughput of the underlying SATA link —probably another bug in CrystalDiskMark.

Benchmarking is hard. Most people get it wrong. The above flaws demonstrate that even major online tech sites do not always provide quality results. Too many of them hire young writers passionate about technology, but otherwise with little in-depth knowledge of what they are testing, and barely able to run benchmarking tools and correctly interpret results that come out of them.

Comments

Ron Talman wrote: Thank you, this was very helpful to me. 13 Sep 2010 08:19 UTC

Jon V wrote: Can you point to anyone getting SSD benchmarking right? Which benchmarking tools do you consider reliable? 13 Sep 2010 10:28 UTC

mrb wrote: I tend to benchmark under Linux with my own custom tools and basic system utilities (dd, iostat).

Anandtech comes out of the pack. For example they are the only ones who benchmark SandForce-based SSDs (OCZ, Crucial, ADATA...) correctly by taking care of defeating the transparent compression and deduplication features of this controller by patching IOMeter to use blocks of random bytes instead of constant bytes.
13 Sep 2010 10:48 UTC

Wes Brown wrote: You hit the nail on the head.
After years of testing spinning disks where high disk queue lengths meant high latency most people test and bench at very low queue depths. I always test up to 128 to make sure that I hit the drive peak and then put NCQ under pressure. On drives like Fusion-io in the enterprise space they chug right along at very high queue depths. I am trying to get people away from obsessing over queue depth numbers and focus on number of IO's and latency. SSD's don't generally suffer the hockey stick effect at high queue depths they may get slower but generally just level out if they don't hit a write amplification issues. I've tried to get all the major sites to share their methodology and ioMeter scripts but they aren't interested in outside verification of their results....
-wes
13 Sep 2010 13:41 UTC

jamie dalgetty wrote: i guess ill hold off on getting one of these 13 Sep 2010 14:10 UTC

Edgar Gharibian wrote: @ mrb

So you have yet to adopt a entry level SSD into your personal system?

Even with the first generation ssd drives, i noticed a big enough difference in the user experience to justify including it as standard equipment (windows user, generally against "first adopter" status in technologocial goods).

i don't see why any of the drawbacks you listed (about the drives, not the reviewers) should stop you from adopting (and loving!) an ssd drive
13 Sep 2010 16:29 UTC

Edgar wrote: I apologize, i was looking at the icons and not the names. didn't realize comment "i guess ill hold off on getting one of these" was not from mrb 13 Sep 2010 16:31 UTC

Suresh wrote: You pointed out where Benchmark Reviews once used queue depth of 1 in Iometer, but they quickly switched to QD 30 for subsequent tests on the entire SandForce SSD product line. They also published an article similar to your, albeit in more depth, here: http://benchmarkreviews.com/index.php?option=com_content&task=view&id=270&Itemid=38

As far as SSD testing goes, they've got one of the best sites on the topic. Their reviews prior to 2010 didn't use the best settings, but for the past year they've had solid results.
13 Sep 2010 18:50 UTC

mrb wrote: Suresh, I know, I mentioned they switch to deeper queues.

However Benchmark Reviews still make many mistakes (see "Here is a terrible use of IOmeter from Benchmark Reviews...", and "I shift my objection to all reviewers, including Benchmark Reviews, who fail to mention how CrystalDiskMark is configured..."). They would have to demonstrate sustained quality articles to regain my confidence.
15 Sep 2010 06:28 UTC

Nik.K wrote: So lets say someone puts 12 SSD drives to the test with the same exact settings and with lets say 5-6 different benchmarking suites. No matter the settings shouldnt the fastest drive be determined by the one with the most wins in all tests ? In the end does it matter if its 1k or 32k when testing all drives with the same settings ? Not to mention that most suites test SEQ Read/Writes. 15 Sep 2010 10:47 UTC

mrb wrote: Nik: using identical settings, without trying to understand how they impact performance, and declaring the drive with the best numbers as the immediate winner, is futile. Benchmarking is more than filling a table with numbers. It is about validating these numbers, by verifying and explaining the bottlenecks that limit performance.

For example some benchmarking tools have bugs (CrystalDiskMark reporting a throughput higher than the theoretical max SATA throughput), some tools favor certain drives too much (eg. those writing the same constant block which gets deduplicated 100% of the time). Would you trust these buggy tools? Would you trust non-realistic performance? Of course not!
16 Sep 2010 07:26 UTC

Nik.K wrote: No one drive can always come up on all benchmark suites, thats a rule. True Crystal does not favor Sandforce SSDs, nor does AS SSD and Sandra Pro does a poor job as well. Sandforce drives work well with HDTune/Tach/Everest.
On the other hand that means that one can't ever have a conclusive result since most programs are not optimised for Sandforce drives. So how does one test 12 different drives (Indillinx/Samsung/Sandforce/Intel) to find which is the best ? How would you for example ?
18 Sep 2010 13:00 UTC

mrb wrote: I would benchmark the 2 extreme cases: reading/writing random data and constant data, with the understanding that real-world performance will fall in-between because real-world data is neither completely random nor invariant (notable exception: when encrypting data, it looks random to the drive.)

AFAIK both CrystalDiskMark and AS SSD are able to benchmark these 2 cases (eg. see "random" vs. "0/1fill" option in CrystalDiskMark), but I see no reviewers testing these different options...
19 Sep 2010 06:55 UTC

Nik.K wrote: Well its better and more understandable to use many different suites than one with different options. Not to mention that most people would get confused by something like that.

However even if you do test drives the way you say you still would not have a valid conclusion.
19 Sep 2010 10:06 UTC

mrb wrote: You have to test these 2 extreme cases to provide valuable information. Testing a bunch of random benchmarking suites without documenting which of these cases they exercise is pointless.

I do agree with you that it is hard to reach a conclusion about which drives are the best, because real-world performance will fall in between (the best you can do is test real-world applications). But really the essence of my point, for the Nth time, is that you have to test these 2 extreme cases. This is what Anandtech does (sometimes), and this is why I respect them more than others.
19 Sep 2010 21:03 UTC

Nik.K wrote: So what you would like to see is AS SSD and CDM with the two options you say.....I have to say that i only use Iometer(4k)/Everest Ultimate/HDTunePro/HDTachRW/ATTO/CDMx64/SandraPro/ASSSD and i really thought that all these covered all the basis. I never thought of the settings you said although i think that they are covered with these tests. However i will try to check them out. 20 Sep 2010 10:09 UTC

Richard wrote: This was really helpful, i agree with you totally; as for Intel ( I can speak from experience, was doing some research concerning their Processors,and my God is it Freaking difficult or near impossible to get information. They have everything all over the place.You actually find what you are looking for by accident,well that's how I felt!!!)

Keep writing articles like this, it actually shows that manufacturers can be very misleading by putting limited information, and that the idiots who do the reviews can't do their jobs. It also shows us, the people who want to understand what it is they are looking for, where to look and how to interpret the information, so we or i , can make an informed decision when purchasing (hardware or software)
23 Sep 2011 05:09 UTC

Kevin wrote: A common problem that I don't see mentioned here is the length of the test being run. We find that most SSDs we test have a significant performance drop after ~12-48 hours. Most consumer and a some enterprise SSDs need to be over provisioned by 30% to benchmark well. 24 Jul 2016 16:51 UTC