Comments About the FPGA Platform Used in the Data Acquisition System of the OPERA Experiment
I spent my week-end reading comments from theoretical physicists, geodesists, and other scientists, speculating on the news of the OPERA experiment that appears to measure muon neutrinos traveling faster than light (see presentation, preprint).
The neutrino beam travels 730 km through the Earth's crust, from the CERN particle's accelerator in Geneva, to the Gran Sasso Laboratory in Italy, in about 2.43 ms. Measurements show that the particles arrive 60.7 ns too early (with a statistical uncertainty of ± 6.9 ns, and a systematic uncertainty of ± 7.4 ns). In terms of distance, this amounts to 18 meters ahead of their expected position. This corresponds to a relative difference with the speed of light of 2.48e-5 ± 0.28e-5 (stat) ± 0.30e-5 (sys).
After devouring everything I could read about the experiment, I speculate ("gut feeling") that the explanation of these unexplainable numbers are variable timing delays introduced by the FPGA-based data acquisition system (DAQ), for the reasons stated below.
In the presentation, Dario Autiero describes the impressive scrutiny they have submitted their numerous timing instruments to. However if I must be dubious about one element, it would be this FPGA-based platform, which sits at the Gran Sasso site, processing the trigger and clock signals. Given the information released publicly, (1) it is the most complex device in the timing chain, (2) contrary to other timing equipment which is off-the-shelf, this system appears to be a custom design of which no precise details were given, (3) all time sources are generally described, calibrated, and double-checked, except a crucial 100 MHz time counter in this FPGA whose source is unknown (a "black box"), and (4) as Dario Autiero said himself, it is rare that particle physicists need such accurate time, which makes me think they may have overlooked certain details when designing it.
Firstly —this is the less likely of the scenarios I will describe but bear with me— if this FPGA-based system is using DRAM (eg. to store and manipulate large quantities of timestamps or events data that do not fit in SRAM) and implements caching, results may vary due to a single variable or data structured being in a cache line or not, which may or may not delay a code path by up to 10-100 ns (typical DRAM latency). This discrepancy may never be discovered in tests because the access patterns by which an FPGA (or CPU) decides to cache data are very dependent on the state of the system.
For example, while under calibration to measure the system's internal delay (with a digital oscilloscope as explained in the preprint), perhaps the engineer runs a series of tests close in time causing consistent cache hits, whereas under normal operation cache misses are the norm (because the system is idling or its cache is polluted by background tasks). As another example, the reverse is also possible, while under calibration perhaps the engineer reboots the FPGA between tests causing the cache to be flushed each time, whereas normal operation leads to consistent cache hits.
Either way, latencies of the order of 10-100 ns unexpectedly added or subtracted to a baseline thought to be constant could completely or partially explain the OPERA results. If a latency undetected during calibration unexpectedly shows up during experiments in code paths manipulating timestamps (on the blue side of the timing chain on page 8), it could accidentally time-tag an event with a "60.7 ns old" timestamp. If a latency detected during calibration unexpectedly disappears during experiments in code paths detecting neutrino arrival events (on the green side of the timing chain on page 8), the FPGA would calculate the event as occurring too early. (However this second case could not account for the full 60.7ns early arrival time because the total FPGA latency was measured as 25 ns, so the worst overestimation cannot be greater than 25 ns. But it would still reduce the significance of the OPERA result from 6.0-sigma to 3.0-sigma or less).
Secondly, this FPGA increments a time counter with a frequency of 100 MHz, which sounds like the counter is simply based on the crystal oscillator of the FPGA platform. It seems strange: the entire timing chain is described in great detail as using high-tech gear (cesium clocks, GPS devices able to detect continental drift!), but one link in this chain, the final one that ties timestamps to neutrino arrival events, is some unspecified FPGA incrementing a counter at a gross precision of 10 ns, based on an unknown crystal oscillator type (temperature and aging can incur an effect as big as about 1 part in 1e6 depending on its type). I can almost picture the engineer coming in the underground Gran Sasso server room that hosts the FPGA platform to calibrate it, inattentively leaving the door open, changing the usual room temperature by as little as ±5⁰C, affecting the crystal accuracy and stability by 1e-6 while he measures the internal system latency, invalidating any future timestamping result taken during experiments with the door shut. According to the paper, this counter is reset every 0.6 second from the OPERA master clock, but even a smaller 1e-7 effect would be sufficient to shift this counter by up to 60 ns at the end of this 0.6 s cycle. Different types of crystal oscillators offer different accuracies. I would like to think that they did not overlook such a detail. Nonetheless, I find it strange that zero details were given about this FPGA platform or the accuracy and stability of this counter.
[Update 2012-04-03: Six months later, my "gut feeling" has been confirmed! The drift from the master clock shifted the counter by 74 ns at the end of the 0.6 s cycle. However the drift was in the wrong direction (it made the neutrinos appear too slow) and was made irrelevant by the much bigger source of errors that was eventually identified: a fiber optic cable was not screwed in correctly. ]
Thirdly —this is my biggest problem with the device— the 100 MHz frequency of the counter normally implies a systematic uncertainty of ± 10 ns, but the paper claims ± 1 ns (see page 8: "FPGA latency ... ± 1 ns"). Why the discrepancy? The paper does mention this 10 ns quantization effect in the text, but does not include it in the table summarizing all systematic uncertainties. This alone would reduce the significance of the OPERA result to less than 6.0-sigma.
It is not my intention to sound overly negative; the OPERA experiment is an example of an extremely well conducted scientific research that necessitated an incredible combination of skills from at least the 174 authors listed in the paper. There are just a few details about this FPGA platform that need to be cleared up. After all, we must assume and check for engineering errors before asserting that neutrinos are speeding at 1.0000248 c.
So, I would like to ask the OPERA team to release more information about this FPGA-based Data Acquisition System. If the 100 MHz counter is truly incremented based on a crystal oscillator source, what type of crystal is used? Or better, would you release the complete schematic and source code of the system, since it is custom-designed? Or is it not? Who designed it? Who calibrated it? Is it sensitive to temperature? If yes, how is the temperature controlled in the server room hosting the system? Perhaps have a second system built by another engineering team, with an identical feature set, but a different design. How about using a faster 500+ MHz counter?