I spent my week-end reading comments from theoretical physicists,
geodesists, and other scientists, speculating on the news of the OPERA experiment that appears to
measure muon neutrinos traveling faster than light (see presentation, preprint).
The neutrino beam travels 730 km through the Earth's crust, from the CERN
particle's accelerator in Geneva, to the Gran Sasso Laboratory in Italy, in
about 2.43 ms. Measurements show that the particles arrive 60.7 ns too early
(with a statistical uncertainty of ± 6.9 ns, and a systematic uncertainty of ±
7.4 ns). In terms of distance, this amounts to 18 meters ahead of their
expected position. This corresponds to a relative difference with the speed of
light of 2.48e-5 ± 0.28e-5 (stat) ± 0.30e-5 (sys).
After devouring everything I could read about the experiment,
I speculate ("gut feeling") that the explanation of these unexplainable numbers
are variable timing delays introduced by the FPGA-based data acquisition system
(DAQ), for the reasons stated below.
In the presentation, Dario Autiero describes the impressive scrutiny they have
submitted their numerous timing instruments to. However if I must be dubious
about one element, it would be this FPGA-based platform, which sits at the
Gran Sasso site, processing the trigger and clock signals. Given the information
released publicly, (1) it is the most complex device in the timing chain,
(2) contrary to other timing equipment which is off-the-shelf, this system appears to
be a custom design of which no precise details were given, (3)
all time sources are generally described, calibrated, and double-checked, except
a crucial 100 MHz time counter in this FPGA whose source is unknown (a "black box"),
and (4) as Dario Autiero
said himself, it is rare that particle physicists need such accurate time, which
makes me think they may have overlooked certain details when designing it.
Firstly —this is the less likely of the scenarios I will describe
but bear with me—
if this FPGA-based system is using DRAM (eg. to store and manipulate
large quantities of timestamps or events data that do not fit in SRAM)
and implements caching, results may vary due to a single
variable or data structured being in a cache line or not, which may or may not
delay a code path by up to 10-100 ns (typical DRAM latency).
This discrepancy may never be discovered in tests because the access patterns by
which an FPGA (or CPU) decides to cache data are very dependent on the state of the
system.
For example, while under calibration to measure the system's internal delay (with
a digital oscilloscope as explained in the preprint), perhaps the engineer runs
a series of tests close in time causing consistent cache hits, whereas under normal
operation cache misses are the norm (because the system is idling or its cache
is polluted by background tasks).
As another example, the reverse is also possible, while under calibration perhaps the
engineer reboots the FPGA between tests causing the cache to be flushed each time,
whereas normal operation leads to consistent cache hits.
Either way, latencies of the order of 10-100 ns
unexpectedly added or subtracted to a baseline thought
to be constant could completely or partially explain the OPERA results.
If a latency undetected during calibration unexpectedly shows up during
experiments in code paths manipulating timestamps
(on the blue side of the
timing chain on page 8), it could accidentally time-tag an event with a
"60.7 ns old" timestamp.
If a latency detected during calibration unexpectedly disappears during
experiments in code paths detecting neutrino arrival events
(on the green side of the
timing chain on page 8), the FPGA would calculate the event
as occurring too early. (However this second case could not account
for the full 60.7ns early arrival time because the total FPGA latency was
measured as 25 ns, so the worst overestimation cannot be greater than 25 ns.
But it would still reduce the significance of the OPERA
result from 6.0-sigma to 3.0-sigma or less).
Secondly, this FPGA increments a time counter with a frequency of 100 MHz,
which sounds like the counter is simply based on the crystal oscillator of
the FPGA platform. It seems strange: the entire timing chain is described
in great detail as using high-tech gear (cesium clocks, GPS devices able
to detect continental drift!), but one link in this chain, the final one
that ties timestamps to neutrino arrival events, is some unspecified
FPGA incrementing a counter at a gross precision of 10 ns, based on an unknown
crystal oscillator type (temperature and aging can incur an effect as big as about
1 part in 1e6
depending on its
type). I can almost picture the engineer coming in the underground Gran Sasso
server room that hosts the FPGA platform to calibrate it, inattentively leaving
the door open, changing the usual room temperature by as little as ±5⁰C,
affecting the crystal accuracy and stability by 1e-6 while he measures the
internal system latency, invalidating any future timestamping result taken
during experiments with the door shut. According to the paper, this
counter is reset every 0.6 second from the OPERA master clock, but
even a smaller 1e-7 effect would be sufficient to shift this counter by up to
60 ns at the end of this 0.6 s cycle. Different types of crystal oscillators
offer different accuracies. I would like to think that they did not overlook such
a detail. Nonetheless, I find it strange that zero details were given
about this FPGA platform or the accuracy and stability of this counter.
[Update 2012-04-03:
Six months later, my "gut feeling" has been confirmed! The drift from the master clock
shifted the counter
by 74 ns at the end of the 0.6 s cycle. However the drift was in the wrong
direction (it made the neutrinos appear too slow) and was made irrelevant by the
much bigger source of errors that was eventually identified: a fiber optic cable
was not screwed in correctly.
]
Thirdly —this is my biggest problem with the device—
the 100 MHz frequency of the counter normally implies a systematic uncertainty
of ± 10 ns, but the paper
claims ± 1 ns (see page 8:
"FPGA latency ... ± 1 ns"). Why the discrepancy? The paper does mention this
10 ns quantization effect in the text, but does not include it in the table
summarizing all systematic uncertainties. This alone would reduce the significance
of the OPERA result to less than 6.0-sigma.
It is not my intention to sound overly negative; the OPERA experiment is an
example of an extremely well conducted scientific research that necessitated
an incredible combination of skills from at least the 174 authors listed in the
paper. There are just a few details about this FPGA platform that need to be
cleared up. After all, we must assume and check for engineering errors
before asserting that neutrinos are speeding at 1.0000248 c.
So, I would like to ask the OPERA team to release more information about this
FPGA-based Data Acquisition System.
If the 100 MHz counter is truly incremented based on a crystal
oscillator source, what type of crystal is used?
Or better, would you release the complete schematic and source code of the system,
since it is custom-designed? Or is it not? Who designed it? Who calibrated it?
Is it sensitive to temperature? If yes, how is the temperature controlled in the
server room hosting the system?
Perhaps have a second system built by another engineering team,
with an identical feature set, but a different design.
How about using a faster 500+ MHz counter?