mrb's blog

Neutrinos Faster Than Light, or Artifacts of the FPGA?

Comments About the FPGA Platform Used in the Data Acquisition System of the OPERA Experiment

Keywords: hardware physics

I spent my week-end reading comments from theoretical physicists, geodesists, and other scientists, speculating on the news of the OPERA experiment that appears to measure muon neutrinos traveling faster than light (see presentation, preprint).

The neutrino beam travels 730 km through the Earth's crust, from the CERN particle's accelerator in Geneva, to the Gran Sasso Laboratory in Italy, in about 2.43 ms. Measurements show that the particles arrive 60.7 ns too early (with a statistical uncertainty of ± 6.9 ns, and a systematic uncertainty of ± 7.4 ns). In terms of distance, this amounts to 18 meters ahead of their expected position. This corresponds to a relative difference with the speed of light of 2.48e-5 ± 0.28e-5 (stat) ± 0.30e-5 (sys).

After devouring everything I could read about the experiment, I speculate ("gut feeling") that the explanation of these unexplainable numbers are variable timing delays introduced by the FPGA-based data acquisition system (DAQ), for the reasons stated below.

In the presentation, Dario Autiero describes the impressive scrutiny they have submitted their numerous timing instruments to. However if I must be dubious about one element, it would be this FPGA-based platform, which sits at the Gran Sasso site, processing the trigger and clock signals. Given the information released publicly, (1) it is the most complex device in the timing chain, (2) contrary to other timing equipment which is off-the-shelf, this system appears to be a custom design of which no precise details were given, (3) all time sources are generally described, calibrated, and double-checked, except a crucial 100 MHz time counter in this FPGA whose source is unknown (a "black box"), and (4) as Dario Autiero said himself, it is rare that particle physicists need such accurate time, which makes me think they may have overlooked certain details when designing it.

Firstly —this is the less likely of the scenarios I will describe but bear with me— if this FPGA-based system is using DRAM (eg. to store and manipulate large quantities of timestamps or events data that do not fit in SRAM) and implements caching, results may vary due to a single variable or data structured being in a cache line or not, which may or may not delay a code path by up to 10-100 ns (typical DRAM latency). This discrepancy may never be discovered in tests because the access patterns by which an FPGA (or CPU) decides to cache data are very dependent on the state of the system.

For example, while under calibration to measure the system's internal delay (with a digital oscilloscope as explained in the preprint), perhaps the engineer runs a series of tests close in time causing consistent cache hits, whereas under normal operation cache misses are the norm (because the system is idling or its cache is polluted by background tasks). As another example, the reverse is also possible, while under calibration perhaps the engineer reboots the FPGA between tests causing the cache to be flushed each time, whereas normal operation leads to consistent cache hits.

Either way, latencies of the order of 10-100 ns unexpectedly added or subtracted to a baseline thought to be constant could completely or partially explain the OPERA results. If a latency undetected during calibration unexpectedly shows up during experiments in code paths manipulating timestamps (on the blue side of the timing chain on page 8), it could accidentally time-tag an event with a "60.7 ns old" timestamp. If a latency detected during calibration unexpectedly disappears during experiments in code paths detecting neutrino arrival events (on the green side of the timing chain on page 8), the FPGA would calculate the event as occurring too early. (However this second case could not account for the full 60.7ns early arrival time because the total FPGA latency was measured as 25 ns, so the worst overestimation cannot be greater than 25 ns. But it would still reduce the significance of the OPERA result from 6.0-sigma to 3.0-sigma or less).

Secondly, this FPGA increments a time counter with a frequency of 100 MHz, which sounds like the counter is simply based on the crystal oscillator of the FPGA platform. It seems strange: the entire timing chain is described in great detail as using high-tech gear (cesium clocks, GPS devices able to detect continental drift!), but one link in this chain, the final one that ties timestamps to neutrino arrival events, is some unspecified FPGA incrementing a counter at a gross precision of 10 ns, based on an unknown crystal oscillator type (temperature and aging can incur an effect as big as about 1 part in 1e6 depending on its type). I can almost picture the engineer coming in the underground Gran Sasso server room that hosts the FPGA platform to calibrate it, inattentively leaving the door open, changing the usual room temperature by as little as ±5⁰C, affecting the crystal accuracy and stability by 1e-6 while he measures the internal system latency, invalidating any future timestamping result taken during experiments with the door shut. According to the paper, this counter is reset every 0.6 second from the OPERA master clock, but even a smaller 1e-7 effect would be sufficient to shift this counter by up to 60 ns at the end of this 0.6 s cycle. Different types of crystal oscillators offer different accuracies. I would like to think that they did not overlook such a detail. Nonetheless, I find it strange that zero details were given about this FPGA platform or the accuracy and stability of this counter.

[Update 2012-04-03: Six months later, my "gut feeling" has been confirmed! The drift from the master clock shifted the counter by 74 ns at the end of the 0.6 s cycle. However the drift was in the wrong direction (it made the neutrinos appear too slow) and was made irrelevant by the much bigger source of errors that was eventually identified: a fiber optic cable was not screwed in correctly. ]

Thirdly —this is my biggest problem with the device— the 100 MHz frequency of the counter normally implies a systematic uncertainty of ± 10 ns, but the paper claims ± 1 ns (see page 8: "FPGA latency ... ± 1 ns"). Why the discrepancy? The paper does mention this 10 ns quantization effect in the text, but does not include it in the table summarizing all systematic uncertainties. This alone would reduce the significance of the OPERA result to less than 6.0-sigma.

It is not my intention to sound overly negative; the OPERA experiment is an example of an extremely well conducted scientific research that necessitated an incredible combination of skills from at least the 174 authors listed in the paper. There are just a few details about this FPGA platform that need to be cleared up. After all, we must assume and check for engineering errors before asserting that neutrinos are speeding at 1.0000248 c.

So, I would like to ask the OPERA team to release more information about this FPGA-based Data Acquisition System. If the 100 MHz counter is truly incremented based on a crystal oscillator source, what type of crystal is used? Or better, would you release the complete schematic and source code of the system, since it is custom-designed? Or is it not? Who designed it? Who calibrated it? Is it sensitive to temperature? If yes, how is the temperature controlled in the server room hosting the system? Perhaps have a second system built by another engineering team, with an identical feature set, but a different design. How about using a faster 500+ MHz counter?


Andy wrote: Interesting observation. I am thinking that they probably use the 100Mhz counter to synchronize with the GPS time every 0.6s and then keep a 10ns resolution time based on that. The photocathode interface to the FPGA most likely is connected to a high speed Serdes port running at many gigbits/second. Say, for instance, it is running at 32x the system clock at 3.2Gbit/second. Then every 100Mhz clock cycle you would get a 32bit word indicating the photo detection event within that 10ns block. This would give you a resolution of about 312ps. So you would then add to the 100Mhz counter value based on the bit position of the detection with 312ps resolution.

I would not expect any DRAM access during the data acquisition portion of the system as this would all be pipelined inside the FPGA and have very deterministic behavior.
I hope that makes some sense!
26 Sep 2011 16:14 UTC

mnk wrote: It is quite easy to spot an error like this one. Note that uncertainty manifests itself in the result as fluctuation so if there really was +-10ns fluctuation they would know.

The FPGA system can be designed in very deterministic way. In fact I think that would be the first instinct of any engineer who works with FPGA for accuracy. It is extremely easy to test deterministic systems, and even if it is not so deterministic it would have shown up as a fluke over 3 years.

It is a very very longshot to assume something went wrong in the FPGA side - it is both easy to catch and easy to test.
26 Sep 2011 16:28 UTC

mnk wrote: Also, the problem is not a "longer than expected timespan", the problem is "shorter than expected timespan". 26 Sep 2011 16:32 UTC

AmonRa wrote: 26 Sep 2011 16:56 UTC

AmonRa wrote: 26 Sep 2011 16:58 UTC

Nick wrote: mnk: How can we be sure all 3 of these potential sources would fluctuate? If the server room was at issue, it could well be systematic... 26 Sep 2011 17:21 UTC

mrb wrote: I agree that the FPGA can and should be designed in deterministic way (it should not be hard). The OPERA team needs to confirm it, hence my question about it. I do agree that this 1st point I raise is the less likely out of the 3. 26 Sep 2011 18:05 UTC

Anonymous Engineer wrote: 26 Sep 2011 18:12 UTC

Maxwell's Daemon wrote: If they take hardware engineering anywhere near as seriously as CERN do (which they inevitably will have, this being a CERN operation), they'll have addressed this, somehow.

Additionally, 100Mhz doesn't necessarily correlate to 10ns temporal resolution - there are more than a few statistical methods you can apply to improve resolution beyond the immediately available single-measurement resolution, and physicists (I am one) are adept at employing them.

That said, I'd be interested to learn exactly what the spec of the FPGA is, and how they've allowed for their timing resolution.
26 Sep 2011 19:07 UTC

Einstein number 2 wrote: It is unlikely an experimental error exist . We will have to accept the results soon .this is how science works after all . 26 Sep 2011 20:46 UTC

Andrew wrote: It’s plausible that the FPGA is the source of their timing problems (especially since there aren’t many details of the FPGA), but it’s extremely unlikely that it’s due to any of the issues this author raises.

1) The chances that the FPGA implements “caching” in a manner similar to a CPU are next to zero. Caches make sense when you’re accessing data somewhat randomly (within a limited window) and repeatedly. The data access patterns for most signal processing algorithms are usually highly predictable (e.g. streaming), so there is no need for a cache. Even if DRAM is used for storing timestamps, that doesn’t mean you’d get variation in the input to timestamp part of the processing. One can easily tolerate DRAM access variation by using a queue at the DRAM controller. Both the Xilinx and Altera DRAM controllers have built-in queues, so I’d be really surprised if that were an issue.
2) All modern FPGAs have a clock manager which can increase or decrease the frequency of the clock actually running in the FPGA. (The Xilinx FPGAs call these MMCM – Mixed-mode clock managers). You can increase the internal frequency of the clock to 500 MHz or more depending on the FPGA with minimal jitter or skew. So even without better sampling techniques, there is likely less variation than a 100 MHz sample rate would imply. You can also sample on both clock edges if desired.

It’s a good idea to look into the FPGA, but caching and a 100 MHz clock are almost certainly not the issues. Still, the authors need to provide some more detail so we can look into it.

Thanks for the post!
26 Sep 2011 20:48 UTC

Tim wrote: Interesting. You're speculating that they're using the FPGA for the time measurement. However, CERN has developed their own silicon for time measurements:
I would bet they are using that chip interfaced to an FPGA. I've evaluated this chip for my own purposes and I can attest that it is indeed very accurate.
26 Sep 2011 21:26 UTC

Robert wrote: My question is:

Does CERN take in account the time (clock cycles) lost by the AD-converters during start-up at the detectors of the "emission" side? This is a fixed time for all ADC's in their system and is about the size of the discrepancy.

I have seen that they are using Acqiris digitizer boards for the detection. On the CERN site this is probably not a problem as I assume that their digitizers in a single instrument are identical and the relation between "trigger" and measurement is fixed (hard). In other words, the triggers and the measurement traces are posponed but their relation is kept.

But unless they are using the same digitizers for the Opera instrument a difference is eminent (and will have always have the same fixed length).

Worse case would be that the Opera instrument timing detection has been compensated for this and the CERN side not.

FYI I designed the system architecture and electronics of a cosmic particle detector (sensor network where nodes are at least 5 km's apart) in the Netherlands (HiSparc) which is also based on GPS and an airborne SAR radar system (RAMSES) with extreme sync (1.5 ps for clock, trigger and analog sign) between 1.5 GSPS ADC boards) where for both projects I had to take in account the delay caused by the ADC's.

I am available for personal discussion; send me a mail and I will respond.
26 Sep 2011 21:44 UTC

mnk wrote: Nick, I was talking about uncertainly. Systematic errors are generally the opposite of uncertainty - they are always there, if not it will fluke.
1. Systematic errors in deterministic systems are easy to catch and measure in tests
2. Uncertainty is easy to spot in the results as fluctuations.

Knowing a 5-6 CERN engineers I am 99.99% sure the FPGA is fine with 0.01 certainty :)
26 Sep 2011 22:35 UTC

Robert wrote: I think you are looking at the wrong side of the track. You are writing:

"latencies of the order of 10-100 ns unexpectedly added or subtracted to a baseline thought to be constant could completely or partially explain the ==OPERA== results"

Cause of several reasons:

1. If they would add the latency then the time measured for the stream would be definitely be slower than the light speed. Cause it will be detected much later.

2. I cannot imagine any possible design where you could subtract a latency, the expression "wait" state says enough.

If there is latency, they can use a fifo and add the used fifo depth (works as a delay line) to add to the result which they probably do.

3. If what you say is true, then there would be a huge spread of the results, and they would be dismissed as faulty. The experiment would not be repeatable.

4. The problem is that things are not delayed but come earlier than expected.

In order to recontruct this, the mistake (if there is one) must be on the primary transmitting side... If you measure at CERN, you get a measuring delay but the neutrino stream does not listen to this. This measuring delay will shorten the measured time between the primary and secondary part of the experiment.

I also agree with Nick. I do not think that CERN engineers would release the FPGA without extensive testing. And as I said before the delay of the ADC's do not pose a problem for CERN as they probably are all the same.
26 Sep 2011 23:18 UTC

mrb wrote: Tim: correct, I am speculating based on the lack of information about this 100 MHz source. All other time sources are extensively described and double or triple-checked, but this 100 MHz source is a black box that they don't explain.

Robert: I thought I was clear enough but apparently not :) Look at the green part of the schema on page 8. This is an estimation of the propagation delay between a target tracker strip and the FPGA. It is estimated to be 59.6 + 25 = 84.6 ns. When the FPGA detects an event at time T, it subtracts 84.6 ns to calculate the actual neutrino arrival time. If an extra 40 ns accidental latency is measured on this green section during system calibration, then the system would subtracts 124.6 ns instead of 84.6 ns while doing this computation during experiments and neutrinos would appear to arrive 40 ns too early. The paper also explains pretty well that delays in this section make things appear to occur too early.
27 Sep 2011 02:36 UTC

Ryan wrote: lol @ all the posturing, wannabe physicists here. Keep it up losers! 27 Sep 2011 05:59 UTC

OzoneJunkie wrote: Some more obscure links that may be of interest:
27 Sep 2011 06:00 UTC

OzoneJunkie wrote: Oh, and also:
27 Sep 2011 06:01 UTC

mrb wrote: OzoneJunkie: thanks! I quickly looked through the docs, at first I don't see anything pertaining to the FPGA platform, but I am going to continue reading... 27 Sep 2011 06:17 UTC

Helen wrote: Will need to replicate! Using another technique 27 Sep 2011 12:48 UTC

Kannan wrote: Don't they have tested the light beam also through the same FPGA? Even then if light is slower then it wouldn't be the FPGA, or neutrinos are really fast. 29 Sep 2011 01:09 UTC

mullerpaulm wrote: Yes indeed, good exploring lads (except for Ryan...hey, are YOU a physicist?). And of course, this needs to be replicated.

Perhaps CERN could figure out how to send a radio signal released at the same time up through GPS down to OPERA's control room, work out the GPS delay each time using GPS to calibrate itself, and with that offset, compare actual arrival times as an independent check. Of course it was no big problem to compute the effective light-distance accurately enough using GPS etc. but this alternative path is also deterministic (I understand) to useful precision with the right equipment. That might have the benefit of eliminating many of the potential errors in the detection and timing systems. In any case, others will repeat the experiment with different equipment, baselines, and technical equipment.

Meanwhile, let's talk a bit around the general philosophy of Physics, and at that basic level, what this may mean if verified.

If it turns out that neutrinos fly at 1.000025c(light), we should remember that they are different, in a unique class as particles. The journalistic buzz about time travel and all the rest does not really measure the difficulty for physics. The science might relatively easily be able to accomodate a discrete, different class of particle, that travels at this speed.

Neutrinos have a miniscule interaction profile with other matter (witness passing through 700+ km of solid rock). Light slows down in glass and air, why? Perhaps light in a vacuum is not quite the last word. Particles with Higgs-mass (e.g. protons) are in one class, and can never reach c(light) per Relativity. Photons (carrying mass in the form of energy relating to wavelength) all travel at c(light) in a vacuum. For all of these, c(light) is the limit, and Relativity applies.

But perhaps uniquely, neutrinos (essentially free of any interactions and so able to fly at full speed even through solid rock) are just slightly different, in a special class, and in the end merely show us that c(light) is the speed limit for everything else. Some particles cannot fly at c, photons can and do (always), neutrinos in a completely different class run a bit faster. The presumptive neutrino 'mass' may be a different kind of mass, non-Higgs carrying 'apparent mass' arising from this excess speed, to wit, even that apparent mass might be an artefact of their traveling at 1.000025c(light).

Sure, if verified, this will open up new horizons in Physics, but I would be surprised if it massively overturned the basics relating to the classical, mass-carrying wave/particles of physics.

Paul M Muller (PhD physics).
29 Sep 2011 02:39 UTC

Aule wrote: I think Mr. Muller has a point. This sounds like an index of refraction problem, and probably will cause no more than a minor sensation where physics would be forced to restate that the maximum speed possible for any information would that of neutrinos in vacuo, rather that light in vacuo. The fact the difference in speed is so small seems to be a dead giveaway. 29 Sep 2011 02:53 UTC

Andrew Casper wrote: I completely agree that this could very well be explained by an error with the FPGA based DAQ, specifically the 100 MHz clock. I've built a few high speed data acquisition boards, which were, interestingly enough, based off a 100 MHz system clock. When I would run two of these units off their own clock I would develop a time difference of over 60 ns well within 0.6 S. If I was able to run both systems off the same clock, I could achieve alignment on the order of picoseconds. It's just very difficult to believe that multiple FPGAs could remain aligned to within tens of nanoseconds based on a 1.66 Hz pulse from a master clock, while relying on a local clock for timinng between pulses. 29 Sep 2011 05:27 UTC

Matt wrote: Err no. They don't need this type of accuracy he said - but it's not that he pulled it out of his sleeve. They worked together with the timing specialists of course, which is what he said. And I hope you don't really believe you can get that type of kit 'off the shelf' - of course it is custom built! Who else on the planet needs stuff like that? 29 Sep 2011 06:58 UTC

Thijs wrote: @Mr. Muller: I'm not a physicist, but I did a course on relativity during my math degree, and what baffled me during this course is that the speed of light has nothing to do with light as such. I was shown (and can probably retrieve) a deduction of special relativity based on the following two assumptions:
- Inertial systems moving at constant relative speed must see the same local physics (principle of relativity).
- There is a maximum speed in the universe (let's call it c ;-) )

Special relativity follows from describing what you see happening in the other inertial system. The speed limit does not mention any kind of particle. It turns out that light 'accidentally' travels with exactly this maximum speed (which is calculated as a product of electrical properties of the vacuum).

So, yes, it would be a problem if something turns out to be travelling faster than light (1). This would lead to the conclusion that either one or both of these assumptions is an inaccurate approximation. This would be the first indication that relativity is an approximation of reality, just like the photoelectric-effect demonstrated that Newtonian physics was an approximation of reality. It may not be a disaster in the sense that physicists will always be doomed to be working with approximations of reality, but it would turn physics as we know it on it's head, just like relativity did.

(1) As far as I know, physicists are still debating whether the collapse of the wave function constitutes 'something', but apparently in relativistic terms it does not constitute 'something' otherwise relativity would have been shaken up before.

PS. No, it cannot be a dispersion effect, because dispersion only slows things down relative to vacuum. No reliable observations exist of dispersion speeding signals up relative to vacuum since this would have caused the same upheaval as the CERN results are doing now.
29 Sep 2011 11:38 UTC

Colin wrote: I'm as total layman, but sometimes see relationships.
It seems to me that all of the current "Big" questions in Physics and Cosmology recently all have some relationship to mass. The recent Higgs results (or lack of) from CERN, dark matter/energy, and now this result.
If Higgs is disproved, there would still have to be some real mechanism that performed it's "task", even if we have no idea where to start looking for it. Isn't it just possible that the only Neutrino's we CAN see are ones that are low probability special cases with "negative" mass (for want of a vocabulary)?
Which opens the door for "special" exceptions to mass force carrier rules, Higg's or whatever replaces it?
Just idle speculation on a rainy Thursday in the Great White North.
Be gentle...
29 Sep 2011 12:59 UTC

mrb wrote: Matt: many scientists do need that. In fact, off-the-shelf devices using a standard protocol are being developed exactly for this case of applications that cannot have a GPS receiver at each node and where the precision of NTP is insufficient:

Also, the IPN Lyon itself, who I recently learned developed the timing devices for OPERA, said they are looking to use PTP for future projects instead of implementing custom proprietary solutions like the one at OPERA.
29 Sep 2011 18:05 UTC

Andrew Casper wrote: I ran a quick test to look at the accuracy of the experiment’s described clocking scheme. I created a 100 MHz counter that is reset every 600 ms by an external, 10 ns pulse, the value of the counter at reset was offloaded to a computer for analysis. To gain some insight into how sensitive this setup is to environment variables, I placed a small desktop fan near the FPGA and turned it on part way through the data collection. You can look at the results here:
You can see that the setup had an initial error of over 60 clock cycles and changed by over 50 clock cycles when the fan started blowing. Of course this is all dependent on the type of crystal you use to generate your clock...
It would be relatively easy to have the FPGA constantly count the number of clock cycles between pulses and adjust some external heater/cooler to constantly keep the oscillator at the correct frequency. Also, the magnitude of any error will be highly dependent on where in the 600 ms period between resets the experiment took place. They mention the total time of flight was less than 3ms, if the generation of the neutrinos was initiated by the master reset pulse, then the system would have very little time to accumulate significant error.
I think the point of all this is that without more information on how the timing was controlled on the FPGA it’s impossible to know if the timing scheme was correct. It’s certainly possible to use the described setup, with appropriate controls, to realize the needed accuracy. But it’s also possible to mess it up. I would, however, be willing to give them the benefit of the doubt...
29 Sep 2011 23:29 UTC

mrb wrote: Andrew: cool demonstration of the "1 part in 1e6" accuracy of an FPGA's internal oscillator, thanks. 30 Sep 2011 02:29 UTC

Dr Peter Gangli wrote: It would make sense to consider the effects of Heisenberg Uncertainty. Careful measurements can be made, numbers will be (have been) obtained, yet the location [x], and the momentum [p] of any particle can only be determined with the uncertainty defined by Heisenberg Uncertainty relationship.

Δp Δq ≥ ℏ/2

Relativity and quantum mechanics describe the same world but from the viewpoint of two different universes. The inherent theoretical uncertainty of the EXACT location [at a given time] and the EXACT velocity of any measured neutrino could have and should have been treated and considered.

Why not? The paper shows only time of flight scatter charts!
01 Oct 2011 05:07 UTC

David wrote: I'd guess, if the FPGA designers weren't complete idiots, they'd timestamp data *before* they put them into a latency-suffering RAM, so no, caching and RAM latencies shouldn't lead to observable timestamp offsets. The 100 MHz clock also isn't likely a source of error. Either the 100MHz clock is locked to a much more accurate reference clock, or they have different clock domains for sampling and data processing, where the timestamping is done in the clock domain that is synchronous to the accurate reference clock.

Maybe the FPGA designers just didn't account for (all) their data processing pipeline's latencies, and so have a constant N*clock offset WRT timestamp. At a 100MHz clock-rate, 60ns is 6 cycles, that might be possible. Are they really only clocking at 100MHz? But no, those errors would be easily spotted when comparing different detectors' signals.
17 Oct 2011 20:30 UTC

BozoQed wrote: "OPERA's observation of a similar time delay with a different beam structure only indicates no problem with the batch structure of the beam, it doesn't help to understand whether there is a systematic delay which has been overlooked," said Jenny Thomas, co-spokesman for the Chicago-based lab's own neutrino experiment, MINOS. 18 Nov 2011 10:57 UTC

John wrote: I guess this is the experiment Michio Kaku was talking about:
05 Jan 2013 11:43 UTC