Extreme Troubleshooting and Problem Resolution
Customer: An Air Force Test Range
Project: Telemetry Processing System
Challenge: Hard-to-Replicate Data Gap During Long Running Testing
Introduction
Sometimes a company is defined as much by what happens when something goes wrong as by what happens when things go right. This is the story of a rare data gap and the giant steps we took to resolve it.
In 2012, NetAcquire shipped seven telemetry-processing systems to our customer’s next-generation range safety system. The system leveraged advanced NetAcquire data flow processing to determine, among other things, real-time vehicle data quality by extracting and checking CRC error check codes on the data stream that was received by antennas at multiple locations.
All seven systems passed every NetAcquire manufacturing test. However, during a long-running test at a later point in the acceptance testing, the range’s system integrator noticed the system exhibited a single, one-minute gap in data processing. They immediately reported this occurrence to NetAcquire.
A key NetAcquire philosophy is to never treat even a single occurrence of a problem as unimportant; no matter how isolated it first appears.
Our first step was to determine if we could reproduce the problem at the NetAcquire factory. NetAcquire maintains an extensive, dedicated QA lab with a large number of available NetAcquire product configurations that represent systems shipped to customers. Automated NetAcquire test software performs 1,548 individual tests of the hardware, firmware, and software in each NetAcquire system. When NetAcquire used a matching in-house system and ran the extensive suite of tests, no problems were detected. NetAcquire also performed various manual tests without seeing a recurrence of the problem.
Meanwhile, the integrator continued testing on the remainder of the seven systems. Three of the systems each operated flawlessly for more than 100 hours of execution. The infrequent failure appeared on the fourth system after an extended run.
At this point, NetAcquire added more personnel to help solve the problem. Since NetAcquire could not reproduce the problem on its in-house hardware, the failing customer system was returned to the factory. It turned out that the customer’s operating configuration was very sophisticated, with more than 100 threads of execution performing complex, numerically intensive processing across 12 simultaneous PCM input channels. Furthermore, when the system’s configuration was slightly simplified, the problem disappeared completely. Finding the problem would be like looking for a needle in an acre of haystacks.
With an infrequently occurring system problem, elapsed time is the enemy of diagnostics efforts. For each diagnostics change to narrow down the problem, up to a week of system execution time could elapse before engineers could determine if the problem still occurred.
While expensive, diagnostics speed can be increased by adding parallel activities. NetAcquire proceeded to manufacture and deploy multiple copies of the customer’s hardware configuration in its QA lab to allow simultaneously long running testing of different scenarios. NetAcquire also engaged multiple teams of engineers who each looked at different possibilities for the cause of the problem.
One team focused on a theoretical possibility of a latent bug in the NetAcquire software. NetAcquire products have a large and sophisticated software base. A software defect was not considered likely because NetAcquire software is built on a clear philosophy: software quality must be designed in rather than tested into the product. This philosophy stems from the well-known limitation of using testing to find problems; testing can miss significant issues that might be infrequent or that require a unique/transient combination of runtime events (i.e., are never found during factory testing). Over two decades of software development, this philosophy has resulted in an extremely stable software code base. NetAcquire had even submitted the source code for its data flow engine software to a third-party software validation company selected by the Air Force for a detailed code analysis as part of obtaining approval for use in range safety-critical systems. Nonetheless, the software team developed an approach for minimizing the “footprint” of the source code executing as part of the customer’s configuration and then began reviewing both the source code and other customer use cases that might share this same code base.
The problem continued to appear infrequently without resolution, so NetAcquire added more staff to the effort. At its peak, two-thirds of the entire NetAcquire engineering department was working on diagnostics.
Resolution
The breakthrough came from a team that was swapping individual hardware components between systems to see whether the problem might “follow” a particular piece of hardware. Based on detailed test results, the team suspected the problem only occurred on certain processor boards. Since many processor boards worked fine, one shortcut would be to declare a few processor boards to be bad and just replace them. However, NetAcquire’s mission-critical engineering methodology emphasizes root cause analysis to ensure that a problem is truly solved. This meant taking diagnostics to an even lower level and swapping individual components on problematic processor boards. Based on this work, the problem appeared to actually follow specific Intel processor chips. The team’s anticipation grew as Intel processor serial number and lot manufacturing records were pulled for each of the customer systems and compared (NetAcquire maintains full serial number traceability on every system shipped).
Once the glint of “a needle in the haystack” appeared, progress was rapid. A particular grade of Intel processor chips was identified as being 100% correlated with the problem. A different Intel processor grade resolved the problem on a previously failing system, including the ultimate continuous test that ran for an entire month. The reason for the extended duration testing was because one NetAcquire engineer developed an estimate that it took on average 1017 processor cycles before the problem typically occurred.
All the customer’s systems were returned to the NetAcquire factory for expedited processor replacement and QA after which the systems were quickly returned to the integrator.
All-Customer Proactive Response
Manufacturing traceability records indicated that four other NetAcquire customers had recently received systems with the problematic Intel processor. Even though these customers were seeing absolutely no issues, NetAcquire proactively contacted each customer and arranged for a hardware repair of their systems at the customer’s convenience.
No customer was charged any costs associated with finding and resolving the Intel processor problem. The original range safety customer resumed their system qualification testing and is now multiple years into highly successful system operation across many missions.