Case Studies

Delta II 7925 launch with GPS IIR-16

Extreme Troubleshooting and Problem Resolution

Customer: An Air Force Test Range
Project: Telemetry Processing System
Challenge: Hard-to-Replicate Data Gap During Long Running Testing

Introduction

Sometimes a company is defined as much by what happens when something goes wrong as by what happens when things go right. This is the story of a rare data gap and the giant steps we took to resolve it.

In 2012, NetAcquire shipped seven telemetry-processing systems to our customer’s next-generation range safety system. The system leveraged advanced NetAcquire data flow processing to determine, among other things, real-time vehicle data quality by extracting and checking CRC error check codes on the data stream that was received by antennas at multiple locations.

All seven systems passed every NetAcquire manufacturing test. However, during a long-running test at a later point in the acceptance testing, the range’s system integrator noticed the system exhibited a single, one-minute gap in data processing. They immediately reported this occurrence to NetAcquire.

A key NetAcquire philosophy is to never treat even a single occurrence of a problem as unimportant; no matter how isolated it first appears.

Our first step was to determine if we could reproduce the problem at the NetAcquire factory. NetAcquire maintains an extensive, dedicated QA lab with a large number of available NetAcquire product configurations that represent systems shipped to customers. Automated NetAcquire test software performs 1,548 individual tests of the hardware, firmware, and software in each NetAcquire system. When NetAcquire used a matching in-house system and ran the extensive suite of tests, no problems were detected. NetAcquire also performed various manual tests without seeing a recurrence of the problem.

Meanwhile, the integrator continued testing on the remainder of the seven systems. Three of the systems each operated flawlessly for more than 100 hours of execution. The infrequent failure appeared on the fourth system after an extended run.

At this point, NetAcquire added more personnel to help solve the problem. Since NetAcquire could not reproduce the problem on its in-house hardware, the failing customer system was returned to the factory. It turned out that the customer’s operating configuration was very sophisticated, with more than 100 threads of execution performing complex, numerically intensive processing across 12 simultaneous PCM input channels. Furthermore, when the system’s configuration was slightly simplified, the problem disappeared completely. Finding the problem would be like looking for a needle in an acre of haystacks.

With an infrequently occurring system problem, elapsed time is the enemy of diagnostics efforts. For each diagnostics change to narrow down the problem, up to a week of system execution time could elapse before engineers could determine if the problem still occurred.

While expensive, diagnostics speed can be increased by adding parallel activities. NetAcquire proceeded to manufacture and deploy multiple copies of the customer’s hardware configuration in its QA lab to allow simultaneously long running testing of different scenarios. NetAcquire also engaged multiple teams of engineers who each looked at different possibilities for the cause of the problem.

One team focused on a theoretical possibility of a latent bug in the NetAcquire software. NetAcquire products have a large and sophisticated software base. A software defect was not considered likely because NetAcquire software is built on a clear philosophy: software quality must be designed in rather than tested into the product. This philosophy stems from the well-known limitation of using testing to find problems; testing can miss significant issues that might be infrequent or that require a unique/transient combination of runtime events (i.e., are never found during factory testing). Over two decades of software development, this philosophy has resulted in an extremely stable software code base. NetAcquire had even submitted the source code for its data flow engine software to a third-party software validation company selected by the Air Force for a detailed code analysis as part of obtaining approval for use in range safety-critical systems. Nonetheless, the software team developed an approach for minimizing the “footprint” of the source code executing as part of the customer’s configuration and then began reviewing both the source code and other customer use cases that might share this same code base.

The problem continued to appear infrequently without resolution, so NetAcquire added more staff to the effort. At its peak, two-thirds of the entire NetAcquire engineering department was working on diagnostics.

Resolution

The breakthrough came from a team that was swapping individual hardware components between systems to see whether the problem might “follow” a particular piece of hardware. Based on detailed test results, the team suspected the problem only occurred on certain processor boards. Since many processor boards worked fine, one shortcut would be to declare a few processor boards to be bad and just replace them. However, NetAcquire’s mission-critical engineering methodology emphasizes root cause analysis to ensure that a problem is truly solved. This meant taking diagnostics to an even lower level and swapping individual components on problematic processor boards. Based on this work, the problem appeared to actually follow specific Intel processor chips. The team’s anticipation grew as Intel processor serial number and lot manufacturing records were pulled for each of the customer systems and compared (NetAcquire maintains full serial number traceability on every system shipped).

Once the glint of “a needle in the haystack” appeared, progress was rapid. A particular grade of Intel processor chips was identified as being 100% correlated with the problem. A different Intel processor grade resolved the problem on a previously failing system, including the ultimate continuous test that ran for an entire month. The reason for the extended duration testing was because one NetAcquire engineer developed an estimate that it took on average 1017 processor cycles before the problem typically occurred.

All the customer’s systems were returned to the NetAcquire factory for expedited processor replacement and QA after which the systems were quickly returned to the integrator.

All-Customer Proactive Response

Manufacturing traceability records indicated that four other NetAcquire customers had recently received systems with the problematic Intel processor. Even though these customers were seeing absolutely no issues, NetAcquire proactively contacted each customer and arranged for a hardware repair of their systems at the customer’s convenience.

No customer was charged any costs associated with finding and resolving the Intel processor problem. The original range safety customer resumed their system qualification testing and is now multiple years into highly successful system operation across many missions.

Is NetAcquire a good fit for your project?

Our applications engineers will discuss your needs and offer advice and pricing for the solutions we can provide.
NetAcquire provides quick responses to phone and email queries during Pacific Time business hours.

Call us toll free: 888-675-1122 or email [email protected]

For Employment, Business Affairs and other NetAcquire Contacts, CONTACT US

Preston Hauck

President, Chief Technology Officer

Preston Hauck founded NetAcquire in 1993 with the goal of providing computer communications and processing systems that seamlessly operate with real-time performance over distributed networks.

During the early years of NetAcquire, Preston took a hands-on role in developing new products. He remains enthusiastically involved in helping customers solve complex real-world challenges.

As the company grew, Preston leveraged his information systems background to create tools that could automate business processes, increasing company efficiency, responsiveness, and scalability. In his current role, Preston oversees new product direction, manages company growth,works directly with clients and the engineering team on projects that call for his applications design expertise.

Prior to founding NetAcquire, Preston served as vice president of software engineering for Microstar Laboratories, a provider of PC-based test and measurement products. During his tenure, he helped the company grow tenfold in both revenue and employees. Preston developed an innovative real-time operating system from the ground-up, plus a data processing engine that powered all of Microstar's products for more than two decades.

Preston holds a bachelor's degree in computer science, graduating first in his class in 1984 from the University of British Columbia.  He was granted a U.S. patent for developing a novel approach to initialize one processor from another in a multi-processor system. He has published a number of papers on real-time computer systems and regularly teaches networked telemetry processing classes.

Preston was born in Canada. After many years living in Washington State, he proudly became a naturalized U.S. citizen in 2003. He enjoys traveling, reading, photography, basketball, skiing, and movies with his wife Colleen and their two children.

John Bono

Engineering Vice President

John Bono, a Seattle area native, graduated summa cum laude with a Bachelor of Science in Electrical Engineering from the University of Washington.  John started his career with Boeing Aerospace Company, working in a variety of roles. He received personal recognition from Boeing’s CEO for his contributions to 767 avionics.  John designed hardware and firmware for a Mil-Std-1553 bus interface and was the hardware lead and system designer for a B-1B flight simulator’s computer generated imagery display.  After Boeing, John joined Advanced Technology Laboratories (ATL) as a signal processing engineer. He was promoted to the role of ultrasound system architect before becoming the software and systems lead for an image storage and network transport system. During John’s tenure at ATL, he received a Technical Fellow award, became a member of the Senior Technical Staff, and had a role in ATL’s achievement of ISO 9001 certification.  

When John joined the rapidly growing NetAcquire Corporation in 2001, his “big company” experience was leveraged to enhance configuration management and quality systems and processes. He soon became the Management Representative for the company’s AS 9100 certification efforts.  

Today, John is involved in the system, hardware, and firmware design of virtually every system NetAcquire ships.  While the majority of these systems are customer-specified configurations of NetAcquire COTS hardware and software modules, many other systems are rack-sized, challenging configurations involving complex interchanges of information in all forms.  The NetAcquire motto of “connect anything to anything” is John’s daily mission; he knows that, when applied to real-time systems, there will be a continuous stream of interesting problems to solve and challenges to conquer to best meet the needs of his customers and their mission critical projects around the world.

Steve Proudlock

Sales Vice President

For many NetAcquire customers, Steve is the initial face and voice of the company. While he has a desk at the company headquarters in Kirkland, Washington, quite often he’s in the air, flying to meet customers, discuss projects and set solutions into motion.

Steve joined NetAcquire in the summer of 2001. He sees his principal role as being the customer’s advocate. Steve thrives on opportunities to work on-site with NetAcquire customers to see first-hand the tasks and challenges they face.

In 2002, Steve joined the volunteer organizing committee for the International Telemetering Conference; he serves on the committee to this day. Steve also supports and participates in the meetings and industry standards efforts of several Range Commanders Council (RCC) committees.

Steve is an Army Veteran, having served in locations around the world in military law enforcement and personal protection roles. After graduating Pre Law from Central Washington University, he began a sales career in retail consumer electronics. The career choice was inspired by his favorite uncle, Ken Cartwright, a career tire salesman at Sears. Ken taught Steve the foundation principles of professional salesmanship: “Say what you are going to do, and then do it,” and “Get to know someone by showing sincere interest and earning their trust”.  

Steve’s Microsoft and Cisco certifications help him better understand the technical aspects of test range telemetry. While Steve is well versed in all aspects of NetAcquire technology today, he still learns something new on a daily basis from his colleagues.

Steve is a Seattle area native who enthusiastically gives his time and resources to a local charity dedicated to supporting foster children and their families.

Mark Roseberry

President

Mark Roseberry has been with NetAcquire since its inception. He holds a Bachelor's Degree in Computer Science and Mathematics, graduating with honors from the University of British Columbia. Mark has almost 40 years of software development and business management expertise. He has been instrumental in  the Company's business development and growth while running the NetAcquire Canada engineering office in Vancouver since 2001 and has been involved in the design and implementation of many key NetAcquire software components.

Mark shares our founder Preston Hauck's hands-on philosophy and his user-centric approach makes him a frequent participant and advocate for strong technical solutions ranging from customer-driven requirements development to robust implementation to  top-notch technical support and responsiveness.