Consulting
Musings
Bio + Résumé Contact
Home >> Musings >> Blog >> LAX Crippled by a $10 part

Blog:

LAX Crippled by a $10 part
Posted 16-Aug-2007 by Robby Slaughter (@robbyslaughter)

The international airport LAX held some 17,000 passengers in limbo last weekend due to a computer outage. You might think that this scale of disaster could only occur due to terrorism, hackers, or an earthquake, but the truth is much simpler. As reported by the Los Angeles Times, the problem was one little network card in one computer.

The card that (prevented) the launch of a thousand planes, so none could head home to Illium. (A twist on Helen of Troy)

The precise technical details of how a $10 part could have lead to 17,000 irate foreign nationals is not yet clear, but the moral is as old as systems engineering itself: "a chain is only as strong as it's weakest link". Even without access to the complex software system and computer networks that comprise the international departure system at LAX, we can tell that the entire monstrosity is completely dependent on the correct operation of at least one, commodity piece of hardware. In a life-critical, multi-billion dollar industry like air travel, this is worse than failure. No tiny little component at the edge of a system should be able knock out all operations for eight hours. This is an eternity for a pittance.

There are two standard approaches to preventing this problem. The first is not to trust a single chain and risk breaking it's weakest link. Always have a second chain in place; a redundant system, ready to carry the load in the event of an unexpected failure. The designers of this system should have had secondary network fabric ready to take over critical operations once the first began behaving erratically. Even if the same network card eventually poisoned the auxiliary infrastructure, the extra time bought by the switchover would have enabled the response team to minimize the damage.

More importantly, comprehensive, periodic failure testing is the key to confirming the robustness of a system. One should expect components to fail, and there's no better way of finding out whether or not you have designed your system to withstand common problems than by forcing those problems to occur. A good failure test involves powering down components, overloading network traffic segments, restarting machines, disconnecting key pathways and even injecting known problems into the system. Only by constantly preparing and testing for challenging scenarios are you able to effectively prevent them from ever occurring.

Redundancy and emphatic testing are not a perfect solution, but are the best tools we have to ensure the reliability of systems. We must not only design them to be able to survive individual failures, but we must actively test them by forcing the failures to occur and monitoring the results. Otherwise, like the passengers at LAX, you may get stuck for hours at a tremendous financial loss.

Would you like to leave a comment? Read this.

###

Blog: Turning Left Against Traffic
Current Project Email List
signup@right.here
Low Volume, Spam Free