Blog: 

Sep
Aug
LAX Crippled by a $10 part
Jul
Jun
The international airport LAX held some 17,000 passengers in limbo last weekend due to a computer
outage. You might think that this scale of disaster could only occur due to terrorism, hackers,
or an earthquake, but the truth is much simpler. As reported by the
Los Angeles Times, the problem was one little network card in one computer.
The card that (prevented) the launch of a thousand planes,
so none could head home to Illium. (A twist on
Helen
of Troy)
The precise technical details of how a $10 part could have lead to 17,000 irate foreign nationals is not yet
clear, but the moral is as old as systems engineering itself: "a chain is only as strong as it's weakest link".
Even without access to the complex software system and computer networks that comprise the international
departure system at LAX, we can tell that the entire monstrosity is completely dependent on the correct operation of
at least one, commodity piece of hardware. In a life-critical, multi-billion dollar industry like air travel, this
is worse than failure. No tiny little component at the edge of a system should be able knock out all operations
for eight hours. This is an eternity for a pittance.
There are two standard approaches to preventing this problem. The first is not to trust a single chain and risk
breaking it's weakest link. Always have a second chain in place; a redundant system, ready to carry the load
in the event of an unexpected failure. The designers of this system should have had secondary network fabric ready
to take over critical operations once the first began behaving erratically. Even if the same network card eventually
poisoned the auxiliary infrastructure, the extra time bought by the switchover would have enabled the response team
to minimize the damage.
More importantly, comprehensive, periodic failure testing is the key to confirming the robustness of a system.
One should expect components to fail, and there's no better way of finding out whether or not you have designed your
system to withstand common problems than by forcing those problems to occur. A good failure test involves powering down
components, overloading network traffic segments, restarting machines, disconnecting key pathways and even injecting known
problems into the system. Only by constantly preparing and testing for challenging scenarios are you able to effectively
prevent them from ever occurring.
Redundancy and emphatic testing are not a perfect solution, but are the best tools we have to ensure the reliability of
systems. We must not only design them to be able to survive individual failures, but we must actively test them by forcing
the failures to occur and monitoring the results. Otherwise, like the passengers at LAX, you may get stuck for hours at
a tremendous financial loss.
Would you like to leave a comment?
Read this.
###