Excerpts from the book
X-Events, Resilience, and Human Progress
John L. Casti
Roger D. Jones
Michael J. Pennock
Click to Buy Paperback
MIRROR VISION AND COMPLEXITY OVERLOAD
On March 27, 1977 the greatest airline disaster in history took place at Los Rodeos Airport on the island of Tenerife in the Canary Islands. Two 747 aircrafts, KLM Flight 4805 and Pan Am Flight 1736, collided in fog. The crash killed 583 people, with just 61 survivors. The accident has been analyzed in depth and the conclusion is that it was caused by a cascade of errors in both communication and human nature.
The communication aspect was probably exacerbated by the fact that much traffic, including the two airliners involved in the crash, had been diverted from the major airport on the island, Gran Caneria Airport, due to the explosion there of a terrorist bomb. This overload of traffic certainly increased the volume of communication between the airport tower at Los Rodeos and the flight crews, which many felt led to much greater confusion than usual in the aircraft crews following the tower’s instructions.
On the human nature side, KLM pilot Jacob van Zanten was eager to get his plane off the ground, since he and his crew were nearly at the limit of their on-duty time. If they didn’t take off soon, they would have to stay in Tenerife overnight and the flight would be postponed until the next day. Van Zanten made a fatal mistake by beginning his takeoff when he heard the message, “You are clear” from the tower. His wish to get into the air likely caused him to overlook the fact that he needed a second clearance before he could take off. But van Zanten’s failure to wait for that clearance before beginning his takeoff required yet one more confusion to give rise to the devastating accident.
As the KLM flight was beginning its takeoff, the Pan Am flight was attempting to find its assigned taxiway. But in the heavy fog, the Pan Am crew overshot that taxiway and as they hunted for the right path they ended up directly on the same runway that the KLM flight was using for takeoff. As the Pan Am plane emerged from the fog, van Zanten tried to get his plane into the air but his attempt was just a bit too late. The Dutch plane bounced off the top of the 747 upper cabin, ripping it from the Pan Am plane and sending the KLM plane more than 100 feet into the air before it crashed down and exploded into a ball of fire. The Pan Am plane was also sliced to pieces and caught fire.
So what we have here is a case of communication complexity overload, aided and abetted by the sense of urgency on the part of the KLM crew to get their plane into the air. If any one of these factors had not been present, there would have been no accident. Almost the very same statement can be made for the overwhelming majority of airplane accidents. The details, of course, are different. But the structure of the story is always the same. A constellation of individual mistakes and misunderstandings combine to create a horrific accident that could otherwise have been avoided. So let’s distill this pattern for airline accidents into some general principles for complex systems.
Airplanes, corporations, banks, and many other organizations have detailed plans for how to address what might be termed “single-point failures.” That is, failures that arise from a single problem. For dealing with such “simple” failures, systems have redundant controls, backup systems, manual overrides, warning systems, and the like to compensate and address failures.
The real problem with complex systems, though, is that the very complexity of the system generally means that failures occur when many things go wrong at once. The Tenerife air disaster is a textbook example of this kind of failure. In that case, no single safety measure, compensating redundancy or fail-safe mechanism put in place to address just one of the failures leading to that accident would suffice. As we saw, the crash required several different types of failures to all occur in order for the failure to take place. Can we do better than to simply protect against single-point failures? James Reason and Dante Orlandella of the University of Manchester in the UK thought so and developed what’s now termed the Swiss Cheese model for accident analysis— and prevention—in several areas, including aviation.
The Swiss Cheese model can be thought of in the following terms. First, complex systems have several ways to defend against failure. Each one of these procedures can be thought of as a slice of Swiss cheese with the holes in each slice corresponding to a point-failure within the system. It is a type of failure that that slice cannot defend against. So to see the entire defense, you stack up all the slices. Then for a total systemic failure to occur, a hole on each slice must line up with holes on all the other slices. Therefore, as long as the positioning of the holes on each slice is more-or-less independent of the others, a failure will be averted. So the basic idea is to have many defenses, each protecting against a particular point-failure. Unless the holes are strongly dependent, i.e., the slices only look different and are actually protecting against roughly the same fail- ures, the totality of the stacked slices will protect against cascades of the sort that brought down the two planes in Tenerife.
To close this section, let’s now return to the question with which we began, namely, how can a corporation continue to prosper in the face of a continually changing environment? Perhaps the best way to see the answer is to look at a couple of examples of firms that dominating their niche—until they collapsed—and analyze why they went into terminal failure.