Are you prepared for the wrong disaster?

19/11/2015

By Greg Jotham, Chief Quality Auditor at AQuA

Engineers are very good at dealing with things they can measure. Things that aren’t easy to measure can be harder to develop good processes for.

In this there lies a potential trap: you can be devoting all your efforts to preventing measurable problems, not the non-measurable, human factors issues that can have as great an adverse effect. If you’re failing to prepare for both kinds of problem, you’re preparing to fail.

Here are two well-known historical disasters that illustrate these two areas of risk.

The Tay Bridge

In June 1878 the Tay Bridge – designed by one of Britain’s most highly regarded engineers, Thomas Bouch – opened for rail services across the Firth of Tay, which up to that time could only be crossed by a ferry. This made a significant reduction in journey times and was regarded as a major advance. Bouch was given a knighthood.

18 months later, in December 1879, the bridge collapsed into the water at the height of a storm, taking a train and the lives of around 70 or 75 people with it. The subsequent inquiry identified multiple failures in research, design, manufacture and maintenance as being the direct cause of the collapse. Bouch’s reputation was destroyed and he died less than a year later.

Takeaways from the Tay Bridge disaster:

1. Do the research, but do it thoroughly

Thomas Bouch already had on-hand advice of the likely maximum wind force to factor into the design, but for a different design of bridge (suspension) in a different location (the Firth of Forth) and re-used it for the Tay Bridge design. It came from the Astronomer Royal, then the primary source of meteorological information, who said afterwards that if he’d been asked about a girder bridge in the Firth of Tay (which is what Bouch delivered) he’d have advised a figure four times higher (120 pounds/square foot rather than 30, after applying a safety factor).

2. Identify best practice and use it, but be open to all inputs

William Rankine, Professor of Civil Engineering and Mechanics at Glasgow University, had some years previously written a book of rules and tables for engineers and architects, advising that as the highest wind pressure recorded by meteorologists in Britain was 55lbs/sq ft and the commonly accepted safety factor was 4 times, he advised designing to withstand a wind force of 200lbs/sq ft. He had said "in important structures, I think that the greatest possible margin should be taken. It does not do to speculate upon whether it is a fair estimate or not".

Bouch and some other engineers doubted the accuracy of these figures, seemingly on the grounds that meteorologists were “experimentalists” and not engineers, and the same engineers preferred to work with a safety factor of 3 rather than 4.

This sounds like it could have been an example of the “not invented here” syndrome, and is a danger that can occur if you only take input from people with the same technical background or commercial imperatives. “This is how we’ve always done it” is a profoundly dangerous approach unless backed up by continuous research to show it is still valid.

3. Multiple small errors can be as dangerous as one large one

The design had many underspecified components and materials, each of which would not always represent a serious risk if they failed, but taken together they made a much larger failure possible. As is sometimes said of this kind of multiple failure in modern engineering incident analysis, “the holes in the Swiss cheese all lined up”.

4. Implement a quality control system appropriate to the risk

Components were not properly inspected for quality of manufacture; assembly was not checked to meet a desired standard.

The foundry making the bridge castings was in financial difficulties (it went into liquidation before the bridge collapsed), and local supervisors had implemented unsafe workarounds to avoid having to remake failed components, unknown to the owners. Where the critical bolt-down lugs that held the uprights to the foundations were incomplete, they used a crude form of welding to attach fill-in pieces with none of the strength needed; where air-holes or voids were found in the castings, they filled them with a mixture of iron filings, wax and other materials that looked like iron and would deceive a magnet, but had no strength. Component inspection by the engineers was also inadequate, so none of these failings were identified.

Other failings in manufacture meant that parts could not be accurately fitted and packing pieces had to be used which reduced strength and rigidity.

No proper provision for inspection after completion was made, so maintenance staff continued to add packing pieces where components shifted or distorted, without reporting issues back. The resulting structure was insufficiently strong or rigid to cope with the full force of a gale, and collapsed. Implementing better measurement, quality control and reporting procedures might have prevented this disaster.

The Titanic and the human disaster

1. Always have a Plan B and question your assumptions, even if following best practice.

Best practice at the time was “the ship is the best lifeboat” – it was far easier to make a ship difficult to sink and keep the passengers on it, than consign them to open boats in the often stormy North Atlantic. Therefore this was safer for passengers. Therefore the assumption that you didn’t need lifeboats for everyone, because the ship could be kept afloat until another vessel could reach it to effect rescue – lifeboats were for transferring passengers to another ship. No-one asked “what happens if the ship is going to sink before rescue can reach it?” – there was no Plan B. Best practice is not always enough - “We’ve always done it this way” is not sufficient.

Engineers knew a ship of that size could only turn slowly, therefore the most likely collision would be head-on, a configuration in which all ships are at their strongest – five years earlier another liner had rammed an iceberg head-on, and despite a crushed bow, had been able to complete its voyage, and in the same year the man who was to become Titanic’s captain had said in an interview he could not "imagine any condition which would cause a ship to founder. Modern shipbuilding has gone beyond that". From these assumptions it was easy to calculate that the worst possible impact at full speed would be unlikely to breach more than two of the sixteen watertight compartments – and they knew that with two compartments flooded the ship would stay afloat indefinitely, even with three or four flooded it should still be safe. The assumption that any collision would be head-on was not questioned.

By the same token, it was assumed that it was unimportant that the watertight compartments were actually only bulkheads, leaving the compartments open at the top. This was because, with twelve or thirteen compartments unbreached, the ship would always float high and level enough that water could not overtop the bulkheads. This assumption was also never questioned.

2. Understand human factors – how people will use your product.

No-one allowed for the fact that humans, having all of their most vulnerable structures on their front, will instinctively attempt to turn away from any collision, and if moving quickly will attempt to stop. The same behaviour happens whether a solo runner or piloting thousands of tons of steel. When the iceberg was sighted the instant command was to reverse engines and put the helm full over. Yet if either one of those actions had been done alone, the ship would likely not have sunk:

  • If nothing had been done and the ship struck the iceberg head-on at full speed, then-current experience suggested that the structure would survive the impact without a fatal number of compartments flooding.
  • If just the engines had been reversed with no course change, the same survivable collision scenario would apply, but with less damage.
  • If just the helm had been put hard over with no reversal of the engines, the best evidence available with modern calculations suggests that the ship might have narrowly missed the iceberg, or struck it only lightly and turned clear, instead of scraping down it for several hundred feet.

However, because the engines were reversed and the helm put hard over at the same time – a natural human reaction - the way the ship was designed meant the effectiveness of the rudder was greatly reduced. So the ship turned more slowly, leading to a “sideswipe” collision which created a series of breaches in the hull covering a total length of 300 feet. This made it certain that five compartments would flood, and the ship could not remain afloat long enough for rescue to arrive.

If the human factors – the natural behaviour of people in an emergency, in this instance – had been better understood, then there would have been a chance that outdated lifeboat regulations, and the design assumptions that made the ship less steerable when the engines were reversed, might have both been challenged. Improvements to either could have reduced the loss of life, perhaps even prevented the disaster.

Common factors in both disasters

Each of these instances show the dangers of a closed mindset. When research, design and development are reduced to box-ticking exercises, copying what has been done before, people become unwilling to listen alternative viewpoints. Things that may undermine the assumptions in existing processes become viewed as weaknesses, rather than a source of improvement. The unchanging nature of the processes becomes a thing from which people gain a false sense of security, rather than testing and verifying those assumptions to establish genuine security – continuous improvement is stifled.

Conclusion

As I said at the beginning, it’s easy to conceive ways of guarding against things that can be inspected and quantified, and these techniques are important for minimising risk. They stand at the heart of much of the testing and QA that is done around the world. However, unless you also strive to understand how people will use your product, and the problems that could arise from incorrect assumptions about usage, you’re leaving yourself open to risks that can be as serious as those from measurable failures. With some products, that’s perhaps a risk of customer dissatisfaction, or maybe reputational damage at worst – potentially serious matters. But in safety-critical environments the risks from not understanding human factors are far greater.

back