The Cascade Nobody Saw Coming

28 June 202610 min readBack to series

If you have missed out the articles in this series, glance through these for a quick recap:

The Green Dashboard Trap: Why ERP and MES Projects Fail Long Before Anyone Reports Red

What a Steering Committee Actually Is, and Why You Are Using It Wrong

Why Scale Changes Everything and Why Your Dashboard Lies at a Different Altitude

There is a particular kind of programme post-mortem that experienced delivery leaders find the most painful.

Not the ones where the failure was unpredictable. Those are hard, but they are clean. Nobody saw it coming because nobody could have seen it coming. The team did what it could. The circumstances were genuinely exceptional.

The painful post-mortems are the ones where someone, at some point, in some workstream, knew. Not everything. Not the full shape of the eventual disaster. But enough. Enough to have said something. Enough to have raised a flag, changed a colour on a dashboard, picked up the phone to the programme director.

And did not.

These post-mortems have a specific texture. The questions are not “how did this happen?” They are “when did we first know this was possible, and what did we do with that knowledge?”

The answers are almost always the same. The first knowledge came early. What was done with it was nothing. Or something that looked like nothing: a RAID entry, a note in a workstream log, a bilateral conversation that produced a commitment that was not followed up.

And between that first knowledge and the moment the programme finally felt the impact, four things happened in sequence. Each one made the next one worse. Each one consumed options that would have been available if the sequence had been interrupted at any earlier point.

The Four Components of a Cascade

A cascade is not a sudden event. It is a process. It has a structure that repeats itself across programmes with enough consistency to be named, described, and designed against.

Understanding the structure precisely is not an academic exercise. It is the precondition for being able to interrupt the sequence before it reaches the stage where interruption is no longer possible.

The Four Components of a Cascade - AI Generated

Component One: The Origin Event

The origin event is the moment when a workstream lead identifies information that has implications beyond her own workstream.

It might be a data quality issue in a source system. It might be a vendor resource change that affects a delivery commitment. It might be a configuration decision that has been made in one workstream and has not yet been communicated to the workstream whose interface specification it affects. It might be a technical constraint that has become visible during development that was not visible during design.

At the origin event, several things are true simultaneously that will not remain true for long.

The issue is at its smallest. It has not yet propagated through the dependency network. The workstreams downstream of the origin have not yet built their plans against the incorrect assumption. The options available to the programme are at their widest: redirect, redesign, resequence, escalate, procure, negotiate. The cost of any of these options is at its lowest.

The origin event is the best possible moment to act. It is also, almost always, the moment at which nothing happens. Because at the origin event, the problem is newly identified, its full implications are not yet understood, and the instinct of the workstream lead is to assess before escalating. The assessment takes time. The time has a cost. And the cost is paid not by the workstream lead who is doing the assessing, but by every workstream downstream of her.

Component Two: The Suppression Period

The suppression period is the interval between the origin event and the moment the issue surfaces through the governance structure.

The word “suppression” is deliberately chosen and occasionally unfair. Not every workstream lead who does not escalate immediately is consciously suppressing information. Many are genuinely assessing, genuinely working toward a resolution, genuinely believing the problem is manageable within the workstream’s own capacity.

But the effect is the same regardless of the intent. During the suppression period, the issue is growing. It is not static while the assessment continues.

The downstream workstreams are making decisions that depend on the assumption that the suppressed information has not invalidated. The project plan is advancing on the basis of a foundation that has a crack in it. The steering committee is making decisions, approving plans, committing resources, and managing stakeholder expectations against a picture of the programme that is no longer accurate.

Every week of suppression is a week in which the programme’s options narrow and the cost of every remaining option increases. The relationship between suppression duration and eventual cost is not linear. It is multiplicative, because the cost propagates through an interdependent system, and interdependent systems amplify inputs.

A four-week suppression period in a single project produces a four-week recovery challenge. A four-week suppression period in a programme of eight interdependent workstreams can produce a five-month recovery challenge, because the four weeks of suppression allowed eight workstreams to advance their plans against an incorrect assumption, and unwinding eight workstreams’ plans is not eight times the work of unwinding one. It is more.

Component Three: The Trigger

The trigger is the moment the suppressed issue surfaces. And here is the thing about the trigger that makes cascades so disorienting for programme leadership: the trigger almost never occurs at the origin workstream.

It occurs downstream. At the point where a workstream that was depending on something that no longer exists tries to use it and discovers the problem.

The integration test that tries to run a data flow and gets results that are physically impossible. The UAT session that tries to use customer records and finds that a third of them have missing fields. The go-live cutover that tries to confirm production orders and finds that the MES and the ERP are using different material codes.

At the trigger moment, the programme director and the steering committee experience the problem for the first time, in a form that is several workstream boundaries removed from its origin. They see the integration test failure. They do not immediately see the weighbridge sensor. They see the UAT data problem. They do not immediately see the data migration scope decision taken three months earlier.

The investigation that connects the trigger to the origin takes time. During that time, the programme is in reactive mode, running a diagnosis that should have been unnecessary, at the moment when it has the least time and the fewest resources to spare.

And when the origin is finally identified, the question that hangs over the room is always the same: why was this not surfaced when it was first known?

The workstream lead at the origin has an answer. It is always some version of: we were assessing, we thought we could manage it, we did not realise the downstream implications.

These answers are not lies. They are the honest description of a workstream lead who did not understand that her technical problem was the programme’s structural problem. That is a governance failure, but it is also a training failure, a communication failure, and a programme architecture failure. It is rarely a character failure.

But it is always expensive.

Component Four: The Amplification

The amplification is what turns a workstream-level problem into a programme-level crisis.

In a single project, the cost of a problem is bounded by the project. There is one plan, one team, one budget, one timeline. The problem affects all of these, but it affects them as a single unit. The recovery is correspondingly bounded.

In a programme, the cost of a problem propagates through the dependency network. Every workstream that has built a plan against the suppressed information has to revise that plan. Every testing window that was sequenced on the assumption of a dependency being available has to move. Every external commitment, client communication, and resource allocation that was made on the basis of the programme’s reported state has to be revisited.

The amplification factor is determined by the density of the dependency network and the duration of the suppression period. A problem that suppresses for three months in a programme with eight interdependent workstreams has had three months to embed itself in the assumptions of every workstream downstream of the origin. Unwinding those assumptions takes longer than the suppression period because the unwinding itself creates new dependencies and new sequencing constraints.

This is why the maths of cascade failure is so counterintuitive. A three-month suppression does not produce a three-month delay. It produces a delay that is a multiple of three months, because the recovery has to sequence through the same dependency network that the suppression corrupted, in the reverse direction, under conditions of time pressure and reduced confidence.

The Weighbridge Sensor: A Complete Cascade in One Example

The weighbridge sensor story is worth examining in full detail because it illustrates all four components with unusual clarity, and because it is the kind of problem that repeats itself in steel plant implementations in different forms with enough regularity to constitute a pattern.

The Origin: Month Ten

An integrated steel plant is running a full production chain MES implementation. Seven workstreams, thirty months planned, external implementation partner alongside an internal IT team.

The raw material intake workstream is responsible for instrumenting and integrating the plant’s incoming material receipt and storage systems into the MES. This includes the weighbridge sensors that measure the weight of incoming iron ore, coal, and flux material at the plant gate.

In month ten, the workstream lead discovers that the weighbridge sensors are producing inconsistent readings. The error rate is approximately 8%, measured against reference weighments taken manually. This is within the acceptable tolerance for the legacy reporting system, which has been using these sensors for fifteen years and has calibrated its reporting practices around the known inaccuracy.

It is not within the acceptable tolerance for the new MES.

The MES’s blast furnace scheduling module, which is workstream two in the programme, has been designed around an optimisation algorithm that calculates optimal burden composition for the blast furnace based on the chemical and physical properties of the incoming raw materials. This algorithm requires material weight data with an accuracy tolerance of plus or minus 2%. The weighbridge sensors are delivering accuracy of plus or minus 8%.

The workstream lead knows about the sensor accuracy issue. She does not know about the 2% accuracy requirement in workstream two’s algorithm, because the interface specification between the raw material intake system and the blast furnace scheduling module describes the data format and frequency, but not the accuracy threshold required by the consuming algorithm.

She assesses the sensor situation. Replacing the sensors is a capital expenditure item: new sensors, installation, calibration, and recommissioning. It requires plant operations approval and a capital expenditure sanction. She estimates this will take time that she does not have budget for in her current workstream plan.

She logs the sensor accuracy as a technical risk in her workstream RAID register. She marks it as being assessed. She does not cross-reference it against the interface specification. She does not contact workstream two. She does not raise it at the programme level.

She reports Green.

The Suppression Period: Months Ten to Thirteen

For three months, the sensor accuracy issue sits in workstream one’s RAID register while workstream two builds its optimisation algorithm against a simulated data feed that meets the 2% accuracy specification. The simulation results are excellent. The algorithm is performing exactly as designed.

Workstream three, which handles the sinter plant and coke oven integration, has a secondary dependency on raw material intake data for its burden calculation models. Workstream three’s team has designed its burden calculation interface on the same accuracy assumptions as workstream two, for the same reasons: the interface specification does not specify accuracy thresholds explicitly, and the team has assumed the sensors will deliver data of sufficient quality.

Workstream five, which handles the production reporting and management information system, is building its daily production report against raw material intake data. The report template has been signed off by the plant management team. The data quality assumption embedded in the report design is consistent with what the weighbridge sensors will deliver: approximately plus or minus 8%. Nobody has questioned whether this is sufficient for the management information the report is supposed to provide.

The programme’s steering committee receives Green status reports from workstream one in months ten, eleven, and twelve. The sensor accuracy risk does not appear in the programme-level RAID report because it has not been escalated from workstream one’s local register.

The steering committee, which includes the plant operations director, who has the authority to expedite a capital expenditure sanction through the plant’s approval process, does not know the sensor issue exists.

The Trigger: Month Thirteen

In month thirteen, the integration testing workstream runs the first end-to-end data flow test. Live data flows from the raw material intake system through the programme’s integration layer into the blast furnace scheduling module.

The algorithm runs. The output is a burden recommendation that, in the words of the integration testing lead, “makes no physical sense.” The recommended iron ore blend ratio for the next shift exceeds the maximum possible volume that can be processed through the blast furnace’s burden system in a single tapping cycle.

The testing team flags the result as a system error. The assumption is that the algorithm has a software bug.

It does not have a software bug. The algorithm is working exactly as designed. It is producing an absurd recommendation because it has received input data with an 8% error rate and is attempting to optimise against a material composition picture that does not reflect the actual raw material inventory in the plant’s stockyard.

The investigation takes three weeks. When the root cause is identified as the sensor accuracy, the cascade assessment is immediate and alarming.

Workstream two’s algorithm needs a data validation and error correction layer that was not in its original design. This layer will need to identify sensor readings that fall outside acceptable accuracy bounds, apply a correction factor based on the sensor’s known calibration drift, and flag readings that cannot be corrected with sufficient confidence for manual review. The design and build of this layer takes four weeks.

Workstream three’s burden calculation models have the same accuracy dependency. A review of workstream three’s design reveals that the sinter plant burden optimisation function is sensitive to iron ore grade data in a way that the 8% sensor error will corrupt. Workstream three needs to implement the same validation and correction approach as workstream two. This adds three weeks to workstream three’s remaining delivery plan.

Workstream five’s management information reporting has been designed against the assumption of 8% sensor accuracy, which the plant management team did not realise was the assumption when they signed off the report template. A review session with the plant management team reveals that three of the seven key metrics in the daily production report are not meaningful at 8% accuracy. The report template needs to be redesigned. This takes two weeks.

The sensor replacement, which in month ten was a planned capital expenditure item that could have been processed through the normal approval cycle in six to eight weeks, is now an emergency procurement. The sensors required are specialist industrial weighing instruments with a twelve-week manufacturing lead time. Emergency expediting reduces this to ten weeks, at a 40% cost premium over the standard procurement price.

The integration testing window, planned for months thirteen to fifteen, cannot proceed while the sensor issue is unresolved and the algorithm modifications are incomplete. It moves to months seventeen to nineteen.

The go-live date, planned for month thirty, moves to month thirty-six.

The total additional cost across the programme: sensor emergency procurement, algorithm redesign and rebuild across two workstreams, management reporting redesign, integration testing window rescheduling, extended implementation partner engagement for six additional months, and internal resource deployment costs, amounts to 28% of the original programme budget.

One sensor accuracy issue. Known in month ten. Suppressed for three months. A six-month delay and 28% of budget.

The Governance Interpretation

The thing that makes this example important is not the sensor. Sensor drift in industrial environments is routine. Weighbridge sensors in steel plants operate in harsh conditions, they calibrate over time, and they require periodic replacement as a matter of normal plant maintenance. This is not an unexpected event. It is a normal event that was identified, assessed, and incorrectly classified as a workstream-level technical issue rather than a programme-level structural risk.

The Three Altitudes of Governance - AI Generated

The reclassification happened because the workstream lead was looking at the sensor from inside her workstream. From inside workstream one, the sensor is a raw material intake instrumentation issue. It is the kind of technical problem that the workstream resolves as part of its normal scope of work.

From the programme level, the sensor is the data source for an interface that two other workstreams have built their core algorithms against, with accuracy requirements that the sensor currently does not meet and cannot meet without a capital expenditure that requires plant operations approval.

From the steering committee level, which includes the plant operations director, the sensor is a capital expenditure approval that can be expedited in a matter of days, followed by a procurement that can be planned and sequenced within the programme’s existing timeline if actioned in month ten.

The same object. Three different views from three different altitudes. The governance failure was not that the workstream lead had the wrong view. It is that only her view was ever brought to the table.

If the programme’s governance structure had surfaced the sensor issue at the programme level in month ten, the plant operations director in the steering committee would have expedited the capital expenditure approval. The sensors would have been on order in month eleven. They would have been installed and commissioned in month fourteen, before the integration testing window opened. The algorithm would have been developed against a correctly specified data feed and would not have needed a redesign. The integration testing window would have proceeded as planned. The go-live would have been in month thirty.

The steering committee had the authority to solve this problem in a single conversation. The workstream lead had the information that should have triggered that conversation.

The information and the authority were never in the same room at the same time.

The Pattern Behind the Cascade: What Steel Plant Programmes Do Repeatedly

The weighbridge sensor story is a specific instance of a pattern that occurs in steel plant MES, ERP, and APS implementations with enough regularity to be considered a systemic risk rather than an individual failure.

The pattern has three characteristics that make it particularly persistent.

The interface specification describes format, not quality.

In most steel plant implementations, the interface specification between workstreams documents the data format, the frequency of transfer, the system of record, and the owner of each data element. It does not document the quality threshold required by the consuming system.

This gap is rational in origin. The team that wrote the interface specification knew what data would flow. They did not necessarily know what quality level each consuming algorithm would require. The algorithm requirements were being developed in parallel with the interface specification and were not yet fully defined when the specification was written.

But the gap creates a structural blind spot: a workstream lead who is assessing a data quality issue cannot cross-reference it against consuming algorithm requirements that are not in the interface specification. She checks the specification. The specification says the data format is correct. She does not know that the accuracy is wrong for the consuming algorithm, because the specification does not specify accuracy.

The fix is straightforward: interface specifications in steel plant implementation programmes should include quality thresholds, not just format specifications. This requires the consuming workstream to specify its quality requirements during the design phase, which requires the producing workstream to know those requirements before finalising its instrumentation design. This is a programme architecture conversation that most programmes do not have.

Capital expenditure items are treated as outside the programme’s scope.

In steel plant implementations, there is a persistent tendency to treat capital expenditure items, physical instrumentation, hardware, infrastructure upgrades, as belonging to the plant’s normal capital programme rather than the implementation programme. This means they follow the plant’s standard capital approval process rather than the programme’s governance process, which creates a disconnect in authority and timeline.

The workstream lead in the sensor example did not escalate to the steering committee partly because she saw the sensor replacement as a plant capital expenditure item, not a programme issue. It was outside the scope of what the programme governed.

This is the wrong classification. If the programme’s delivery depends on a capital expenditure item being completed within a specific timeline, that item is a programme dependency. It should be in the programme RAID register at the programme level, with the steering committee’s awareness, regardless of which approval process it goes through.

The programme’s dependency mapping does not extend to the physical layer.

In digital transformation programmes, the dependency mapping tends to focus on software and data dependencies: which system feeds which system, which configuration depends on which data model, which testing depends on which build. It rarely extends systematically to the physical layer: which software function depends on which physical instrument performing to a specific specification.

In a steel plant context, this is a significant gap. The MES’s optimisation algorithms are deeply dependent on the quality of data from physical instruments: weighbridges, level sensors, temperature probes, flow meters, chemical analysers. The accuracy, availability, and calibration status of these instruments directly affects the quality of the MES’s outputs.

A programme that does not include physical instrument specifications in its dependency mapping is a programme that has a systematic blind spot in its risk identification. The weighbridge sensor will not appear in the programme’s risk register unless someone has explicitly asked: “what physical instruments does this programme depend on performing to specification, and what is their current calibration status?”

Most programmes do not ask this question. It should be in the programme’s design phase checklist.

Tips for Project Managers and Workstream Leads: How to Recognise a Cascade Before It Starts

One: When you find a risk, immediately ask who else depends on the thing that is at risk.

This is the single most important habit change that prevents cascades. Before assessing whether a risk is manageable within your workstream, ask one question: is the output that this risk affects also an input to any other workstream?

If the answer is yes, the risk is not a workstream-level risk. It is a programme-level risk. It belongs in the programme RAID register and in the next programme governance conversation, regardless of whether you believe it is manageable within your workstream.

This is not about losing ownership of the risk. It is about giving the programme the information it needs to understand the risk correctly from the programme level, not just from the workstream level.

Two: Read the interface specification from the other direction.

Most workstream leads read the interface specification to understand what they need to send and what they will receive. Few read it from the perspective of the consuming workstream: what quality, accuracy, completeness, and timeliness does the consuming system actually need from the data I am providing?

Spend thirty minutes in the design phase reviewing your outgoing interfaces from the consuming workstream’s perspective. If you do not know what quality threshold the consuming algorithm requires, ask. The answer may change how you assess the risks in your own workstream’s delivery.

Three: Treat capital expenditure dependencies as programme risks, not plant operational items.

If your workstream’s delivery depends on a piece of physical infrastructure, instrumentation, or equipment being available, calibrated, and performing to specification, that dependency belongs in the programme’s risk framework, not just the plant’s maintenance schedule.

Identify these physical dependencies explicitly in the design phase. Confirm the current performance of any instrumentation your workstream depends on. If there is a gap between current performance and the required specification, raise it at the programme level immediately, regardless of which organisational process owns the resolution.

Four: Do not confuse “assessing” with “managing.”

Assessment is the activity of understanding a risk. Managing is the activity of taking action to reduce the risk’s probability or impact. These are different activities, and confusing them is one of the most common sources of cascade failure.

A risk that is “being assessed” for three consecutive weeks is not being managed. It is being studied. If assessment has not produced a management action within two weeks, the risk should be escalated to the programme level, with the results of the assessment to date, and a clear statement of what is needed from the programme to move from assessment to management.

Five: When the simulation works perfectly, ask what the simulation is not simulating.

The blast furnace scheduling algorithm performed perfectly in simulation. That performance created confidence that delayed the question: does the live data environment meet the accuracy assumptions of the simulation?

In steel plant implementations, the simulation environment is almost always cleaner than the live environment. This is not a flaw in the simulation. It is a feature: the simulation is designed to test the algorithm’s logic, not the data quality of the production environment. But it creates a false confidence if the team does not explicitly ask: what are the ways in which the live environment differs from the simulation, and have we tested our algorithm against those differences?

Ask this question in every review of simulation testing results. The answer will sometimes be reassuring. Occasionally it will surface the weighbridge sensor before the integration test does.

Tips for Steering Committee Members and Programme Directors: How to See a Cascade Before the Trigger

One: Create a standing agenda item called “Cross-Workstream Dependencies at Risk.”

Not workstream risks. Cross-workstream dependencies. The specific interfaces, data flows, and output-to-input connections between workstreams that are currently in the active planning horizon and that have any uncertainty attached to them.

This agenda item forces workstream leads to think about their dependencies before the governance session, and forces the programme director to maintain a consolidated view of the dependency network’s health. It surfaces the kind of concerns that are below the formal escalation threshold but above the “nothing to report” level, which is precisely the territory where cascade origins live.

Two: When a workstream RAID item has been “In Progress” for more than three fortnights, ask to see it.

A RAID item that has not moved from “In Progress” in six weeks is either a genuinely complex issue that is receiving substantive attention, or a parked item that is receiving the appearance of attention. The committee cannot tell which from a status field.

Ask to see the assessment. What has been done in the six weeks? What has been learned? What are the options being considered? What is the remaining assessment timeline? This question does not need to be adversarial. It is simply the governance-level version of the question the workstream lead should be asking herself: is this item moving, or is it being stored?

Three: Commission an interface register review at the midpoint of the programme.

By the midpoint of a large implementation programme, the actual state of interfaces between workstreams has almost always diverged from the interface specifications written during the design phase. Configuration decisions, vendor deliveries, and design refinements have each made changes that were handled bilaterally or technically without being reflected in the master interface documentation.

A structured interface register review at the midpoint of the programme, conducted by the programme architect against input from all workstream leads, will surface these divergences while there is still time to address them. The cost of the review is small. The cost of discovering the divergences during integration testing is large.

Four: Ask specifically about physical instrument dependencies in steel plant programmes.

In the quarterly programme health review, include a standing question about physical instrumentation dependencies: which instruments, sensors, and measurement systems does this programme’s software depend on performing to specification, and what is the current verification status of those specifications?

This question should be directed at the programme director and should produce a documented answer. If the programme director cannot answer it, the answer does not exist, which means the risk has not been assessed.

In a steel plant context, this question alone would have surfaced the weighbridge sensor in month ten rather than month thirteen.

The Final Point

A cascade is not an accident. It is a sequence. And sequences can be interrupted.

The sequence begins at the origin event, when a workstream lead identifies a risk that has programme-level implications and classifies it as a workstream-level issue. The sequence is interrupted if the risk is escalated at that moment, with sufficient information for the steering committee to understand its cross-workstream implications and deploy the authority needed to resolve it.

If the sequence is not interrupted at the origin event, it continues through the suppression period, growing in cost and narrowing in options with every week that passes. It can still be interrupted during the suppression period, with decreasing effectiveness as time passes.

If the sequence is not interrupted during the suppression period, it reaches the trigger, which is always downstream, always disorienting, and always more expensive than the origin event would have been to address. At the trigger, the cascade is no longer preventable. The only question is how far the amplification will travel before the programme can contain it.

The steering committee had the authority to prevent the cascade in the sensor example. The workstream lead had the information. The information and the authority never met, because the governance structure did not create the conditions for them to meet.

That is the cascade nobody saw coming. And it is, almost always, the cascade that was most visible to the person who chose not to look too closely.

Aankh band karna andhera nahi hai. Andhera toh tab aata hai jab sab dekhte hain aur bolte nahi. Closing your eyes is not darkness. Darkness comes when everyone can see and nobody speaks.

Have you been the integration lead who ran the test that failed and spent three weeks tracing it back to an origin that was months old? Or the steering committee member who asked “why didn’t we know in month ten?” and received an answer that was technically accurate and completely unsatisfying? Tell me what the post-mortem felt like. The comments are open.

#ProgrammeManagement #CascadeFailure #ERPImplementation #MESImplementation #APSImplementation #SteelManufacturing #RiskManagement #ProgrammeGovernance #SteeringCommittee #InterfaceManagement #DeliveryExcellence #HonestLeadership #IndianManufacturing

Disclaimer: The incidents, characters, projects, and organisations referenced in this article are fictionalised composites drawn from recurring patterns observed across complex transformation programmes. Their purpose is to illustrate leadership and governance lessons rather than describe any specific organisation, project, customer, or implementation. The lessons, however, are very real.