Fix Forward¶

When a deploy goes wrong, the team has two basic recovery options. Roll back means returning the system to the previous known-good state, undoing the bad deploy. Fix forward means shipping a corrective change that moves the system toward the intended working state without reverting.

For most of the history of software, rollback was the default and fix-forward was the exception. Modern continuous-delivery practice has largely reversed this. Fixing forward is now the default in mature engineering organizations, and rollback is the emergency tool that gets used when fix-forward will not work in time.¹²

Why Fix-Forward Has Become the Default¶

A few factors have shifted the calculus:

Deploys are routine. A team that deploys many times a day can fix forward in minutes. Rollback, in the same environment, often takes longer than that, because it requires coordinating a deploy of an older version against a moving target.
Schema and state evolve forward. A deploy that introduced a database migration, a new external integration, or a new flag state often cannot be cleanly rolled back. The schema is already changed; the third-party API has already been called; the flag system already remembers. Rolling back the code without rolling back the side effects can produce a worse state than fixing forward.
Rollbacks lose context. The bad deploy might have been ninety-percent right. Rolling back loses the ninety as well as the ten. Fix-forward keeps the good and corrects the bad.
Feature flags make targeted recovery easy. When a specific feature is misbehaving, flipping its flag off is faster, safer, and more targeted than rolling back the entire deploy.⁴
Smaller deploys make fix-forward small. A team that ships a small change can ship a small fix immediately. The change set under investigation is narrow enough to diagnose quickly.

The pattern works because the underlying engineering practices (small batches, automated pipelines, fast feedback) are all in place to make a forward fix shippable in minutes. Without those practices, fix-forward is harder than rollback. With them, the reverse.³

When Rollback Is Still Right¶

Fix-forward is the default. Rollback remains the right answer in several scenarios:

The fix is not yet known. Diagnosing the problem will take longer than the user impact can wait. Rolling back buys time to investigate without bleeding.
The bad code is corrupting state. Every minute the bad deploy stays live, the consequences compound. The fastest stop is the right stop, even if recovery work continues afterwards.
The deploy was structurally invalid. A deploy that takes the system into an unrecoverable state (broken migrations, infinite loops, runaway costs) needs an emergency exit, not an iteration.
The team cannot ship the fix safely in the available time. When the deploy pipeline itself is degraded, or when CI is failing, fix-forward depends on infrastructure that may not be reliable in the moment.

The right framing is: "is the fix in hand, and can we ship it confidently in the next few minutes?" If yes, fix forward. If no, roll back and fix afterwards.

The Capability Underneath¶

Fix-forward is not really a deployment strategy. It is a property of a team that has the underlying capabilities to ship a small, focused change quickly and confidently:

Fast CI and a reliable deploy pipeline. The fix has to be shippable in minutes.
Production observability. The team has to know what is broken before it can fix it.
Feature flags. Targeted disablement is part of the toolkit.
Small, reversible deployments. A small original deploy is much easier to fix forward than a large one.
Practice. The team has done this before. The first fix-forward of any incident is not the time to learn the pattern.

A team that has these capabilities can fix forward almost reflexively. A team that does not should not pretend it can.

Common Anti-Patterns¶

Treating fix-forward as a substitute for rollback ability. Rollback should still work, even if it is rarely used. A team that has lost the ability to roll back has lost a tool, not gained one.
Fixing forward through panic. A "fix" pushed under pressure, without review, that introduces a second problem on top of the first. The fix has to be deliberate, not heroic.
Skipping the postmortem because "we fixed it." A fast fix is not a substitute for understanding why the original deploy went wrong. The investigation still has to happen.
Fix-forward as a forward-forward-forward death march. Each successive fix introduces new issues. At some point the right call is to stop, roll back, and start over with a coherent plan.
Conflating fix-forward with not noticing. Continuously shipping changes that quietly degrade the system is not fix-forward. It is just an unhealthy deploy pipeline.

What This Looks Like in Practice¶

Default to fix-forward when the fix is in hand and shippable in minutes. Treat rollback as the emergency option, not the first option.
Decide explicitly. Every incident should have a clear answer to "fix forward, or roll back, and why." Defaulting either way without thinking produces predictable mistakes.
Keep rollback paths in working order. A rollback that has not been exercised is not a rollback. Periodically verify that the system can be returned to a known-good state.
Use feature flags as targeted rollback. When the problem is a specific feature, disable it. The bad code is still deployed but inactive.
Postmortem the same way regardless of recovery path. Whether the team fixed forward or rolled back, the question is the same: why did the problem reach production, and what would prevent the next one in this class?

Important caveat

Rollback remains an emergency tool when the situation demands it. The default in mature continuous-delivery environments is fix-forward, but the team that can no longer roll back has lost a capability rather than outgrown it.

Betsy Beyer, Chris Jones, Jennifer Petoff, and Niall Richard Murphy (eds.), Site Reliability Engineering: How Google Runs Production Systems (O'Reilly, 2016). Among the operational arguments: that rapid recovery, not perfect prevention, is the realistic strategy for reliability, and that the engineering practices supporting rapid recovery (small changes, fast pipelines, observability) are the things actually worth investing in. ↩
Nicole Forsgren, Jez Humble, and Gene Kim, Accelerate: The Science of Lean Software and DevOps (IT Revolution, 2018), and DORA, State of DevOps Report (annual): https://dora.dev/research/. Mean time to restore (MTTR) is one of the four key metrics the DORA research uses to distinguish high-performing teams from low-performing ones. The empirical finding is that elite teams are not the ones that avoid failure; they are the ones that recover from failure in minutes rather than days. Fix-forward, when it works, is the operational expression of that capability. ↩
Richard I. Cook, How Complex Systems Fail (1998, revised 2000): https://how.complexsystems.fail/. Eighteen short observations on failure in complex systems. The relevant ones for fix-forward: failures are normal, recovery rather than prevention is the practical goal, and the people closest to the system at the moment of failure are the ones who can recover it. Cook's framing underpins much of the resilience-engineering and SRE literature on operating under uncertainty. ↩
See Feature Flags for the longer treatment. ↩