12 Apr 2026 · 12 min read

Error budgets as a moral instrument

What happens when you stop treating SLOs as numbers and start treating them as agreements.

The standard introduction to error budgets goes something like this: your service has a 99.9% availability SLO, which means you have 43 minutes of downtime per month to spend. When you have budget remaining, you can move fast. When the budget is exhausted, you slow down and focus on reliability. Simple.

This framing is useful. It is also incomplete. It treats the error budget as a resource — something to be managed, optimised, perhaps gamed. What it misses is the more important question: an agreement with whom?

The agreement underneath the number

Every SLO is a promise. Not to a dashboard. Not to a monitoring system. To the people who depend on your service.

When you define 99.9% availability, you are telling your users: we expect this service to be unavailable for up to 43 minutes each month. We think that is acceptable for what you are doing with it. We have thought about what you lose when it is down, and we believe this number reflects a fair balance between the cost of reliability and the cost of failure.

If you have not thought about it that way, you have not written an SLO. You have written a number.

What makes an SLO honest

An honest SLO has three components that are rarely all present at once. It has a clear definition of what is being measured — not just uptime, but latency, error rate, correctness. It has a threshold grounded in user impact, not in what is technically achievable or what looks impressive in a board deck. And it has an owner: a person or team who has genuinely agreed that this is the right number and will be accountable when it is breached.

The third component is the hardest. We have sat in many meetings where an SLO is presented to a team rather than negotiated with them. The number is handed down from a product manager or a CTO. The engineers nod. The SLO goes into the monitoring system. Six months later, no one can remember why it is 99.95% and not 99.9%, and the error budget is used primarily to justify or deny feature work in ways that feel arbitrary to everyone involved.

Error budgets as moral clarity

When an error budget is treated as an agreement rather than a resource, something changes in how teams talk to each other. The reliability engineers and the feature engineers are no longer on opposite sides of a negotiation. They share a commitment. The budget is not the reliability team’s property to protect against the feature team’s raids. It is a shared account that both teams have agreed to steward.

This changes the conversation when the budget is running low. Instead of “we cannot deploy because reliability says no,” it becomes “we made a promise to our users about how much disruption they would experience this month, and we are close to the edge of it. What do we want to do?”

That is a harder conversation. It is also an honest one.

Practical steps

Start with your most critical user journey. Define failure from the user’s perspective, not the system’s. Agree on the threshold with the people who will be held to it, not just the people who will measure it. Review it quarterly, not annually. And when the budget is breached, treat the postmortem not as a failure investigation but as a renegotiation: was the SLO right? Was the incident foreseeable? What does the budget need to be to make this promise keepable?

The error budget is, in the end, a way of making your reliability commitments legible. Used well, it is one of the most powerful tools in the discipline — not because of what it measures, but because of the conversation it forces you to have.

— Continue reading

All field notes →

Work with us