15 min read/3 views

SLOs Changed How We Ship Software — Error Budgets, Burn Rates, and Why 99.99% Uptime Is a Lie

AWS hit 99.95% uptime in 2025. If the biggest cloud can't do four nines, your startup can't either. How SLOs and error budgets actually work.

DevOps Monitoring Architecture Opinion

Enjoyed this article?

Get new posts delivered to your inbox. No spam, unsubscribe anytime.

15 min read/3 views

SLOs Changed How We Ship Software — Error Budgets, Burn Rates, and Why 99.99% Uptime Is a Lie

AWS hit 99.95% uptime in 2025. If the biggest cloud can't do four nines, your startup can't either. How SLOs and error budgets actually work.

DevOps Monitoring Architecture Opinion

SLOs Changed How We Ship Software — Error Budgets, Burn Rates, and Why 99.99% Uptime Is a Lie

AWS US-EAST-1 went down for 15 hours in October 2025, affecting over 4 million users. That single region hosts 30-40% of all AWS workloads. AWS's overall 2025 uptime? 99.95%. Not 99.99%. Not five nines. Not even four nines. Three nines and a five.

And here's the part nobody talks about: AWS still delivered exceptional value that year. Their customers still shipped billions of requests. The business kept running. Because uptime isn't the metric that matters — what matters is whether your users noticed.

That's the fundamental insight behind Service Level Objectives, and it's why SLOs have quietly become the most important concept in modern software operations. Not because they promise perfection, but because they finally gave engineering teams a framework for deciding how much imperfection is acceptable.

The Nines Are Lying to You

Let's start with what everyone gets wrong about uptime. Here's what those nines actually mean in downtime per year:

Availability	Annual Downtime	Monthly Downtime	Daily Downtime
99% (two nines)	3.65 days	7.31 hours	14.4 minutes
99.9% (three nines)	8.77 hours	43.8 minutes	1.44 minutes
99.99% (four nines)	52.6 minutes	4.38 minutes	0.86 seconds
99.999% (five nines)	5.26 minutes	26.3 seconds	0.09 seconds

52 minutes of downtime per year for four nines. That sounds generous until you realize a single bad deployment can eat that in ten minutes.

But here's the real lie: each additional nine costs exponentially more. Going from 99.9% to 99.95% is a factor of 2x in engineering investment. Going from 99.95% to 99.99% is a factor of 5x. And going from 99.99% to 99.999%? That's another 10x. You're spending 10 times more money to save 47 minutes of downtime per year.

Most teams promising 99.99% uptime in their SLAs have never actually done the math on what it costs to deliver. And the cloud providers themselves can't even do it consistently — between August 2024 and August 2025, AWS, Azure, and Google Cloud together experienced more than 100 service outages.

The actual numbers from 2025:

Provider	Effective Uptime	Major Incidents	Avg. Recovery Time
AWS	99.95%	6	2.8 hours
Azure	99.97%	4	4.2 hours
Google Cloud	99.98%	3	1.9 hours

If the three largest infrastructure providers on the planet can't hit four nines, what makes you think your startup can?

SLA vs. SLO vs. SLI: The Vocabulary That Actually Matters

Before SLOs, we had SLAs. And SLAs are terrible for engineering decisions.

An SLA (Service Level Agreement) is a contract between provider and customer with financial penalties for breaches. It's a legal document. It tells you the minimum you're allowed to deliver before you start writing refund checks. No engineering team has ever been inspired by the phrase "meet the contractual minimum."

An SLO (Service Level Objective) is an internal performance target that your team sets for itself. It's what you actually aim for. And critically, it should always be stricter than your SLA. If your SLA promises 99.5% uptime, your internal SLO might be 99.9%. That gap between 99.5% and 99.9% is your safety margin.

An SLI (Service Level Indicator) is the actual measurement — the number that tells you if you're meeting your SLO. It's the query, the metric, the ground truth.

Here's the relationship:

# SLI: What you measure
sli: successful_requests / total_requests

# SLO: What you aim for
slo: 99.9% of requests succeed over 30 days

# SLA: What you promise externally
sla: 99.5% availability, 10% credit if breached

# Error budget: The gap between perfection and your SLO
error_budget: 100% - 99.9% = 0.1% = 43.8 minutes/month

This distinction matters because it changes the conversation. An SLA conversation is: "Are we in breach of contract?" An SLO conversation is: "Can we ship this feature today, or should we fix that flaky endpoint first?"

Error Budgets: The Idea That Changed Everything

The error budget is the single most impactful concept to come out of Google's SRE practice. It's elegant in its simplicity: if your SLO is 99.9% availability over 30 days, then you're allowed 0.1% unreliability — that's your error budget.

0.1% of a month is about 43.8 minutes. That's 43.8 minutes where your service can be down, slow, or returning errors before you've violated your own objective. Every deployment, every experiment, every infrastructure change eats into that budget.

Here's why this changes everything: the error budget turns reliability from a moral obligation into a resource to be spent.

Before error budgets, the conversation between product teams and reliability teams was adversarial. Product wanted to ship fast. SRE wanted stability. Neither had a framework for resolving the tension. It was vibes and politics.

With error budgets, the conversation becomes data-driven: "We have 38 minutes of budget remaining this month. This deployment has historically caused 2 minutes of elevated error rates during rollout. We can afford it." Or: "We've burned 90% of our budget in the first week. We need to freeze features and fix the database connection pool issue."

The error budget gave engineers objective authority to say "not now" when reliability was at risk. And it gave product teams permission to ship when there was budget to spend. No more arguments. Just math.

The Error Budget Policy

An error budget without a policy is just a dashboard nobody looks at. Google's SRE Workbook defines escalating actions based on remaining budget:

Budget Remaining	Action
More than 50%	Ship freely. Deploy new features. Run experiments.
25% to 50%	Slow down. Extra review on risky changes. Reduce deployment frequency.
1% to 25%	Feature freeze. Only reliability improvements and critical bug fixes.
0% (exhausted)	Complete freeze. Escalate to leadership. All hands on reliability.

There's also what Google calls the "silver bullet" exception — a very small number of overrides for truly business-critical launches that can't wait. But these are rare, tracked, and require executive approval. If you're using silver bullets every sprint, your SLO is wrong.

Burn Rates: The Speed of Failure

Knowing your error budget is remaining at 40% is useful. Knowing you're burning through it at 10x the sustainable rate is actionable. That's what burn rate gives you.

A burn rate of 1 means you're consuming your error budget at exactly the rate that would exhaust it at the end of the window. Sustainable. A burn rate of 2 means you'll run out halfway through. Concerning. A burn rate of 36 means you'll burn 5% of your monthly budget in a single hour. That's a page-the-on-call situation.

Google recommends two alerting thresholds as reasonable starting points:

2% budget consumption in one hour — fast burn, something is actively broken
5% budget consumption in six hours — slow burn, something is degrading

Multi-Window Burn Rate Alerting

The sophisticated approach uses two windows for each alert — a long window to detect the sustained burn, and a short window to confirm it's still happening (not a brief spike that already recovered):

# Prometheus alerting rules for SLO burn rate
groups:
  - name: slo-burn-rate
    rules:
      # Fast burn: 2% budget in 1 hour (burn rate 14.4)
      - alert: SLOBurnRateFast
        expr: |
          (
            sum(rate(http_requests_total{status=~"5.."}[1h]))
            /
            sum(rate(http_requests_total[1h]))
          ) > (14.4 * 0.001)
          and
          (
            sum(rate(http_requests_total{status=~"5.."}[5m]))
            /
            sum(rate(http_requests_total[5m]))
          ) > (14.4 * 0.001)
        labels:
          severity: critical
        annotations:
          summary: "High SLO burn rate - 2% budget in 1 hour"

      # Slow burn: 5% budget in 6 hours (burn rate 6)
      - alert: SLOBurnRateSlow
        expr: |
          (
            sum(rate(http_requests_total{status=~"5.."}[6h]))
            /
            sum(rate(http_requests_total[6h]))
          ) > (6 * 0.001)
          and
          (
            sum(rate(http_requests_total{status=~"5.."}[30m]))
            /
            sum(rate(http_requests_total[30m]))
          ) > (6 * 0.001)
        labels:
          severity: warning
        annotations:
          summary: "Elevated SLO burn rate - 5% budget in 6 hours"

This two-window approach dramatically reduces false alerts. The long window catches the trend. The short window confirms it's not already resolved. You stop waking people up for 30-second blips that self-healed.

Choosing SLIs: Measure What Users Feel, Not What Servers Report

The most common SLO mistake is measuring the wrong thing. CPU utilization is not an SLI. Memory usage is not an SLI. Pod restart counts are not SLIs. These are infrastructure metrics. They tell you what your servers are experiencing, not what your users are experiencing.

Good SLIs fall into four categories defined by Google's SRE framework:

SLI Type	What It Measures	Example
Availability	Can users reach the service?	Successful requests / total requests
Latency	How fast does it respond?	% of requests under 200ms
Quality	Is the response correct and complete?	% of responses with full data (not degraded)
Freshness	Is the data up to date?	% of queries returning data under 1 minute old

Here's a practical example for an e-commerce checkout API:

slos:
  checkout-availability:
    sli: successful_checkout_requests / total_checkout_requests
    target: 99.95%
    window: 30d
    # Excludes client errors (4xx) - those aren't our fault
    
  checkout-latency:
    sli: checkout_requests_under_500ms / total_checkout_requests
    target: 99%
    window: 30d
    # p99 latency matters more than p50 for checkouts
    
  checkout-correctness:
    sli: orders_with_correct_totals / total_orders
    target: 99.99%
    window: 30d
    # Getting the price wrong is worse than being slow

Notice the different targets. Availability at 99.95% because brief outages are recoverable — users retry. Latency at 99% because some slow requests are tolerable. But correctness at 99.99% because charging someone the wrong amount is a trust-destroying event.

Your SLO targets should reflect the actual impact of failure on your users, not some aspirational number your VP put on a slide.

The Five SLO Anti-Patterns That Will Wreck Your Team

I've seen SLO initiatives fail more often than they succeed. Not because the concept is flawed, but because teams make the same mistakes repeatedly.

Anti-Pattern 1: Setting SLOs Too High

The most common mistake. A team sets 99.99% availability because it sounds professional. Then they blow their entire monthly error budget on the first deployment. Now they're in a perpetual feature freeze, the error budget policy becomes meaningless, and everyone stops looking at the dashboard.

Aggressive SLOs leave no margin for experimentation. If your budget is 4 minutes per month, a single canary deployment's elevated error rate could exhaust it. That's not a reliability framework — it's a ship-nothing framework.

Fix: Start with 99.9% or even 99.5%. Tighten only when you're consistently under-spending your budget.

Anti-Pattern 2: SLOs That Nobody Enforces

You defined the SLOs. You built the dashboards. And then you burned through the error budget three months in a row and nothing happened. No freeze. No escalation. No consequences.

An error budget policy only works if leadership backs it. When the VP of Product overrides the feature freeze because "this launch is critical," you don't have an SLO practice. You have expensive monitoring.

Fix: Get error budget policy sign-off from engineering leadership AND product leadership before you start. Both sides need skin in the game.

Anti-Pattern 3: Treating Error Budget as a Failure Target

Some teams interpret the error budget as a target to hit — "we have 43 minutes of allowed downtime, so it's fine if we use all of it." The error budget is not a failure target. It's a safety margin. Spending all of it every month means you have no buffer for unexpected incidents.

Fix: If you're consistently consuming more than 50% of your error budget, either your SLO is too aggressive or your system needs reliability investment.

Anti-Pattern 4: Too Many SLOs

A team defines 47 SLOs across 12 services. Nobody can track them. Dashboards become noise. Alert fatigue sets in.

Fix: Start with 1-3 SLOs per service. One for availability, one for latency. Maybe one for correctness if it's a financial system. Add more only when the existing ones prove insufficient.

Anti-Pattern 5: SLIs That Don't Reflect User Experience

Your backend returns 200 OK, but the response body contains an error message. Your health check passes, but the login flow is broken. Your API responds in 50ms, but it's returning cached stale data from three hours ago.

If a service consistently operates with a full error budget, your SLI might be measuring the wrong thing. The dashboard says green while users are filing support tickets.

Fix: Define SLIs from the user's perspective, not the server's. Instrument at the edge, not at the application. Use synthetic monitoring to validate what real users experience.

DORA Metrics and the Speed-Stability Myth

There's a persistent belief that shipping fast and staying reliable are opposing forces. DORA's research has repeatedly demonstrated the opposite: speed and stability are correlated, not at odds.

The DORA framework now includes six measurable dimensions, and Reliability — assessed through SLOs and SLIs — is one of them. Elite-performing teams deploy multiple times per day while maintaining low change failure rates. They don't sacrifice stability for velocity. They use SLOs and error budgets to find the optimal balance.

Here's the mechanism: when you have error budget remaining, you ship aggressively. When the budget is low, you slow down and fix things. This natural rhythm means you're never in a sustained period of "move fast and break things" or a sustained period of "change nothing." You oscillate between innovation and hardening based on real data.

That's why SLOs are decision-making drivers, not just monitoring targets. They tell you when to push and when to pull back. They replace the gut-feel "are we shipping too fast?" with a quantitative answer.

Building Your SLO Practice: A Practical Playbook

If you're starting from zero, here's the sequence that works. I've seen teams try to skip steps and it always backfires.

Step 1: Pick One Service

Don't try to SLO everything at once. Pick your most important user-facing service — the one where downtime directly impacts revenue or user trust.

Step 2: Define 1-2 SLIs

Start with availability (success rate) and latency (p99 response time). Instrument them at the load balancer or API gateway level, not at the application level. You want to measure what users actually experience.

# Example: calculating availability SLI from logs
from datetime import datetime, timedelta

def calculate_availability_sli(
    start: datetime,
    end: datetime,
    total_requests: int,
    failed_requests: int,
) -> dict:
    success_rate = (
        (total_requests - failed_requests) / total_requests
    ) * 100
    
    window_minutes = (end - start).total_seconds() / 60
    
    slo_target = 99.9
    error_budget_pct = 100 - slo_target  # 0.1%
    error_budget_minutes = window_minutes * (error_budget_pct / 100)
    
    actual_error_pct = (failed_requests / total_requests) * 100
    budget_consumed = (actual_error_pct / error_budget_pct) * 100
    
    return {
        "sli_value": round(success_rate, 4),
        "slo_target": slo_target,
        "error_budget_total_min": round(error_budget_minutes, 2),
        "budget_consumed_pct": round(budget_consumed, 2),
        "budget_remaining_pct": round(100 - budget_consumed, 2),
        "status": "healthy" if budget_consumed < 75 else "at_risk"
            if budget_consumed < 100 else "exhausted",
    }

Step 3: Set a Conservative SLO

Your first SLO should be achievable. Look at your last 30 days of data. If your actual availability was 99.85%, set your SLO at 99.5%. Give yourself room. You can always tighten it later.

Step 4: Write the Error Budget Policy

This is a document — signed by engineering and product leadership — that defines what happens at each budget threshold. Use the tiered model from earlier. Make the consequences real but reasonable.

Step 5: Build the Dashboard

You need three things visible at all times:

Current SLI value vs. SLO target
Error budget remaining (percentage and absolute time)
Burn rate over the last 1 hour and 6 hours

Step 6: Set Up Burn Rate Alerts

Configure fast-burn (critical, pages on-call) and slow-burn (warning, notifies the team channel) alerts. Start conservative — you can tune thresholds after a few weeks of data.

Step 7: Run It for 30 Days Before Enforcing

Let the SLO run in observation mode for a full cycle. Watch how the budget is consumed. Adjust the target if it's too aggressive or too loose. Only start enforcing the error budget policy after you trust the measurement.

The Tool Landscape in 2026

You don't need to build SLO infrastructure from scratch. The tooling has matured significantly:

Tool	Best For	Approach
Nobl9	Dedicated SLO management	Multi-source, 40+ integrations, vendor-agnostic
Datadog SLOs	Teams already on Datadog	Native burn rate alerts, 900+ integrations
Prometheus + Sloth	Open-source, self-hosted	Generates Prometheus alerting rules from SLO specs
Google Cloud SLO Monitoring	GCP-native services	Built into Cloud Monitoring
Dynatrace	Enterprise, full-stack	AI-powered SLO analysis

Nobl9 is the top choice for dedicated SLO management because it's purpose-built for the problem, integrating with over 30 telemetry sources including Prometheus, Datadog, New Relic, and cloud providers. If you're already paying for Datadog, their built-in SLO features with burn rate alerting are good enough for most teams.

For teams on a budget, Prometheus with Sloth generates multi-window burn rate alerting rules from simple SLO definitions. It's free, battle-tested, and exactly what most startups need.

What I Actually Think

SLOs are the best framework we have for shipping software responsibly. But I think the industry has gotten some things wrong about them.

First: most teams set their SLOs too high. I'd rather see a team rigorously enforce a 99.5% SLO than have a 99.99% SLO that everyone ignores. The value of SLOs isn't in the number — it's in the decision framework. A 99.5% SLO with a real error budget policy will produce more reliable software than a 99.99% SLO that exists only on a dashboard.

Second: the error budget policy is more important than the SLO itself. I've seen teams with beautiful SLO dashboards that never trigger any action. The budget turns red, nobody cares, and the next sprint's feature work continues unchanged. That's monitoring theater. The policy — the escalation plan, the feature freeze threshold, the leadership buy-in — is what makes the entire system work.

Third: 99.99% uptime is genuinely a lie for 95% of companies. Not because they can't occasionally achieve it, but because they can't sustain it while also shipping features. AWS, with its essentially unlimited engineering budget, delivered 99.95% in 2025. If you're a 50-person startup claiming four nines in your SLA, you're either not measuring correctly or you're not deploying often enough.

And finally: SLOs work because they turn reliability into a business conversation. Before SLOs, reliability was a technical concern that product managers didn't understand and couldn't reason about. Error budgets translated "the system is flaky" into "we can't ship for the next two weeks because we spent our reliability budget." That's a language product and engineering can share.

The teams that get this right don't treat SLOs as a monitoring feature. They treat them as the operating system for how they ship software. Every deployment decision, every architecture choice, every on-call escalation flows through the error budget. It's not a dashboard. It's a decision-making framework.

And that's why SLOs changed how we ship software. Not because they made things more reliable — though they did. But because they finally gave us a shared vocabulary for the oldest tension in engineering: the desire to ship fast and the need to not break things.

References:

Enjoyed this article?

Get new posts delivered to your inbox. No spam, unsubscribe anytime.

SLOs Changed How We Ship Software — Error Budgets, Burn Rates, and Why 99.99% Uptime Is a Lie

The Nines Are Lying to You

Let's start with what everyone gets wrong about uptime. Here's what those nines actually mean in downtime per year:

Availability	Annual Downtime	Monthly Downtime	Daily Downtime
99% (two nines)	3.65 days	7.31 hours	14.4 minutes
99.9% (three nines)	8.77 hours	43.8 minutes	1.44 minutes
99.99% (four nines)	52.6 minutes	4.38 minutes	0.86 seconds
99.999% (five nines)	5.26 minutes	26.3 seconds	0.09 seconds

52 minutes of downtime per year for four nines. That sounds generous until you realize a single bad deployment can eat that in ten minutes.

The actual numbers from 2025:

Provider	Effective Uptime	Major Incidents	Avg. Recovery Time
AWS	99.95%	6	2.8 hours
Azure	99.97%	4	4.2 hours
Google Cloud	99.98%	3	1.9 hours

If the three largest infrastructure providers on the planet can't hit four nines, what makes you think your startup can?

SLA vs. SLO vs. SLI: The Vocabulary That Actually Matters

Before SLOs, we had SLAs. And SLAs are terrible for engineering decisions.

An SLI (Service Level Indicator) is the actual measurement — the number that tells you if you're meeting your SLO. It's the query, the metric, the ground truth.

Here's the relationship:

# SLI: What you measure
sli: successful_requests / total_requests

# SLO: What you aim for
slo: 99.9% of requests succeed over 30 days

# SLA: What you promise externally
sla: 99.5% availability, 10% credit if breached

# Error budget: The gap between perfection and your SLO
error_budget: 100% - 99.9% = 0.1% = 43.8 minutes/month

Error Budgets: The Idea That Changed Everything

Here's why this changes everything: the error budget turns reliability from a moral obligation into a resource to be spent.

The Error Budget Policy

An error budget without a policy is just a dashboard nobody looks at. Google's SRE Workbook defines escalating actions based on remaining budget:

Budget Remaining	Action
More than 50%	Ship freely. Deploy new features. Run experiments.
25% to 50%	Slow down. Extra review on risky changes. Reduce deployment frequency.
1% to 25%	Feature freeze. Only reliability improvements and critical bug fixes.
0% (exhausted)	Complete freeze. Escalate to leadership. All hands on reliability.

Burn Rates: The Speed of Failure

Knowing your error budget is remaining at 40% is useful. Knowing you're burning through it at 10x the sustainable rate is actionable. That's what burn rate gives you.

Google recommends two alerting thresholds as reasonable starting points:

2% budget consumption in one hour — fast burn, something is actively broken
5% budget consumption in six hours — slow burn, something is degrading

Multi-Window Burn Rate Alerting

# Prometheus alerting rules for SLO burn rate
groups:
  - name: slo-burn-rate
    rules:
      # Fast burn: 2% budget in 1 hour (burn rate 14.4)
      - alert: SLOBurnRateFast
        expr: |
          (
            sum(rate(http_requests_total{status=~"5.."}[1h]))
            /
            sum(rate(http_requests_total[1h]))
          ) > (14.4 * 0.001)
          and
          (
            sum(rate(http_requests_total{status=~"5.."}[5m]))
            /
            sum(rate(http_requests_total[5m]))
          ) > (14.4 * 0.001)
        labels:
          severity: critical
        annotations:
          summary: "High SLO burn rate - 2% budget in 1 hour"

      # Slow burn: 5% budget in 6 hours (burn rate 6)
      - alert: SLOBurnRateSlow
        expr: |
          (
            sum(rate(http_requests_total{status=~"5.."}[6h]))
            /
            sum(rate(http_requests_total[6h]))
          ) > (6 * 0.001)
          and
          (
            sum(rate(http_requests_total{status=~"5.."}[30m]))
            /
            sum(rate(http_requests_total[30m]))
          ) > (6 * 0.001)
        labels:
          severity: warning
        annotations:
          summary: "Elevated SLO burn rate - 5% budget in 6 hours"

Choosing SLIs: Measure What Users Feel, Not What Servers Report

Good SLIs fall into four categories defined by Google's SRE framework:

SLI Type	What It Measures	Example
Availability	Can users reach the service?	Successful requests / total requests
Latency	How fast does it respond?	% of requests under 200ms
Quality	Is the response correct and complete?	% of responses with full data (not degraded)
Freshness	Is the data up to date?	% of queries returning data under 1 minute old

Here's a practical example for an e-commerce checkout API:

slos:
  checkout-availability:
    sli: successful_checkout_requests / total_checkout_requests
    target: 99.95%
    window: 30d
    # Excludes client errors (4xx) - those aren't our fault
    
  checkout-latency:
    sli: checkout_requests_under_500ms / total_checkout_requests
    target: 99%
    window: 30d
    # p99 latency matters more than p50 for checkouts
    
  checkout-correctness:
    sli: orders_with_correct_totals / total_orders
    target: 99.99%
    window: 30d
    # Getting the price wrong is worse than being slow

Your SLO targets should reflect the actual impact of failure on your users, not some aspirational number your VP put on a slide.

The Five SLO Anti-Patterns That Will Wreck Your Team

I've seen SLO initiatives fail more often than they succeed. Not because the concept is flawed, but because teams make the same mistakes repeatedly.

Anti-Pattern 1: Setting SLOs Too High

Fix: Start with 99.9% or even 99.5%. Tighten only when you're consistently under-spending your budget.

Anti-Pattern 2: SLOs That Nobody Enforces

You defined the SLOs. You built the dashboards. And then you burned through the error budget three months in a row and nothing happened. No freeze. No escalation. No consequences.

Fix: Get error budget policy sign-off from engineering leadership AND product leadership before you start. Both sides need skin in the game.

Anti-Pattern 3: Treating Error Budget as a Failure Target

Fix: If you're consistently consuming more than 50% of your error budget, either your SLO is too aggressive or your system needs reliability investment.

Anti-Pattern 4: Too Many SLOs

A team defines 47 SLOs across 12 services. Nobody can track them. Dashboards become noise. Alert fatigue sets in.

Fix: Start with 1-3 SLOs per service. One for availability, one for latency. Maybe one for correctness if it's a financial system. Add more only when the existing ones prove insufficient.

Anti-Pattern 5: SLIs That Don't Reflect User Experience

If a service consistently operates with a full error budget, your SLI might be measuring the wrong thing. The dashboard says green while users are filing support tickets.

Fix: Define SLIs from the user's perspective, not the server's. Instrument at the edge, not at the application. Use synthetic monitoring to validate what real users experience.

DORA Metrics and the Speed-Stability Myth

There's a persistent belief that shipping fast and staying reliable are opposing forces. DORA's research has repeatedly demonstrated the opposite: speed and stability are correlated, not at odds.

Building Your SLO Practice: A Practical Playbook

If you're starting from zero, here's the sequence that works. I've seen teams try to skip steps and it always backfires.

Step 1: Pick One Service

Don't try to SLO everything at once. Pick your most important user-facing service — the one where downtime directly impacts revenue or user trust.

Step 2: Define 1-2 SLIs

# Example: calculating availability SLI from logs
from datetime import datetime, timedelta

def calculate_availability_sli(
    start: datetime,
    end: datetime,
    total_requests: int,
    failed_requests: int,
) -> dict:
    success_rate = (
        (total_requests - failed_requests) / total_requests
    ) * 100
    
    window_minutes = (end - start).total_seconds() / 60
    
    slo_target = 99.9
    error_budget_pct = 100 - slo_target  # 0.1%
    error_budget_minutes = window_minutes * (error_budget_pct / 100)
    
    actual_error_pct = (failed_requests / total_requests) * 100
    budget_consumed = (actual_error_pct / error_budget_pct) * 100
    
    return {
        "sli_value": round(success_rate, 4),
        "slo_target": slo_target,
        "error_budget_total_min": round(error_budget_minutes, 2),
        "budget_consumed_pct": round(budget_consumed, 2),
        "budget_remaining_pct": round(100 - budget_consumed, 2),
        "status": "healthy" if budget_consumed < 75 else "at_risk"
            if budget_consumed < 100 else "exhausted",
    }

Step 3: Set a Conservative SLO

Your first SLO should be achievable. Look at your last 30 days of data. If your actual availability was 99.85%, set your SLO at 99.5%. Give yourself room. You can always tighten it later.

Step 4: Write the Error Budget Policy

Step 5: Build the Dashboard

You need three things visible at all times:

Current SLI value vs. SLO target
Error budget remaining (percentage and absolute time)
Burn rate over the last 1 hour and 6 hours

Step 6: Set Up Burn Rate Alerts

Configure fast-burn (critical, pages on-call) and slow-burn (warning, notifies the team channel) alerts. Start conservative — you can tune thresholds after a few weeks of data.

Step 7: Run It for 30 Days Before Enforcing

The Tool Landscape in 2026

You don't need to build SLO infrastructure from scratch. The tooling has matured significantly:

Tool	Best For	Approach
Nobl9	Dedicated SLO management	Multi-source, 40+ integrations, vendor-agnostic
Datadog SLOs	Teams already on Datadog	Native burn rate alerts, 900+ integrations
Prometheus + Sloth	Open-source, self-hosted	Generates Prometheus alerting rules from SLO specs
Google Cloud SLO Monitoring	GCP-native services	Built into Cloud Monitoring
Dynatrace	Enterprise, full-stack	AI-powered SLO analysis

For teams on a budget, Prometheus with Sloth generates multi-window burn rate alerting rules from simple SLO definitions. It's free, battle-tested, and exactly what most startups need.

What I Actually Think

SLOs are the best framework we have for shipping software responsibly. But I think the industry has gotten some things wrong about them.

References: