Track error budget allocation and consumption with burn rate analysis. Part of the DevTools Surf developer suite. Browse more tools in the Developer Utilities collection.
Use Cases
Track SLO compliance and error budget consumption in real time
Determine when to freeze feature development in favor of reliability work
Communicate reliability status to stakeholders using budget metaphor
Enter your SLO target (e.g., 99.9% availability) and the measurement window (typically 30 days) — the tracker calculates total allowed error minutes and tracks consumed minutes
Use the burn rate alert panel to configure notifications when you are burning through your error budget 2x or 6x faster than sustainable
Toggle between time-based (uptime minutes) and event-based (successful request percentage) SLOs — they require different input data
Fun Facts
The concept of error budgets was popularized by Google's Site Reliability Engineering (SRE) book, published in 2016. It reframed reliability from a binary (up/down) to a quantitative metric — allowing engineering teams to make data-driven decisions about risk.
At 99.9% availability (three nines), you have an error budget of 43.8 minutes per month. At 99.99% (four nines), only 4.38 minutes. The jump from 3 to 4 nines typically requires 10x more investment in reliability engineering and is not cost-effective for all services.
Error budget burn rate is the concept of consuming your monthly budget faster or slower than a linear rate. At 1x burn rate you consume exactly your budget in 30 days. At 2x burn rate you exhaust it in 15 days — triggering escalating alerts as burn rate increases.
FAQ
What is an SLO and how is it different from an SLA?
An SLO (Service Level Objective) is an internal reliability target set by the engineering team — the goal to aim for. An SLA (Service Level Agreement) is an external contractual commitment to customers with financial penalties for breach. SLOs should be stricter than SLAs to provide a buffer.
How do I decide on an SLO target?
Start with user research: what reliability level do users actually need? Then measure your current reliability. Set the SLO slightly below current performance (if it is good) to give room for planned maintenance. Avoid aspirational SLOs you cannot currently meet.
What should happen when the error budget is exhausted?
Per Google's SRE framework: stop deploying new features until the next measurement window. Use the remaining time for reliability improvements and root cause analysis. This creates a virtuous cycle where reliability improvements are prioritized by self-regulation rather than management mandate.