So you want to write an SLO
I used to work in Ruby on Rails. It’s not bad to work in if you’re looking to crank some code out quickly and you’re okay with a few runtime errors and you love throwing money at compute resources and the overall level of service doesn’t need to be that high. It was an okay time. But how did I know the level of service wasn’t as good as it could’ve been? Am I just biased because I love Rust?
While I am biased, I know the level of service was lacking because I set up service-level objectives (SLOs) for both my Ruby and my Rust services. Across the board, my Rust services had stricter SLOs, and consistently met them. In fact, it was frequent SLO error budget violations that prompted me to move away from Ruby in the first place. I wasn’t content to set an SLO based on the tech I was using—rather, I wanted to set an SLO based on the business context and then use the tech I needed to use to reach that SLO.
However, this article isn’t about language choice. It’s about SLOs, so let me start by acknowledging that writing good SLOs is a tricky business. To construct a good one, we need to understand why we need SLOs and what they do, then we’ll dive into how to set them for your service.
So first, why do we need SLOs?
We live in a broken world
One unfortunate reality of software services is that things go wrong.
Sometimes things go wrong simply because of a defect in the code. But other times, you end up dealing with issues that don’t have an obvious cause or a simple fix. You might experience a transient network outage. A server might run out of CPU or memory unexpectedly due to a surge in traffic. Multi-threaded server code could have a subtle flaw that allows for a race condition.
This is all exacerbated in modern, distributed systems where problems tend to propagate. Unless an engineer has done some extensive and graceful error handling, an exception in a server can derail every downstream server. How can we hope to build these systems when every dependency presents a potential failure mode?
We promise to be good
SLO as a promise to your customer
An SLO is a promise to your client (whether that be an end user or another service) that your service will do what you’ve said it will do. Often this is phrased as
the service will be good X percent of the time. But given this SLO “promise” format, we can immediately ask some questions. First of all, what do we mean by
be good? Second, how do we pick a value for
X? And what kind of time window are we calculating
The best way to approach these questions is by looking at them from your customer’s point of view. Your customer might be an end user of your app, or your customer might be another team consuming your API, or you might be your own customer (for example, you might have a service powering analytics for your own team to look at). Your customer will have some expectations for your system, both functional and non-functional, and each one that you’re able to enumerate and measure could potentially become an SLO.
Enumerating SLOs using customer expectations
Let’s run through an example to see how to make SLOs based on customer expectations. We can start by considering a backend service serving a REST API. Pretend the API is being consumed by a frontend client running in a user’s browser. The frontend has some business requirements, such as displaying the correct data for a given user, and loading the page within 3 seconds for optimal conversion rates. Also, while this may not have been explicitly recorded in a JIRA ticket anywhere, everyone has an implicit expectation that the page will work when a user goes to it. Errors fetching data are only tolerable some tiny percentage of the time.
The first thing we need to notice is that since this frontend has a hard dependency on your service, all of these business requirements are hard upper limits for your service as well. The frontend expects your service to return the correct data, to respond within 3 seconds (and preferably faster than that, so the page still has some leeway to render and make other calls), and it expects calls to fail very rarely. This gives us three SLO skeletons to start with:
- return correct data some percent of the time
- latency < 3 seconds some percent of the time
- error rate on REST endpoints is low
Make sure you’ve got SLIs for your SLOs
How do we take those skeletons and flesh them out into finalized, trackable, and enforceable SLOs? The next step is to take stock of the SLIs we have on hand. SLI stands for service-level indicator and is basically just a data point you’re collecting for your service. For instance, if you’re tracing each request coming in to your REST API, you can determine how many requests have happened overall, the duration for each request, and whether or not the request ended with an error. These sorts of SLIs are crucial for creating an SLO since they help us define and track whether our service is being “good”.
For our example, let’s assume we are tracing each request and therefore have access to a few SLIs (aka metrics) called
request.errors. With these metrics we know we can handle two of our three proposed SLOs:
request.duration can be used for our latency SLO, and with
request.errors we can track our error rate.
What about returning correct data though? How do we measure that? Well first off, let me say that services should always have tests that run before deployment to verify correctness. We one hundred percent want to keep those tests and we still want to work to improve them. But as an added layer of protection and visibility into production, we can also run “tests” on the code once it reaches production. This is done with something commonly called a synthetic test. It’s basically a script that hits your endpoint on some cadence and asserts whatever you tell it to assert (for instance, you might check that the JSON your endpoint returns is correct). If this synthetic test success rate is being tracked, we can use that success rate for our correctness SLO.
Now that we’ve found an SLI for each of our SLOs, we need to dial in what “good” means for each.
For our correctness SLO, “good” is simple: the synthetic tests must pass. Figuring out
X is a little trickier, since that’s the percentage of tests that we want to pass, but we’ll worry about
X in a minute.
For the latency SLO, we have a bit more freedom. The upper limit is 3 seconds, since that’s a business requirement imposed on us by the frontend. We likely want to target a lower latency though, because the frontend still needs time to make other calls and render. This is where setting an SLO becomes more of an art than a science. You’ll need to pick a value that seems attainable, but also is fast enough for your client. Keep in mind that you may have a hard lower limit on latency if your service has dependencies of its own. For example, if your service calls a third-party API with an SLO of 500ms, you know your service can’t promise something better than that. In this scenario, “good” might mean a latency of 1 second or less (i.e.
request.duration ≤ 1s). If that turns out to be too hard to attain, you can adjust higher to something like 1.5 or 2 seconds. If it’s too easy, decrease the threshold.
Lastly, we have the error rate SLO. “Good” is simple for this SLO as well: an error is bad, but a successful request is good. Since the other SLOs are framed as
the service will be good X percent of the time, we will actually want to flip this SLO around and call it an “availability” SLO rather than an “error rate” SLO. This way
X will represent the number of requests that were successful (i.e. the service was available) rather than measuring how many requests errored out. Using the SLIs we have, we’d measure this by doing
(request.hits - request.errors) / request.hits.
Finally, we’ve come to the hardest part of creating an SLO. We need to figure out an actual
X value, a real threshold, for the percentage of time your service needs to be good.
Let’s start with the correctness SLO. We’ve determined that “good” means the synthetic tests pass. Therefore,
X is the percentage of tests that pass (successful tests / total tests). In an ideal world, that percentage should be 100%. I mean, if the tests failed during the deploy process, we’d halt the deploy, right? So why would we tolerate anything less than 100% correctness in production?
The trouble is that our correctness SLO isn’t just verifying that our logic is correct. Instead, it’s a bit like an in-depth error rate monitor that checks the results more thoroughly, but probably runs less frequently. It’s still subject to the whims of transient network issues and other outages. With that in mind, we need to accept the fact that some of the tests will fail. Let’s just throw out 95% as a value for
X to start and adjust from there. In practice, that means that if this synthetic test runs once an hour, you’d run 720 tests a month and 36 of those 720 would be allowed to fail.
Next we have our latency SLO. For this one we determined that “good” means requests take 1 second or less to complete. We can measure the proportion of requests that meet that threshold and compare that to our
X. In this case, since we’ve picked a threshold and have some freedom to move it around, I’d set
X to 99% to start. That means that 99% of all the requests need to complete in 1 second or less. If you run into issues with this SLO, I’d adjust the threshold first (to the 1.5 or 2 seconds we discussed earlier). If you still can’t reliably attain the SLO, then you probably need to improve your service!
Last we come to the SLO where we have the most freedom: the availability SLO. We have freedom because no one left any notes in JIRA about how many errors are tolerable. At the same time, everyone implicitly agrees that errors are bad and that preventing them is important. An error on the backend could cause the frontend to show the user an error page, which undoubtedly leads to a bad user experience (which then has business ramifications). So how do we pick a target availability?
My advice is to make the availability SLO as tight as you can make it, commensurate with the importance of the experience you’re serving. By “importance” I’m referring mainly to the impact this experience has on your business goals. Let’s say you’re a subscription business, and this is the page where users put in their card info to pay for the subscription. An error here could be devastating. You’ve gotten the user to the point of sale, then failed them at the last moment, causing them to reconsider giving you their money. Further, an error on a page dealing with their credit card information will likely shake their trust. It is important to get this experience right. More time should go into engineering the services providing this experience to make sure it’s available.
So how tight can you make an availability SLO without investing undue engineering effort? It depends on what your service is doing and what the SLOs of its dependencies look like, so let’s narrow down our example a little bit. Let’s say your backend service serving a REST API is simply returning JSON payloads to its clients. Those payloads are filled out by querying a Postgres database that has tables with hundreds of thousands of rows (not millions or billions). Let’s also assume there aren’t any data integrity issues. In my experience:
- a 90% availability is just plain lazy. With no data integrity issues, the database shouldn’t be a problem. To exceed a 10% error rate you’ve either got bugs in your code, there’s a config issue (perhaps making too many database connections, or too few and timing out), or there’s some other fixable problem. Perhaps you’ve run into transient network issues, but with database failover and server health checks this kind of problem is often solved automatically. A service like the one we described simply shouldn’t have an error on 1 out of every 10 requests.
- a 99% availability is much better. Personally I think this is a great starting place for a new SLO, and you can dial it in as you collect more data. This availability target is achievable with good tests (to make sure you haven’t got bugs in your API), and with basic infrastructure automations like autoscaling, database failover, and health checks. Usually by the time a team is considering SLOs the team has a decent automated test setup, so issues aren’t coming from code defects or blatant config issues. Instead you can get subtle issues, like a failing dependency.
- a 99.9% availability is also achievable, though you may need to dig deeper to hit this target. Budgets might need to get bigger to allow for more servers, more resources, and more redundancy. Autoscaling policies should be aggressive. Applications could also benefit from retries, backoff logic, circuit breakers, async processing of longer-lived work, caching, and other strategies. It goes without saying that test coverage should also be excellent, and exploratory testing should be performed to root out any edge cases.
From there, every order of magnitude improvement gets harder and harder. In my experience, going from 90% to 99% requires some process changes. Going from 99% to 99.9% requires some more rigorous engineering. Getting from 99.9% to 99.99% is harder still (and honestly might not be achievable without multi-region redundancy and other costly engineering efforts).
A real world comparison
Let’s look at Google’s Pub/Sub service to get a frame of reference. Google says they try to provide a 99.95% monthly uptime percentage for their regional topics, which they define as the “total number of minutes in a month, minus the number of minutes of Downtime suffered from all Downtime Periods in a month, divided by the total number of minutes in a month”. A Downtime Period is basically a period of 60 seconds or more where all requests are failing.
First, let’s note that Google measures their SLO in minutes, rather than requests. Their
X value is computed by dividing the minutes where they were “up” by the total number of minutes that month. With a 99.95% uptime, they’re only allowed to have about 21 and a half minutes of downtime a month before they start paying credits back to their users.
This is an interesting way to phrase an SLO because it frees Google from counting all failed requests against their SLO. Instead, requests must be failing for a full 60 seconds or more before they start to count. So if there was an issue where requests were returning 500s, but it resolved before 60 seconds elapsed, Google would count that minute as having been 100% available. It comes off as some tricky legal language to me, so I prefer to be honest and just divide the number of successful requests by the total number of requests to calculate uptime.
The last thing to note is that Google evaluates their SLO over a one month time window. This makes sense for their use case, since billing happens monthly. From my own experience, this is still a reasonable time window even if your service isn’t beholden to a billing cycle. Traffic tends to fluctuate over the course of a week, so looking over one month gives you at least four weeks to amortize your errors over. This prevents occasional traffic bursts (say from marketing blasts or other non-organic sources) from exhausting your SLO error budget. You’ll still see how the traffic spikes can impact your level of service, but it will be in light of the larger pattern of user behavior, which can help you make more informed decisions about the level of service you’re providing. Another good approach is to track your SLO over multiple time windows. Datadog has 7 days, 30 days, and 90 days as the defaults if you make an SLO on their platform.
Overall, SLOs are tricky and will need to be adjusted a few times after you set them, but they’re incredibly helpful for creating a more stable software ecosystem and for detecting issues over different time horizons. If you approach your level of service from the mindset of a user of your service, you’ll be able to figure out which SLOs you want to set and also roughly what they should be set to. Keep in mind that you will need to measure SLIs to actually set and track SLOs, so think ahead and instrument your services well!
After looking at your SLOs from the customer perspective, remember that business requirements are transitive, and therefore the requirements on the service furthest downstream apply to all of the upstream services. In our example, the frontend needed to load in 3 seconds, which meant that our backend service had to respond in less time than that!
One last point when it comes to setting SLOs: upstream services can’t always adjust to meet the needs of downstream services. Occasionally there will be a hard limit in a dependency and your service simply won’t be able to do better than that limit. Take Google Pub/Sub as an example: we know they promise 99.95% uptime, so if our backend service has a hard dependency on Pub/Sub, we can’t promise better than 99.95% uptime ourselves (unless perhaps we build in retries and redundancy and take other steps to self-heal if Pub/Sub fails).
Last but not least, keep an eye on your SLO error budgets. If you’re burning through your error budget every month, you want a process for dealing with it. I like to have the on-call person on my team triage it and come up with some ideas to fix it. The rest of the team helps pick an approach, and then we put a story on the backlog to prioritize. Don’t just ignore error budget violations! Objectives don’t mean a thing if you’re not accountable to them.
Or, if you're curious why this kind of work matters, I'd also highly recommend giving Accelerate a read. This book completely changed how I think about software engineering.
Please note that the links above are affiliate links, so I'll receive a small commission if you click through and purchase either book. That being said, I'll only ever recommend books I've read and found to be useful! Happy coding!