Ownership and on-call
At many modern tech companies, teams build services and run them too. I watched this shift take place at the first company I worked at as the DevOps movement was taking off. Rather than writing code on one team, then throwing it over a wall to another team to operate the code, companies started combining the “development” and the “operations” into one team, hence, DevOps. This approach enables teams to build both product knowledge and technical leverage, since the team gets to see how their product and tech decisions fare in production. Every time a team sees user frustration, high latencies, or unexpected failure modes, it see new opportunities to improve the UI, the UX, or the level of service going forward. This learning is crucial for success, and it can only come when a team is responsible for their code in production.
Production responsibility places demands on the team however. One such demand is on-call. On-call is the practice of setting an engineer up with a pager (typically an app on your computer or phone) that can notify the engineer when something breaks. The engineer must respond to any production issues that they get paged for. While on-call isn’t the most pleasant part of the job, it is an important part of providing a consistent level of service to customers. In my own experience, it’s also a great way to rapidly learn from technical mistakes (which allows you to build more technical leverage). Problems that page at 3am tend to get patched up pretty quickly!
Despite its importance, I have seen on-call get screwed up and neglected. Sometimes teams fail to create enough monitors and alerts, so they aren’t paged for real incidents. Other times, teams have too many noisy alerts and they start to ignore them altogether. From what I’ve seen, these are symptoms of the same underlying problem: many companies do not understand service ownership properly, and responsibilities like on-call suffer as a result. To make on-call successful, we need to sort out the more fundamental problem of service ownership.
How NOT to own a service
One of the easiest ways to screw up service ownership is to put it in the hands of multiple teams. Diffuse ownership by more than one team is as good as no ownership. As groups get larger, people tend to contribute less (the Ringelmann effect), so we can extrapolate that people will respond to fewer pages, or will defer to others to fix problems (“oh, someone else with more knowledge will get that”). Taking a step back, it also tends to mean that no one is a real domain expert capable of capturing the right service-level indicators (SLIs) or monitoring them in the first place. You might be totally blind, and yet have no one who knows or cares because there’s no real accountability for the product metrics. When asked why a product isn’t performing as expected, teams have a large surface area that they can spread the blame over. Ownership in this case feels like Bilbo Baggins: “thin, sort of stretched, like butter scraped over too much bread”.
On the flip side, companies screw up service ownership by neglecting certain areas of their app or product. There are services that aren’t owned by any team, so technical and product knowledge associated with them slowly fades into the ether as employees come and go. Without that knowledge, responding to pages becomes very challenging (assuming a team is even receiving pages for the service). I acknowledge that some services stop needing frequent feature updates after a while. However, those services should at least be assigned a team in GitHub, and someone should be updating libraries and doing security patches periodically.
Now that we’ve seen how ownership can go wrong—either by having too many teams or too few—the ideal option we’re left with is to have one team own a service. But now we need to answer: which team?
The team providing the API is the team that owns it
This is where the principles of service-oriented architecture (SOA) can help. In effect, SOA tells us that teams should only be communicating through APIs. Imagine you’re using a vendor’s API: you can’t see their trace metrics or their cluster health or any of that. Also, you’re their client, and probably paying them money for their service. Who is responsible for the uptime of that API in this case? Clearly it’s the vendor! The same is true for two teams: if you’re using another team’s API, you are the client and they are the vendor. They are responsible for that API, even if it’s a dependency in the product experience you’re building, in the same way that a vendor is responsible for their API even if you’re using it to build your product.
Now let me say, you are free to monitor an API you’re using. You can still collect trace metrics from your side to see the latency, error rate, client-side errors, and so on. This is probably wise as a diagnostic tool. However, ultimate responsibility for understanding and fixing the service providing the API falls with the owner. Let me reiterate: consuming an API does not make you the owner. The owner is the one providing the API.
The domain team should own the domain API
Let’s go on a tangent for a second, since this is also something companies screw up: who should be providing the API in the first place?
Companies sometimes make the mistake of splitting up teams by mission. This seems great in theory (“I have a team making customers happy, and a team acquiring more customers, and a team getting customers to spend more!”), but in practice it leads to bad software and inefficiency. Mission-based teams have messy architecture because missions are not tangible. You cannot build software for an ambiguous mission statement. Software needs clear inputs and outputs, and it needs to perform well-defined operations on that data. Also, architecture cannot change on a whim. Upper management might change their objectives like the wind, but because it’s not straightforward to change the software systems, you end up having to juggle services between teams as their missions change. Speaking from experience, it’s a recipe for disaster.
Instead, companies should be splitting teams up by product or domain. This is part of what’s called an inverse Conway maneuver; essentially, Conway’s Law tells us that systems will look like the orgs they’re built in, so if you shape your org like the architecture you want then you encourage that architecture to emerge. Good architecture has clear domains that interact through well-specified interfaces. Accordingly, a good team layout should have clear domains and interfaces too. Teams should own their product or domain in its entirety, and ownership should be long-lived. Then, rather than reforming teams around the work that needs to be done, the work should be broken down according to which product or domain needs to change to accomplish the mission. Those changes should then be brought to the respective teams. This is how you create stable, well-architected systems that actually delight customers.
Now circling back to API ownership, we can see that each domain should have its own API, as the API is the interface into that domain. Since one team owns the entirety of a domain, that means they own the domain API as well. As a result, that domain team should be on-call for that domain API. It should not be ambiguous! If it is, you probably need to spend more time mapping domains.
Ownership beyond APIs
One final consideration: how does ownership and on-call work if a team isn’t providing something as clear cut as an API? For instance, what if your team is providing a set of standards, patterns, or some underlying infrastructure for an application to run on? Who is the owner now?
In this situation, we can borrow from the AWS shared responsibility model. The idea is, in short, that AWS is responsible for the infrastructure running all of their services and offerings, while the customer is responsible for using those services and offerings correctly. This includes configuring them correctly. So if you’re a team owning some infrastructure (the “AWS” team in this situation), you must monitor that infrastructure and ensure it is operational at all times. Meanwhile, if you’re the client who’s deploying code to that infrastructure, it is your responsibility to make sure your code is working as expected. If the service goes down due to a bug in an endpoint, the application team (client) must respond to it. If the service goes down because the worker nodes keep tipping over, then the infrastructure team (vendor) must respond to it.
Service ownership in a nutshell
Overall, the team providing a service is the owner and should be on-call. A team consuming a service is not the owner, even if the service is a hard dependency for the experience being built. This applies generally, whether the “service” in question is an API, or an infrastructure platform, or some other kind of dependency.