Five More Years Until I Can Automate My Job

Published: 2025-06-27
Updated: 2025-06-27

Earlier this month, I went to Datadog’s DASH conference in New York. It was a typical conference: overwhelming in-person product marketing and upsells disguised as support sessions. And happily oblivious to it all were the hundreds of developers hanging out and enjoying a couple days away from the everyday work grind. I was no exception, of course.

I won’t cover everything that went on, since most of the sessions I attended weren’t great. I tried to go to ones with fascinating titles, like “Faster Pipelines Without The Flakiness: Building a CI System Developers Actually Like.” It’s a mouthful, but it’s relevant for my team. However, it turned out to be a pitch for Datadog’s CI Observability tool with a customer testimonial tagged on. Note for next time: go to some of the fireside chats or case-study presentations from other companies instead.

Those sessions aside, the keynote was interesting. It was also a barrage of product marketing, and some of it felt like managers had volunteered to present so they could add it as a line item on their promo packets. It was a little goofy. Yet, some of the products showed promise.

Want to hear about new articles?

AI SRE and On-Call on a call

Before I get into it, let me caveat: I’m not an AI hype person. I’m sick of hearing about it, actually. Similar to the way things have played out with smartphones, I’m betting that this tool, which should expand our horizons, will actually end up shrinking our participation in the real world, and that will ultimately hurt us more than it will help us. But I digress.

The keynote was bursting with AI. Almost all of the products, except for a brief segment on cost-efficient log storage, were applications of AI to various SRE-esque tasks that Datadog powers.

In fact, the very first product they showcased was Bits AI SRE. (Bits is Datadog’s cutesy little name for their AI agent). The SRE product seemed genuinely helpful—it could automatically triage issues by generating multiple hypotheses about what caused a particular alert. It detailed its attempts to validate each hypothesis, and as it found leads, it generated more hypotheses about those particular causes and continued to investigate. It essentially asked, “Why did this happen?” until it reached the root cause. That’s exactly what humans do when faced with an alert or an incident, and since most of our triage is done in Datadog, why not have Bits take over? That’s one stressful part of my job that I’m happy to give away, even if it only works in a small percentage of cases.

Immediately after that, they presented On-Call, which was not quite as impressive. It was a slow lady voice that read alert details over the phone. You could respond by voice to ask for the next step in the runbook linked to that alert (you did write a runbook for every alert, right?), and it can even hook into automations to perform the steps in the runbook for you. That all seems great, except the AI voice speaks slower than some of the professors whose lectures I used to watch at 4x speed, and there’s no way I’ve written runbooks for every alert, much less created automations based on them. In reality, On-Call is a slower version of the text alerts that pop up in Slack. At 3am, I would be lulled back to sleep if the On-Call lady started talking. This product would be way more helpful if it called Bits AI SRE and droned to it instead.

That got me wondering how useful Bits SRE would actually be. Where the On-Call product requires alert hygiene and discipline to create runbooks and write automations for them, the SRE product demands monitoring and APM metadata integrity. You’d need to have excellent SLI coverage, correct tagging for your SLIs, and thorough tagging throughout your span hierarchy to ensure Datadog can identify service dependencies correctly. You’d also need database spans to be labeled as ‘database,’ web spans labeled as ‘web,’ and so on.

The ddtrace library should do much of that out of the box, but there are many places where we’ve abused APM or built up cruft. We have custom spans with questionable data integrity and many areas where our service tag is the same across independently deployable components when it should probably be unique. We retain all top-level request spans for an accurate request count, but then we sample all sub-spans, making traces appear wildly inconsistent. And if Bits SRE is responding to monitors, I already know many of our monitors are not fine-grained enough to easily get to a root cause. Rather than alerting on an individual endpoint error rate or latency, we will alert at the service level in some cases. What kinds of hallucinations might that cause?

Moar AI (to help you write bad code)

The products kept coming—as did the AI hype and my concerns about how well it would navigate our poor metadata integrity and hygiene. First, there was AI Dev Agent, a tool that would write PRs to fix issues detected in Datadog. Cool on the surface, except you need to give Datadog access to all of your source code and then wonder whether it can figure out the real offender when your service tag isn’t correct half the time.

Next up was APM Investigator, another hypothesis generator that attempts to automatically identify the root cause of latency increases. How does that work when half your traces are missing all of their spans except the root? What does it think the problem is when the distributed tracing setup is so opaque you need a palantir to discern the network hops your app made?

Last in this block of products was App Recommendations, which was like the AI Dev Agent, except it just chatted changes to you instead of making a PR directly. The verbatim note I took about this one was “tells your engineers things they already know or should’ve known or should’ve done or can’t do because the AI is hallucinating or doesn’t have enough context.”

All three of those products had compelling value props, except that they require thorough span coverage, correct span metadata, and accurate distributed tracing. They probably can’t operate with the Gnostic secret knowledge that we employees possess, like when service A fails on an innocuous span called “foo,” that means it tried to make a request to service B but was rejected at some uninstrumented part of our networking stack. Even the best value prop doesn’t mean much if you can’t get the tool to work.

Another issue is that these products benefit from broader Datadog adoption. If you have RUM and logs set up and tagged correctly, each tool gets dramatically more information and context to make good suggestions and hypotheses. We are, of course, not using those products.

This is a good business model, though. It lures customers in because the AI tooling increases in value as Datadog use increases, which creates a moat that makes it very hard to leave. Other AI agents won’t be as effective because they lack context. Using disparate tools for logs, spans, and metrics doesn’t make sense if you want the AI agents to be maximally effective. Datadog is drawing everything observability-related into its atmosphere and charging a pretty penny for each addition.

IDP and MCP oh my

A few other products were mentioned, including one non-AI tool that also posed problems. That was the Internal Developer Portal (IDP). It’s basically Service Catalog on steroids trying to woo customers away from Backstage and other developer portals. The downside is it’s clearly designed for microservices. We’d have a hard time setting up nice one-to-one service to team mappings with runbooks, monitors, and SLOs neatly contained therein. Our monolith app would lead to more of a one-to-N structure and just a pile of monitors and garbage and still no clear ownership. Have I mentioned I’m not a big fan of monoliths? I have blabbed about my preferred code structure and ownership in the past.

After that, there was MCP Server, which allows Datadog to connect with any AI agent you’d like to use for development, so it can nag you about writing unit tests right in your IDE.

There was also Codex CLI, which I didn’t totally understand; it was like a Datadog TUI that hooks into ChatGPT, I think. There must be some licensing money involved in exchange for Datadog’s hoards of customer data. Somewhere, Sam Altman is cackling in monospace.

Flex Logs: the least hyped product that I’m most hyped about

The best part of the whole presentation was a small segment near the end on Flex Logs. Customers were ticked that log retention was so expensive, so Datadog introduced a new tiered storage solution to bring prices down. They also added Flex Frozen to get affordable 7-year retention for auditing purposes. There was also the nifty Archive Search feature, which lets you dig through archived logs in Datadog or your own S3 bucket using Datadog’s search UI. It seemed reasonably fast, too, even for logs that were being indexed on demand. I loved it—simple, solves a real customer problem, and selfishly gets me closer to convincing my boss that we should use Datadog logs.

Then we rounded out the talk with a few bizarre case studies, including Toyota showing off SOS buttons, which have been in cars since like 2008, along with elevator music over Okta execs talking about how they like looking at logs in Flex Logs. Right on, I guess?

Takeaways

Sadly, my conclusion is I don’t think I can automate my job away yet. Bits SRE could do some incident triage, but that’s a relatively small percentage of my work. If I want Datadog to detect issues, identify root causes, talk to me slowly and therapeutically about the automations it’s going to run, then submit PRs to remove all of those naughty bugs, it’s going to need a lot more information, and much better metadata integrity and hygiene. I’ve got at least 5 years left of trying to convince folks to use Datadog logs—much less breaking up the monolith into clear domains, assigning team ownership over those bounded contexts, fixing our APM abuse, and tagging everything correctly.

Perhaps that’s what we need some AI for: fixing all of our messy bureaucratic and organizational problems. I’d rather write code than have to convince some directors that the monolith is unideal. It’s so frustrating explaining issues to people with no skin in the game and hoping they hear you. But you can’t raise $40 billion by doing something sensical like getting rid of the stuff people don’t want to do. You’ve got to automate the things people wish they had more leisure time to focus on instead, like programming, art, writing, and music. Did I mention I’m not that hype about AI?

Maybe we should step back from the craziness and get back to doing what engineers do: solving real problems. If Datadog wants to save me 10 to 20 hours a week at work, all they need to do is mail me a cardboard cutout.

This is the de facto book to read if you want to learn more about SRE: Site Reliability Engineering. Please note that's an affiliate link and I'll earn a small commission if you make a purchase.

I've also written about related topics, like sorting out on-call responsibilities and writing SLOs. Thanks for reading!

Support me on Substack or Patreon!