AI agents invade observability: snake oil or the future of SRE?

(monitoring2.substack.com)

46 points by RyeCombinator 2 days ago | 46 comments

The issue I see is that this is pretty much the final boss for AI systems. Not because the tasks to do are inherently too difficult or whatever, but the integration of data and quality of that data is so variable that you just can't get something done reliably.

Compare this to codebase AI, where much of the data you need lies in your codebase or repo. Even then, most of these coding tools aren't even close to automating meaningful coding tasks in practice, and while that doesn't mean they can't in the future, it's a long ways off!

Now in the ops world, there's little to no guarantee that you'll have relevant diagnostic data coming out of a system that you need to diagnose it. That weird way you're using kafka right now? The reason for it is told via oral tradition on the team. Runbooks? Oh, those things that we don't bother looking at since they're out of date? ...and so on.

The challenge here is in effective collection of quality data and context, not the AI models, and that's precisely what's so hard about operations engineering in the first place.

ern 2 days ago | root | parent | next |

Even then, most of these coding tools aren't even close to automating meaningful coding tasks in practice, and while that doesn't mean they can't in the future, it's a long ways off!

Not related to your main point, but I've introduced Github Copilot to my teams, and, surprisingly, two of our strongest developers reached out to me independently, and told me it's been a huge boost to their productivity, one in refactoring legacy code, and another in writing some non-trivial components. I thought the primary use would be as a crutch for less capable developers, so I was surprised by this.

As a middle-manager whose day job previously robbed me of the opportunity to write code, I've used ChatGPT 4o to write complex log queries on legacy systems that would have been nearly impossible for me to do otherwise (and would have taken a lot of effort from my teams) and to turn out small (but meaningful) tasks, including learning Android dev from scratch to unblock another group and other worthwhile things that keep my team from being distracted and able to deliver.

I guess there's a "no true Scotsman" fallacy hiding there, about what constitutes "meaningful coding tasks in practice", but to me, investing in these tools has been money well spent.

phillipcarter 2 days ago | root | parent | next |

Oh, I completely agree with using tools like this. For example, the latest models are very good at being passed a description of a problem and its inputs, expected outputs, and some sample test cases, and then generating a very diverse set of additional cases that likely account for some edge cases you might have missed. Hugely productive for things like that.

However, these same coding assistants lack so much! For example, I can't have a CSV co-located in my directory as a jupyter notebook file, then start prompting+coding without having also done a call to df.head to get those results burned into the notebook file. The CSV is sitting right there! These tools should be able to detect that kind of context, but they can't right now. That's the sort of thing I mean when we have a long way to go.

salomonk_mur a day ago | root | parent |

But still, a huge productivity boost. I think we can say that as of the lastest models, AI pairs are pretty great and save a ton of time.

fumeux_fume a day ago | root | parent | prev |

My experience with Copilot has been the opposite of your devs'. It frequently screws up routine tasks, has poor UI and writes buggy code I would expect from someone who was brand new to programming. Sometimes it'll chance on a nice solution without requiring modifications, but not enough to fool me into any glowing reviews! I think I have higher standards than most though.

One thing I don’t trust about this approach is when using coding assistants, the generated code might not be what you need at first and then you keep iterating or use what’s necessary from the output but in ops that approach can make things worse burning more money and trust.

shombaboor a day ago | root | parent |

ive definitely got into prompt cycles where i ask myself it definitely would have been shorter to have written it myself. so far it can't do anything I can't do myself, it's a time saver for the most common boilerplate blocks, function definition etc.

I agree completely. SRE/Ops/Livesite is an incredibly hard problem and very easy to make shiny demos for products that will not reproduce those results when you need them most.

The article talks about moving past “copilots” and moving right to “agents”. There’s probably some semantics to decipher there, but we haven’t even gotten copilots to work well! At their core, they are essentially the same problem, but I feel a lot safer with a chatbot suggesting a mitigation than just going and performing it.

arminiusreturns a day ago | root | parent | prev |

Ops dude here. I agree. I recently was doing some digging on the real future of AI in dev/ops, and I found that the higher the complexity, the less capable the AI (oh god, I've turned into one of those that is now saying AI instead of ML/DL). Operations is the height of big picture complexity - exactly what AI is not good at. That said, I think it could do a lot to assist with finding anomalies that get missed in the flood of data. I've done some fun log stuff on this using Z values before but it took mental effort! So I do think it could help a lot with the queries/searches, but it is unlikely to be able to do "whole system" observability across DC/stacks very well in current iteration.

PS: I hate how many "agents" already have to run on systems. Especially when the prod stuff is core starved already. I can't tell you how many times I've found an agent (like crowdstrike) causing some strange cascade issue!

RodgerTheGreat a day ago | prev | next |

I think the past few years have amply demonstrated that they can be both snake oil and the future of SRE. A dim future indeed.

grugagag 2 days ago | prev | next |

I feel AI will get in the way same as other products have done in the past. Sure, it’ll fit some areas and we’ll hear a happy story here and there but business need to focus on their core competencies and do a job with that well before hoping for a magic solution. Us, the workers will need to clean up the mess…

nineteen999 a day ago | prev | next |

The words reliability and LLM do not currently belong in the same sentence.

limaoscarjuliet a day ago | root | parent |

This paper explains why this is (and likely will be for some time) the case: https://link.springer.com/article/10.1007/s10676-024-09775-5

ppsreejith a day ago | prev | next |

> If every major APM vendor and dozens of startups release agents in the next year, it will be difficult for customers to tell what’s snake oil or what’s actually useful. One approach, also seen in the financial space, is having open benchmarks for assessing how well agents can answer questions and show domain-specific knowledge.

IME benchmarks, though valuable, don't fully reflect the real world, often only reflecting the easily quantifiable. The best way is to be able to quickly try out an agent to see how it performs on your work environment. Sort of like having a private test set you can try different agents on to see how they perform in the real world quickly.

Disclaimer: I'm building MinusX, a data science agent (github.com/minusxai/minusx)

zug_zug 2 days ago | prev | next |

I cannot for the life of me understand why SRE of all roles would be the one to attempt to use agents for. IMO it's one of the last roles that it would apply, long after core development.

I mean is the AI going to read your sourcecode, read all your slack messages for context, login to all your observability tools, run repeated queries, come up with a hypothesis, test it against prod? Then run a blameles retrospective, institute new logging, modify the relevant processes with PRs, and create new alerts to proactively catch the problem?

As an aside - this is garbage attempt at an article, kinda saying nothing.

sgarland 2 days ago | root | parent | next |

Because the people creating and selling these solutions are charlatans who have probably never debugged a gnarly Linux issue; more importantly, they’re selling it to people who also don’t know any better.

SRE as a term is effectively meaningless at this point. I can count on one hand the number of SRE coworkers I’ve had who were worth a damn. Most only know how to make Grafana dashboards and glue TF modules together.

tayo42 2 days ago | root | parent |

Sre shouldn't be a job. If you look at what an sre aspires to be, it's just good software engineering.

It's glorified sysadmin work, and the role tbh in the industry should have stayed that way.

If your senior swes can't monitor, add metrics, design fault tolerant systems, debug the system and environment their stuff runs on and need to baby sat there's a problem

sgarland 2 days ago | root | parent |

What you’re describing is mostly in the application side of things, though. You still need Ops to run all the infra, which is what most places’ SRE role really is.

Most devs I’ve worked with didn’t have any interest in dealing with infra. I have no problem with that; while I think anyone working with computers should learn at least some basics about them, I completely understand and agree with specialization.

tayo42 2 days ago | root | parent |

I haven't seen an ops team run infra in awhile. They're all software engineers writing abstraction over cloud. Infra is like "platform" now and they are swes running it.

stackskipton a day ago | root | parent |

I mean, I'm SRE/Ops but every company I've joined, I've been brought in to manage the infrastructure from ball of wax some dev got operational enough to limp into prod. When I mean limp, they are truly limping and couldn't run to save their life.

You need someone to be disconnected from getting birthday to display on the settings page and make sure Bingo, Papaya, Raccoon and Omega Star are just up and running. Esp since Omega Star can't get their shit together.

Exactly, because the major thing every company needs is an opaque agent messing with all the production settings and releases that impact revenue and uptime? This is one of the silliest pitches I've heard in my life.

I'm all for a k8s / Terraform / etc focused GPT trained on my workspace to help me remember that one weird custom Terraform module we wrote three years ago, but I don't want it implementing system wide changes without feedback.

Noticing trends in logs ,especially across different systems, could be useful for identifying a lot problems before they break.

clvx 2 days ago | root | parent | next |

I think this is the ideal spot. I wouldn’t trust an agent doing remediation but I would gladly take input from them regarding services that are acting up or have a trend to disaster. This would improve using my time in what matters instead of cleaning data to determine what matters. In a perfect world I’d like to provide my threshold or what success mean to the agent and expect information without me tweaking dashboards, configuring alerts, etc.

bradly 2 days ago | root | parent |

I've worked on large FAANG systems that used ML models to identify all sort of things from hair in pictures to fraud and account takeovers. Tasks are then queue up for human review. AI wasn't a buzzword at the time so we didn't call it that, but I'm guessing this would be similar.

That already exists, though? Log anomalies is a thing.

For all the grief people give Datadog, it is an incredibly good product. Unfortunately, they know it, and charge accordingly.

Splunk has had this for close to two decades.

And I’ve worked on some of the world’s largest systems and in most cases simply looking for the words: error, exception etc is enough for parsing through the logs.

For everything else you need systems like Datadog to visually show you issues e.g. service connection failures.

stackskipton a day ago | root | parent |

Even then, you generally need someone smart enough to figure out what is causing the error.

Node is throwing EADDRINFO. Why? Well, it's DNS likely but cause of that DNS failure is pretty varied. I've seen DNS Server is offline, firewall is blocking TCP Port 53, bad hostname in config, team took service offline and so forth.

phillipcarter 2 days ago | root | parent | prev |

FWIW this is part of the promise of AIOps of the 2010s and it's still only barely starting to happen. I'm glad it has, but there's so, so far to go here.

Agents do not necessarily need to do the entire job of an SRE. They can deliver tremendous value by doing portions of the job. i.e. bringing in relevant data for a given alert or writing post-mortems. There are aspects of the role that are ripe for LLM-powered tools

threeseed a day ago | root | parent |

I really can’t think of anything more counter-productive than AI post-mortems.

The whole point of them is for the team to understand what were wrong and how processes can be improved in order to prevent issues happening again. The idea of throwing away all the detail and nuance and having some AI generated summary will only make the SRE field worse.

Also really don’t understand the benefit of LLM for bringing in relevant data. Would much prefer that be statically defined i.e. when a database query takes 10x longer, bring me the dashboards for OS and hardware as well.

a day ago | root | parent |

[deleted]

> As an aside - this is garbage attempt at an article, kinda saying nothing.

Disagree. It was a good survey of the current AI SRE agents landscape. I don't have my head in the sand, but there are new startups coming up that I hadn't heard about.

>I mean is the AI going to...

Yes. Not one AI but multiple AI agents, all working on different aspects of the problem autonomously and then coming together where needed to collaborate. Single LLMs doing things by themselves are like concrete, creating swarms of agents is like adding rebar to the mix.

As with Asimov's The Last Question, there will be insufficient data for a meaningful course of action. But we tend to cut corners, be convinced with some synthetic use cases and take the risk without being aware that they are so. The mistakes will be dismissed as attributed to misuses, or weird installations, and the successes will fuel the hype.

They're going to need SRE for the AI :)

Cyph0n 2 days ago | root | parent |

Nah it’s just AI SRE agents all the way down.

wruza 2 days ago | root | parent |

Except you don’t need down, you can simply loop them. The quality of such system will measure in how much it can do in a particular compute budget and theoretically can even self-stabilize by roi feedback.

6510 2 days ago | root | parent |

Even the most novice coder with a poor brain can write all the software if you give them a million years.

>I mean is the AI going to read your sourcecode, read all your slack messages for context, login to all your observability tools, run repeated queries, come up with a hypothesis, test it against prod? Then run a blameles retrospective, institute new logging, modify the relevant processes with PRs, and create new alerts to proactively catch the problem?

Yes. That's exactly where this stuff is going right now.

more_corn 2 days ago | root | parent | prev |

I mean, doesn’t all that sound like something fine tuned LLMS would be great at? Well constrained tasks, if else statements, runbook -> automation.

1over137 2 days ago | prev | next |

What is SRE?

steerpike a day ago | root | parent |

Site Reliability Engineering.

desktopninja 2 days ago | prev | next |

SysAdmin. Even AI needs a hero.

arminiusreturns a day ago | root | parent |

Can we bring back sysadmin as a job title instead of all these fancy newfangled ancronyms! Signed, greybeard sysadmin pretending to be newfangled.

ddmma a day ago | prev | next |

I believe that expertise combined with automation can build a strong foundation that can be delegated to AI agents. Presumably not all of them will deliver greater good but let’s not lose hope, it’s still infancy.

RecycledEle 2 days ago | prev |

I need to buy a vowel, Pat.

I can't solve the article.

1over137 a day ago | root | parent |

Indeed. The article doesn’t even define the SRE acronym. :(