Devops Test Associates

The hidden risks inside Agentic AI systems

March 12, 2026

Every few years, the technology world takes a familiar idea, rebrands it with a sharper title, and pushes it back into the limelight. The enthusiasm around Agentic AI is a perfect example. Many organisations are rushing to adopt frameworks they barely understand, often without asking whether the underlying models are appropriate for the job. That’s usually where fragile architecture takes root. Predictive AI belongs to one family of problem-solvers; Generative AI to another. Conflating them and treating them as interchangeable creates conditions for systems that behave unpredictably.

The goal here isn’t to pour cold water on Agentic AI either. Used well, these systems can be incredibly effective. But the enterprise world, and especially regulated sectors, need a more grounded understanding of how these architectures work and where the real risks lie. Because when organisations misjudge the ingredients, the end result becomes much harder to trust, monitor or defend, as Kit Ruparel, Chief Technology Officer, TCC Group – Recordsure, explains…

Agents are not new, but the stakes are higher now

Software Agents have been around since the 1950s: small autonomous modules capable of responding to requests and carrying out tasks. They became mainstream decades ago through distributed systems and microservice architectures. For most of that time, they weren’t ‘intelligent’, they were designed to perform tasks autonomously and follow predefined rules, i.e. code that engineers had set down.

Crucially, those agents communicated with one another through tightly defined interfaces. APIs were explicit. Inputs and outputs were structured. If something failed, it failed loudly. Engineers could trace it, interrogate it, and correct it.

But even early examples had a familiar flavour. Anyone who remembers sitting at a ’90s-era desktop, typing their grandmother’s spaghetti carbonara recipe into Microsoft Word and being interrupted by Clippy offering help with a ‘formal business letter’, has met an early AI-enhanced Agent. Clippy parsed intent (often badly), offered guidance, and assumed you needed its opinion.

Then came the Alexa Skills Kit from Amazon in 2015, which effectively launched a consumer grade Agentic framework. You could ask your device how to make Boeuf Bourguignon or to play a local radio station, and under the hood, it would interpret your request and route it to whichever skill – whichever agent – seemed most capable or appropriate.

So, the idea isn’t new. What is new, however, is the sheer scale of LLM (Large Language Model) adoption. What is also new is how those LLMs allow agentic components to communicate with one another. Instead of relying solely on rigid, formally defined APIs, agents can now pass instructions between each other in plain English. That flexibility is powerful, but it also introduces ambiguity. Natural language is expressive, not deterministic. Subtle differences in phrasing can therefore alter intent, and interpretation becomes probabilistic rather than exact,

According to Gartner, Generative AI (Gen AI) is now the most frequently deployed type of AI in organisations. And a more recent McKinsey survey shows the trend moving even faster: 23% of organisations are already scaling an agent-based AI system in at least one business function, while another 39% have begun experimenting with AI agents.

That acceleration combined with agents effectively ‘talking’ to one another in natural language is why the boundary between stable engineering and experimental behaviour has become harder to see, and why clarity now matters more than hype.

Where agentic accuracy starts to crumble

The simplest way to understand the problem is through the example of a pub quiz night.

You’ve been selected as a team captain of a team of strangers. Mo knows music. Sally knows football. Tabby knows electronics. A question about an Arsenal player’s pre-game eating habits lands; you hand it to Sally because you recognise the club. She answers correctly. You’ve acted as a ‘Parser’, interpreting the question, as a ‘Reasoner’, deciding that you couldn’t answer the question directly based on your own limited knowledge but that another might, and as an ‘Orchestrator’ deferring the question to one of your agents. Sally, because she ‘registered her skill’ with you concerning UK football knowledge. You asked Sally the question using a common ‘communications contract’ between you (the English language), taking her response via the same contracted language. Then, finally, your ‘Reasoning’ decided to trust her response based on your ‘confidence’ in it, rather than simply answering “we don’t know” or making up a different answer yourself.

Then comes a question on the origins of the UK electronic band ‘808 State’. You ask Mo because it’s a band. He offers a story about a U.S. police penal code for ‘disturbing the peace’. But you also hear ‘electronic’, so you check with Tabby, who recalls the Roland TR-808 drum machine. Two plausible answers. You weigh confidence, context and instinct, and choose one, even if your next sip of beer is slightly tentative.

That’s how agent systems work. Parse, reason, orchestrate, respond. Now imagine replacing each step with a large language model:

Model A interprets the question
Model B decides who should answer
Model C writes the request
Model D generates the answer
Model E evaluates the response

Even if the model is 90% accurate, the combined reliability falls sharply. Five LLM-driven steps drop the final accuracy to around 59%. Extend to ten agents and statistically, two out of three final answers become wrong – yet still delivered with total confidence.

It becomes the grown-up equivalent of the childhood whisper game (or telephone game in the U.S.): each hop adds distortion, and because LLMs don’t output probability, every answer sounds polished and authoritative regardless of whether it is true or not.

This behaviour is already showing up in enterprise data. A recent survey of more than a thousand organisations found that the proportion abandoning most of their AI initiatives before production jumped from 17% to 42% in a single year. The gap between proof-of-concept performance and real-world reliability is widening, often because organisations underestimate the compounding-error risk inside agent chains.

Cue: Predictive AI. Predictive AI models work differently. They return confidence scores. You know when they’re unsure. And because of this, you can tune thresholds and manage risk meaningfully. That difference becomes essential when you’re building systems that need to justify their behaviour.

Designing agentic AI that doesn’t collapse under real-world conditions

Agentic AI is arriving whether organisations plan for it or not. The tools people already use are moving in that direction. The real question is whether the architecture is strong enough to withstand the inevitable scrutiny.

Three principles determine the answer:

Design communication deliberately

If two components are under your control, don’t default to natural language. Clear APIs reduce ambiguity and allow for the audit trails that regulated sectors require. Generative models should sit where they add genuine value, not where they create unnecessary interpretative layers.

Ensure every component can say “I don’t know”

Predictive models already offer probability. Use that. Generative models need structured guardrails that allow them to decline gracefully rather than fabricate their way forward. This single act prevents entire cascades of error.

Build governance and traceability into the core

Modern cloud ecosystems now provide identity boundaries, content-safety checks, audit trails and monitoring tools built specifically for AI. Anything developed outside these guardrails must meet the same standard. Otherwise, every clever agent becomes a governance blind spot.

And human oversight still matters. Not heavy handed supervision but informed judgement, the kind that spots when behaviour deviates from expectation. In regulated environments, such as wealth management and financial services, that intuition remains a critical line of defence.

This is where the real “so what” emerges. Building AI is easy now. Building AI that can be monitored, justified and trusted is not. And it’s that gap that many early adopters haven’t yet grasped but will feel soon enough.

Responsible design will matter more than generative hype

Agentic AI will reshape plenty of workflows, but not because it thinks like us. The value lies in the architecture wrapped around it: separating predictive and generative logic, limiting compounded error, controlling uncertainty, and designing systems that remain stable long after the novelty fades.

AI-tech firms that successfully fulfil ‘the brief’ won’t necessarily be the ones with the flashiest demos, promising you the world. These will be the ones that deliver responsible AI, with reliable frameworks that can admit uncertainty, maintain traceability and satisfy governance expectations from the outset.

Agentic AI isn’t inherently risky because it’s powerful. It’s risky when adopted without the discipline it demands. In the end, it’s ultimately the responsible AI design, not hype, that keeps these systems safe, reliable, and worthy of trust.

Photo by Immo Wegmann on Unsplash

POST TAGS: Featured

The hidden risks inside Agentic AI systems

Agents are not new, but the stakes are higher now

Where agentic accuracy starts to crumble

YOU MIGHT ALSO LIKE

The hidden risks inside Agentic AI systems

CONVERSION RATE OPTIMISATION MONTH: Optimising high-cost acquisition in a margin-tight market

E-commerce continues strong growth despite operational challenges

Leave a Reply Cancel reply

SUBSCRIBE TO INDUSTRY NEWS

RECENT POSTS

The hidden risks inside Agentic AI systems

CONVERSION RATE OPTIMISATION MONTH: Optimising high-cost acquisition in a margin-tight market

E-commerce continues strong growth despite operational challenges

CONVERSION RATE OPTIMISATION MONTH: Building a culture of continuous optimisation in d2c

Our key takeaways from eCommerce Forum 2026