The first version worked perfectly in the demo. It crashed on day three in production, taking down a customer-facing workflow that forty people relied on every morning.
That was my introduction to the real cost of getting agentic AI system design wrong. Not a graceful degradation, not a warning in the logs — a silent loop that consumed token quota until the billing alarm fired at 2 AM.
Over the following eight months, I rebuilt the same system three times. Each iteration taught me something the previous one did not. What follows is the honest account of all three — the architecture diagrams, the failure modes, the decisions I wish I had made on day one, and the four-component model that finally made it to production and stayed there.
Whether you are an engineer responsible for building the plumbing or an architect deciding what components to commission, this is the article I needed before I started.
Version 1: The Single Agent That Tried to Do Everything
V1 — The Naive BuildThe first system was simple by design. One agent, powered by GPT-4, with access to every tool the task required: web search, code execution, a file system interface, a database client, an email sender, and three internal API connectors. Seven tools total.
Diagram 1: Version 1 — one agent, unrestricted access to all tools, no memory, no observability
It looked clean on paper. In practice, the agent's context window was bloated with tool schemas before the first real work began. With seven tools registered, approximately 4,000 tokens of schema description landed in every prompt — before any task content, before any conversation history.
On long tasks, we hit the context ceiling mid-execution. The agent had no recovery strategy. It would either hallucinate the remaining steps or enter a reasoning loop, calling the same tool repeatedly with slightly different arguments each time, burning tokens until the run timed out.
The second problem was state. Every run began from zero. If a task was interrupted and restarted, the agent had no memory of what it had already completed. Our workaround was to write longer and longer system prompts that summarised prior context — which made the context problem significantly worse.
The third problem was observability. When something failed, the only artifact was a wall of unstructured LLM output. We had no trace of which tool was called, in what order, with what arguments, or what it returned. Debugging a production failure meant reconstructing the agent's decision path from log fragments. It was archaeology, not engineering.
Version 2: The Multi-Agent Mess
V2 — The Over-Engineered RebuildI overcorrected hard. Inspired by microservices thinking, I decomposed the monolithic agent into six specialised agents: a Research Agent, a Synthesis Agent, a Code Agent, a Writer Agent, a Review Agent, and a Formatter Agent. A central Planner Agent coordinated them all.
Diagram 2: Version 2 — seven agents with a rejection loop, no circuit breaker, 45-second latency
On paper, this was elegant. In production, it was the most expensive debugging experience of my career.
The first problem was latency. What previously took eight seconds now took forty-five. Every agent handoff involved a fresh LLM call, a new context assembly, and an additional round of network overhead. We were paying the API tax seven times for work that needed it twice.
The second problem was cost. Each agent required its own full context, which meant task-relevant information was duplicated across every prompt in the chain. Our per-task token spend tripled within the first week of the rollout.
The third problem was coordination failure. When the Review Agent rejected an output, it passed the rejection signal back to the Planner, which re-queued the original task, which triggered the same six-agent sequence again. Without a circuit breaker, a single malformed task could loop indefinitely. It did. On a Friday evening.
The fourth problem was debugging. A production failure now required tracing through seven separate agent logs to reconstruct what happened. The system was technically observable but practically incomprehensible at 11 PM with a customer on the phone.
Version 3: The Architecture That Ships
V3 — Production-ReadyThe third design started with a different question. Instead of asking how to split the system up, I asked what the system actually needed in order to be reliable. The answer was four things: a smart orchestrator, a scoped tool registry, a deliberate memory layer, and an evaluation gate before any output reached the user.
Diagram 3: Version 3 — production architecture with orchestrator, scoped workers, memory layer, guardrails, and confidence-based human routing
Version 3 runs with one Orchestrator and a maximum of three Worker Agents, each scoped to a distinct capability domain. The Orchestrator handles task decomposition, worker assignment, state management, retry logic, and the decision about whether a completed output requires human approval before delivery. Worker Agents are stateless — all state lives in the Memory Layer, which the Orchestrator manages centrally.
This design is not glamorous. It does not look impressive in a system diagram. It has run without a critical production failure for six months. That is the metric that matters.
The Four Components Every Production Agentic System Needs
The difference between V2 and V3 was not the number of agents. It was the four infrastructure components that V1 and V2 both lacked. Here is each one in detail.
1. The Orchestrator
The orchestrator is the central nervous system of the entire system. Its responsibilities are: receive a task, decompose it into sub-tasks, assign sub-tasks to the appropriate workers, collect results, manage retries, maintain execution state, and decide when output is ready to be evaluated.
The critical design principle is that the orchestrator should be the only component that holds the full task context. Worker agents receive only what they need for their specific sub-task. This keeps individual context windows tight, makes token costs predictable, and gives you a single place to instrument, monitor, and debug the execution path.
Below is a simplified implementation of this pattern in Python:
class Orchestrator:
def __init__(self, tool_registry, memory, evaluator):
self.tools = tool_registry
self.memory = memory
self.evaluator = evaluator
def run(self, task: Task) -> Result:
# Decompose task into ordered sub-tasks
plan = self.planner.decompose(task)
results = []
for sub_task in plan.steps:
worker = self.select_worker(sub_task)
context = self.memory.load(sub_task.context_key)
output = worker.execute(sub_task, context)
self.memory.save(sub_task.context_key, output)
results.append(output)
final_output = self.synthesize(results)
# Gate output through the evaluator before delivery
eval_result = self.evaluator.score(final_output, task)
if eval_result.score < CONFIDENCE_THRESHOLD:
return self.route_to_human_review(final_output, eval_result)
self.audit_log.write(task, final_output, eval_result)
return final_output
Notice that the orchestrator owns the full execution loop. No worker agent calls another worker agent directly. All state transitions go through one place.
2. The Tool Registry
In V1, tools were imported directly into the agent. Any agent could call any tool. This is the agentic equivalent of giving every employee in a company unrestricted root access — technically functional, operationally reckless.
A tool registry is a centralised wrapper that sits between agents and tools. It handles three things: permission scoping (which agents are authorised to call which tools), input schema validation (arguments are verified before execution), and audit logging (every tool invocation is recorded with its arguments and output).
Diagram 4: Tool Registry — permission check, schema validation, and audit trail on every tool call before execution
In practice, permission scoping looks like this: the Research Worker can call web search and the vector store retrieval tool. It cannot call the email sender or the database write tool. The Code Worker can execute sandboxed code. It cannot write to the file system in production paths. These boundaries are not just good security hygiene — they dramatically reduce the blast radius when an agent misbehaves.
3. The Memory Layer
The most common architecture mistake I see in production agentic systems is treating the context window as the only form of memory. The context window is short-term memory. It is ephemeral, expensive, and limited. Production systems need all four memory types working together.
Diagram 5: Four memory types a production agentic system needs — most teams only implement the first one
- Short-term memory is the active context window. Everything in the current turn lives here. It is fast but expensive and bounded. When a task exceeds this boundary, you need the other three types to pick up the slack.
- Episodic memory stores summaries of past sessions. When a user returns after a week and references a decision made in a previous run, the orchestrator retrieves the relevant episode and injects a summary into the current context. This is where PostgreSQL or Redis comes in.
- Semantic memory is your vector store. It holds domain knowledge, product documentation, internal policies — anything the agent needs to reason accurately over a corpus that is too large to fit in context. Retrieval-augmented generation (RAG) is the pattern for accessing it.
- Procedural memory is static configuration: the tool registry schemas, the agent role definitions, the guardrail policies. It does not change at runtime, but it shapes every single agent decision. Most teams define this ad hoc in system prompts. Externalising it as configuration makes it auditable, versioned, and changeable without a code deploy.
4. Guardrails and Evaluation
In V1, agent output went directly to the user. In V3, it passes through an evaluation gate first. This is the component that prevented the most production incidents in our system — and the one most teams skip because it feels like overhead.
Diagram 6: Evaluation loop — four-dimension scoring before delivery, failed outputs retried with evaluator feedback injected into prompt
The evaluator is itself an LLM call — a separate model instance that acts as a judge and scores the agent output across four dimensions: faithfulness to the source material, relevance to the original task, task completion, and safety. If the aggregate score falls below a configurable threshold, the output is not delivered. Instead, the evaluator's feedback is injected into a retry prompt and the agent tries again.
The key implementation detail is that the evaluator must be a different model instance from the one that generated the output, and it must be given the original task specification alongside the output to score. An evaluator judging its own work in a shared context is not an evaluator — it is confirmation bias at scale.
Choosing Your Orchestration Pattern
Not every agentic system needs the same coordination model. The three patterns you will encounter in the wild each have distinct cost, latency, and reliability characteristics. Here is the honest comparison.
Diagram 7: Three orchestration patterns with their trade-offs — start with Supervisor-Worker for any enterprise system
| Pattern | Best For | Avoid When | Latency | Cost |
|---|---|---|---|---|
| Sequential Pipeline | Linear workflows where each step depends on the previous output | Tasks that can run in parallel, or where one step failing stalls everything | Medium | Low |
| Supervisor-Worker Recommended | Most enterprise use cases; tasks that can be parallelised with a central state manager | Task graphs so dynamic that the orchestrator itself becomes a bottleneck | Low to Medium | Medium |
| Peer-to-Peer Mesh | Adversarial review workflows, multi-perspective reasoning, debate-style synthesis | Any context requiring predictable latency, controlled cost, or clear audit trails | High | High |
The supervisor-worker pattern is my default recommendation for enterprise systems. The orchestrator as a single coordination point gives you one place to add retries, circuit breakers, cost controls, and observability hooks. When something goes wrong — and it will — you know exactly where to look.
The sequential pipeline is appropriate for simple, well-defined workflows where the steps are known in advance and each one truly depends on the last. Report generation pipelines and structured data extraction workflows are good examples. The peer-to-peer mesh is powerful for specific use cases like document review or adversarial red-teaming, but it is expensive and hard to control. Do not start there.
Human-in-the-Loop Is Not a Nice-to-Have
Every agentic system I have audited in the past year has the same gap: human review is either absent entirely or implemented as a binary flag that gets turned off the moment someone complains that the system is too slow.
The correct approach is not binary. It is confidence-based routing.
Diagram 8: Confidence-based HITL routing — above 0.85 auto-executes, 0.60 to 0.85 queues for human review, below 0.60 rejects with explanation
The confidence score that drives this routing comes from the evaluator. A high-confidence output — well above threshold, all four evaluation dimensions passing cleanly — is auto-executed. A medium-confidence output queues for a human reviewer who can approve, edit, or reject it. A low-confidence output is rejected immediately, with an explanation returned to the user rather than a hallucinated answer delivered with false authority.
Two rules matter here. First, any action that is irreversible — sending an email, writing to a production database, triggering a payment — should have a lower auto-execute threshold regardless of confidence score. The cost of a false positive on an irreversible action is categorically different from the cost of a false positive on a text summary. Second, every decision in this routing graph, whether automated or human, must be written to the audit log. This is not optional in enterprise deployments. Your compliance team will ask for it eventually. Build it from the start.
What I Would Do Differently From Day One
If I were starting this system today, here is the sequence I would follow:
- Start with a single agent and a tool registry from day one. Do not wait until you have a production incident to add permission scoping and audit logging. The tool registry costs one afternoon to build and saves weeks of debugging later.
- Design the memory layer before writing a single agent. Decide upfront which database holds episodic memory, which vector store handles semantic retrieval, and how procedural configuration is versioned. These decisions are architectural — changing them later requires rebuilding the agents that depend on them.
- Add the evaluator before going to production, not after. An agent without an evaluation gate is not a production system. It is a demo that has not failed yet.
- Instrument everything from the first deploy. Log every tool call, every agent decision, every memory read and write. The cost of generating these logs is trivial. The cost of reconstructing them after a failure is not.
- Add agents only when a single agent demonstrably cannot do the job. The threshold is not "this would be cleaner with two agents". The threshold is "this agent is consistently hitting context limits or producing quality degradation that a second specialised agent would solve". Set that bar high.
Frequently Asked Questions
What is the difference between an AI agent and a regular LLM API call?
A regular LLM API call is stateless. You send a prompt and receive a response. An AI agent is a system that plans multi-step tasks, calls external tools, retains memory across steps, and makes decisions about what to do next based on intermediate results. The defining characteristic is agency: the system determines its own execution path rather than following a fixed prompt-response cycle.
When should I use a multi-agent system instead of a single agent?
Use a single agent when tasks are sequential, context fits comfortably in one window, and failures are straightforward to debug. Move to multi-agent when tasks are genuinely parallelisable, when distinct specialisation produces measurably better output quality, or when a single agent consistently degrades over long runs. The mistake most teams make is splitting too early, before they have evidence that a single agent cannot handle the load.
What is the best orchestration pattern for enterprise agentic AI?
The supervisor-worker pattern is the most reliable choice for enterprise systems. A single orchestrator decomposes tasks and manages state while specialised worker agents execute sub-tasks in parallel. This gives you observability and retry logic concentrated in one component, clear accountability when something fails, and a single surface to attach cost controls and security policies.
How do I stop an agentic system from going off-rails in production?
Four controls work in combination. A tool registry that scopes which agents can call which tools limits the blast radius of misbehaviour. Confidence-based human-in-the-loop routing intercepts low-confidence and irreversible actions before they execute. An LLM-as-judge evaluation layer scores outputs before delivery. Hard token budget caps per task run prevent runaway cost from a looping agent. Any one of these in isolation is insufficient. Together, they create defence in depth.
What memory architecture should a production agentic system use?
Production systems need all four memory types: short-term (the active context window), episodic (past session summaries in PostgreSQL or Redis), semantic (a vector store for retrieval-augmented generation), and procedural (tool schemas and policy rules as static configuration). Most teams only implement short-term memory and wonder why their agents have no continuity across sessions. The other three types are what separates a system that scales from one that requires constant human intervention.
Final Thought
The hardest part of agentic AI system design is not the LLM. The LLM is the easiest component in the stack. The hard parts are the infrastructure decisions you make in the first week — how memory is structured, where state lives, what the tool boundaries are, and when to involve a human — because those decisions shape every single component you build on top of them.
Version 3 is not perfect. No production system is. But it has a memory layer that keeps context from exploding, a tool registry that contains failures before they propagate, an evaluation gate that catches bad outputs before they reach users, and an audit log that makes every incident a tractable debugging exercise rather than a guessing game.
That is the bar worth building toward. Start there, not with the number of agents.
If you are designing an agentic AI system for your organisation and want a second opinion on the architecture, the consulting page has details on how we can work together.