The first version worked perfectly in the demo. It crashed on day three in production, taking down a customer-facing workflow that forty people relied on every morning.

That was my introduction to the real cost of getting agentic AI system design wrong. Not a graceful degradation, not a warning in the logs — a silent loop that consumed token quota until the billing alarm fired at 2 AM.

Over the following eight months, I rebuilt the same system three times. Each iteration taught me something the previous one did not. What follows is the honest account of all three — the architecture diagrams, the failure modes, the decisions I wish I had made on day one, and the four-component model that finally made it to production and stayed there.

Whether you are an engineer responsible for building the plumbing or an architect deciding what components to commission, this is the article I needed before I started.


Version 1: The Single Agent That Tried to Do Everything

V1 — The Naive Build

The first system was simple by design. One agent, powered by GPT-4, with access to every tool the task required: web search, code execution, a file system interface, a database client, an email sender, and three internal API connectors. Seven tools total.

graph TD U([User Request]) --> A[Single LLM Agent] A --> T1[Web Search ] A --> T2[File System ] A --> T3[Database Client ] A --> T4[Email Sender ] A --> T5[Code Executor ] A --> T6[External APIs ] T1 & T2 & T3 & T4 & T5 & T6 --> A A --> R([Response]) classDef user fill:#dbeafe,stroke:#3b82f6,stroke-width:2px,color:#1e3a8a classDef agent fill:#fee2e2,stroke:#dc2626,stroke-width:3px,color:#7f1d1d classDef tool fill:#f1f5f9,stroke:#94a3b8,stroke-width:1.5px,color:#334155 classDef output fill:#dcfce7,stroke:#16a34a,stroke-width:2px,color:#14532d class U user class A agent class T1,T2,T3,T4,T5,T6 tool class R output

Diagram 1: Version 1 — one agent, unrestricted access to all tools, no memory, no observability

It looked clean on paper. In practice, the agent's context window was bloated with tool schemas before the first real work began. With seven tools registered, approximately 4,000 tokens of schema description landed in every prompt — before any task content, before any conversation history.

On long tasks, we hit the context ceiling mid-execution. The agent had no recovery strategy. It would either hallucinate the remaining steps or enter a reasoning loop, calling the same tool repeatedly with slightly different arguments each time, burning tokens until the run timed out.

The second problem was state. Every run began from zero. If a task was interrupted and restarted, the agent had no memory of what it had already completed. Our workaround was to write longer and longer system prompts that summarised prior context — which made the context problem significantly worse.

The third problem was observability. When something failed, the only artifact was a wall of unstructured LLM output. We had no trace of which tool was called, in what order, with what arguments, or what it returned. Debugging a production failure meant reconstructing the agent's decision path from log fragments. It was archaeology, not engineering.

Lesson from V1 An agent is not a function. Treating it like one — give it inputs, expect outputs — ignores the state, memory, and observability infrastructure it requires to survive in production. These are not optional add-ons. They are the system.

Version 2: The Multi-Agent Mess

V2 — The Over-Engineered Rebuild

I overcorrected hard. Inspired by microservices thinking, I decomposed the monolithic agent into six specialised agents: a Research Agent, a Synthesis Agent, a Code Agent, a Writer Agent, a Review Agent, and a Formatter Agent. A central Planner Agent coordinated them all.

graph TD U([User]) --> PL[Planner Agent] PL --> RA[Research Agent] PL --> CA[Code Agent] PL --> WA[Writer Agent] RA --> SY[Synthesis Agent] CA --> SY WA --> FM[Formatter Agent] SY --> RV[Review Agent] FM --> RV RV -->|Rejection Loop| PL PL --> R([Response]) classDef user fill:#dbeafe,stroke:#3b82f6,stroke-width:2px,color:#1e3a8a classDef planner fill:#fef3c7,stroke:#d97706,stroke-width:3px,color:#78350f classDef worker fill:#fee2e2,stroke:#ef4444,stroke-width:2px,color:#7f1d1d classDef output fill:#dcfce7,stroke:#16a34a,stroke-width:2px,color:#14532d class U user class PL planner class RA,CA,WA,SY,FM,RV worker class R output

Diagram 2: Version 2 — seven agents with a rejection loop, no circuit breaker, 45-second latency

On paper, this was elegant. In production, it was the most expensive debugging experience of my career.

The first problem was latency. What previously took eight seconds now took forty-five. Every agent handoff involved a fresh LLM call, a new context assembly, and an additional round of network overhead. We were paying the API tax seven times for work that needed it twice.

The second problem was cost. Each agent required its own full context, which meant task-relevant information was duplicated across every prompt in the chain. Our per-task token spend tripled within the first week of the rollout.

The third problem was coordination failure. When the Review Agent rejected an output, it passed the rejection signal back to the Planner, which re-queued the original task, which triggered the same six-agent sequence again. Without a circuit breaker, a single malformed task could loop indefinitely. It did. On a Friday evening.

The fourth problem was debugging. A production failure now required tracing through seven separate agent logs to reconstruct what happened. The system was technically observable but practically incomprehensible at 11 PM with a customer on the phone.

Lesson from V2 More agents is not better architecture. Every additional agent adds coordination overhead, cost, latency, and a new failure surface. The right number of agents is the minimum required for each one to stay focused and context-bounded — usually far fewer than you think on day one.

Version 3: The Architecture That Ships

V3 — Production-Ready

The third design started with a different question. Instead of asking how to split the system up, I asked what the system actually needed in order to be reliable. The answer was four things: a smart orchestrator, a scoped tool registry, a deliberate memory layer, and an evaluation gate before any output reached the user.

graph TD U([User Request]) --> GW[API Gateway] GW --> OC[Orchestrator] OC --> PL[Task Planner] PL --> Q[Task Queue] Q --> W1[Research Worker] Q --> W2[Code Worker] Q --> W3[Writer Worker] W1 --> ML[Memory Layer] W2 --> ML W3 --> ML ML --> OC OC --> GD[Guardrails + Evaluator] GD --> HI{Human Approval?} HI -->|No| RS([Final Response]) HI -->|Yes| HR[Human Review Queue] HR --> AP{Approved?} AP -->|Yes| RS AP -->|No| OC OC --> AL[(Audit Log)] classDef user fill:#dbeafe,stroke:#3b82f6,stroke-width:2px,color:#1e3a8a classDef gateway fill:#ede9fe,stroke:#7c3aed,stroke-width:2px,color:#4c1d95 classDef core fill:#dcfce7,stroke:#16a34a,stroke-width:3px,color:#14532d classDef planner fill:#f0fdf4,stroke:#4ade80,stroke-width:1.5px,color:#166534 classDef memory fill:#f3e8ff,stroke:#a855f7,stroke-width:2px,color:#581c87 classDef guard fill:#dbeafe,stroke:#3b82f6,stroke-width:2px,color:#1e3a8a classDef decision fill:#fef9c3,stroke:#ca8a04,stroke-width:2px,color:#713f12 classDef output fill:#dcfce7,stroke:#16a34a,stroke-width:2px,color:#14532d classDef storage fill:#f1f5f9,stroke:#94a3b8,stroke-width:1.5px,color:#475569 class U user class GW gateway class OC core class PL,Q,W1,W2,W3,HR planner class ML memory class GD guard class HI,AP decision class RS output class AL storage

Diagram 3: Version 3 — production architecture with orchestrator, scoped workers, memory layer, guardrails, and confidence-based human routing

Version 3 runs with one Orchestrator and a maximum of three Worker Agents, each scoped to a distinct capability domain. The Orchestrator handles task decomposition, worker assignment, state management, retry logic, and the decision about whether a completed output requires human approval before delivery. Worker Agents are stateless — all state lives in the Memory Layer, which the Orchestrator manages centrally.

This design is not glamorous. It does not look impressive in a system diagram. It has run without a critical production failure for six months. That is the metric that matters.


The Four Components Every Production Agentic System Needs

The difference between V2 and V3 was not the number of agents. It was the four infrastructure components that V1 and V2 both lacked. Here is each one in detail.

1. The Orchestrator

The orchestrator is the central nervous system of the entire system. Its responsibilities are: receive a task, decompose it into sub-tasks, assign sub-tasks to the appropriate workers, collect results, manage retries, maintain execution state, and decide when output is ready to be evaluated.

The critical design principle is that the orchestrator should be the only component that holds the full task context. Worker agents receive only what they need for their specific sub-task. This keeps individual context windows tight, makes token costs predictable, and gives you a single place to instrument, monitor, and debug the execution path.

Below is a simplified implementation of this pattern in Python:

class Orchestrator:
    def __init__(self, tool_registry, memory, evaluator):
        self.tools = tool_registry
        self.memory = memory
        self.evaluator = evaluator

    def run(self, task: Task) -> Result:
        # Decompose task into ordered sub-tasks
        plan = self.planner.decompose(task)
        results = []

        for sub_task in plan.steps:
            worker = self.select_worker(sub_task)
            context = self.memory.load(sub_task.context_key)
            output = worker.execute(sub_task, context)
            self.memory.save(sub_task.context_key, output)
            results.append(output)

        final_output = self.synthesize(results)

        # Gate output through the evaluator before delivery
        eval_result = self.evaluator.score(final_output, task)
        if eval_result.score < CONFIDENCE_THRESHOLD:
            return self.route_to_human_review(final_output, eval_result)

        self.audit_log.write(task, final_output, eval_result)
        return final_output

Notice that the orchestrator owns the full execution loop. No worker agent calls another worker agent directly. All state transitions go through one place.

2. The Tool Registry

In V1, tools were imported directly into the agent. Any agent could call any tool. This is the agentic equivalent of giving every employee in a company unrestricted root access — technically functional, operationally reckless.

A tool registry is a centralised wrapper that sits between agents and tools. It handles three things: permission scoping (which agents are authorised to call which tools), input schema validation (arguments are verified before execution), and audit logging (every tool invocation is recorded with its arguments and output).

graph LR A[Agent] --> TR[Tool Registry] TR --> SC{Permission Check} SC -->|Allowed| VS[Schema Validator] SC -->|Denied| ER[Error + Log] VS -->|Valid| EX[Execute Tool] VS -->|Invalid| ER EX --> AL[(Audit Trail)] EX --> RES[Result to Agent] classDef agent fill:#dbeafe,stroke:#3b82f6,stroke-width:2px,color:#1e3a8a classDef registry fill:#ede9fe,stroke:#7c3aed,stroke-width:2px,color:#4c1d95 classDef check fill:#fef9c3,stroke:#ca8a04,stroke-width:2px,color:#713f12 classDef ok fill:#dcfce7,stroke:#16a34a,stroke-width:1.5px,color:#14532d classDef error fill:#fee2e2,stroke:#dc2626,stroke-width:2px,color:#7f1d1d classDef storage fill:#f1f5f9,stroke:#94a3b8,stroke-width:1.5px,color:#475569 class A agent class TR registry class SC check class VS,EX,RES ok class ER error class AL storage

Diagram 4: Tool Registry — permission check, schema validation, and audit trail on every tool call before execution

In practice, permission scoping looks like this: the Research Worker can call web search and the vector store retrieval tool. It cannot call the email sender or the database write tool. The Code Worker can execute sandboxed code. It cannot write to the file system in production paths. These boundaries are not just good security hygiene — they dramatically reduce the blast radius when an agent misbehaves.

3. The Memory Layer

The most common architecture mistake I see in production agentic systems is treating the context window as the only form of memory. The context window is short-term memory. It is ephemeral, expensive, and limited. Production systems need all four memory types working together.

Agent
↕ reads and writes to all four memory layers
Short-Term
Context Window
Active working memory for the current turn. Everything in flight lives here. Ephemeral and bounded.
In-memory buffer · Up to 128K tokens
Episodic
Session Summaries
Compressed records of past runs. Retrieved when users reference prior decisions or context.
PostgreSQL or Redis
Semantic
Domain Knowledge
Product docs, policies, knowledge bases. Accessed via retrieval-augmented generation at query time.
Pinecone · Qdrant · Weaviate
Procedural
Tool Schemas & Policies
Agent role definitions, tool permission rules, guardrail policies. Static at runtime but versioned.
Config files · Version controlled

Diagram 5: Four memory types a production agentic system needs — most teams only implement the first one

  • Short-term memory is the active context window. Everything in the current turn lives here. It is fast but expensive and bounded. When a task exceeds this boundary, you need the other three types to pick up the slack.
  • Episodic memory stores summaries of past sessions. When a user returns after a week and references a decision made in a previous run, the orchestrator retrieves the relevant episode and injects a summary into the current context. This is where PostgreSQL or Redis comes in.
  • Semantic memory is your vector store. It holds domain knowledge, product documentation, internal policies — anything the agent needs to reason accurately over a corpus that is too large to fit in context. Retrieval-augmented generation (RAG) is the pattern for accessing it.
  • Procedural memory is static configuration: the tool registry schemas, the agent role definitions, the guardrail policies. It does not change at runtime, but it shapes every single agent decision. Most teams define this ad hoc in system prompts. Externalising it as configuration makes it auditable, versioned, and changeable without a code deploy.

4. Guardrails and Evaluation

In V1, agent output went directly to the user. In V3, it passes through an evaluation gate first. This is the component that prevented the most production incidents in our system — and the one most teams skip because it feels like overhead.

graph LR AO[Agent Output] --> JD[LLM Evaluator] JD --> S1[Faithfulness Score] JD --> S2[Relevance Score] JD --> S3[Task Completion] JD --> S4[Safety Check] S1 & S2 & S3 & S4 --> AG[Aggregate Score] AG --> TH{Pass Threshold?} TH -->|Yes| DL[Deliver to User] TH -->|No| RT[Retry with Feedback] RT --> AO classDef input fill:#f1f5f9,stroke:#94a3b8,stroke-width:1.5px,color:#334155 classDef judge fill:#dbeafe,stroke:#3b82f6,stroke-width:2px,color:#1e3a8a classDef score fill:#f0fdf4,stroke:#4ade80,stroke-width:1.5px,color:#166534 classDef agg fill:#ede9fe,stroke:#7c3aed,stroke-width:2px,color:#4c1d95 classDef decision fill:#fef9c3,stroke:#ca8a04,stroke-width:2px,color:#713f12 classDef deliver fill:#dcfce7,stroke:#16a34a,stroke-width:2px,color:#14532d classDef retry fill:#fef3c7,stroke:#d97706,stroke-width:2px,color:#78350f class AO input class JD judge class S1,S2,S3,S4 score class AG agg class TH decision class DL deliver class RT retry

Diagram 6: Evaluation loop — four-dimension scoring before delivery, failed outputs retried with evaluator feedback injected into prompt

The evaluator is itself an LLM call — a separate model instance that acts as a judge and scores the agent output across four dimensions: faithfulness to the source material, relevance to the original task, task completion, and safety. If the aggregate score falls below a configurable threshold, the output is not delivered. Instead, the evaluator's feedback is injected into a retry prompt and the agent tries again.

The key implementation detail is that the evaluator must be a different model instance from the one that generated the output, and it must be given the original task specification alongside the output to score. An evaluator judging its own work in a shared context is not an evaluator — it is confirmation bias at scale.


Choosing Your Orchestration Pattern

Not every agentic system needs the same coordination model. The three patterns you will encounter in the wild each have distinct cost, latency, and reliability characteristics. Here is the honest comparison.

Sequential Pipeline
Agent A Agent B Agent C
Best for linear, step-dependent workflows where each output feeds directly into the next.
One failure stalls the entire pipeline.
Recommended
Supervisor-Worker
Orchestrator
Worker A Worker B Worker C
Central orchestrator manages state, retries, and routing. Workers are stateless and parallelisable.
Peer-to-Peer Mesh
Agent A Agent B Agent C Agent D
Agents communicate directly. Good for adversarial review and debate-style reasoning.
High cost, high latency, hard to debug in production.

Diagram 7: Three orchestration patterns with their trade-offs — start with Supervisor-Worker for any enterprise system

Pattern Best For Avoid When Latency Cost
Sequential Pipeline Linear workflows where each step depends on the previous output Tasks that can run in parallel, or where one step failing stalls everything Medium Low
Supervisor-Worker Recommended Most enterprise use cases; tasks that can be parallelised with a central state manager Task graphs so dynamic that the orchestrator itself becomes a bottleneck Low to Medium Medium
Peer-to-Peer Mesh Adversarial review workflows, multi-perspective reasoning, debate-style synthesis Any context requiring predictable latency, controlled cost, or clear audit trails High High

The supervisor-worker pattern is my default recommendation for enterprise systems. The orchestrator as a single coordination point gives you one place to add retries, circuit breakers, cost controls, and observability hooks. When something goes wrong — and it will — you know exactly where to look.

The sequential pipeline is appropriate for simple, well-defined workflows where the steps are known in advance and each one truly depends on the last. Report generation pipelines and structured data extraction workflows are good examples. The peer-to-peer mesh is powerful for specific use cases like document review or adversarial red-teaming, but it is expensive and hard to control. Do not start there.


Human-in-the-Loop Is Not a Nice-to-Have

Every agentic system I have audited in the past year has the same gap: human review is either absent entirely or implemented as a binary flag that gets turned off the moment someone complains that the system is too slow.

The correct approach is not binary. It is confidence-based routing.

graph TD T[Task Ready] --> CS{Confidence Score} CS -->|Above 0.85| AE[Auto Execute] CS -->|0.60 to 0.85| HQ[Human Review Queue] CS -->|Below 0.60| RJ[Reject with Explanation] HQ --> HD{Human Decision} HD -->|Approve| AE HD -->|Edit| RV[Revise and Re-score] HD -->|Reject| RJ AE --> AL[(Audit Log)] RJ --> AL RV --> CS classDef task fill:#dbeafe,stroke:#3b82f6,stroke-width:2px,color:#1e3a8a classDef decision fill:#fef9c3,stroke:#ca8a04,stroke-width:2px,color:#713f12 classDef auto fill:#dcfce7,stroke:#16a34a,stroke-width:2px,color:#14532d classDef review fill:#fef3c7,stroke:#d97706,stroke-width:2px,color:#78350f classDef reject fill:#fee2e2,stroke:#dc2626,stroke-width:2px,color:#7f1d1d classDef revise fill:#ede9fe,stroke:#7c3aed,stroke-width:1.5px,color:#4c1d95 classDef storage fill:#f1f5f9,stroke:#94a3b8,stroke-width:1.5px,color:#475569 class T task class CS,HD decision class AE auto class HQ review class RJ reject class RV revise class AL storage

Diagram 8: Confidence-based HITL routing — above 0.85 auto-executes, 0.60 to 0.85 queues for human review, below 0.60 rejects with explanation

The confidence score that drives this routing comes from the evaluator. A high-confidence output — well above threshold, all four evaluation dimensions passing cleanly — is auto-executed. A medium-confidence output queues for a human reviewer who can approve, edit, or reject it. A low-confidence output is rejected immediately, with an explanation returned to the user rather than a hallucinated answer delivered with false authority.

Two rules matter here. First, any action that is irreversible — sending an email, writing to a production database, triggering a payment — should have a lower auto-execute threshold regardless of confidence score. The cost of a false positive on an irreversible action is categorically different from the cost of a false positive on a text summary. Second, every decision in this routing graph, whether automated or human, must be written to the audit log. This is not optional in enterprise deployments. Your compliance team will ask for it eventually. Build it from the start.


What I Would Do Differently From Day One

If I were starting this system today, here is the sequence I would follow:

  1. Start with a single agent and a tool registry from day one. Do not wait until you have a production incident to add permission scoping and audit logging. The tool registry costs one afternoon to build and saves weeks of debugging later.
  2. Design the memory layer before writing a single agent. Decide upfront which database holds episodic memory, which vector store handles semantic retrieval, and how procedural configuration is versioned. These decisions are architectural — changing them later requires rebuilding the agents that depend on them.
  3. Add the evaluator before going to production, not after. An agent without an evaluation gate is not a production system. It is a demo that has not failed yet.
  4. Instrument everything from the first deploy. Log every tool call, every agent decision, every memory read and write. The cost of generating these logs is trivial. The cost of reconstructing them after a failure is not.
  5. Add agents only when a single agent demonstrably cannot do the job. The threshold is not "this would be cleaner with two agents". The threshold is "this agent is consistently hitting context limits or producing quality degradation that a second specialised agent would solve". Set that bar high.

Frequently Asked Questions

What is the difference between an AI agent and a regular LLM API call?

A regular LLM API call is stateless. You send a prompt and receive a response. An AI agent is a system that plans multi-step tasks, calls external tools, retains memory across steps, and makes decisions about what to do next based on intermediate results. The defining characteristic is agency: the system determines its own execution path rather than following a fixed prompt-response cycle.

When should I use a multi-agent system instead of a single agent?

Use a single agent when tasks are sequential, context fits comfortably in one window, and failures are straightforward to debug. Move to multi-agent when tasks are genuinely parallelisable, when distinct specialisation produces measurably better output quality, or when a single agent consistently degrades over long runs. The mistake most teams make is splitting too early, before they have evidence that a single agent cannot handle the load.

What is the best orchestration pattern for enterprise agentic AI?

The supervisor-worker pattern is the most reliable choice for enterprise systems. A single orchestrator decomposes tasks and manages state while specialised worker agents execute sub-tasks in parallel. This gives you observability and retry logic concentrated in one component, clear accountability when something fails, and a single surface to attach cost controls and security policies.

How do I stop an agentic system from going off-rails in production?

Four controls work in combination. A tool registry that scopes which agents can call which tools limits the blast radius of misbehaviour. Confidence-based human-in-the-loop routing intercepts low-confidence and irreversible actions before they execute. An LLM-as-judge evaluation layer scores outputs before delivery. Hard token budget caps per task run prevent runaway cost from a looping agent. Any one of these in isolation is insufficient. Together, they create defence in depth.

What memory architecture should a production agentic system use?

Production systems need all four memory types: short-term (the active context window), episodic (past session summaries in PostgreSQL or Redis), semantic (a vector store for retrieval-augmented generation), and procedural (tool schemas and policy rules as static configuration). Most teams only implement short-term memory and wonder why their agents have no continuity across sessions. The other three types are what separates a system that scales from one that requires constant human intervention.


Final Thought

The hardest part of agentic AI system design is not the LLM. The LLM is the easiest component in the stack. The hard parts are the infrastructure decisions you make in the first week — how memory is structured, where state lives, what the tool boundaries are, and when to involve a human — because those decisions shape every single component you build on top of them.

Version 3 is not perfect. No production system is. But it has a memory layer that keeps context from exploding, a tool registry that contains failures before they propagate, an evaluation gate that catches bad outputs before they reach users, and an audit log that makes every incident a tractable debugging exercise rather than a guessing game.

That is the bar worth building toward. Start there, not with the number of agents.

If you are designing an agentic AI system for your organisation and want a second opinion on the architecture, the consulting page has details on how we can work together.