Eighty-seven percent of AI projects never make it to production. Generative AI has made that number worse, not better — because the gap between a compelling GPT-4 demo and a reliable, auditable, cost-controlled production system is wider than most engineering leaders expect.

This isn't a failure of ambition. It's a failure of architecture.

This playbook covers what actually changes between a generative AI pilot and a production system: the infrastructure decisions, the governance requirements, the team structures, and the three traps that kill most enterprise AI deployments before they ship.

Why Generative AI Pilots Stall Before Production

A pilot is optimised for one thing: demonstrating potential. Production is optimised for something entirely different — reliably delivering value at scale, on time, within cost, with accountability.

The mismatch is structural. Pilots typically run on:

  • A single LLM provider with no fallback
  • Hardcoded prompts with no versioning
  • No latency or cost monitoring
  • Manual evaluation ("it looked good in the demo")
  • No consideration of data residency, PII handling, or audit trails

When the business asks to scale that pilot, the engineering team discovers these aren't configuration problems — they're architectural ones. Re-engineering from scratch is often faster than retrofitting.

The Three Production Killers

1. Prompt brittleness. Prompts that work reliably in testing degrade under real user input distributions. Without a prompt versioning and regression-testing system, every model update is a production incident waiting to happen.

2. Cost at scale. A pilot consuming 500K tokens/day costs ~$10. At 50M tokens/day — the load of a modest internal tool — that's $1,000/day before optimisation. LLM cost modelling is not optional; it's a blocker.

3. Compliance discovery late. Legal and security review at the pilot stage is superficial. At the production stage, questions about data residency, model output logging, PII in prompts, and liability for hallucinated outputs arrive all at once. Discovery late is always expensive.

The Four Architectural Decisions That Define Production Readiness

Before writing a single line of production infrastructure, engineering leaders need four decisions locked in.

1. Build vs. Orchestrate vs. Fine-tune

Most enterprise use cases sit in one of three categories:

Approach When to use Trade-off
Prompt engineering + RAG Retrieval over your own data, controlled output Lowest cost, fastest to ship, fragile at edges
Orchestrated agents Multi-step workflows, tool use, decision trees Higher reliability, more complex eval, significant latency
Fine-tuning / distillation Consistent tone, domain specialisation, cost reduction at scale Upfront cost, requires labelled data, needs MLOps infra

The wrong choice is almost always fine-tuning too early. Start with RAG + prompt engineering. Fine-tune only when you have production data proving the baseline isn't good enough.

2. Model Provider Strategy

Vendor lock-in with LLM providers is a real risk. A single-provider strategy means:

  • You absorb every price change without leverage
  • A provider outage is your outage
  • Model deprecations force emergency migrations

The production architecture should abstract the model layer behind a routing interface — tools like LiteLLM allow switching between OpenAI, Anthropic, Google, and open-weight models without application code changes.

3. Evaluation Infrastructure

You cannot operate what you cannot measure. Production generative AI requires:

  • Automated evals: LLM-as-judge pipelines for quality regression testing on every deployment
  • Human-in-the-loop review queues: Sampled output review for compliance-sensitive outputs
  • Latency and cost dashboards: Per-endpoint token consumption, cost per user session, p95 latency
  • Hallucination detection: Grounding checks against source documents for RAG outputs

4. Data Architecture for Retrieval

RAG quality is only as good as your retrieval layer. The common failure mode is treating the vector database as a simple keyword search replacement. Production RAG requires:

  • Chunking strategy tuned to your document structure (legal docs chunk differently to product docs)
  • Metadata filtering to scope retrieval by user role, recency, or document type
  • Hybrid search (dense + sparse) for recall across diverse query types
  • Index refresh pipelines with latency SLAs, not manual re-indexing

Building the Team for Production AI

A generative AI pilot can be run by two engineers and a product manager. A production system requires a different staffing model.

The capabilities you need — regardless of how you staff them:

  • LLM Engineering: Prompt design, eval pipelines, context window management, fine-tuning where appropriate
  • MLOps / LLMOps: Model deployment, cost monitoring, A/B testing infrastructure, rollback procedures
  • AI Product Management: Translating business requirements into system behaviours, owning the eval criteria, managing stakeholder expectations around model limitations
  • AI Governance: Responsible AI review, compliance sign-off, audit trail ownership

For most enterprises, the fastest path to production is bringing in an external AI solution architect to design the production architecture, then transitioning ownership to an internal team with clear documentation and runbooks.

A Production Readiness Checklist

Before declaring a generative AI system production-ready, engineering leads should be able to answer yes to all of the following:

Infrastructure

  • Model provider abstraction layer in place (no hard-coded provider calls)
  • Prompt versioning and rollback capability
  • Cost monitoring with per-request attribution and budget alerts
  • Latency SLAs defined and instrumented (p50, p95, p99)

Evaluation

  • Automated regression test suite running on every deployment
  • Human review queue for sampled outputs in place
  • Baseline quality metrics established from pilot data

Compliance & Governance

  • PII detection and redaction on inputs and outputs
  • Output logging with configurable retention policy
  • Data residency requirements documented and met
  • Legal sign-off on AI-generated content liability

Operations

  • On-call runbook for model degradation or provider outage
  • Graceful degradation path (what happens when the LLM call fails?)
  • User feedback loop for quality signals

From 0 to Production: A Realistic Timeline

Phase Duration Key output
Architecture review 2 weeks Documented system design, provider strategy, eval criteria
Infrastructure build 4–6 weeks LLM gateway, RAG pipeline, eval harness, cost dashboard
Closed beta 4 weeks Production data, baseline metrics, first eval report
Compliance review 2–3 weeks Legal sign-off, security review, audit trail validation
Staged rollout 2 weeks 5% → 25% → 100% traffic with rollback gates

Twelve to fifteen weeks is an honest timeline for a well-resourced team. Teams attempting to compress below eight weeks almost always find the time again on the other side as production incidents.

FAQ

What is the biggest mistake CTOs make when scaling generative AI?

Treating the pilot as a foundation rather than a throwaway. Pilots are designed to answer "can this work?" — not "how do we operate this at scale?" The architecture that answers both questions is usually different, and that's fine. Budget for the rebuild.

Should we build on GPT-4 or use open-weight models in production?

It depends on your data residency requirements, cost sensitivity, and latency SLAs. GPT-4 class models win on quality; open-weight models (Llama 3, Mistral) win on cost, control, and data privacy. Most production systems use a tiered approach — a capable open-weight model for high-volume, lower-stakes tasks, and a frontier model for complex or high-stakes ones.

How do we handle hallucinations in production?

You don't eliminate them — you manage them. This means grounding all responses in retrieved source documents, surfacing citations to users, running automated grounding checks, and routing high-stakes queries to human review. Designing your UX to set appropriate user expectations is as important as the technical controls.

When does it make sense to fine-tune vs. use RAG?

RAG first, always. Fine-tuning adds significant cost and operational complexity. Move to fine-tuning only when you have clear evidence from production data that the base model's style, tone, or domain knowledge is materially limiting quality — and when you have the labelled dataset and MLOps infrastructure to support it.

How do we estimate LLM costs before going to production?

Instrument your pilot to log token counts per request. Calculate average tokens per session × expected daily active users × model cost per token. Add a 30% buffer for prompt engineering overhead. Build a cost model in a spreadsheet before you build cost monitoring in code — the spreadsheet will tell you whether the business model works at scale.

Conclusion

Scaling generative AI from pilot to production is an architectural problem, a governance problem, and a team problem — not a model problem. The models are good enough. The gap is in the systems built around them.

The CTOs who ship production AI successfully treat the pilot as evidence and the production build as a new project. They invest in evaluation infrastructure before they invest in features. They design for provider flexibility from day one. And they bring in specialist expertise for the architecture decisions that are hardest to reverse.

If you're planning a production generative AI deployment and want an independent architecture review before you commit to an approach, let's talk.