Every CTO I talk to has the same story. Their team runs a proof-of-concept on GPT-4o, it costs $200 during testing, and they greenlight production. Six weeks later the invoice is $18,000. They call me.

I've been building ML systems since 2017 — computer vision pipelines, on-device models, and now LLM-native products at enterprise scale. Over the last two years I have helped scale generative AI applications serving tens of millions of tokens per day. The cost structure of LLMs in production is genuinely different from any prior cloud workload, and most teams don't model it correctly until they've already been burned.

This article gives you the actual numbers, the full cost taxonomy, and the formulas I use when scoping a new engagement. No "contact us for pricing." Just math.

1. Why LLM Costs Always Surprise You

Traditional cloud services charge for time (compute-hours) or storage (GB-months). LLMs charge for tokens — and tokens are invisible until you've instrumented your system. Three dynamics make this dangerous:

  • Context window inflation. RAG pipelines routinely inject 2,000–8,000 tokens of retrieved context per request — most of which the model reads but the user never sees. You're paying for every one of those tokens.
  • System prompt leakage. A 500-word system prompt repeated across 1 million daily requests is 375 million input tokens per month — roughly $937.50 at GPT-4o-mini rates, or $937 at zero if you cache it.
  • Output token asymmetry. Output tokens cost 3–10× more than input tokens across every frontier model. An agent that "thinks out loud" before answering (chain-of-thought) can multiply your bill by 5× with no change to the user-visible response.
~68% of enterprise teams underestimate their first-year LLM API spend by more than 3× — based on my observations across client engagements in 2024–2025.

2. API Pricing: Model-by-Model Breakdown (Q1 2026)

All prices below are per million tokens (1M tokens ≈ 750,000 words ≈ ~1,500 pages of text). Prices are in USD. I've grouped them into three tiers by capability vs. cost profile.

Tier 1 — Frontier Models (Maximum capability, highest cost)

Model Provider Input ($/1M) Output ($/1M) Context Window Best For
Claude Opus 4.6 Frontier Anthropic $15.00 $75.00 200K Complex reasoning, legal/medical analysis
GPT-4.5 Frontier OpenAI $75.00 $150.00 128K Highest-stakes generation tasks
Gemini 1.5 Ultra Frontier Google $10.00 $30.00 1M Long-document analysis, multimodal

Tier 2 — Balanced Models (Production sweet spot for most enterprises)

Model Provider Input ($/1M) Output ($/1M) Context Window Best For
Claude Sonnet 4.6 Balanced Anthropic $3.00 $15.00 200K Enterprise RAG, agentic workflows, coding
GPT-4o Balanced OpenAI $2.50 $10.00 128K General-purpose production workloads
Gemini 1.5 Pro Balanced Google $1.25 $5.00 1M (≤128K) Long-context summarisation, document QA
Mistral Large 2 Balanced Mistral AI $2.00 $6.00 128K EU data-residency requirements

Tier 3 — Efficient Models (High throughput, cost-optimised)

Model Provider Input ($/1M) Output ($/1M) Context Window Best For
Claude Haiku 4.5 Efficient Anthropic $0.80 $4.00 200K Classification, routing, lightweight extraction
GPT-4o-mini Efficient OpenAI $0.15 $0.60 128K High-volume tasks, first-pass filtering
Gemini 2.0 Flash Efficient Google $0.10 $0.40 1M Bulk processing, streaming pipelines
Llama 3.3 70B (self-hosted) Meta / AWS/GCP ~$0.90–$1.40 blended* 128K Data-sovereignty, fine-tuning ownership

* Self-hosted Llama 3.3 70B on a dedicated AWS ml.g5.48xlarge instance (~$16.29/hr on-demand) running at 80% utilisation with ~22M tokens/hr throughput. Your numbers will vary with batching efficiency and instance type. Prices as of Q1 2026 — verify at provider pricing pages before budgeting.

Important: Never budget based on API list prices alone. Prompt caching (available on Anthropic and OpenAI) can reduce input token costs by 75–90% for repeated system prompts. Batch API endpoints (Anthropic Batches, OpenAI Batch API) offer 50% discounts for asynchronous workloads. These two features alone can halve your total bill.

3. Token Economics — The Number Most Teams Get Wrong

The canonical mistake is pricing per request instead of per token. Requests vary by 10–100× in token count. Here's the model I use to baseline any new system:

# Monthly token cost — baseline formula

monthly_input_tokens = (
  avg_system_prompt_tokens
  + avg_conversation_history_tokens  # grows with turns!
  + avg_rag_context_tokens          # often 2K–8K
  + avg_user_message_tokens
) × daily_requests × 30

monthly_output_tokens = avg_response_tokens × daily_requests × 30

monthly_api_cost = (
  monthly_input_tokens / 1_000_000 × input_price_per_million
  + monthly_output_tokens / 1_000_000 × output_price_per_million
)

# Don't forget: output tokens are typically 4–10× more expensive per token

Worked example: customer support bot at 50K requests/day

Token ComponentAvg Tokens/RequestMonthly Total
System prompt (cached after first call)600900M (900M cacheable)
RAG context (3 retrieved chunks)1,8002.7B
Conversation history (avg 3 turns)9001.35B
User message120180M
Total input tokens/month3,4205.13B
Model response (output)350525M

Cost on GPT-4o (no caching): 5.13B × $2.50/M + 525M × $10/M = $12,825 + $5,250 = $18,075/month

Cost on GPT-4o (with prompt caching for system prompt): 4.23B × $2.50/M + 900M × $0.50/M + 525M × $10/M = $10,575 + $450 + $5,250 = $16,275/month

Cost on GPT-4o-mini (same workload): 5.13B × $0.15/M + 525M × $0.60/M = $769.50 + $315 = ~$1,085/month

This is why model selection is the single highest-leverage cost decision. A 16× cost difference between GPT-4o and GPT-4o-mini is not unusual for the same task when quality requirements allow it.

4. Cloud Infrastructure Overhead

API costs are only part of the picture. For any production LLM system you also need:

Infrastructure LayerWhat It IsTypical Monthly Cost (mid-scale)
Vector databasePinecone / Weaviate / pgvector on RDS$70–$500/month
Embedding model callstext-embedding-3-small at $0.02/1M tokens$50–$400/month
Orchestration computeLangChain/LlamaIndex app on ECS/Cloud Run$200–$800/month
API gateway + rate limitingAWS API Gateway or Kong$30–$150/month
Observability stackLangSmith / Helicone / custom OTEL pipeline$100–$600/month
Caching layerRedis for semantic/exact-match caching$50–$300/month
Data storage (logs + evals)S3 + Athena for cost tracking and QA$30–$200/month
Security / PII scrubbingCustom middleware or Amazon Comprehend$100–$500/month

For a mid-scale deployment (50K–200K requests/day), infrastructure overhead typically adds $600–$3,500/month on top of API spend. At 1M+ requests/day it becomes the dominant cost driver, especially if you run dedicated GPU instances for self-hosted models.

5. Hidden LLMOps Costs Nobody Advertises

This is where agency proposals fall apart. The following costs are real and recurring, but they rarely appear in a vendor slide deck:

Prompt Engineering & Maintenance — $5,000–$25,000/quarter
Every model version change (GPT-4o → GPT-4o-2024-11-20, Claude 3.5 → 4.x) can silently break prompt behaviour. Regression testing a library of 40–200 prompts against a new model version takes 2–5 engineer-days per model update. At enterprise scale, organisations experience 4–8 major model updates per year.
Evaluation & Quality Assurance — $3,000–$15,000/quarter
LLM outputs are probabilistic. You need an eval pipeline: a golden dataset, automated scoring (often using a judge LLM — which itself costs tokens), and human review sampling. A 1,000-sample eval set run weekly on GPT-4o-as-judge costs ~$50/run — $2,600/year — before human time.
Guardrails & Safety Filtering — 10–30% token overhead
Content moderation (OpenAI Moderation API: $0/call but adds latency; custom NeMo Guardrails or Llama Guard: ~$0.20–$0.80 per 1K requests) and PII detection add both cost and latency. Many teams underestimate this entirely.
Context Window Management Engineering — 1–3 engineer-weeks/quarter
As conversation history grows, you hit context limits. Building and maintaining a sliding-window summarisation system, hierarchical memory, or vector-based memory retrieval is non-trivial ongoing engineering work — not a one-time setup.
Retry Logic & Fallback Routing — 5–15% API cost inflation
Provider outages, rate limit errors (HTTP 429), and timeout retries all burn tokens and compute. A robust production system with exponential backoff, fallback providers, and circuit breakers adds engineering complexity and 5–15% effective API overhead from duplicate requests.
Fine-tuning (if applicable) — $2,000–$50,000 one-time + $500–$3,000/month hosting
OpenAI fine-tuning: ~$8/1M training tokens for GPT-4o-mini + $0.30/1M input inference on the fine-tuned model. A fine-tuning run on 500K token dataset = ~$4 training cost. However, fine-tuned model deployment costs more than base models. Self-hosted fine-tuned models on AWS SageMaker with ml.g5.2xlarge run ~$1.52/hr, requiring dedicated instance reservation.
2.3×–4.1× The typical ratio of total LLMOps cost to raw API spend across enterprise deployments I've audited. Budget for the multiplier, not just the API invoice.

6. Three Real-World Cost Scenarios

Scenario A: Internal Knowledge Base Q&A (500 employees, ~2K requests/day)

API (Claude Sonnet 4.6, avg 2K input + 400 output tokens)$900/month
Vector DB (Pinecone starter)$70/month
Embeddings (monthly re-index)$15/month
Orchestration compute (Cloud Run)$120/month
Observability (Helicone)$80/month
Total infra + API~$1,185/month
Engineering maintenance (0.15 FTE)~$2,250/month
True monthly TCO~$3,435/month (~$41K/year)

Scenario B: Customer-Facing AI Support Agent (50K requests/day, mixed model routing)

GPT-4o-mini for intent classification (95% of traffic)$650/month
Claude Sonnet 4.6 for complex escalations (5% of traffic)$1,350/month
Embeddings + vector DB (Weaviate Cloud)$340/month
Redis semantic cache (30% hit rate)$190/month
Orchestration + API gateway (ECS)$420/month
LangSmith observability$250/month
PII scrubbing middleware$180/month
Total infra + API~$3,380/month
Engineering maintenance (0.5 FTE)~$7,500/month
Prompt eval & QA~$1,200/month
True monthly TCO~$12,080/month (~$145K/year)

Scenario C: Autonomous AI Agent Pipeline (document processing, 500K docs/month)

Gemini 2.0 Flash for extraction (high volume)$1,200/month
Claude Sonnet 4.6 for validation + final output$4,800/month
Claude Opus 4.6 for edge-case escalations (~2%)$1,100/month
Self-hosted Llama 3.3 70B on spot GPU (preprocessing)$1,800/month
Storage + compute (AWS)$1,400/month
Orchestration + monitoring$600/month
Total infra + API~$10,900/month
Engineering team (1.5 FTE: 1 AI engineer + 0.5 MLOps)~$22,500/month
Eval pipeline + human review~$3,000/month
True monthly TCO~$36,400/month (~$437K/year)

7. The TCO Formula I Use With Clients

When I scope a new enterprise AI engagement, I use this total cost of ownership framework. It consistently predicts actual spend to within 20%:

# Enterprise LLM Total Cost of Ownership (annual)

TCO = API_cost + Infra_cost + LLMOps_cost + Engineering_cost

# Where:
API_cost = Σ (monthly_tokens_per_model × price_per_token) × 12

Infra_cost = (vector_db + embeddings + compute + gateway + cache + observability) × 12

LLMOps_cost = (eval_pipeline + guardrails + prompt_maintenance + retry_overhead) × 12

Engineering_cost = FTE_count × avg_annual_fully_loaded_cost × llm_allocation_fraction

# Rule of thumb multiplier (validated across 15+ deployments):
TCO ≈ API_cost × 3.2  # for teams new to LLMOps
TCO ≈ API_cost × 1.8  # for mature teams with tooling in place

The multiplier drops as your team matures tooling, builds a prompt library, and shifts high-volume tasks to cheaper model tiers. Teams that skip the LLMOps investment pay for it in model quality incidents, prompt regressions, and runaway token spend.

8. Five Levers That Cut Your Bill by 40–70%

Lever 1: Prompt Caching (saves 15–40% on input costs)

Both Anthropic (cache_control) and OpenAI support prompt caching. Cache your system prompt and any static few-shot examples. At $3.00/1M input with Anthropic, cached tokens cost $0.30/1M — a 90% reduction. Payback threshold: if the same prefix is used more than ~once per few seconds, caching wins.

Lever 2: Model Routing (saves 30–60% overall)

Not every request needs GPT-4o. Build a lightweight router (a fine-tuned classifier or even keyword rules) that sends simple queries to GPT-4o-mini or Gemini Flash and only escalates complex ones to frontier models. In my experience, 60–75% of enterprise support tickets can be handled by a Tier-3 model with no quality drop.

Lever 3: Semantic Caching (saves 10–25% on repeated queries)

Cache LLM responses keyed by embedding similarity. Tools like GPTCache or a Redis + pgvector setup can serve identical or near-identical queries from cache. For FAQ-heavy workflows, cache hit rates of 25–40% are achievable.

Lever 4: Output Token Reduction (saves 20–50% on output costs)

Explicitly constrain output format. Forcing JSON-only responses, limiting response length via max_tokens, and instructing the model to be concise reduces output tokens substantially. A 200-word instruction to "be brief" can cut your output bill by 30–50% in structured-output pipelines.

Lever 5: Batch API for Async Workloads (saves 50% flat)

If your use case tolerates 24-hour turnaround (document processing, overnight report generation, batch classification), use the OpenAI Batch API or Anthropic Message Batches. Both offer a 50% discount over synchronous calls with identical quality. For 500K document processing scenarios, this alone saves ~$5,000–$6,000/month.

9. Build vs. Buy vs. Fine-Tune — The Honest Matrix

Option Upfront Cost Ongoing Cost Control Time-to-Production When to Choose
API (off-the-shelf) Low ($0) High (per-token) Low Days–weeks MVP, uncertain volume, general tasks
Fine-tuned hosted model Medium ($2K–$20K) Medium (reduced tokens) Medium 4–8 weeks Domain-specific tasks, consistent format, 10M+ tokens/month
Self-hosted open model High ($20K–$100K infra setup) Predictable (GPU instances) High 8–16 weeks Data sovereignty, regulated industries, >1B tokens/month
Distilled/quantised model High (distillation pipeline) Low (cheap inference) Very High 12–24 weeks Edge deployment, offline use, extreme cost sensitivity

The crossover point for self-hosting versus API: if your monthly API spend exceeds ~$8,000–$12,000 and your task is well-defined enough for a specialised model, self-hosting (or fine-tuning on a cheaper tier) typically breaks even within 6–9 months. Below that threshold, the engineering overhead rarely justifies it.

The Bottom Line

The question "how much does it cost to deploy an LLM?" has no universal answer — but it has a rigorous one once you define your request volume, token profile, latency requirements, and quality bar. I've watched companies spend $200K/year on a system that should cost $40K, and I've watched others under-invest in observability and pay for it in silent quality degradation.

The framework above gives you enough structure to build a credible cost model before you write a line of production code. If you're a CTO or engineering leader evaluating an LLM initiative and want a second opinion on the numbers — or if you're looking for help designing a cost-efficient LLM architecture from the start — that's exactly the kind of work I do.

Transparency in AI pricing is a competitive advantage, not a risk. The teams that model costs rigorously from day one ship faster, justify budgets more easily, and avoid the invoice shock that kills otherwise promising AI programmes.