- Why LLM costs always surprise you
- API pricing: model-by-model breakdown (Q1 2026)
- Token economics — the number most teams get wrong
- Cloud infrastructure overhead
- Hidden LLMOps costs nobody advertises
- Three real-world cost scenarios
- The TCO formula I use with clients
- Five levers that cut your bill by 40–70%
- Build vs. buy vs. fine-tune — the honest matrix
Every CTO I talk to has the same story. Their team runs a proof-of-concept on GPT-4o, it costs $200 during testing, and they greenlight production. Six weeks later the invoice is $18,000. They call me.
I've been building ML systems since 2017 — computer vision pipelines, on-device models, and now LLM-native products at enterprise scale. Over the last two years I have helped scale generative AI applications serving tens of millions of tokens per day. The cost structure of LLMs in production is genuinely different from any prior cloud workload, and most teams don't model it correctly until they've already been burned.
This article gives you the actual numbers, the full cost taxonomy, and the formulas I use when scoping a new engagement. No "contact us for pricing." Just math.
1. Why LLM Costs Always Surprise You
Traditional cloud services charge for time (compute-hours) or storage (GB-months). LLMs charge for tokens — and tokens are invisible until you've instrumented your system. Three dynamics make this dangerous:
- Context window inflation. RAG pipelines routinely inject 2,000–8,000 tokens of retrieved context per request — most of which the model reads but the user never sees. You're paying for every one of those tokens.
- System prompt leakage. A 500-word system prompt repeated across 1 million daily requests is 375 million input tokens per month — roughly $937.50 at GPT-4o-mini rates, or $937 at zero if you cache it.
- Output token asymmetry. Output tokens cost 3–10× more than input tokens across every frontier model. An agent that "thinks out loud" before answering (chain-of-thought) can multiply your bill by 5× with no change to the user-visible response.
2. API Pricing: Model-by-Model Breakdown (Q1 2026)
All prices below are per million tokens (1M tokens ≈ 750,000 words ≈ ~1,500 pages of text). Prices are in USD. I've grouped them into three tiers by capability vs. cost profile.
Tier 1 — Frontier Models (Maximum capability, highest cost)
| Model | Provider | Input ($/1M) | Output ($/1M) | Context Window | Best For |
|---|---|---|---|---|---|
| Claude Opus 4.6 Frontier | Anthropic | $15.00 | $75.00 | 200K | Complex reasoning, legal/medical analysis |
| GPT-4.5 Frontier | OpenAI | $75.00 | $150.00 | 128K | Highest-stakes generation tasks |
| Gemini 1.5 Ultra Frontier | $10.00 | $30.00 | 1M | Long-document analysis, multimodal |
Tier 2 — Balanced Models (Production sweet spot for most enterprises)
| Model | Provider | Input ($/1M) | Output ($/1M) | Context Window | Best For |
|---|---|---|---|---|---|
| Claude Sonnet 4.6 Balanced | Anthropic | $3.00 | $15.00 | 200K | Enterprise RAG, agentic workflows, coding |
| GPT-4o Balanced | OpenAI | $2.50 | $10.00 | 128K | General-purpose production workloads |
| Gemini 1.5 Pro Balanced | $1.25 | $5.00 | 1M (≤128K) | Long-context summarisation, document QA | |
| Mistral Large 2 Balanced | Mistral AI | $2.00 | $6.00 | 128K | EU data-residency requirements |
Tier 3 — Efficient Models (High throughput, cost-optimised)
| Model | Provider | Input ($/1M) | Output ($/1M) | Context Window | Best For |
|---|---|---|---|---|---|
| Claude Haiku 4.5 Efficient | Anthropic | $0.80 | $4.00 | 200K | Classification, routing, lightweight extraction |
| GPT-4o-mini Efficient | OpenAI | $0.15 | $0.60 | 128K | High-volume tasks, first-pass filtering |
| Gemini 2.0 Flash Efficient | $0.10 | $0.40 | 1M | Bulk processing, streaming pipelines | |
| Llama 3.3 70B (self-hosted) | Meta / AWS/GCP | ~$0.90–$1.40 blended* | 128K | Data-sovereignty, fine-tuning ownership | |
* Self-hosted Llama 3.3 70B on a dedicated AWS ml.g5.48xlarge instance (~$16.29/hr on-demand) running at 80% utilisation with ~22M tokens/hr throughput. Your numbers will vary with batching efficiency and instance type. Prices as of Q1 2026 — verify at provider pricing pages before budgeting.
3. Token Economics — The Number Most Teams Get Wrong
The canonical mistake is pricing per request instead of per token. Requests vary by 10–100× in token count. Here's the model I use to baseline any new system:
monthly_input_tokens = (
avg_system_prompt_tokens
+ avg_conversation_history_tokens # grows with turns!
+ avg_rag_context_tokens # often 2K–8K
+ avg_user_message_tokens
) × daily_requests × 30
monthly_output_tokens = avg_response_tokens × daily_requests × 30
monthly_api_cost = (
monthly_input_tokens / 1_000_000 × input_price_per_million
+ monthly_output_tokens / 1_000_000 × output_price_per_million
)
# Don't forget: output tokens are typically 4–10× more expensive per token
Worked example: customer support bot at 50K requests/day
| Token Component | Avg Tokens/Request | Monthly Total |
|---|---|---|
| System prompt (cached after first call) | 600 | 900M (900M cacheable) |
| RAG context (3 retrieved chunks) | 1,800 | 2.7B |
| Conversation history (avg 3 turns) | 900 | 1.35B |
| User message | 120 | 180M |
| Total input tokens/month | 3,420 | 5.13B |
| Model response (output) | 350 | 525M |
Cost on GPT-4o (no caching): 5.13B × $2.50/M + 525M × $10/M = $12,825 + $5,250 = $18,075/month
Cost on GPT-4o (with prompt caching for system prompt): 4.23B × $2.50/M + 900M × $0.50/M + 525M × $10/M = $10,575 + $450 + $5,250 = $16,275/month
Cost on GPT-4o-mini (same workload): 5.13B × $0.15/M + 525M × $0.60/M = $769.50 + $315 = ~$1,085/month
This is why model selection is the single highest-leverage cost decision. A 16× cost difference between GPT-4o and GPT-4o-mini is not unusual for the same task when quality requirements allow it.
4. Cloud Infrastructure Overhead
API costs are only part of the picture. For any production LLM system you also need:
| Infrastructure Layer | What It Is | Typical Monthly Cost (mid-scale) |
|---|---|---|
| Vector database | Pinecone / Weaviate / pgvector on RDS | $70–$500/month |
| Embedding model calls | text-embedding-3-small at $0.02/1M tokens | $50–$400/month |
| Orchestration compute | LangChain/LlamaIndex app on ECS/Cloud Run | $200–$800/month |
| API gateway + rate limiting | AWS API Gateway or Kong | $30–$150/month |
| Observability stack | LangSmith / Helicone / custom OTEL pipeline | $100–$600/month |
| Caching layer | Redis for semantic/exact-match caching | $50–$300/month |
| Data storage (logs + evals) | S3 + Athena for cost tracking and QA | $30–$200/month |
| Security / PII scrubbing | Custom middleware or Amazon Comprehend | $100–$500/month |
For a mid-scale deployment (50K–200K requests/day), infrastructure overhead typically adds $600–$3,500/month on top of API spend. At 1M+ requests/day it becomes the dominant cost driver, especially if you run dedicated GPU instances for self-hosted models.
5. Hidden LLMOps Costs Nobody Advertises
This is where agency proposals fall apart. The following costs are real and recurring, but they rarely appear in a vendor slide deck:
Every model version change (GPT-4o → GPT-4o-2024-11-20, Claude 3.5 → 4.x) can silently break prompt behaviour. Regression testing a library of 40–200 prompts against a new model version takes 2–5 engineer-days per model update. At enterprise scale, organisations experience 4–8 major model updates per year.
LLM outputs are probabilistic. You need an eval pipeline: a golden dataset, automated scoring (often using a judge LLM — which itself costs tokens), and human review sampling. A 1,000-sample eval set run weekly on GPT-4o-as-judge costs ~$50/run — $2,600/year — before human time.
Content moderation (OpenAI Moderation API: $0/call but adds latency; custom NeMo Guardrails or Llama Guard: ~$0.20–$0.80 per 1K requests) and PII detection add both cost and latency. Many teams underestimate this entirely.
As conversation history grows, you hit context limits. Building and maintaining a sliding-window summarisation system, hierarchical memory, or vector-based memory retrieval is non-trivial ongoing engineering work — not a one-time setup.
Provider outages, rate limit errors (HTTP 429), and timeout retries all burn tokens and compute. A robust production system with exponential backoff, fallback providers, and circuit breakers adds engineering complexity and 5–15% effective API overhead from duplicate requests.
OpenAI fine-tuning: ~$8/1M training tokens for GPT-4o-mini + $0.30/1M input inference on the fine-tuned model. A fine-tuning run on 500K token dataset = ~$4 training cost. However, fine-tuned model deployment costs more than base models. Self-hosted fine-tuned models on AWS SageMaker with ml.g5.2xlarge run ~$1.52/hr, requiring dedicated instance reservation.
6. Three Real-World Cost Scenarios
Scenario A: Internal Knowledge Base Q&A (500 employees, ~2K requests/day)
| API (Claude Sonnet 4.6, avg 2K input + 400 output tokens) | $900/month |
| Vector DB (Pinecone starter) | $70/month |
| Embeddings (monthly re-index) | $15/month |
| Orchestration compute (Cloud Run) | $120/month |
| Observability (Helicone) | $80/month |
| Total infra + API | ~$1,185/month |
| Engineering maintenance (0.15 FTE) | ~$2,250/month |
| True monthly TCO | ~$3,435/month (~$41K/year) |
Scenario B: Customer-Facing AI Support Agent (50K requests/day, mixed model routing)
| GPT-4o-mini for intent classification (95% of traffic) | $650/month |
| Claude Sonnet 4.6 for complex escalations (5% of traffic) | $1,350/month |
| Embeddings + vector DB (Weaviate Cloud) | $340/month |
| Redis semantic cache (30% hit rate) | $190/month |
| Orchestration + API gateway (ECS) | $420/month |
| LangSmith observability | $250/month |
| PII scrubbing middleware | $180/month |
| Total infra + API | ~$3,380/month |
| Engineering maintenance (0.5 FTE) | ~$7,500/month |
| Prompt eval & QA | ~$1,200/month |
| True monthly TCO | ~$12,080/month (~$145K/year) |
Scenario C: Autonomous AI Agent Pipeline (document processing, 500K docs/month)
| Gemini 2.0 Flash for extraction (high volume) | $1,200/month |
| Claude Sonnet 4.6 for validation + final output | $4,800/month |
| Claude Opus 4.6 for edge-case escalations (~2%) | $1,100/month |
| Self-hosted Llama 3.3 70B on spot GPU (preprocessing) | $1,800/month |
| Storage + compute (AWS) | $1,400/month |
| Orchestration + monitoring | $600/month |
| Total infra + API | ~$10,900/month |
| Engineering team (1.5 FTE: 1 AI engineer + 0.5 MLOps) | ~$22,500/month |
| Eval pipeline + human review | ~$3,000/month |
| True monthly TCO | ~$36,400/month (~$437K/year) |
7. The TCO Formula I Use With Clients
When I scope a new enterprise AI engagement, I use this total cost of ownership framework. It consistently predicts actual spend to within 20%:
TCO = API_cost + Infra_cost + LLMOps_cost + Engineering_cost
# Where:
API_cost = Σ (monthly_tokens_per_model × price_per_token) × 12
Infra_cost = (vector_db + embeddings + compute + gateway + cache + observability) × 12
LLMOps_cost = (eval_pipeline + guardrails + prompt_maintenance + retry_overhead) × 12
Engineering_cost = FTE_count × avg_annual_fully_loaded_cost × llm_allocation_fraction
# Rule of thumb multiplier (validated across 15+ deployments):
TCO ≈ API_cost × 3.2 # for teams new to LLMOps
TCO ≈ API_cost × 1.8 # for mature teams with tooling in place
The multiplier drops as your team matures tooling, builds a prompt library, and shifts high-volume tasks to cheaper model tiers. Teams that skip the LLMOps investment pay for it in model quality incidents, prompt regressions, and runaway token spend.
8. Five Levers That Cut Your Bill by 40–70%
Lever 1: Prompt Caching (saves 15–40% on input costs)
Both Anthropic (cache_control) and OpenAI support prompt caching. Cache your system prompt and any static few-shot examples. At $3.00/1M input with Anthropic, cached tokens cost $0.30/1M — a 90% reduction. Payback threshold: if the same prefix is used more than ~once per few seconds, caching wins.
Lever 2: Model Routing (saves 30–60% overall)
Not every request needs GPT-4o. Build a lightweight router (a fine-tuned classifier or even keyword rules) that sends simple queries to GPT-4o-mini or Gemini Flash and only escalates complex ones to frontier models. In my experience, 60–75% of enterprise support tickets can be handled by a Tier-3 model with no quality drop.
Lever 3: Semantic Caching (saves 10–25% on repeated queries)
Cache LLM responses keyed by embedding similarity. Tools like GPTCache or a Redis + pgvector setup can serve identical or near-identical queries from cache. For FAQ-heavy workflows, cache hit rates of 25–40% are achievable.
Lever 4: Output Token Reduction (saves 20–50% on output costs)
Explicitly constrain output format. Forcing JSON-only responses, limiting response length via max_tokens, and instructing the model to be concise reduces output tokens substantially. A 200-word instruction to "be brief" can cut your output bill by 30–50% in structured-output pipelines.
Lever 5: Batch API for Async Workloads (saves 50% flat)
If your use case tolerates 24-hour turnaround (document processing, overnight report generation, batch classification), use the OpenAI Batch API or Anthropic Message Batches. Both offer a 50% discount over synchronous calls with identical quality. For 500K document processing scenarios, this alone saves ~$5,000–$6,000/month.
9. Build vs. Buy vs. Fine-Tune — The Honest Matrix
| Option | Upfront Cost | Ongoing Cost | Control | Time-to-Production | When to Choose |
|---|---|---|---|---|---|
| API (off-the-shelf) | Low ($0) | High (per-token) | Low | Days–weeks | MVP, uncertain volume, general tasks |
| Fine-tuned hosted model | Medium ($2K–$20K) | Medium (reduced tokens) | Medium | 4–8 weeks | Domain-specific tasks, consistent format, 10M+ tokens/month |
| Self-hosted open model | High ($20K–$100K infra setup) | Predictable (GPU instances) | High | 8–16 weeks | Data sovereignty, regulated industries, >1B tokens/month |
| Distilled/quantised model | High (distillation pipeline) | Low (cheap inference) | Very High | 12–24 weeks | Edge deployment, offline use, extreme cost sensitivity |
The crossover point for self-hosting versus API: if your monthly API spend exceeds ~$8,000–$12,000 and your task is well-defined enough for a specialised model, self-hosting (or fine-tuning on a cheaper tier) typically breaks even within 6–9 months. Below that threshold, the engineering overhead rarely justifies it.
The Bottom Line
The question "how much does it cost to deploy an LLM?" has no universal answer — but it has a rigorous one once you define your request volume, token profile, latency requirements, and quality bar. I've watched companies spend $200K/year on a system that should cost $40K, and I've watched others under-invest in observability and pay for it in silent quality degradation.
The framework above gives you enough structure to build a credible cost model before you write a line of production code. If you're a CTO or engineering leader evaluating an LLM initiative and want a second opinion on the numbers — or if you're looking for help designing a cost-efficient LLM architecture from the start — that's exactly the kind of work I do.
Transparency in AI pricing is a competitive advantage, not a risk. The teams that model costs rigorously from day one ship faster, justify budgets more easily, and avoid the invoice shock that kills otherwise promising AI programmes.