Why Your AI Demo Works and Your AI Product Doesn’t

Someone on your team built a demo. It’s impressive. You paste in a document, GPT-4 extracts the key fields, a nice UI renders the results, and the CEO is ready to fund a six-month roadmap. Ship it.

Except you won’t ship it. Not like that. What you have is a script that works on 10 documents with a human watching it. What you need is a system that works on 10,000 documents at 3 AM with nobody watching it. Those are not the same engineering problem, and the gap between them is where companies burn $100K+ and two quarters before they either figure it out or quietly kill the project.

We’ve spent the last two years pulling teams out of this gap. Here’s what actually breaks, why, and what production AI engineering looks like when you stop pretending the demo is the product.

What Changes Between Demo and Production

Everything. But specifically:

Concurrency. Your demo runs one request at a time. Production runs 50. OpenAI’s rate limits don’t care about your launch date. GPT-4o gives you maybe 10,000 TPM on Tier 1. Your document pipeline burns 4,000 tokens per document. Do the math — you’re processing 2.5 documents per minute before you even hit application-level bottlenecks.

Error handling. Your demo has a try/catch that logs the error and moves on. Production needs to answer: which document failed? Can we retry it? After how long? How many times? What do we tell the user? What happens to the seven documents behind it in the queue? Is the failure transient or structural? Your demo answers none of these questions.

Edge cases. You tested on clean PDFs with standard formatting. Production gets scanned documents at 150 DPI, 47-page contracts with tables that span page breaks, handwritten margin notes, files named final_FINAL_v3_USE_THIS_ONE.pdf. Your extraction prompt doesn’t handle any of this because you never saw it during the demo.

Cost. Your demo processed 10 documents for $0.30 and everyone agreed AI is basically free. At production scale, you’re processing 500 documents a day. With GPT-4o at $2.50/$10 per million tokens, your 4K-token extraction running on 80-page documents (averaging 60K input tokens each) costs $0.15 per document. That’s $75/day, $2,250/month — and that’s before you add the validation pass, the summarization step, and the retry overhead. The CFO’s question went from “this is so cheap” to “what is this line item and why does it keep going up.”

Latency. Your demo takes 8 seconds to process a document and nobody cares because someone is sitting there watching it. Production needs P95 under 30 seconds because there’s an SLA, a queue backing up, and a user staring at a progress bar. Under concurrent load, that 8 seconds becomes 14, then 23, then timeouts.

Monitoring. You have console.log. That’s it. You have no idea what your hallucination rate is. You don’t know which documents fail most often. You can’t tell if Tuesday’s cost spike was because someone uploaded 200 documents or because your prompt started generating 3x more tokens than usual. You are flying blind.

The Five Things That Break First

In roughly this order, every time:

1. Token Limits on Real Data

Your test documents were 3–5 pages. Real documents are 40–120 pages. A 100-page PDF tokenizes to roughly 75,000 tokens. GPT-4o’s context window handles it, but now you’re paying 15x more per document than your cost model predicted, latency triples, and your extraction quality drops because the model is trying to find a three-line clause buried in 75K tokens of boilerplate. The attention mechanism doesn’t scale linearly with context length — relevant information in the middle of long contexts gets lost. This is documented, measured, and it will bite you.

2. Latency Under Concurrent Load

A single GPT-4o call averages 3–8 seconds for a moderate completion. Stack three sequential LLM calls in your pipeline (extraction, reasoning, formatting) and you’re at 12–24 seconds for a single document. Now run 20 of these concurrently. You hit rate limits, responses slow down, and your P95 latency climbs from 20 seconds to 90+. We’ve seen pipelines that ran fine in testing hit 3-minute P95s under real load because nobody modeled the queuing behavior.

3. Hallucination Rate at Scale

Here’s the thing about a 2% hallucination rate: on 10 test documents, that’s zero hallucinations. On 500 documents a day, that’s 10 documents with fabricated data going into your downstream systems every single day. The hallucination rate also isn’t constant — it’s correlated with document complexity, token count, and how far the input deviates from the patterns in your prompt examples. Your hardest documents hallucinate at 8–12%, not 2%. You won’t know this until production unless you’re measuring it, and you’re not measuring it.

4. Cost Explosion

This one is simple math that nobody does until the invoice arrives. Development: 50 test runs/day, $15/day, $450/month. Production: 500 documents/day, multi-step pipeline with retries, $150/day baseline, $4,500/month. Then someone adds a “just re-run the failures” button that triggers a full re-extraction and your February bill is $8,200. We have seen teams 10x their projected API costs within 60 days of launch. Every time, the response is surprise. Every time, the cost model existed on a napkin.

5. Lack of Observability

You cannot fix what you cannot see. Most teams launch their AI pipeline with exactly the same observability they’d give a curl command: did it return 200 or not. They don’t track tokens consumed per request, latency percentiles per pipeline stage, output quality metrics, cost per document, error categorization, or retry rates. When something goes wrong — and it will go wrong on a Thursday afternoon — they’re reading raw logs trying to figure out which of 47 documents in a batch caused the pipeline to stall.

What Production AI Engineering Actually Means

It means treating LLM calls like you’d treat any unreliable external dependency — because that’s what they are. An LLM API is a third-party service with variable latency, occasional errors, rate limits, and non-deterministic output. If you’re not engineering around all four of those properties, you don’t have a production system.

Retry logic with exponential backoff and jitter. Not retry(3). Actual backoff that respects rate limit headers, adds jitter to prevent thundering herds, and distinguishes between retryable errors (429, 500, timeout) and permanent failures (400, invalid content). Track retry rates as a first-class metric — if your retry rate crosses 15%, something is wrong upstream.

Circuit breakers. When OpenAI is having a bad day (and they will), your system should detect the failure pattern and stop sending requests, not pile up 200 timed-out calls in your queue. Trip the circuit at 5 consecutive failures or 50% error rate over a 60-second window. Half-open after 30 seconds. This is not optional.

Fallback chains. Your primary model is GPT-4o. Your fallback is Claude 3.5 Sonnet with an adapted prompt. Your second fallback is a rule-based extraction that handles the 60% of cases that don’t actually need an LLM. If you don’t have a fallback chain, a single provider outage takes down your entire product. We’ve built systems where the fallback path handles 30% of total volume during normal operation because it’s cheaper and faster for simple cases.

Structured output validation. The LLM returns JSON. Or it returns JSON with a trailing comma. Or it returns markdown-wrapped JSON. Or it returns a conversational explanation with JSON embedded somewhere in the middle. Use constrained decoding (function calling, response_format) where available, and still validate the output against a schema. Check value ranges, required fields, type constraints, and cross-field consistency. Reject and retry on validation failure. Log every rejection with the raw output for debugging.

Dead-letter queues. After 3 retries across 2 models with validation failures on each attempt, the document goes to a dead-letter queue for human review. Not silently dropped. Not retried infinitely. Queued, tagged with the failure mode, and surfaced in a dashboard. This is how you maintain data integrity without blocking the pipeline.

Graceful degradation. When your pipeline is overloaded, serve partial results rather than no results. Extract the high-confidence fields and flag the rest for async processing. Return what you have with a confidence score rather than blocking on a perfect result that may never arrive.

The Monitoring You Actually Need

At minimum, you need these dashboards on day one:

//Token consumption per pipeline stage, per model, per document type, trailing 24h and 7d
//Latency percentiles (P50, P95, P99) per stage, with alerting when P95 exceeds your SLA
//Error rates by category: rate limits, timeouts, validation failures, provider errors, content policy rejections
//Cost attribution per customer, per document type, per pipeline stage — not a single monthly bill from OpenAI
//Quality metrics: validation pass rate, retry rate, dead-letter rate, and ideally a sampled human review score
//Drift detection: are your outputs changing over time? Is the average token count per completion creeping up? Is your validation failure rate on a specific document type increasing?

If you don’t have this, you’re not running a production system. You’re running a demo with a domain name.

A Concrete Example: Document Processing at Scale

Here’s a real pattern. A team builds a pipeline that extracts structured data from vendor contracts. Works great on 10 test PDFs. Here’s what happens when they push it to 500 documents.

Documents 1–50: Everything works. Extraction accuracy is 94%. Average latency 11 seconds. Cost is tracking to projections. Team is confident.

Documents 51–200: The first scanned PDF arrives. OCR output is garbled. The extraction prompt, tuned for clean text, hallucinates field values from the noise. Nobody notices because there’s no output validation — the hallucinated values are syntactically valid JSON with plausible-looking numbers. Three contracts with fabricated payment terms enter the downstream system.

Documents 200–350: A batch of 40 documents hits at 2 PM. Rate limits kick in. The pipeline has no backpressure mechanism, so all 40 requests fire simultaneously, 35 get 429’d, the retry logic (a flat 3-second delay) puts them all back in the queue at the same time, they all 429 again. The queue depth hits 120. P95 latency is now 4 minutes. A user reports the system is “frozen.”

Documents 350–450: Someone uploads a 247-page master service agreement. It tokenizes to 185,000 tokens. The single-pass extraction prompt blows past the output token limit, returns a truncated JSON response, the parser throws an unhandled exception, and the worker process crashes. The remaining 12 documents in that worker’s batch are silently dropped. Nobody knows they were dropped until a client calls three days later asking where their report is.

Document 500: The monthly OpenAI invoice arrives. It’s $14,200. The projection was $4,000. Nobody can explain the difference because there’s no cost attribution. Was it the retries? The 247-page document? The batch that got stuck in a retry loop? All of the above, but you can’t prove it.

Every one of these failures was preventable with standard engineering practices. Chunking strategy for long documents. Input validation and OCR quality scoring. Rate-limit-aware request scheduling with backpressure. Output schema validation with dead-lettering. Per-document cost tracking. Worker health checks and batch recovery.

Architecture Patterns That Actually Work

After building enough of these systems, clear patterns emerge:

Separate extraction from reasoning from generation. Don’t ask one mega-prompt to read a document, understand it, make decisions about it, and produce formatted output. That’s three or four distinct operations with different accuracy requirements, different cost profiles, and different failure modes. Split them. Your extraction stage can use a cheaper, faster model. Your reasoning stage needs the best model you have. Your generation stage might not need an LLM at all — template engines exist.

Chunk and map-reduce for long documents. Don’t stuff 75K tokens into one prompt and hope. Split the document into semantic chunks (by section, by page, by heading structure). Extract from each chunk independently. Merge the results with a lightweight reconciliation step that resolves conflicts and deduplicates. This is slower on simple documents but dramatically more reliable on complex ones, and it lets you parallelize.

Validate outputs structurally and semantically. Structural: does the JSON parse? Are required fields present? Are dates actually dates? Semantic: is the extracted contract value within a plausible range? Does the vendor name match something in your known-vendor list? Is the extracted date in the future when it shouldn’t be? These checks catch 80% of hallucinations before they reach downstream systems.

Build the pipeline as a DAG, not a script. Each stage is an independent unit with defined inputs, outputs, and failure modes. Stages communicate through a message queue or workflow engine. Failed stages can be retried independently without re-running the entire pipeline. This is boring distributed systems engineering, and it’s exactly what’s needed.

Cache aggressively. If you’ve already extracted data from a document, store the result. Documents don’t change. LLM calls are expensive and non-deterministic — running the same extraction twice might give you different results, so run it once, validate it, and cache the validated output. Invalidate only on prompt changes or model version updates.

The gap between an AI demo and an AI product isn’t AI knowledge. It’s engineering discipline applied to a system with a fundamentally unreliable core component. The teams that ship production AI systems aren’t the ones with the most sophisticated prompts. They’re the ones that treat LLM calls as unreliable I/O, build observable pipelines with proper failure handling, and never mistake “it worked on my laptop” for “it’s ready.”

The demo was the easy part.