TL;DR: Anyone can demo an AI that writes a plausible expert quote. The hard part is everything around it: matching one specific spokesperson's voice instead of a generic "expert tone," grounding the quote in real sources, returning it before the deadline, handling the cases where there's no source or an ambiguous question, and logging every step so you can defend what shipped. We built the Media Comment Generator. The demo took a weekend. Production took the rest of the year.
There's a particular kind of LinkedIn post I've stopped being impressed by. Someone wires up a model, types "write me an expert quote about AI in marketing," gets back three tidy sentences, and announces they've replaced the PR consultant. The quote even reads well.
Then you ask the obvious question: whose voice is that? Where did the claim in sentence two come from? What happens when the journalist asks about something there's no article on yet? What happens at 2pm on deadline day when it needs to work, not maybe work?
That's the gap between a demo and a product. I know it precisely, because we crossed it building the Media Comment Generator — the tool our agency clients now use to answer journalists every day. The demo was a weekend. The product was a year. Here's where that year went.
A demo proves it's possible. Production proves it's reliable.
A demo runs once, in a scenario the demoer chose, with a forgiving audience. Production runs a hundred times a week, in scenarios nobody chose, with a PR director's relationship with a Reuters reporter on the line.
The demo answers "can the model produce a good quote?" The answer has been yes for a while. Production answers a harder question: "can a team build an entire deadline around this and not get burned?" Everything below is what that second question costs.
1. Voice fidelity, not "expert tone"
A demo produces a quote in an expert voice. Production produces a quote in your spokesperson's voice — and they are not the same thing.
A real spokesperson has a vocabulary, a sentence rhythm, opinions they'll defend, and things they will never say. The CEO who says "I'll be blunt" and the CMO who hedges everything in data are not interchangeable, and a journalist who's interviewed them before will notice instantly if the quote sounds wrong.
So production needs a voice profile per spokesperson: characteristic phrasing, areas of genuine expertise, reasoning style, and a hard do-not-say list (claims they can't make, competitors they won't name, positions off-limits for legal reasons). The model writes as that person, not as a generic authority. Get this wrong and you haven't saved time — you've created a quote the spokesperson has to rewrite, which is slower than starting from scratch.
2. Source attribution and fact-grounding
A demo quote can say "studies show engagement is up 40%." In production that sentence is a liability. Which study? Up since when? If a journalist prints it and it's invented, that's the client's credibility, then the agency's.
Production grounds every factual claim. The engine searches for current, real sources, ties claims back to them, and surfaces the citations so a human can check before anything ships. When the model wants to assert a number it can't support, the right behavior isn't to invent one — it's to soften to a defensible claim or flag that the number needs verification.
This is the difference between a writing toy and a tool a PR professional will put their name behind. The grounding layer is unglamorous and it's most of the trust.
3. Latency under a real deadline
The demo doesn't care if it takes ninety seconds or four minutes. Production does, because the entire value proposition is "before the deadline." A journalist who needs three quotes by 4pm will not wait while your multi-step pipeline thinks.
Real comment generation is several steps — parse the request, research, draft in voice, ground the facts, strip AI markers, validate. Each step adds time. Production engineering is the work of making that chain fast enough to feel responsive under pressure: parallelizing what can run in parallel, caching spokesperson profiles, picking the right model for each step instead of the biggest model for all of them. Our target is well under a minute end to end, because over that and the human starts wondering whether they should've just written it themselves.
4. Edge cases — where demos quietly fail
The demo scenario always has a clean article and a clear question. Production gets the messy reality:
- No source found. The topic is breaking and nothing's been published yet. The engine must say "I couldn't find current sources" rather than confidently hallucinate around the gap.
- Ambiguous question. "What do you think about the new rules?" — which rules? Production needs to ask or make a clearly-stated assumption, not guess silently.
- Multiple spokespeople. The agency represents three experts; which voice does this quote take? The system has to route to the right profile, or ask.
- Out-of-scope request. A journalist asks the fintech CEO about a medical topic. The do-not-say boundary has to hold even when the question invites crossing it.
A demo never hits these because the demoer steers around them. Production lives in them. Handling edge cases gracefully — degrading honestly instead of failing confidently — is most of what separates a tool people trust from one they quietly stop using after it embarrasses them once.
5. Reliability and observability
The demo runs on the demoer's laptop while they watch. Production runs unattended while a PR manager is in a meeting, trusting it'll be done when they're out.
That means the boring infrastructure that no one demos:
- Retries and graceful failure when a search API times out or a model call drops, instead of a dead end.
- Rate-limit handling so a busy Tuesday doesn't get throttled into errors.
- Audit trails — every request logged with its sources, its voice profile, and its output, so when a client asks "where did this quote come from?" three weeks later, there's an answer.
- Observability — you can see what the pipeline is doing, where it slows down, and where it's drifting, before a user reports it.
The unglamorous work is the work. Voice profiles, fact-grounding, retries, audit logs — none of it screenshots well for a launch post. All of it is what makes a PR team rely on the thing at 2pm on a deadline instead of keeping a human on standby "just in case." If they keep the human on standby, you haven't shipped a product. You've shipped a demo with extra steps.
The 2pm test
Here's how I judge whether an AI tool is real. Picture a PR manager on Tuesday at 2pm. A journalist from a major outlet needs three quotes by 4pm. Does the manager open your tool with confidence, or do they open it and also start drafting manually because they don't trust it to deliver?
A demo earns applause. A product earns that 2pm trust — and the only way to earn it is to do the year of unglamorous work that no demo shows. Voice fidelity so the quote sounds like the actual person. Grounding so the facts hold up. Speed so it beats the clock. Edge-case honesty so it fails safely. Reliability so it works while no one's watching.
We built the demo in a weekend. Then we spent a year making it something a PR team relies on without a backup plan. That year is the product.
