Numbers don’t lie. But they can stay selectively silent. This chapter puts both the good and the ugly in print.
All figures from Baiyuan Pilot: 12 tenants, ~1.2M queries in Q1 2026. Principles: aggregate only, industry-delidentified, no absolute-number stacking, granularity useful for engineers but not for architecture copying.
Context: Consumer electronics brand, annual revenue > USD 5M, Taiwan market. Widget + LINE deployed.
| Metric | Before | After (3 mo) |
|---|---|---|
| Daily tickets | 120 | 38 (−68%) |
| First response time | 18 min | 0.8 s |
| L1 hit rate | — | 52% |
| Cache hit rate | — | 31% |
| Monthly LLM spend | — | USD 680 |
| CSAT | 4.1/5 | 4.3/5 |
| Handoff rate | 100% | 11% |
Observations:
Lesson learned: Week 1 hallucination event — AI said free-shipping threshold was NT$500, actually NT$800. Root cause: L2 retrieved an old FAQ chunk. Fix: elevated “shipping policy” to L1 Wiki with monthly revalidation.
Context: B2B SaaS with 300+ articles across API docs, integration guides, SDK samples. Developer self-serve.
| Metric | Value |
|---|---|
| Monthly queries | 120,000 |
| L1 hit rate | 38% |
| L2 with Rerank | 18% |
| Avg answer length | 340 chars |
| With code block | 61% |
| Follow-up rate | 22% |
Observations:
Context: Mid-sized skincare brand, 14 SKUs needing 2026 Q1 PIF filing.
| Metric | Consultant | PIF AI |
|---|---|---|
| Per-SKU time | 30 workdays | 4 workdays |
| Per-SKU cost | USD 3,500 | USD 600 |
| Regulation update tracking | Monthly manual | Weekly auto |
| Citation traceability | 60–70% | 100% |
| TFDA first-pass approval | 70% | 88% |
| Monthly LLM spend | — | USD 320 |
Observations:
paragraph_hash — TFDA reviewers stop questioning sourcesLesson: ECHA had a major 2026/02 update; old Wiki expired overnight. Added “source-change alert” — tenant Dashboard now shows “7 PIF filings cite expired data, review recommended.”
Context: B2B strategy consultancy; 10 partner bios, 30 research reports, 12 industry analyses. GEO for AI visibility + RAG for internal search. Both share the same brand facts.
| Metric | W0 | W6 |
|---|---|---|
| AI citation rate (ChatGPT) | 18% | 41% |
| AI citation rate (Perplexity) | 22% | 58% |
| Fact accuracy (NLI) | 67% | 94% |
| Hallucination events / week | 12 | 2 |
| Avg repair latency | — | 6.2 days |
| Internal CS hit rate | 72% | 89% |
Most striking: Week 3 system caught Perplexity saying Partner Alice “graduated from Harvard” — actually Stanford. GEO triggered:
This is the concrete value of deep integration.
| Metric | A | B | C | D |
|---|---|---|---|---|
| L1 hit | 52% | 38% | 62% | 41% |
| Cache hit | 31% | 22% | 14% | 26% |
| Monthly cost | $680 | $450 | $320 | $520 |
| Main hallucination | Numbers | Nonexistent endpoint | None (NLI catches) | Person facts |
| Handoff rate | 11% | N/A | 24% | N/A |
Conclusion 1: Structure drives L1 hit. FAQs / regulations → 50%+. Dev docs / free Q&A → 30–40%.
Conclusion 2: NLI pays off for regulated/academic domains. +18% cost, hallucination → 0.
Conclusion 3: GEO + RAG coupling shifts “brand AI health” overall. A single metric misleads.
Conclusion 4: Token cost absolute ≠ cost ratio. E-commerce at $680 is 0.016% of revenue. PIF at $20 per $600 filing is 3.3%. PIF demands aggressive optimization.
Navigation: ← Ch 10 · 📖 Contents · Ch 12 →