An enterprise drops its PDFs into ChatGPT. The next day customers discover the AI quoted wrong prices, got the return policy backwards, and leaked Company A’s confidential data to Company B. Welcome to the dark forest.
Since mid-2024, nearly every enterprise CTO has heard the same request: “Turn our product manual into an AI that can answer questions.” The first implementation usually looks like this:
documents = load_pdfs("docs/")
chunks = split_into_chunks(documents, size=500)
vectors = openai.embed(chunks)
qdrant.upsert(vectors)
def ask(question):
q_vec = openai.embed(question)
top_k = qdrant.search(q_vec, k=5)
context = "\n".join(top_k)
return openai.chat([
{"role": "system", "content": "Answer based on the following"},
{"role": "user", "content": f"{context}\n\nQuestion: {question}"}
])
It seems magical in week one. By month one, five blind spots emerge:
Each problem has a solution. Solving all five in a multi-tenant SaaS is not a prompt-tweaking exercise — it’s an infrastructure problem.
Engineers blame “GPT-4o still makes stuff up,” but hallucinations have four sources:
| Source | Responsibility | Fix Layer |
|---|---|---|
| Model limit | Model | Switch model (Claude/Gemini/…) |
| Wrong chunks retrieved | Infrastructure | Higher recall, hybrid retrieval, reranking |
| Outdated chunks | Infrastructure | Version tagging, freshness signals |
| LLM “completes” beyond the chunk | Prompt engineering | Strict citation, NLI verification |
Blaming the model is lazy. Over 60% of hallucinations are fixable at infrastructure layer (Baiyuan internal Q1 2026):
pie title Hallucination root-cause distribution (Baiyuan Q1 2026, n=1200)
"Wrong chunks (fixable)" : 42
"Outdated chunks (fixable)" : 18
"Prompt didn't require citation (fixable)" : 12
"Model overreach (hard)" : 28
Fig 1-1: Hallucination root-cause breakdown
The book’s core thesis: treat the 60% “fixable” bucket as infrastructure, then let NLI + ChainPoll (Ch 12) handle the remaining 28%.
Many RAG demos quote “3,000 tokens per query.” The real cost curve at enterprise scale:
| Scale | Queries/day | Monthly tokens | GPT-4o API cost |
|---|---|---|---|
| Pilot | 500 | 5M | ~USD 150 |
| SMB | 5,000 | 50M | ~USD 1,500 |
| Mid-market SaaS | 50,000 | 500M | ~USD 15,000 |
| Large CC center | 500,000 | 5B | ~USD 150,000 |
But most of this is avoidable:
Baiyuan’s measured result: L1 hit rate 35–60% → monthly token spend drops to 20–40% of naive baseline.
SaaS RAG has a fundamental difference from in-house RAG: isolation is not optional. Four real industry incidents (anonymized, 2024–2025):
WHERE tenant_id = ?These map to the three-layer tenant isolation in Ch 6:
flowchart LR
R[Request] --> A[Layer 1<br/>App<br/>X-Tenant-ID]
A --> D[Layer 2<br/>DB<br/>PostgreSQL RLS]
D --> Q[Layer 3<br/>Query<br/>WHERE tenant_id=?]
Q --> OK[Safe]
Fig 1-2: Three-layer defense-in-depth
Miss any layer, get one extra hole.
“Knowledge base” sounds simple to business stakeholders. To engineers it’s a haunted house. The source types we support (Ch 7):
| Source | Example | Pain |
|---|---|---|
| Paste text | FAQs typed by employees | Format noise |
| Upload file | PDF, Word, PPT, TXT | OCR, tables, linebreaks |
| URL import | Marketing pages, Notion | JS rendering, login walls |
| Site scrape | Periodic full-site crawl | robots.txt, rate limits, dedupe |
| Webhook push | ERP/CRM events | Increments, dedup, versioning |
| API pull | Internal services | Auth, schema drift |
This isn’t a RAG product — it’s a knowledge ETL platform. Ch 7 details each pipeline.
Counterintuitive engineering decision. Baiyuan has three product lines:
The “obvious” approach is one RAG per product. We chose one shared RAG infrastructure. Reasons:
@id interlinking shares three layers (Organization → Service → Person) across productsCost: multi-tenant + multi-product complexity. Ch 9 and Ch 10 break down the integration patterns.
How do we build a single multi-tenant RAG infrastructure that simultaneously supports customer Q&A, GEO hallucination repair, and PIF regulatory filing at production grade on cost, hallucination, and isolation?
This is the thread running through all 12 chapters.
Navigation: 📖 Contents · Ch 2 →