The honest chapter. What we haven’t solved, what we might reverse, what’s coming.
Per-tenant embeddings at ~100k volume: HNSW P95 < 120 ms. Past 5M, performance degrades. If any tenant hits 5M, evaluate Qdrant/Milvus migration or sharding.
zh_parser (SCWS-based) misses new words, brand names, product names. We patch with synonym dictionaries — ongoing maintenance burden. Experimental alternative: LLM-time tokenization — better accuracy, +100 ms, higher cost.
Text only. Real knowledge is mixed:
CLIP-style multimodal embedding experimental; targeted 2026 Q3.
Deployed only in AWS Tokyo. EU compliance needs EU region. Docker Compose architecture doesn’t support multi-region; needs K8s refactor.
LLM-authored Wiki has systematic biases: Western-centric examples, inconsistent transliteration, stale post-training-cutoff knowledge. Our mitigations (strict “chunks-only” instruction, cross-chunk consistency lint) partially help but don’t fix the root.
The paper suggests 60, but gives no theoretical justification. We haven’t run sufficient A/B to validate it for Chinese. Worth revisiting.
GPT-4o-mini misclassifies vague openings (“I was wondering…”). Knowledge → smalltalk means the customer gets a polite non-answer. Fix direction: expanded training set + confidence threshold + dual-path on low confidence.
English NLI (DeBERTa-v3-NLI) is excellent. Chinese NLI quality varies; we use mDeBERTa-multi + human audit at ~85% accuracy. Production-grade Chinese NLI is an open problem.
Current pricing by message count. Actual cost varies:
High-precision tenants underpay; low-precision overpay. 2026 Q3: precision-tier pricing.
Shared infra is wonderful; “GEO-triggered RAG repair token usage” is hard to attribute. Currently GEO API calls count against RAG tenant quota — financially imprecise.
Upgrading embedding model (text-embedding-3-small → -large) requires full re-embed. For a large tenant: USD 2,000+. We’ve deferred such upgrades — tech debt accumulates.
How often to compile?
Current: fingerprint + weekly lint + manual trigger — no clean theory.
Customer says “your website says CEO is Bob.” RAG Wiki says “Alice.” Who wins?
This is a trust chain problem with no engineering answer yet.
Claude 200k, Gemini 2M — tempting to “stuff everything in prompt.” Our position: RAG doesn’t die, it mutates.
L1 Wiki becomes the tool to align LLM attention precisely, rather than a substitute for vector retrieval.
Text Wiki is natural. What is a Wiki for images / video / audio?
No unified answer.
Tentative (subject to market feedback):
| Quarter | Item | Priority |
|---|---|---|
| 2026 Q2 | Multimodal embedding (CLIP-style) | High |
| 2026 Q2 | Rerank default-on evaluation | Medium |
| 2026 Q2 | GEO ↔ RAG Wiki patch API launch | High |
| 2026 Q3 | Precision-tier pricing | High |
| 2026 Q3 | EU region deployment (K8s) | Medium |
| 2026 Q3 | Japanese NLI self-training | Medium |
| 2026 Q4 | Long-context + Wiki hybrid strategy | Medium |
| 2026 Q4 | Self-hosted edition | Low |
This book is a living document: minor versions each quarter, major annually. GitHub Issues capture reader feedback. Updates in CHANGELOG.md.
Navigation: ← Ch 11 · 📖 Contents · Appendix A →