Baiyuan RAG Knowledge Platform Whitepaper

Chapter 2 — Baiyuan RAG System Overview

Map first, details later. This chapter is the skeleton for the next eleven.

2.1 The System in One Sentence

Baiyuan RAG Knowledge Platform is a shared AI knowledge infrastructure built on PostgreSQL + pgvector (storage), Redis (cache), Node.js (API), multi-tenant isolation (security), and L1 Wiki + L2 RAG (retrieval). Three product lines (CS / GEO / PIF) access it via X-RAG-API-Key + X-Tenant-ID.

2.2 Request-to-Response Path

sequenceDiagram
    autonumber
    participant Client
    participant GW as Gateway
    participant Auth
    participant Cache as Redis
    participant L1 as L1 Wiki
    participant L2 as L2 pgvector+BM25
    participant LLM
    participant Audit

    Client->>GW: POST /api/v1/ask
    GW->>Auth: verify key + tenant
    GW->>Cache: lookup
    alt Cache hit
        Cache-->>Client: return (0.1s)
    else Cache miss
        GW->>L1: slug query
        alt L1 hit
            L1-->>GW: wiki body
        else L1 miss
            GW->>L2: vector+BM25+RRF
            L2-->>GW: top-K chunks
            GW->>LLM: chunks + question
        end
        LLM-->>GW: answer
        GW->>Cache: store (TTL=600s)
        GW->>Audit: log
        GW-->>Client: answer + sources
    end

Fig 2-1: /api/v1/ask sequence

About 2/3 of queries finish before hitting LLM generation — this is the core of token economics.

2.3 Core Database Schema

Table	Purpose	Key Fields
`tenants`	Tenant master	`id`, `api_key`, `plan`, `quota`
`knowledge_bases`	KB per tenant	`id`, `tenant_id`, `is_default`
`documents`	Source docs	`id`, `kb_id`, `doc_type`, `status`
`chunks`	Splits	`id`, `document_id`, `content`, `fts` (tsvector generated)
`embeddings`	Vectors	`chunk_id`, `embedding vector(1536)`
`wiki_pages`	L1 pages	`id`, `kb_id`, `slug`, `body`
`queries`	Audit log	`id`, `tenant_id`, `question`, `from_wiki`, `latency_ms`

All tenant-scoped tables enable PostgreSQL Row-Level Security (Ch 6).

2.4 Component Roles

flowchart TB
    GW[Gateway Node.js] --> MW[Middleware]
    MW --> ASK[Ask Service<br/>L1→L2 orchestrator]
    GW --> INGEST[Ingestion Worker]
    ASK --> PG[(PostgreSQL + pgvector)]
    ASK --> RD[(Redis)]
    ASK --> LLM[OpenAI/Claude/Gemini]
    INGEST --> PG
    WIKIC[Wiki Compiler<br/>nightly] --> PG
    WIKIC --> LLM
    WIKIL[Wiki Linter<br/>daily] --> PG

Fig 2-2: Component layout

Gateway: HTTP/SSE only, no business logic
Ask Service: L1→L2 orchestrator
Ingestion Worker: background PDF/URL/file processing
Wiki Compiler: offline batch, usually nightly
Wiki Linter: daily consistency check

Product	Uses RAG For	Feeds RAG With	Special Need
AI CS	Q&A, handoff summary	FAQ, product manual	SSE, <3s latency
GEO	Hallucination repair GT	Brand bio, team, services	NLI, strict citation
PIF AI	Ingredient/toxicology lookup	PubChem/ECHA/TFDA	Traceable citation, version lock

Shared points:

Same tenant_id maps to one brand across three products
Schema.org @id cross-reference (Ch 9)
Shared Wiki compiler with product-tuned prompts
Single API endpoint: https://rag.baiyuan.io

2.6 Technology Decisions

Decision	Choice	Alternatives	Why
Vector store	pgvector	Pinecone, Qdrant, Milvus	Same Postgres — txn, ops simplicity
Main DB	PostgreSQL 16	MySQL, CockroachDB	Mature pgvector, RLS, JSONB
FTS	PG tsvector	Elasticsearch	One fewer service
Fusion	RRF (k=60)	Weighted avg, ColBERT	Robust, no tuning
Cache	Redis 7	Memcached	Shared, precise TTL
Language	Node.js (TS)	Python, Go	Same stack as chat-gateway
Wiki LLM	Claude Sonnet 4.6	Smaller model	Offline, quality matters
Answer LLM	Router (multi)	Single vendor	Cost/availability spread
Deploy	Docker Compose / Lightsail	Kubernetes	Tenant scale, lower overhead
Auth	Header-based API key	OAuth	Product-to-product call

Every choice is a trade-off. Ch 12 revisits which may need revision.

The Chat Widget for the AI Customer Service line is a ~35KB JavaScript bundle embedded on every page of customer websites. At scale (100 tenants × 10K daily page views), this yields ~1M widget loads per day. Serving each hit from the Lightsail origin nginx would make it the platform’s first bottleneck.

The platform uses a two-tier cache architecture — origin + CDN edge — so that end-user requests almost never reach origin.

2.7.1 Delivery Path

flowchart LR
    Browser[Customer Browser<br/>1-day cache]
    Edge[Cloudflare Edge<br/>300+ PoP / 1-year cache]
    Nginx[Origin nginx<br/>Lightsail]
    FS[/usr/share/nginx/<br/>html/widget/]
    Browser -- MISS --> Edge
    Edge -- MISS --> Nginx
    Nginx -- alias --> FS

Best case: browser cache serves instantly (< 10ms). Cold start: CF edge returns in < 60ms TTFB from the Taipei PoP. Worst case: the first request in a region pays one origin round-trip, after which the regional PoP serves subsequent hits.

2.7.2 Cache Headers

Origin nginx returns for /widget/*:

Cache-Control: public, max-age=86400, s-maxage=31536000, immutable
Access-Control-Allow-Origin: *

Directive	Audience	Meaning	Rationale
`max-age=86400`	Browser	Revalidate after 1 day	Support rapid bug-fix rollout
`s-maxage=31536000`	Shared CDN	1 year	Edge HIT rate → 100%, origin rarely fetched
`immutable`	Browser	No revalidation during TTL	Skip conditional GET, cut RTT

A Cloudflare Cache Rule overrides edge TTL to 1 year (Override origin → 1 year), guaranteeing long edge retention even on the Free plan.

2.7.3 CORS and Versioning

The widget loads cross-origin, so Access-Control-Allow-Origin: * is required. This is a public resource — no secrets — and tenant identity is passed at runtime via window.BAIYUAN_WIDGET.tenantKey.

Current strategy: versionless URL + short browser TTL.

✅ Upside: one URL forever; customers never re-edit their embed snippet
❌ Downside: after a bug fix, clients see the new version only after the 1-day browser TTL
🔄 Upgrade path: switch to chat-widget.v{SEMVER}.js and extend browser TTL to 1 year

2.7.4 Invalidation and Purge

Browser: change the URL (version bump) → fresh fetch; or wait for max-age
CF edge: Dashboard → Caching → Purge Everything / Custom Purge (by URL); or CF API for automation
Origin: file at /home/ubuntu/cs-widget/dist/, mounted read-only into nginx; docker compose up -d --no-deps nginx hot-reloads

2.7.5 Measured Performance

Metric	Value (Taipei PoP, CF HIT)
TTFB	< 60ms
Total	< 70ms
Origin fetch rate	< 0.1%
Edge HIT rate	> 99.9%

Key Takeaways

System = PG + pgvector + Redis + Node.js + L1/L2 Hybrid
Request latency depends on where in Cache→L1→L2 the hit lands
All tenant tables use RLS, the first line of multi-tenant safety
Three product lines deliberately share the platform
Choices like pgvector / RRF / Node.js are trade-offs, not ideals
The Chat Widget is delivered via a two-tier cache (origin nginx + Cloudflare edge), achieving < 60ms TTFB and < 0.1% origin fetch rate

References

pgvector · RRF paper · PostgreSQL RLS

Navigation: ← Ch 1 · 📖 Contents · Ch 3 →

This site is open source. Improve this page.