Customers asked “can our PDFs be questioned?” Then said “we also have Notion, our website, ERP, and monthly Excels.” This chapter handles all six.
stateDiagram-v2
[*] --> Uploaded
Uploaded --> Extracting
Extracting --> Chunking
Chunking --> Embedding
Embedding --> Ready
Ready --> ReEmbedding: update
ReEmbedding --> Ready
Extracting --> Failed
Chunking --> Failed
Embedding --> Failed
Failed --> Extracting: retry
Ready --> Deleted
Deleted --> [*]
Fig 7-1: Document lifecycle
CREATE TABLE documents (
id UUID PRIMARY KEY, tenant_id UUID NOT NULL, kb_id UUID NOT NULL,
title TEXT, doc_type TEXT, -- text|url|file|scraped|auto_push|api
source_uri TEXT, file_name TEXT, file_size BIGINT,
char_count INT, chunk_count INT,
status TEXT DEFAULT 'uploaded', error TEXT,
ingested_at TIMESTAMPTZ, ready_at TIMESTAMPTZ, deleted_at TIMESTAMPTZ,
source_hash TEXT,
created_at TIMESTAMPTZ DEFAULT now()
);
POST /api/v1/documents/text
{"knowledge_base_id":"uuid","title":"Return Policy 2026","content":"..."}
| Type | Extension | Extractor |
|---|---|---|
| pdfjs + OCR fallback | ||
| Word | .doc/.docx | mammoth |
| PowerPoint | .ppt/.pptx | pptx-parser |
| Excel | .xls/.xlsx | xlsx (per sheet) |
| Text | .txt/.md | direct |
| HTML | .html | cheerio |
Backend fetch → content-type branch: HTML → Puppeteer, PDF → file pipeline, others rejected.
flowchart LR
ROOT[root_url] --> Q[BFS Queue]
Q --> FETCH
FETCH --> ROBOTS{robots.txt}
ROBOTS -->|disallow| SKIP
ROBOTS -->|allow| PARSE
PARSE --> EXTRACT[@mozilla/readability]
EXTRACT --> LINKS --> Q
EXTRACT --> DOC
Details: main content via Readability, dedupe by URL/content hash, 1 req/s per site.
POST /api/v1/documents/push
X-Webhook-Signature: <hmac-sha256>
HMAC verification:
const expected = hmacSha256(rawBody, tenant.webhook_secret);
if (!timingSafeEqual(req.headers['x-webhook-signature'], expected)) {
return res.status(401).send('invalid signature');
}
Periodic sync for Notion / Confluence / Zendesk: compare external_id + version, upsert on change.
Three PDF categories:
| Category | Extraction |
|---|---|
| Text PDF | pdfjs-dist |
| Mixed | pdfjs + per-image OCR |
| Image-only (scan) | Google Vision OCR |
Detection heuristic: first 3 pages total <300 chars → image-only. Google Vision chosen over local Tesseract for 96% vs 82% CJK accuracy and layout preservation; $1.50/1,000 pages is acceptable for SaaS.
flowchart TB
API[POST /documents/*] --> INS[insert status=uploaded]
INS --> PUB[Redis Stream]
PUB --> RESP[return document_id]
PUB --> W1[Extractor]
W1 --> W2[Chunker]
W2 --> W3[Embedder]
W3 --> READY[status=ready]
Fig 7-3: Ingestion pipeline
Independent horizontal scaling per worker. Redis Streams (vs RabbitMQ) for reuse, consumer groups, XLEN visibility.
sha256(raw) skips duplicates| Error | Cause | Strategy |
|---|---|---|
| OCR fail | File corrupt / Vision rate limit | Retry 3×, exponential backoff |
| Embedding fail | OpenAI rate limit | Worker pause 60s, requeue |
| Parse fail | Unsupported format | Immediate fail, notify user |
| Wiki compile fail | LLM error | Revert previous Wiki, lint_status = failed |
Dead Letter Queue: >3 retries → daily report for human review.
Navigation: ← Ch 6 · 📖 Contents · Ch 8 →