The website for humans and the content for AI should not be the same HTML. Forcing one document to serve both leaves both parties worse off.
A modern website is designed for humans:
<div class="col-md-6"> layouts that lack semantic meaningFor humans these are UX. For AI crawlers they are noise. When an AI bot (GPTBot, ClaudeBot, PerplexityBot, Googlebot, and 20+ others) fetches a modern brand page, three failure modes are common:
<div id="app"></div>.The result: the AI’s perception of the brand is either wrong or thin. The remedy is not to rewrite every customer website for AI’s benefit. It is to prepare a dedicated, clean shadow document for the AI.
flowchart LR
Brand[Brand data] --> H[Human version<br/>React / Vue / Next.js<br/>full UI / animation / ads]
Brand --> A[AI version shadow doc<br/>Pure HTML + JSON-LD<br/>pure semantics, no noise]
H -->|Browser| User[User experience]
A -->|AI Bot| LLM[Model training and retrieval]
Fig 6-1: The same brand data produces two renderings — human version optimized for experience, AI version optimized for semantics.
AXP (AI-ready eXchange Page) is Baiyuan’s name for such shadow documents. Each AXP page has three layers.
flowchart TB
subgraph AXP["AXP shadow document"]
HTML["(1) Pure HTML semantic skeleton<br/>h1 / h2 / p / ul / table"]
JSONLD["(2) Schema.org JSON-LD<br/>Organization / Service / Person<br/>three-layer @id interlinking"]
MD["(3) Markdown raw text block<br/>for RAG chunking and vectorization"]
end
HTML --> AI[AI bot consumption]
JSONLD --> AI
MD --> AI
Fig 6-2: Three complementary layers — pure HTML for coarse-grain crawling, JSON-LD for knowledge graphs, Markdown for RAG.
(1) Pure HTML skeleton
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>Acme Aesthetics — downtown Chicago laser clinic</title>
<meta name="description" content="Acme Aesthetics offers laser...">
<link rel="canonical" href="https://acme-aesthetics.example/">
</head>
<body>
<main>
<h1>Acme Aesthetics</h1>
<section>
<h2>About</h2>
<p>Founded in 2018, Acme focuses on...</p>
</section>
<section>
<h2>Services</h2>
<ul>
<li>Laser hair removal</li>
<li>Botulinum treatments</li>
</ul>
</section>
</main>
</body>
</html>
(2) Schema.org JSON-LD — see Ch 7 for the full design.
(3) Markdown raw-text block — for RAG chunking:
# Acme Aesthetics
## About
Founded in 2018 in downtown Chicago, specializing in...
## Services
- Laser hair removal
- Botulinum treatments
All three layers coexist in the same URL response: HTML as the body, JSON-LD in a <script type="application/ld+json"> block, Markdown in a <script type="text/markdown" id="axp-markdown"> block.
AXP is delivered by edge injection: the CDN layer intercepts requests and chooses the response body based on User-Agent. Our platform uses Cloudflare Workers.
flowchart TD
Req[HTTP request] --> UA{UA matches<br/>AI bot?}
UA -->|yes| Cache{Worker cache<br/>hit?}
UA -->|no| Pass[pass through<br/>to customer origin]
Cache -->|hit| Serve[return cached shadow doc]
Cache -->|miss| Fetch[fetch shadow doc from<br/>geo.baiyuan.io API]
Fetch --> Store[write to Worker cache<br/>TTL 15 min]
Store --> Serve
Pass --> Origin[customer origin server]
Fig 6-3: Worker tries cache first, only fetches from origin API on miss. Human requests pass through with no added latency.
export default {
async fetch(request, env) {
const ua = request.headers.get('user-agent') || '';
const url = new URL(request.url);
// 1. Not an AI bot: pass through to customer origin
if (!isAIBot(ua)) {
return fetch(request);
}
// 2. AI bot: try cache
const cacheKey = `axp:${url.hostname}:${url.pathname}`;
const cached = await env.KV.get(cacheKey);
if (cached) {
return new Response(cached, {
headers: { 'content-type': 'text/html; charset=utf-8' },
});
}
// 3. Cache miss: fetch shadow doc from our API
const axpUrl = `https://api.geo.baiyuan.io/axp?host=${url.hostname}&path=${url.pathname}`;
const axpRes = await fetch(axpUrl);
if (!axpRes.ok) {
return fetch(request); // shadow fetch failed — fall back to origin
}
const body = await axpRes.text();
await env.KV.put(cacheKey, body, { expirationTtl: 900 });
return new Response(body, {
headers: { 'content-type': 'text/html; charset=utf-8' },
});
},
};
Key design points:
The platform currently recognizes 25 AI bot UAs, grouped by function:
flowchart LR
subgraph LLM["LLM crawlers (10)"]
GPTBot
ClaudeBot
GoogleExtended[Google-Extended]
CCBot[CCBot / Common Crawl]
MetaBot[FacebookBot / Meta-ExternalAgent]
ByteBot[Bytespider]
AnthropicBot[anthropic-ai]
AppleBot[Applebot-Extended]
end
subgraph Search["Search-type crawlers (6)"]
Perplexity[PerplexityBot]
ChatGPTUser[ChatGPT-User]
PerplexityUser[Perplexity-User]
YouBot[YouBot]
PhindBot
end
subgraph Preview["Preview / link generators (5)"]
LinkedInBot
FacebookExt[facebookexternalhit]
TwitterBot[Twitterbot]
DiscordBot[Discordbot]
end
subgraph RAG["RAG / enterprise crawlers (4)"]
CohereBot[cohere-ai]
DiffBot[Diffbot]
OmigoBot[Omigo]
end
Fig 6-4: 25 AI bots grouped by role. The platform enables AXP injection for all four groups by default; admins can disable groups per customer if needed.
Implementation uses a combined regex rather than nested if/else, for maintainability:
const AI_BOT_REGEX = new RegExp(
[
'GPTBot', 'ChatGPT-User', 'OAI-SearchBot',
'ClaudeBot', 'anthropic-ai', 'Claude-Web',
'Google-Extended', 'GoogleOther',
'PerplexityBot', 'Perplexity-User',
'CCBot', 'Bytespider', 'FacebookBot',
'Meta-ExternalAgent', 'Applebot-Extended',
'cohere-ai', 'Diffbot', 'YouBot', 'PhindBot',
'LinkedInBot', 'facebookexternalhit', 'Twitterbot',
'Discordbot', 'Omigo', 'DuckAssistBot',
].join('|'),
'i'
);
function isAIBot(ua) {
return AI_BOT_REGEX.test(ua);
}
The UA list is reviewed quarterly; new crawlers (e.g., OAI-SearchBot first observed in July 2025) must be added promptly.
A practical edge case: when the SaaS platform itself is also a user of the SaaS (dogfooding), the same domain needs to serve both “platform users” (logged-in product) and “brand website visitors” (anonymous content).
Baiyuan’s own geo.baiyuan.io is exactly this case:
| Path | Human user | AI bot |
|---|---|---|
/ |
Marketing home (public) | AXP brand page for “Baiyuan” |
/dashboard |
Authenticated dashboard (private) | 403 — should not be AXP-ified |
/features, /pricing |
Product pages (public) | AXP service pages |
/login, /signup |
Auth pages (public but no brand content) | Do not inject; pass through |
flowchart TD
R[Request] --> B{AI bot?}
B -->|no| P1[pass through to origin]
B -->|yes| P{Path category}
P -->|marketing / brand page| A[inject AXP]
P -->|authenticated feature| X1[return 403 Forbidden]
P -->|public page without brand content| X2[pass through to origin]
Fig 6-5: The path-category table is maintained per-brand in admin settings. Paths not in the table default to pass-through — conservative by design.
AI bot crawl efficiency depends on sitemap.xml. Under AXP, the sitemap must be dynamically generated in alignment with AXP paths, otherwise we end up in the confusing state of “sitemap lists URL X, but the Worker does not inject AXP on path X.”
The platform generates a sitemap per customer domain following these rules:
brand_locations, brand_services, brand_employees tables<lastmod> from the corresponding entity’s updated_at<priority> by path type: home 1.0, service pages 0.8, employee pages 0.6robots.txt actively declares Sitemap: https://<domain>/sitemap.xmlThe sitemap is also served by the Cloudflare Worker; human visitors that hit /sitemap.xml see it too (SEO convention — not hidden).
The Schema.org spec allows nested arrays, but in practice the following pattern triggers issues:
// WRONG: nested array — some AI parsers reject the whole block
{
"@context": "https://schema.org",
"@graph": [
[
{ "@type": "Organization", "name": "Acme Aesthetics" }
],
[
{ "@type": "Service", "name": "Laser treatment" }
]
]
}
// RIGHT: flat array with @id references
{
"@context": "https://schema.org",
"@graph": [
{ "@type": "Organization", "@id": "#org", "name": "Acme Aesthetics" },
{ "@type": "Service", "@id": "#svc-laser", "name": "Laser treatment",
"provider": { "@id": "#org" } }
]
}
Fig 6-6: Relationships between entities must be expressed via @id references, not via array nesting. This is a hard requirement of the Schema.org validator tooling.
Benefits of flattening + @id linking:
A record of issues encountered with Google Search Console (GSC) during 2024–2025 operation:
| Pitfall | Symptom | Root cause | Remedy |
|---|---|---|---|
noindex meta override |
GSC reports “excluded by noindex tag” | UAT .env accidentally deployed to PROD |
Strict env-var validation at app startup refusing unreasonable combinations |
| Cross-domain canonical | PROD page’s canonical points to UAT domain | Same codebase uses shared canonical logic | Change canonical to dynamic, from request.hostname |
| Missing bot UA | GSC index shifts but AI citations drop | New bot UA not yet in regex | Quarterly review of CF Worker logs for unmatched UAs |
| Sitemap inconsistency | GSC reports “Discovered — currently not indexed” | AXP page exists but sitemap missed it | Generate sitemap from the same source (AXP index table) |
| HTTPS/HTTP mixing | robots.txt returns 200 on HTTP, 404 on HTTPS |
Worker did not handle http:// traffic |
Force 301 to HTTPS and inject robots on both |
These issues are not AXP-specific, but AXP amplifies their severity: AI bots recrawl less frequently than Googlebot, so a single mistake can cost weeks before it is observed and corrected. It is better to run a pre-flight check-prod-seo.sh in CI than to find these live.
AXP page quality depends not just on “can AI bots fetch it” but on how much factual brand knowledge is inside. Pages generated purely from LLM inference tend to contain hallucinations or vague generalities. Injecting knowledge from the brand’s own RAG knowledge base yields a qualitatively different result.
Baiyuan uses a central shared RAG engine (§9.4), but each brand’s documents live in a dedicated Knowledge Base (KB) identified by rag_kb_id:
kbId to the query, ensuring only that brand’s facts are retrievedflowchart LR
subgraph Tenant["Same tenant"]
BrandA["Brand A<br/>rag_kb_id: kb-aaa"]
BrandB["Brand B<br/>rag_kb_id: kb-bbb"]
end
subgraph RAG["Central RAG engine"]
KBA["KB: kb-aaa<br/>(Brand A only)"]
KBB["KB: kb-bbb<br/>(Brand B only)"]
end
BrandA -->|askWithKbId| KBA
BrandB -->|askWithKbId| KBB
Fig 6-7: Brand-level KB isolation. The engine is shared; the knowledge is not.
seedBrandRAGKB: automatic KB creation and seeding on AXP enableWhen a brand enables AXP (enableAXP API), the system runs three steps asynchronously in the background:
ragCreateKnowledgeBase returns a new kbId, written to brand_rag_configsgeo_importance DESC) — the RAG backend crawls and vectorises each page, supplementing static profile knowledge with live site content// Inside enableAXP setImmediate block (non-blocking)
await initialCrawl(brandId, tenantId); // generate AXP pages
await seedBrandRAGKB(brandId, queryFn); // parallel: seed RAG KB
Design principles:
brand.keywords (the brand’s target GEO/SEO keywords) are injected into every RAG query as a keywordsHint suffix:
// hybridCoordinator.service.js
const keywordsHint = keywords.length
? `\n\nPlease naturally incorporate the following target keywords (not all need to appear): ${keywords.join(', ')}`
: '';
const question = PAGE_TYPE_QUESTIONS[pageType](brandName) + keywordsHint;
For pricing_summary and product_features — page types prone to “pure tables, zero keywords” — the RAG prompt explicitly requires a keyword-rich introductory paragraph before the table:
1. Opening paragraph (2–3 sentences): describe the brand's positioning,
naturally incorporating the target keywords
2. Full pricing table: only list data confirmed in the knowledge base;
do not infer or fabricate
Observed keyword coverage (ILIKE substring match, 5 brands × 6 page types, 2026-04-21):
| Brand | Keywords | Lowest coverage page | Min coverage | Most pages |
|---|---|---|---|---|
| Brand A | 13 | pricing_summary | 9/13 | 11–13/13 |
| Brand B | 12 | pricing_summary | 7/12 | 11–12/12 |
| Brand C | 10 | pricing_summary | 7/10 | 9–10/10 |
| Brand D | 12 | faq / pricing | 7/12 | 10–12/12 |
| Brand E | 10 | pricing_summary | 1/10 ★ | 9–10/10 |
★ Brand website has no public pricing page; RAG correctly refuses to fabricate data — low coverage is expected.
content_preview: a per-page quality signalThe AXP page list API now includes a content_preview field: the first 150 characters of content_md after stripping multiline HTML comments.
This fixes a practical problem: when all six page types of the same brand displayed the same fingerprint_phrase, users could not confirm whether each page had generated correctly.
-- Strip multiline HTML comments; take first 150 chars
LEFT(REGEXP_REPLACE(content_md, '<!--[\s\S]*?-->', '', 'g'), 150) AS content_preview
The regex must use [\s\S]*? (dotall) rather than [^>]* — the latter fails to match HTML comments that span multiple lines.
After AXP had been live for a year, fragmentation built up: the same concept had multiple names (facts / fact_check / factCheck), section names drifted, generator logic was scattered. In April 2026 we ran a P1-P9 unified pipeline refactor with the goal “one pipeline, one naming convention, one output.”
flowchart LR
P1[P1 Naming<br/>22 page_type] --> P2[P2 Generators<br/>9 producers]
P2 --> P3[P3 RAG loop<br/>gap then auto fill]
P3 --> P4[P4 Single-page UI<br/>Pipeline status]
P4 --> P5[P5 Old UI sunset<br/>-2997 lines]
P5 --> P6[P6 brand-create<br/>auto hook]
P6 --> P7[P7 Mon 06:00<br/>cron rerun-missing]
P7 --> P8[P8 Schema.org<br/>FAQPage / Review]
P8 --> P9[P9 Spec 7 rules<br/>fully enforced]
Fig 6-10: Nine stages. Each is independently verifiable; we don’t move on until the previous one ships.
page_type categoriesThe old system had homepage / home / brandHome all referring to the same page; fact_check / factCheck / facts were mixed. The refactor consolidated everything into 22 snake_case categories:
| Bucket | Example page_type | Notes |
|---|---|---|
| Foundation | brand_overview, faq, about |
Universal across brands |
| Product | product_features, pricing, competitor_comparison |
B2B products |
| Trust | fact_check, review_aggregate, media_coverage |
Third-party validation |
| Local | service_area, office_address, gbp_profile |
Geo-bound |
| Knowledge | glossary, case_study, industry_report |
Content depth |
| Personal IP | creator_profile, talk_topics, future_plans |
ME platform only (categories 23+) |
A page_type without a matching generator is architecturally forbidden — no orphan categories.
Each of the 9 generators obeys three rules:
brand + RAG knowledge + pricing API; never scattered DB tablessource_chunks: [{rag_chunk_id, score}] so customers can verify citationsThree crons running end-to-end:
detectContentGaps(brandId), writes missing categories to content_gapscontent_gaps WHERE status='pending', runs the matching generatorAcross 5 pilot brands, average llms-full.txt size grew from 8K to 52K characters (6.5×) — eliminating “the customer forgot” dead zones.
P1-P3 injected Article / WebPage; P8 added three advanced types:
faq.js output as FAQPage > mainEntity[Question]aggregateRating.ratingValuePerson > knowsAbout + worksFor + award, treating creators as person entitiesInjection follows §6.7’s flat strategy (@id references, not nested arrays).
| Metric | Before | After |
|---|---|---|
| Naming variants | 8 mixed | 22 snake_case |
| Duplicate / orphan page_types | 13 | 0 |
Avg llms-full.txt size |
8K | 52K (6.5×) |
| Manual customer fill-in count | High | 0 (auto-fill) |
| Old UI dead code | ~3000 lines | -2997 lines (P5 cleanup) |
Refactor spec, 7 rules (P9 enforced): one naming per concept; no orphan page_types; idempotent generators; traceable RAG citations; gap detection runs end-to-end; old UI sunset has a migration path; any new chapter passes spec review first.
@id flattening, not nested arraysrag_kb_id); seedBrandRAGKB automatically uploads a brand profile and website URLs when AXP is enabled — no manual intervention requiredbrand.keywords are injected into RAG queries as keywordsHint; pricing and feature pages require a keyword-rich intro paragraph; content_preview replaces fingerprint_phrase as the per-page quality signalNavigation: ← Ch 5: Multi-Provider Routing · 📖 Index · Ch 7: Schema.org Phase 1 →