Baiyuan GEO Platform Whitepaper

Chapter 7 — Schema.org Phase 1: 25 Industries × Three-Layer @id Interlinking

Schema.org is not “adding a few tags.” Without industry specialization, without entity interconnection, without auto-generation, it is effectively invisible to AI.

Table of Contents


7.1 Schema.org’s role has shifted in the AI era

Schema.org was created in 2011 by Google, Bing, Yahoo, and Yandex as a shared structured-data vocabulary. Its original purpose was to feed traditional search engines into producing Rich Results (star ratings, breadcrumb trails, expandable FAQ, etc.).

Since 2024 its role has shifted in two fundamental ways:

  1. From “search-engine decoration” to “structured source for AI training data” — major LLMs ingest Common Crawl during pretraining, and Schema.org JSON-LD is the densest entity-data layer in that corpus.
  2. From “nice-to-have” to “required” — a website without Schema.org looks to AI like “a blob of text”; a website with Schema.org looks like “an identifiable entity”. The gap is the same order as “does this image have alt text” for screen readers.

This book treats Schema.org as the first lever in Baiyuan GEO’s optimization path. Without a solid Schema.org structure, other dimensions cannot stabilize AI perception no matter how they are tuned.


7.2 Industry-specialized @type across 25 categories

Schema.org defines hundreds of @type values, many highly specialized (e.g., MedicalClinic, VeterinaryCare, CafeOrCoffeeShop). Picking the wrong @type is the equivalent of filing yourself under the wrong cabinet — AI uses @type as a key dimension when placing an entity in its knowledge graph.

Our platform distills common industries into 25 categories, each mapping to a primary + secondary Schema.org @type.

Fig 7-1: 25-industry classification (16 physical + 7 online + 2 fallback)

code Name Schema.org @type
medical_clinic Medical / aesthetics clinic MedicalClinic, LocalBusiness
dental_clinic Dental clinic Dentist, LocalBusiness
general_clinic General medical clinic MedicalOrganization, LocalBusiness
beauty_salon Beauty / hair salon BeautySalon, LocalBusiness
fitness Gym / yoga / pilates HealthClub, SportsActivityLocation
restaurant Restaurant Restaurant, FoodEstablishment
cafe Cafe CafeOrCoffeeShop
legal_service Law firm LegalService, ProfessionalService
accounting Accounting firm AccountingService, ProfessionalService
real_estate Real estate agency RealEstateAgent, ProfessionalService
auto_repair Auto repair AutoRepair, AutomotiveBusiness
education_offline Tutoring / training center EducationalOrganization, LocalBusiness
veterinary Veterinary clinic VeterinaryCare, MedicalOrganization
lodging Hotel / B&B LodgingBusiness, Hotel
retail_store Retail store Store, LocalBusiness
financial_service Financial service FinancialService, ProfessionalService
saas_application SaaS product SoftwareApplication, Organization
web_application Web tool WebApplication, Organization
mobile_app Mobile app MobileApplication, Organization
ecommerce Pure e-commerce OnlineStore, Organization
online_education Online learning platform EducationalOrganization
news_media News / content site NewsMediaOrganization
online_professional Online professional service ProfessionalService, Organization
other_physical Other physical business LocalBusiness
other_online Other online service Organization

Fig 7-1: 16 physical + 7 online + 2 fallback. Each category specifies two @types (primary + secondary) exploiting Schema.org’s permission for arrays.

Why 25 categories and not more

Schema.org includes hundreds of subtypes. But over-specialization actually lowers AI recognition rates. The reasons:


7.3 Three-layer @id interlinking

Fig 7-2: Three-layer entity knowledge graph

flowchart TB
    subgraph L1["Layer 1: the subject"]
      Org["Organization / LocalBusiness<br/>@id = #org<br/>name / description / url / logo /<br/>address / telephone / sameAs"]
    end
    subgraph L2["Layer 2: services"]
      Svc1["Service<br/>@id = #svc-1"]
      Svc2["Service<br/>@id = #svc-2"]
      SvcN["..."]
    end
    subgraph L3["Layer 3: people"]
      Emp1["Person / Physician<br/>@id = #emp-1"]
      Emp2["Person / Attorney<br/>@id = #emp-2"]
    end
    Svc1 -->|provider| Org
    Svc2 -->|provider| Org
    Emp1 -->|worksFor| Org
    Emp2 -->|worksFor| Org
    Svc1 -->|performer| Emp1
    Svc2 -->|performer| Emp2
    Org -.->|sameAs| Wiki[Wikipedia]
    Org -.->|sameAs| WD[Wikidata]
    Org -.->|sameAs| LI[LinkedIn]
    Org -.->|sameAs| GBP[Google Business Profile]

Fig 7-2: Three layers reference each other by @id to form a local knowledge graph; external authoritative nodes are linked via sameAs.

Why three layers rather than one blob

A common mistake is to stuff everything into a single Organization:

{
  "@type": "Organization",
  "name": "Acme Aesthetics",
  "employees": [
    { "name": "Dr. Smith", "jobTitle": "Director" }
  ],
  "services": [
    "Laser hair removal", "Double-eyelid surgery"
  ]
}

The problem: AI cannot treat “Dr. Smith” as an independently referenceable entity (Person); “Laser hair removal” is a string, not an entity (Service). A question like “who performs laser hair removal?” has no structured answer to reach.

The three-layer @id pattern creates addressable entities:

{
  "@context": "https://schema.org",
  "@graph": [
    {
      "@type": ["MedicalClinic", "LocalBusiness"],
      "@id": "https://acme.example/#org",
      "name": "Acme Aesthetics",
      "sameAs": [
        "https://www.wikidata.org/wiki/Q...",
        "https://www.linkedin.com/company/..."
      ]
    },
    {
      "@type": "Physician",
      "@id": "https://acme.example/#emp-1",
      "name": "Dr. Smith",
      "jobTitle": "Director",
      "worksFor": { "@id": "https://acme.example/#org" }
    },
    {
      "@type": "Service",
      "@id": "https://acme.example/#svc-laser",
      "name": "Laser hair removal",
      "provider": { "@id": "https://acme.example/#org" },
      "performer": { "@id": "https://acme.example/#emp-1" }
    }
  ]
}

When a user asks AI “who performs laser hair removal at Acme?”, the AI has a complete entity chain to reason across — not a fuzzy string match.


7.4 Physical vs online: divergent field weights

The is_physical flag determines the completeness weight table. The two types of businesses influence AI citation through completely different dimensions.

Fig 7-3: Weight divergence

flowchart LR
    subgraph Physical["Physical (is_physical=true)"]
      P1["address 15%"]
      P2["GBP Place ID 15%"]
      P3["opening hours 10%"]
      P4["phone 10%"]
      P5["services 10%"]
      P6["employees 10%"]
      P7["url / name / logo 30%"]
    end
    subgraph Online["Online (is_physical=false)"]
      O1["url 20%"]
      O2["description 15%"]
      O3["logo_url 10%"]
      O4["services / features 20%"]
      O5["sameAs external links 15%"]
      O6["FAQ 10%"]
      O7["other 10%"]
    end

Fig 7-3: For physical businesses, address + GBP dominate at 30%. For online services, url + description dominate at 35%. Same algorithm, two weight tables, reflecting real user intent differences.

Rationale

The platform UI hides/shows fields dynamically based on is_physical: physical customers see Address and Opening Hours cards; online customers do not.


7.5 Data completeness algorithm

Each field carries a weight (0–100); filling it in adds its weight. Total completeness is the weighted average.

function computeCompletion(brand, industry) {
  const weights = industry.is_physical ? PHYSICAL_WEIGHTS : ONLINE_WEIGHTS;
  let score = 0;
  let maxScore = 0;

  for (const [field, weight] of Object.entries(weights)) {
    maxScore += weight;
    if (isFilledMeaningfully(brand, field)) {
      score += weight;
    }
  }

  return Math.round((score / maxScore) * 100);
}

// Not just "non-empty" — checks meaningful content
function isFilledMeaningfully(brand, field) {
  const value = getField(brand, field);
  if (!value) return false;
  if (typeof value === 'string' && PLACEHOLDER_PATTERNS.test(value)) return false;
  if (Array.isArray(value) && value.length === 0) return false;
  return true;
}

Why “non-empty” is not enough

Early implementation only checked for non-empty fields. Customers started filling url: "https://", description: "company", and other placeholder strings to inflate their completeness score. isFilledMeaningfully adds three checks:

  1. Placeholder regex — catches ^(https?:\/\/)?$, single-character strings, and known stubs
  2. Minimum length — e.g., descriptions must be at least 20 characters to count
  3. Format validation — URLs must be resolvable, phones must parse to E.164 format, etc.

The UI does not prevent the entry, but the algorithm does not count the score. This avoids misleading users into false improvement signals on subsequent optimization work.


7.6 Dual entry points: Wizard + Edit

Fig 7-4: Entry-point flow

flowchart TD
    Start{User type} -->|new brand| Wiz[Wizard<br/>linear 7-step flow]
    Start -->|existing brand| Dash[Dashboard<br/>completeness banner]
    Wiz --> W1[Step 1: basic info]
    W1 --> W2[Step 2: industry and description]
    W2 --> W3[Step 3: address & location<br/>if is_physical]
    W3 --> W4[Step 4: opening hours<br/>if is_physical]
    W4 --> W5[Step 5: services]
    W5 --> W6[Step 6: employees]
    W6 --> W7[Step 7: FAQ and social]
    W7 --> Done[done]
    Dash -->|<80%| Alert[red / amber warning]
    Dash --> Edit[/brands/:id/entity<br/>jump to any card]
    Alert --> Edit
    Edit --> Save[save → completeness %<br/>updates live]

Fig 7-4: New brands go through Wizard to guarantee first-time coverage; existing brands use Edit to update at will. Both paths share the same Card components (DRY).

Why the Wizard does not force every field

Each Wizard step allows “skip for now”:

This is a product-philosophy choice: let the brand exist in AI first, then chase perfection.


7.7 GBP URL Parser

Google Business Profile (GBP) exposes location identity through three different ID forms, and customers often only have one of the three URLs handy:

ID type Example URL Use
place_id https://www.google.com/maps/place/?q=place_id:ChIJ... Places Details API primary key
FTID https://maps.google.com/maps?ftid=0x0:0xe6... Google Maps internal ID
CID https://www.google.com/maps?cid=... Customer ID short URL form

Fig 7-5: Parser decision tree

flowchart TD
    In[Paste any GMB URL] --> Split{URL form}
    Split -->|contains place_id:| P[extract place_id]
    Split -->|contains ftid=| F[extract FTID<br/>convert to place_id]
    Split -->|contains cid=| C[extract CID<br/>Places API → place_id]
    Split -->|short URL goo.gl| R[resolve 301 → re-enter Split]
    Split -->|other| X[return null<br/>ask user for full URL]
    P --> Done[return Place ID]
    F --> Done
    C --> Done

Fig 7-5: The parser branches explicitly on the four URL forms. Any unparseable URL returns a clear error — no guessing.

Why CID requires an API call

CID is a Google-internal serial number and cannot be converted to a Place ID without calling Google’s Places API (findPlaceFromText):

async function cidToPlaceId(cid) {
  const res = await fetch(
    `https://maps.googleapis.com/maps/api/place/findplacefromtext/json?` +
    `input=cid:${cid}&inputtype=textquery&fields=place_id&key=${API_KEY}`
  );
  const data = await res.json();
  return data.candidates?.[0]?.place_id ?? null;
}

This call consumes Google API quota; the parser caches per-URL results for 24 hours to avoid repeat consumption.


7.8 Function skeleton

generateBrandEntitySchema

function generateBrandEntitySchema(brand, industry) {
  const base = `https://${brand.primary_domain}`;
  const graph = [];

  // Layer 1: Organization / LocalBusiness
  graph.push({
    '@type': industry.schema_types, // array, e.g. ["MedicalClinic", "LocalBusiness"]
    '@id': `${base}/#org`,
    name: brand.name,
    url: brand.url,
    description: brand.description,
    logo: brand.logo_url,
    ...(industry.is_physical && {
      address: buildAddress(brand.location),
      telephone: brand.location?.telephone,
      openingHoursSpecification: buildHours(brand.hours),
      geo: buildGeo(brand.location),
    }),
    sameAs: buildSameAs(brand), // Wikipedia / Wikidata / LinkedIn / GBP
  });

  // Layer 2: services
  for (const svc of brand.services ?? []) {
    graph.push({
      '@type': 'Service',
      '@id': `${base}/#svc-${svc.slug}`,
      name: svc.name,
      description: svc.description,
      provider: { '@id': `${base}/#org` },
    });
  }

  // Layer 3: employees
  for (const emp of brand.employees ?? []) {
    graph.push({
      '@type': emp.specialized_type ?? 'Person', // Physician / Attorney / ...
      '@id': `${base}/#emp-${emp.slug}`,
      name: emp.name,
      jobTitle: emp.job_title,
      worksFor: { '@id': `${base}/#org` },
    });
  }

  return {
    '@context': 'https://schema.org',
    '@graph': graph,
  };
}

This function is the shared foundation for AXP generation (Ch 6) and closed-loop hallucination remediation (Ch 9).


Key takeaways

References


Navigation: ← Ch 6: AXP Shadow Document · 📖 Index · Ch 8: GBP API Integration →