If you ask different questions every time, you can never tell whether the score moved because the brand changed or because the questions changed.
The regular scan from Ch 2 dynamically generates intent queries each run, to simulate the variety of real user questions. This is good for horizontal answers to “how often is the brand mentioned right now?” but it cannot answer the longitudinal question: “between last week and this week, how has the AI’s perception of this brand changed?”
Because the two scans used different query sets, the score delta has at least three possible causes:
Without separating these three causes, the trend line is noise. To filter out real change, we need a fixed query set with repeated testing.
flowchart TB
subgraph Regular["Regular scan (horizontal)"]
R1[new query each run]
R2[answers 'what's the current score']
R3[not suited for time-series comparison]
end
subgraph Baseline["Phase baseline (longitudinal)"]
B1[fixed query set]
B2[repeated weekly / bi-weekly]
B3[answers 'how is perception evolving']
end
Regular --- Baseline
Fig 10-1: The two scan types answer different types of questions. Complementary, not substitutable.
response_text) stored in baseline_test_runs.queries_json and baseline_test_responsesflowchart LR
Q["Intent queries × 20 (fixed)"] --> P1[Phase 1<br/>Day 0]
Q --> P2[Phase 2<br/>Day 7]
Q --> P3[Phase 3<br/>Day 14]
P1 -->|compare| P2
P2 -->|compare| P3
P1 -->|compare| P3
P1 --> R[response_text × platforms]
P2 --> R
P3 --> R
Fig 10-2: Three askings, one query set, three complete response sets compared.
20 is an empirical choice. It can be revised if data supports a different number later.
Phase baseline testing runs on a fully independent data path that does not overlap with regular scanning:
| Facet | Regular scan | Phase baseline |
|---|---|---|
| Query source | Dynamically generated each run | Fixed at Phase 1 |
| Trigger frequency | Daily / 4h | Scheduled or manually triggered |
| Included in main GEO score? | Yes | No (shown separately) |
| Subject to Stale Carry-Forward? | Yes | No (failed scans mark incomplete) |
| Data retention | Rolling window | Permanent response_text retention |
| Uses Redis cache? | Yes (reduces duplicate API cost) | No (every ask is fresh) |
Regular scans cache a recent response for the same question (assuming AI’s opinion doesn’t shift in minutes) to reduce cost. But baseline’s purpose is to measure change in AI’s opinion — caching would destroy the measurement.
If baseline results counted toward the main score, Phase 2 and Phase 3 retests would create a triple-count effect (same brand counted multiple times in adjacent time windows), polluting the dashboard trend. Keeping them separate preserves score purity.
Phase baseline data’s value is not only “score change” — it supports four independent observation axes.
flowchart TB
subgraph Quant["1. Quantitative"]
Q1[citation rate up/down]
Q2[position forward/backward]
Q3[platform coverage widens/narrows]
end
subgraph Qual["2. Qualitative"]
L1[description wording change]
L2[depth of language]
L3[added or removed points]
end
subgraph Comp["3. Competitive"]
C1[co-occurring competitor list]
C2[competitor positional relationships]
C3[new competitors appear]
end
subgraph Sent["4. Sentiment"]
S1[sentiment distribution shift]
S2[rise/fall of strong language]
S3[neutral → positive / negative shift]
end
Fig 10-3: Four orthogonal axes. Quantitative is computable; Qualitative needs qualitative analysis; Competitive needs graph diff; Sentiment needs scoring.
Quantitative — compute differences directly: score delta, percentage change, trend slope.
Qualitative — diff Phase 1 and Phase 2 response_text; highlight “added paragraphs”, “removed paragraphs”, “replaced phrases”. Surface the diff visually to the customer.
Competitive — extract all brand entities in each response; compare sets across phases (new arrivals / departures / retained). Treat as a set-difference over time.
Sentiment — run sentiment classification per sentence; compare distributions. E.g., Phase 1 = neutral 80% / positive 15% / negative 5%, Phase 2 = neutral 60% / positive 30% / negative 10% → clear sentiment polarization.
Three recommended triggers:
Rebuild (do not extend) a Phase baseline when any of these occur:
On rebuild, create a new baseline_cohort_id; keep the old for historical reference but do not add new data points to it.
/baseline page showing Phase 1→2→3 comparison viewsresponse_text is retained permanentlyNavigation: ← Ch 9: Closed-Loop Remediation · 📖 Index · Ch 11: Case Studies →