Any system that depends on a single AI vendor is a fragile system. Multi-path fault tolerance belongs at the start of the architecture, not bolted on after an incident.
Before discussing how to tolerate faults, it is worth articulating why. Single-vendor AI dependency carries five structural risks that are independent of the vendor’s size or brand:
gpt-4-0314, claude-2), every feature depending on it must migrate immediately.These are not “handle when they happen” risks. They are “they will happen” realities. Depending on any single vendor means locking all five categories into your own SLA. For a SaaS this is unacceptable design.
Baiyuan GEO uses a primary/fallback dual-path abstraction named modelRouter.
flowchart LR
App[Application layer<br/>scan / eval / hallucination] --> MR[modelRouter]
MR -->|primary| Agg[OpenAI-compatible<br/>aggregator]
MR -->|fallback| D1[OpenAI direct]
MR -->|fallback| D2[Anthropic direct]
MR -->|fallback| D3[Google direct]
MR -->|fallback| D4[DeepSeek direct]
MR -->|fallback| D5[Moonshot direct]
MR -->|fallback| Dn[other vendor-direct]
Agg --> LLM1[GPT-4o / Claude / Gemini / ...]
D1 --> LLM1
Fig 5-1: The application layer calls an abstract modelRouter.complete(). The decision of which path to take is encapsulated inside the router and is transparent to the application.
| Path | Properties | Typical use |
|---|---|---|
| Primary (aggregator) | One API endpoint covers multiple vendor models, unified billing, low onboarding cost | Day-to-day scanning, scoring, general inference |
| Fallback (vendor-direct) | Per-vendor API key, per-vendor billing, per-vendor SDK | Activated on primary failure; also used when a vendor-specific feature is needed |
The primary path’s strength is “onboard once, use everywhere.” The fallback path’s strength is independence. The modelRouter abstracts both so the application layer neither knows nor cares which path a given request took.
Dual-path routing sounds simple. Implementing it surfaces three thorny issues.
The most commonly underestimated issue. The same model is identified by different strings in the aggregator and the native API:
| Vendor-native ID | Aggregator ID |
|---|---|
deepseek-chat |
deepseek-v3 |
moonshot-v1-8k |
kimi-k2 |
claude-3-5-sonnet-20241022 |
claude-3-5-sonnet |
qwen-plus |
qwen3-plus |
meta-llama/Llama-3.3-70B-Instruct |
llama-3.3-70b |
Without handling this mapping, a model name that succeeds on the primary path returns HTTP 404 on direct-call fallback. The fix is a mapping table:
const DIRECT_MODEL_ID_MAP = {
// aggregator ID → direct provider ID
'deepseek-v3': { provider: 'deepseek', model: 'deepseek-chat' },
'deepseek-r1': { provider: 'deepseek', model: 'deepseek-chat' },
'kimi-k2': { provider: 'moonshot', model: 'moonshot-v1-8k' },
'qwen3-plus': { provider: 'alibaba', model: 'qwen-plus' },
'grok-2': { provider: 'xai', model: 'grok-2-latest' },
'claude-3-5-sonnet': { provider: 'anthropic', model: 'claude-3-5-sonnet-20241022' },
// ... full mapping covers 15+ entries
};
Every model addition or deprecation must update this table or a path silently breaks.
Vendor-native APIs each require certain parameters the aggregator silently defaults for you. When switching to direct, those parameters must be supplied explicitly:
| Model family | Required extraParams |
|---|---|
| Qwen3 family | enable_thinking: false (otherwise returns the reasoning trace, wasting tokens) |
| DeepSeek-R1 | Defaults to reasoning mode; set temperature: 0.6 for stable output |
| Claude (Anthropic SDK) | max_tokens mandatory (OpenAI-style vendors allow it omitted) |
| Gemini | safetySettings defaults are strict; commercial content often needs relaxation |
We maintain a DIRECT_EXTRA_PARAMS table per provider and merge it in at path-switch time.
DeepSeek-R1 and OpenAI o1/o3 are reasoning-type models — they think internally before responding, producing 15–30 second latencies that exceed our scanner’s 25-second timeout.
The fix is not to raise the timeout (that would drag the entire scan cadence). The fix is to fall back to the same vendor’s non-reasoning variant:
| Primary model | Fallback direct model | Compromise |
|---|---|---|
deepseek-r1 |
deepseek-chat |
Loses reasoning_content field |
o1-mini |
gpt-4o-mini |
Loses the reasoning chain; citation detection does not depend on it |
For GEO scanning, citation detection only needs the final text. The reasoning chain is a nice-to-have, not a requirement. Trading reasoning_content for completion rate is the right call.
This is the most common production trap. A typical system has at least three layers of retry:
Business Layer ──┐
├── withRetry(fn, { retries: 3 })
Router Layer ────┤
├── primary fail → fallback (= another attempt)
SDK Layer ───────┤
└── OpenAI SDK built-in maxRetries: 2
Multiplied out: a single OpenAI SDK timeout of 30s × 2 retries × withRetry 3 × primary-to-fallback switch 1 = worst case 7 minutes before the system gives up. The scanner expects a request to finish in 25 seconds. Instead, a single request blocks an entire worker for more than an order of magnitude longer.
Only the outermost layer gets to decide when to give up. Every inner layer has
maxRetries: 0and an explicitly settimeout.
Concrete implementation:
// SDK layer: no internal retry
const openai = new OpenAI({
apiKey: API_KEY,
maxRetries: 0,
timeout: 20_000, // 20s hard limit
});
// Router layer: single failover, no loop
async function routeComplete(request) {
try {
return await primaryPath(request);
} catch (err) {
if (isRetryableError(err)) {
return await fallbackPath(request);
}
throw err;
}
}
// Business layer: caller decides retry policy
const result = await withRetry(
() => modelRouter.complete(request),
{ retries: 2, backoff: 'exponential' }
);
Three layers each handle their own responsibility. Total worst-case elapsed time = 20s × 2 (primary + fallback) × 3 (outer retry) = 120s. Predictable, monitorable, boundable.
Not every error should trigger a fallback. Some errors a switch cannot help — switching wastes a fallback slot for nothing.
flowchart TD
E[Primary path error] --> C{Error type}
C -->|HTTP 5xx| S[switch to fallback]
C -->|HTTP 429 rate-limit| S
C -->|Timeout| S
C -->|Network error| S
C -->|HTTP 401 auth| X[re-raise immediately<br/>switching won't help]
C -->|HTTP 400 param| X
C -->|HTTP 402 payment| X
C -->|Model not found| S
S --> R[record event<br/>provider, model, reason, duration]
X --> R
R --> O[emit metrics<br/>for observability]
Fig 5-2: Switch decision tree. 5xx / 429 / timeout / network errors are *transient and warrant a switch; 401 / 400 / 402 are permanent and should not trigger one.*
Every switch event must record:
| Field | Purpose |
|---|---|
provider_primary |
Which vendor served the primary path |
provider_fallback |
Which vendor served the fallback (if triggered) |
model_requested |
The model ID the application asked for |
model_actual |
The model that actually served (may differ on fallback) |
reason |
transient / permanent / timeout / model_not_found / … |
latency_primary_ms |
Primary path elapsed time (including timeout wait) |
latency_fallback_ms |
Fallback path elapsed time |
status |
success_primary / success_fallback / both_failed |
Over time these metrics answer important questions: which vendor is least stable, which model triggers fallbacks most often, which time windows are riskiest? Without this data, debugging is driven by customer complaints.
The worst state a fallback can be in is not “failing” — it is “unused daily, broken when actually needed.” Three mechanisms mitigate this.
On every worker startup, send a minimal request ("ping" or a one-word completion) to every fallback provider:
async function smokeTestFallbacks() {
const providers = Object.keys(DIRECT_CLIENTS);
const results = await Promise.allSettled(
providers.map(p => pingProvider(p, { timeout: 5_000 }))
);
return results.map((r, i) => ({
provider: providers[i],
ok: r.status === 'fulfilled',
error: r.status === 'rejected' ? r.reason.message : null,
}));
}
Output hits the startup log and a /health/fallbacks endpoint. Any broken path is visible immediately to ops.
Every 30 minutes, a light check using a low-cost prompt ("OK" → 2 tokens). Three consecutive failures flip the provider’s status to “fallback unavailable” in the UI.
Some providers have no available fallback. Three common cases:
| Case | Example | Policy |
|---|---|---|
| Closed-source, no public API | Some proprietary enterprise models | UI states “no fallback”; primary failure → fail |
| Requires customer-supplied key | Certain paid frontier models | User fills key in settings; left blank → no auto-fallback |
| Geopolitical restriction | Certain cross-border services | Dynamic enablement by customer’s region |
Transparency beats “pretending to have coverage.” Telling the customer “this provider has no fallback” is better than letting them discover it during an outage.
export async function complete({ prompt, model, temperature, ...opts }) {
const request = buildRequest({ prompt, model, temperature, ...opts });
try {
const result = await callPrimary(request);
emitMetric({ status: 'success_primary', model });
return result;
} catch (err) {
if (!isRetryableError(err)) {
emitMetric({ status: 'permanent_fail', reason: err.code, model });
throw err;
}
const fallback = resolveFallback(model);
if (!fallback) {
emitMetric({ status: 'no_fallback_available', model });
throw err;
}
const fallbackReq = mapRequestForProvider(request, fallback);
try {
const result = await callDirect(fallback.provider, fallbackReq);
emitMetric({
status: 'success_fallback',
provider_fallback: fallback.provider,
model_actual: fallback.model,
});
return result;
} catch (fbErr) {
emitMetric({ status: 'both_failed', model });
throw fbErr;
}
}
}
function resolveFallback(aggregatorModelId) {
// 1. Exact match in DIRECT_MODEL_ID_MAP
if (DIRECT_MODEL_ID_MAP[aggregatorModelId]) {
return DIRECT_MODEL_ID_MAP[aggregatorModelId];
}
// 2. Pattern match for versioned models
// e.g. "deepseek-r1-0528" → "deepseek-chat"
for (const [pattern, target] of DIRECT_PATTERN_MAP) {
if (pattern.test(aggregatorModelId)) return target;
}
// 3. No fallback available
return null;
}
These two functions are the heart of the routing architecture. Token counting, prompt assembly, and response normalization all layer on top of them.
modelRouter abstracts primary/fallback dual paths; the application layer is blind to the choicemaxRetries: 0 to avoid exponential elapsed timeNavigation: ← Ch 4: Stale Carry-Forward · 📖 Index · Ch 6: AXP Shadow Document →