AI Factory & Agentic Inference Playbook — Architecture, FinOps, and Migration for Token-Heavy Workloads
By Vatsal Shah · 2026-05-31 · AI Infrastructure / FinOps
Viral hook: Chatbots optimized cost per message. Agents optimize cost per workflow — and your factory wasn't built for workflows.
Table of Contents
- Chapter 1: Workload Science for Agents
- Chapter 2: Reference Architectures
- Chapter 3: FinOps Model & Showback
- Chapter 4: Migration Methodology
- Chapter 5: Day-2 Operations
- Key Takeaways & FAQ
Introduction: From Chat Endpoints to Workflow Factories
If you've spent the last eighteen months buying GPU capacity for "AI," odds are your dashboards still measure cost per chat message. That's the wrong unit. A customer-support bot might burn 800–1,200 tokens per turn. A production agent closing an insurance claim fans out across retrieval, planning, tool calls, verification, and summarization—often 40,000–120,000 tokens per completed workflow, with bursty parallelism that looks nothing like steady QPS on a single model endpoint.
I've audited factories where utilization graphs looked healthy at 62% GPU average—while p99 workflow latency blew past SLO because three planner agents spawned twelve sub-agents each during batch reconciliation windows. The hardware wasn't idle; it was mis-scheduled. Agentic inference is a scheduling problem dressed up as an API problem.
This playbook is the operating manual I wish existed when I migrated the first "copilot cluster" into a real factory: workload science first, architecture second, FinOps as the forcing function, migration as controlled risk, and Day-2 as measurable SLOs—not heroics.
Before we touch rack diagrams, align on outcomes. An AI factory is a platform layer that provisions compute, routes models, meters tokens, enforces policy, and exposes SLAs to product teams running agents—not a single vLLM pod behind a load balancer. If your Agentic SDLC operating model is the "how we build," the factory is the "how we run at scale."
The left side of Figure A is familiar: one gateway, one model pool, one autoscaler on request rate. The right side adds a workflow orchestration plane, model tier registry, online vs batch schedulers, and cost attribution tags that follow a workflow_id from first planner call to final audit log. Without those controls, FinOps can't answer the only question executives care about: What did we pay to finish the job?
Practitioner insight: Don't benchmark agents with "tokens per second" alone. Benchmark tokens per successful workflow under realistic tool latency and failure retries. That's the number your CFO will remember.
Who This Playbook Is For
Platform engineers building inference control planes, FinOps leads translating tokens into P&L, SRE teams carrying agent SLOs, and engineering executives preparing for agent scale without invoice shock. If you're still buying GPUs per chat endpoint, start at Chapter 1. If you're mid-migration to tiered routing, jump to Chapter 4. If finance is asking why AI spend doubled while tickets fell, start at Chapter 3.
How to Use This Document
Read straight through once for the narrative, then use chapters as reference during architecture reviews, migration ceremonies, and monthly factory councils. Code labs are starting points—adapt to your observability stack and cloud contracts. Pair this playbook with hands-on process assessments when you need independent validation of factory readiness.
Ready to implement? Our delivery process pairs factory architecture reviews with hands-on migration runbooks. For executive TCO modeling, see our business advisory lane—or request an AI factory TCO review when you're preparing board-level capacity plans.
Chapter 1: Workload Science for Agents
1.1 Why Chat Metrics Lie About Agent Load
The first mistake I see in capacity planning is treating an agent like a chatbot with extra steps. Chat traffic is roughly Markovian: one user message in, one model completion out, context grows linearly with turn count. Agent traffic is branching. A planner spawns researchers; researchers call tools; tools return payloads that get re-summarized; a verifier model may re-read the entire thread. Depth isn't "turns," it's graph depth.
When you profile agents, capture four dimensions on every workflow span:
- Fan-out factor (F): max parallel model calls per orchestration tick.
- Context growth rate (G): tokens added per hop (tool JSON is brutal here).
- Retry multiplier (R): expected re-runs after tool failure or policy rejection.
- Cache affinity (C): share of prompt prefix stable across hops.
Expected tokens per workflow ≈ base_prompt × (1 + R) × Σ(hops) × (1 + tool_attachment_factor). If you only autoscaled on HTTP QPS, you'll miss the cliff when F jumps from 2 to 16 during month-end jobs.
Practitioner insight: Add a workflow_id label on every inference request on day one. Without it, you'll never reconcile FinOps to product outcomes—and you'll re-litigate the same chargeback fight quarterly.
1.2 Fan-Out, Backpressure, and Queue Discipline
Fan-out is where innocent pilot clusters go to die. Suppose a coordinator dispatches eight sub-agents, each with a 32k context window prefill. That's eight concurrent prefills on the same GPU pool unless you shard by lane. Lanes are logical queues with independent concurrency caps: online-interactive, online-standard, batch-offline, eval-regression.
Backpressure belongs in the orchestrator, not the GPU driver. When lane saturation exceeds a watermark (e.g., 85% of negotiated concurrency tokens), the orchestrator should:
- Degrade model tier for non-critical sub-agents (frontier → mid → small).
- Collapse duplicate retrieval hops via deduplicated embedding cache.
- Shed lowest-priority workflows (marketing copy gen) before payroll reconciliation agents.
I've implemented token debt counters per tenant: if a team exceeds their in-flight token budget, new fan-out branches queue with visible ETA rather than silently piling onto shared H100s.
| Workload Pattern | Typical Fan-Out | Scheduling Lane | Primary Risk |
|---|---|---|---|
| Interactive copilot | 1–2 | online-interactive | Tail latency spikes |
| Research agent mesh | 6–24 | online-standard | KV cache thrash |
| Batch reconciliation | 8–64 | batch-offline | Cluster hogging |
| Eval / regression | 4–12 | eval-regression | Contaminating prod SLO |
1.3 Caching: Prefix, KV, and Tool Result Memoization
Caching for agents is a portfolio, not a checkbox. Provider prefix caching rewards stable system prompts and JSON schemas at the top of the context window—move volatile user/tool payloads to the tail. KV cache reuse matters when sub-agents share parent context; some runtimes support session IDs that map to shared physical pages. Tool memoization is underrated: if get_customer(12345) returned 40KB JSON ten seconds ago, don't re-embed it across six sub-agents.
Policy guardrails: memoize only idempotent reads; TTL by data class (public docs 24h, PII 60s). Log cache keys in your observability plane so security can audit what was shared across agents.
1.4 Long-Context Economics and Compaction
Long context is seductive and expensive. Every extra 8k tokens in prefill burns memory bandwidth and extends time-to-first-token. For agents, implement compaction hops: a cheap summarizer model collapses tool traces before the planner re-enters. Compaction quality gates matter—if summaries drop constraint IDs, downstream models hallucinate compliance approvals.
Heuristics I use:
- Hard cap tool JSON at ingress (truncate with structured pointers, not naive ellipsis).
- Promote "durable facts" to a workflow scratchpad store; pass handles, not blobs.
- Reserve 128k+ windows for true multivariate reasoning, not lazy logging dumps.
1.5 Batch vs Online: Two Factories in One
Online inference optimizes p95/p99 latency under variable fan-out. Batch inference optimizes throughput per dollar via continuous batching, padded sequences, and opportunistic use of spot/preemptible capacity. Agents need both: customer-facing steps on online lanes; nightly reconciliation on batch lanes with SLA measured in hours, not milliseconds.
Never let batch queues borrow from interactive GPU pools without hard isolation—I've seen "just this one nightly job" degrade executive dashboards at 06:00 when batch spilled over.
1.6 Workload Profiling Lab: Instrumentation Schema
You can't optimize what you don't label. Minimum span attributes for every model call:
workflow_id,parent_span_id,agent_role,model_tier,input_tokens,output_tokens,cached_tokens,prefill_ms,decode_ms,tool_wait_ms,usd_estimate.
1.7 Codelab: Workload Profiler Emitter (Python)
class="tok-cm"># workload_profiler.py — emit OpenTelemetry-friendly spans for agent hops
from dataclasses import dataclass, asdict
from time import perf_counter
from typing import Optional
import json
import sys
@dataclass
class InferenceSpan:
workflow_id: str
span_id: str
parent_span_id: Optional[str]
agent_role: str
model_tier: str
lane: str
input_tokens: int
output_tokens: int
cached_tokens: int = 0
class Profiler:
class="tok-kw">def __init__(self, sink=print):
self.sink = sink
class="tok-kw">def run_hop(self, span: InferenceSpan, callable_fn):
t0 = perf_counter()
result = callable_fn()
elapsed_ms = (perf_counter() - t0) * 1000
payload = {**asdict(span), class="tok-str">"latency_ms": round(elapsed_ms, 2)}
self.sink(json.dumps(payload))
return result
class="tok-cm"># Example: coordinator calls researcher sub-agent
if __name__ == class="tok-str">"__main__":
prof = Profiler(sink=lambda line: sys.stdout.write(line + class="tok-str">"\n"))
span = InferenceSpan(
workflow_id=class="tok-str">"wf-claim-88421",
span_id=class="tok-str">"span-research-3",
parent_span_id=class="tok-str">"span-plan-1",
agent_role=class="tok-str">"researcher",
model_tier=class="tok-str">"mid",
lane=class="tok-str">"online-standard",
input_tokens=18240,
output_tokens=960,
cached_tokens=12000,
)
prof.run_hop(span, lambda: {class="tok-str">"status": class="tok-str">"ok", class="tok-str">"docs": 4})
1.8 Codelab: Fan-Out Limiter (TypeScript)
// fanoutLimiter.ts — cap parallel model calls per workflow
type Task<T> = () => Promise<T>;
export class FanoutLimiter {
private inFlight = 0;
constructor(
private readonly maxParallel: number,
private readonly onReject?: (reason: string) => void,
) {}
async run<T>(task: Task<T>): Promise<T> {
if (this.inFlight >= this.maxParallel) {
this.onReject?.("fanout_cap");
throw new Error("Fan-out cap exceeded; queue or degrade tier");
}
this.inFlight++;
try {
return await task();
} finally {
this.inFlight--;
}
}
}
// Usage in orchestrator
const limiter = new FanoutLimiter(8, (r) => metrics.increment("fanout_reject", { reason: r }));
await Promise.all(urls.map((u) => limiter.run(() => callModel(u))));
1.9 Capacity Signatures and Seasonality
Agents inherit business seasonality. Month-end finance agents don't look like Tuesday marketing agents. Build capacity signatures: vectors of (lane, hour, fan_out_p95, tokens_per_workflow_p95). Forecast GPU needs from signatures, not vanity utilization averages.
1.11 Deep Dive: Prefill vs Decode in Multi-Hop Agents
GPU time splits into prefill (process the prompt, populate KV cache) and decode (emit tokens autoregressively). Chat turns with short prompts are decode-heavy. Agent hops with 20k-token tool dumps are prefill-heavy. When eight sub-agents launch together, you create eight prefill storms on shared memory bandwidth—p99 TTFT explodes even if decode QPS looks fine.
Mitigations I've shipped in anger:
- Staggered fan-out: release sub-agents in waves of N, not all-at-once.
- Prompt deduplication: hash parent context; reuse KV where runtime allows.
- Speculative sub-agent cancellation: if planner confidence drops, kill in-flight branches before prefill completes.
Track prefill_ms / (prefill_ms + decode_ms) per workflow class. When that ratio crosses 0.55 for interactive lanes, you're mis-classifying workloads—move heavy hops to batch or compact first.
1.12 Tool Latency and the Hidden Queue
Agents spend wall-clock waiting on tools—CRM APIs, SQL, vector DB, human approval. That wait isn't billed to GPUs but still blocks workflows and triggers retry loops that are billed. Instrument tool_wait_ms separately. I've seen workflows where tool wait was 70% of duration but teams only optimized model tier.
Design tool SLAs paired with inference SLAs: if CRM p95 is 4s, don't spawn six parallel CRM calls without a circuit breaker. Use bulkhead patterns per downstream system so one poisoned dependency doesn't convert into a fan-out retry tsunami.
1.13 Determinism, Temperature, and Cost Variance
Low temperature doesn't guarantee deterministic spend. Tool outputs vary; retrieval sets shift; cache keys miss. For FinOps forecasting, maintain distributions—not point estimates. Report p50 and p95 $/outcome per workflow class in showback dashboards.
For eval and regression lanes, pin temperatures and model builds. For creative marketing agents, allow variance but cap spend with hard max_cost_usd on the router.
1.14 Regulatory and Residency Impacts on Workload Shape
Restricted data often can't use public API caching or cross-region failover. That forces higher prefill repetition and blocks shared prefix caches across tenants. Factory planners must add residency overhead coefficients to capacity models—typically 15–30% more tokens per outcome for strictly pinned workloads.
1.15 Observability Anti-Patterns
- Aggregating all models into one "AI spend" line item.
- Logging only final hop tokens.
- Ignoring failed workflows in success-rate SLOs.
- Using average fan-out instead of p95 fan-out for capacity.
Fix these before you buy more GPUs. Otherwise you're funding noise.
1.16 Workshop: Building a Workload Profile in One Sprint
Week 1: instrument workflow_id and span attributes. Week 2: sample 500 production workflows per class. Week 3: compute F, G, R, C distributions and draft lane mapping. Week 4: present capacity signature to finance with scenario bands. This beats six months of "we'll monitor it later."
1.17 Case Study: Month-End Fan-Out Cliff
A fintech client ran claims reconciliation agents on the 1st of each month. Pilot metrics looked tame: 12 RPS average. Production month-end hit 180 RPS equivalent when fan-out expanded to 24 parallel researchers per coordinator. p99 workflow latency went from 38s to 11 minutes; GPUs weren't down—they were prefill-saturated.
We fixed it in three moves without buying hardware first: staggered fan-out waves of six, compaction hops before planner re-entry, and moving reconciliation to batch lanes with 4-hour SLA. p99 returned to 52s; $/outcome dropped 31% because we stopped infinite frontier retries on timeout.
1.18 KV Cache Fragmentation
When agents share partial context, fragmentation still happens if each sub-agent tweaks system prompts. Standardize system prompts per role (researcher, verifier, summarizer) across squads. Platform team owns three canonical prompts, not three hundred snowflakes.
1.19 Token Budgeting at Orchestration Design Time
Product specs should include token budgets alongside functional requirements: "This workflow may consume at most 80k tokens at p95." Architects sign off before implementation. Prevents surprise graphs that no router can tier away.
1.20 Interplay with Agentic SDLC
Coding agents from your Agentic SDLC playbook generate PRs; factory metrics prove whether those agents are affordable at scale. Link CI agent spans to the same workflow_id scheme so engineering and product agents appear in one FinOps plane.
1.21 Extended Comparison: Batch Scheduling Policies
| Policy | Strength | Weakness |
|---|---|---|
| FIFO | Simple fairness | Head-of-line blocking |
| Priority by SLO | Protects interactive | Starves batch if mis-tuned |
| Cost-aware | Minimizes $/token | Complex to explain |
| Deadline-aware | Meets cutoffs | Requires accurate duration est. |
Most factories start FIFO, then add priority lanes—not the other way around.
1.22 When to Reject Work
Healthy factories say no. If tokens_in_flight exceeds global watermark and workflow class isn't critical, return 429 with retry-after and estimated queue time. Silent degradation breeds mistrust; explicit queueing breeds trust.
1.23 The Physics of Concurrent Agent Graphs
Think of your factory as a queueing network. Each orchestration tick is a node; each model call is a server with service time dominated by prefill length and decode tokens. Fan-out multiplies arrivals at child servers simultaneously. The utilization law you learned in undergrad still applies: as ρ → 1, latency explodes. Agents push ρ toward 1 faster than chat because they parallelize intentionally.
That's why lane isolation is non-negotiable. Interactive ρ should stay below 0.7 sustained; batch can run hotter because SLA is measured in hours. When product demands "run everything now," translate the argument into ρ numbers executives understand.
1.24 Context Windows as a Budget, Not a Feature
Vendors market 200k+ context windows. Operationally, each additional token in prefill is a microsecond tax multiplied by cluster width. Teach product managers to write specs in token budgets and evidence handles, not "send the whole PDF." The factory enforces budgets at router ingress with structured errors when exceeded—fail fast, don't truncate silently without audit logs.
1.25 Cross-Functional Review Cadence
Weekly 30-minute review: platform, FinOps, one product owner. Agenda: top three workflows by spend, one incident, one optimization shipped. Continuity beats quarterly hero projects.
1.26 Long-Horizon Trends
Agents will get more autonomous, not less. Fan-out depth will rise unless regulators or economics constrain it. Factories that instrument workflow science now will adapt; those that bolt GPUs onto chat endpoints will replatform under duress.
1.27 Extended Narrative: Profiling in Practice
When I arrive on site, the first question isn't "how many H100s do you own?" It's "show me one workflow trace." Most teams can't. We spend week one instrumenting orchestrators—often LangGraph, Temporal, or custom asyncio graphs—and week two sampling production. The ah-ha moment is always the same: a "simple" customer service agent actually fans out to fourteen model calls because retrieval, rerank, draft, verify, and compliance checks each became separate hops without anyone drawing the graph.
We label hops with agent_role even when the same binary serves multiple roles via prompts. Roles matter for tiering: verifiers stay frontier longer; formatters drop to small tiers. We measure attachment factor—how many tokens tools add per hop. CRM JSON is the usual villain. Compression policies (field allowlists, stable key ordering) cut tokens and improve cache hits because hashes stabilize.
Batch vs online isn't religious war; it's SLO war. I ask product: "Is a human waiting?" If yes, online lane with aggressive caps. If no, batch lane with cost-aware scheduling. Mixed workflows get step-level routing inside the orchestrator—don't classify entire workflows as batch just because 80% of steps are offline; one interactive confirmation step forces online protection for that segment.
Finally, we document retry economics. A 10% retry rate on a 50k-token workflow adds 5k tokens expected value—but tail risk is higher because retries often escalate tier or widen context. Model retries as Bernoulli trials; plan capacity for tails, not means.
1.10 Chapter 1 Synthesis
Workload science is the prerequisite for every architecture diagram that follows. Profile fan-out, segregate lanes, treat caching as a system, compact context deliberately, and split batch from online as if they were two products—because to your users' SLOs, they are.
Chapter 2: Reference Architectures
2.1 Factory Layers: Control Plane vs Data Plane
A reference AI factory splits cleanly into control plane (policy, routing, registry, FinOps tags) and data plane (GPU/TPU workers, model servers, embedding indexes). Product teams submit workflows to the control plane API; they never pin to individual pods. This mirrors how Kubernetes separated scheduling from container execution—except our "pods" are quantized model replicas with wildly different memory footprints.
Core components:
- Ingress gateway — authN/Z, rate limits, request normalization.
- Workflow router — selects model tier, lane, and residency rules per hop.
- Model registry — versions, capability matrix, cost coefficients.
- Scheduler — binds requests to pools (online H100, batch A100, edge NPU).
- Observability bus — spans, dollars, watts (where available).
Practitioner insight: Keep the router stateless; put workflow state in a durable orchestration store. Routers scale horizontally; sticky GPU sessions do not.
2.2 Routing Layer: Policy, Not Just Load Balancing
Routing is where FinOps meets security. Inputs: workflow_class, data_sensitivity, latency_slo, max_cost_usd, required_capabilities (tools, vision, JSON mode). Outputs: model_tier, region, lane, fallback_chain.
Implement fallback chains explicitly: if frontier times out, mid-tier summarizer completes with degraded quality flag—don't infinite-retry frontier and torch spend.
2.3 Model Tiering Matrix
| Tier | Typical use | Latency target | Cost index (illustrative) |
|---|---|---|---|
| Frontier | Planning, complex reasoning | < 8s TTFT | 1.00× |
| Mid | Tool synthesis, summarization | < 3s TTFT | 0.35× |
| Small | Classification, extraction | < 1s TTFT | 0.08× |
| Local CPU | PII regex, format validation | < 50ms | ~0× API |
Tiering isn't "use cheap models whenever." It's risk-adjusted tiering: high-stakes hops stay frontier until verifier confidence exceeds threshold.
2.4 Edge vs Core: Where Inference Should Run
Core datacenter pools host large models, big KV caches, and centralized FinOps. Edge nodes (factory floor, retail store, regulated VPC) run small models for redaction, intent detection, and offline-first buffering. Agents spanning edge and core need synchronization contracts: edge agents emit signed summaries; core agents never pull raw PII back across the boundary without policy tokens.
Trend note (vendor-neutral): 2026 hardware roadmaps emphasize higher tokens per watt and rack-scale power orchestration—industry discussion often cites next-gen accelerators (e.g., Vera Rubin-class) and datacenter power fabrics (DSX-style orchestration) as drivers for denser agent factories. Treat these as capacity planning signals, not purchase mandates. Your factory abstractions should survive SKU changes.
2.5 Embedding and Retrieval Tier
Agents spend tokens re-reading retrieval results. Architect a dedicated embedding tier with:
- Hybrid search (dense + lexical) behind a stable MCP/tool interface.
- Chunk size standards (512–1k tokens) with metadata for citation.
- Re-rankers on mid-tier models before context injection.
2.6 Multi-Region and Residency
Regulated workloads require data-plane pinning. The router enforces allowed_regions per workflow class. Failover across regions for stateful agents is painful—design idempotent workflow steps and externalize checkpoints.
2.7 Codelab: Routing Policy Engine (Python)
class="tok-cm"># routing_policy.py — declarative tier selection
from dataclasses import dataclass
@dataclass
class RouteRequest:
workflow_class: str
sensitivity: str class="tok-cm"># public | internal | restricted
latency_slo_ms: int
max_cost_usd: float
needs_vision: bool = False
TIER_COST = {class="tok-str">"frontier": 1.0, class="tok-str">"mid": 0.35, class="tok-str">"small": 0.08}
class="tok-kw">def select_tier(req: RouteRequest) -> str:
if req.sensitivity == class="tok-str">"restricted" and req.workflow_class == class="tok-str">"underwriting":
return class="tok-str">"frontier"
if req.latency_slo_ms < 1500:
return class="tok-str">"small"
if req.max_cost_usd < 0.02:
return class="tok-str">"mid"
return class="tok-str">"frontier" if req.needs_vision else class="tok-str">"mid"
2.8 Codelab: Factory Client SDK (TypeScript)
// factoryClient.ts — single entry for agent hops
export interface HopRequest {
workflowId: string;
spanId: string;
agentRole: string;
messages: Array<{ role: string; content: string }>;
workflowClass: string;
maxCostUsd: number;
}
export class FactoryClient {
constructor(private readonly baseUrl: string, private readonly token: string) {}
async complete(hop: HopRequest): Promise<{ text: string; tier: string; usd: number }> {
const res = await fetch(`${this.baseUrl}/v1/complete`, {
method: "POST",
headers: {
Authorization: `Bearer ${this.token}`,
"Content-Type": "application/json",
"X-Workflow-Id": hop.workflowId,
},
body: JSON.stringify(hop),
});
if (!res.ok) throw new Error(`factory error ${res.status}`);
return res.json();
}
}
2.9 Failure Domains and Blast Radius
Shard factories by blast radius: payments-agents never shares GPU pools with marketing-agents. Shared infrastructure is fine; shared scheduler debt is not.
2.11 Control Plane APIs Product Teams Actually Use
Expose a single POST /v1/hops/complete with explicit policy headers. Product teams shouldn't pick GPU types—they declare intent: workflow_class, sensitivity, max_cost_usd, latency_slo_ms. The factory returns tier, region, lane, usd_estimate, and trace_id.
Version the API. Pin breaking changes to quarterly trains so agent frameworks don't fracture.
2.12 High Availability Without Sticky Sessions
Stateful KV sessions tempt architects to use sticky sessions on load balancers. That complicates failover. Prefer externalized session stores or recomputable context with durable scratchpads. When sticky sessions are unavoidable, document session migration playbooks during node drains.
2.13 Heterogeneous Pools: CPU, GPU, NPU
Not every hop needs a GPU. Regex validators, JSON schema checkers, and lightweight classifiers belong on CPU pools—or WASM sandboxes. NPUs at the edge handle embedding micro-batches efficiently. The reference architecture should show three elastic pools with a unified router, not a monolith icon labeled "AI."
2.14 Network Egress and Egress Cost
Tool-heavy agents generate egress charges that dwarf token costs in some SaaS setups. Track egress per workflow. Place tool gateways in the same region as data sources. Cache tool responses where policy allows.
2.15 Vera Rubin / DSX as Planning Signals (Not Ads)
Industry chatter in 2026 highlights next-gen accelerators with improved matrix ops per watt and datacenter power orchestration that treats racks as unified power domains. Regardless of vendor, your abstraction layer should record tokens_per_watt and $/trillion_tokens from benchmarks you run—not slides you receive. Schedule re-benchmarks quarterly; agent mixes shift faster than silicon roadmaps.
2.16 Factory Maturity Model
| Stage | Characteristics |
|---|---|
| 0 Ad hoc | One endpoint, no tags |
| 1 Metered | workflow_id, basic dashboards |
| 2 Routed | tiers, lanes, fallbacks |
| 3 FinOps | $/outcome, showback |
| 4 Resilient | shadow/canary, workflow SLOs |
| 5 Optimized | automated tier tuning, predictive capacity |
Most enterprises claiming "AI factory" are stage 1–2. Be honest in assessments.
2.17 Partner Integration: Orchestrators and MCP
If you deploy MCP tool meshes, the factory router should sit above tool execution, not behind it—policy first, then tools, then models. Cross-link your MCP gateway standards with factory identity so audit logs correlate tool calls and inference spans under one workflow_id.
2.18 Disaster Recovery
DR for factories isn't "restore the model weights." It's: restore routing tables, tier maps, budget configs, observability pipelines, and orchestrator checkpoints. Run DR drills that fail over routers while keeping data-plane residency constraints intact.
2.19 Reference Deployment Topologies
Topology A — Single region hub: simplest, best for mid-market. Topology B — Active-active dual region: for resiliency with residency constraints per tenant. Topology C — Hub + edge satellites: manufacturing, retail, healthcare bedside. Pick topology before SKU shopping.
2.20 API Gateway vs Service Mesh
Gateways handle auth, WAF, rate limits. Mesh handles mTLS and fine-grained service policies. Inference routers often live as a control plane service behind the gateway, not inside the mesh data path—keep hot paths short.
2.21 Model Registry Fields
Registry entries should include: model_id, build, context_limit, supports_tools, supports_vision, cost_coefficient, residency, deprecation_date, eval_scorecard_id. Deprecation dates force migration conversations early.
2.22 Routing Experiments (A/B)
Run controlled experiments on tier policies with guardrails: max 5% traffic, automatic rollback on $/outcome regression. Experiments are how you learn if mid-tier can handle planner hops for specific classes—opinions don't scale.
2.23 Private vs Public Model Paths
Hybrid factories route restricted workflows to private weights; general tasks to public APIs with redaction gateways in between. Document data flow diagrams for security reviews—auditors ask every time.
2.24 Cost of Complexity
Every routing dimension adds operational burden. Start with three workflow classes and three tiers. Expand when showback proves pain, not when architects get bored.
2.25 Security Architecture Overlays
Place policy enforcement points before model calls: PII scanners, prompt injection classifiers (small models), tool allowlists, output filters for restricted classes. Security isn't a post-hoc filter on completions—it's part of routing decisions (deny, degrade, require_hITL).
2.26 Interoperability Standards
Push internal teams toward OpenTelemetry trace context propagation. Align tool interfaces with MCP where possible so agent frameworks swap without rewriting factory clients. Standards reduce glue work more than another Kubernetes operator.
2.27 Scaling the Control Plane
Control plane services are cheap relative to GPUs—but they must scale horizontally. Use read replicas for registry and policy stores; cache hot policy bundles at routers with version hashes. Stale policy is a security incident; overloaded policy servers are an availability incident.
2.28 Architectural Review Checklist
Before production launch: lanes defined? tiers mapped? residency enforced? budgets attached? rollback tested? shadow metrics compared? If any answer is no, you're not ready—regardless of demo applause.
2.29 Extended Narrative: Designing the Router
The router is the moral center of the factory. It encodes what you value: safety, cost, speed, quality. Start with a rules engine you can read in a code review; add ML routing only when rules leave money on the table and you have labeled outcomes. Rules should live in git, versioned, reviewed by security and FinOps, deployed like any other service.
Fallback chains must be tested. If mid-tier fails open to frontier automatically, you'll never know mid-tier was broken until the invoice arrives. Tests should inject failures: latency spikes, 429 storms, garbage outputs. The router should degrade gracefully—shorter answers, narrower tools, delayed fan-out—before hard failing user workflows.
Edge vs core decisions are data residency decisions first, latency second. I've seen edge deployments justified for latency while the real win was keeping patient data off the WAN. Be honest in architecture decision records.
When industry news discusses Vera Rubin-class accelerators or DSX power fabrics, translate hype into benchmark tasks: your top five workflows, measured on candidate kit, reported as $/outcome and tokens/watt. Everything else is marketing until it passes your harness.
2.30 Service Level Objectives for Routing
The router itself needs SLOs: policy evaluation p99 <20ms, registry lookup p99 <10ms, decision availability 99.95%. Router outages stall every agent—treat control plane as tier-0.
2.31 Data Planes for Embeddings vs Generation
Split embedding inference from autoregressive generation pools. Embedding bursts from retrieval shouldn't delay decode on interactive lanes. Publish separate capacity signatures for each.
2.32 Quota and Throttle Design
Per-tenant quotas: max concurrent workflows, max tokens in flight, max fan-out depth. Expose quota headers in API responses so product UIs explain waits instead of mysterious hangs.
2.33 Testing Reference Architectures
Contract tests between orchestrator and factory client. Golden routing decisions for fixture requests. Chaos tests: registry down, policy store stale, GPU pool 429 storm.
2.34 Documentation for Product Teams
Developer portal pages: how to declare workflow classes, how to read showback tags, how to request tier exceptions, how to interpret errors (BUDGET_EXCEEDED, RESIDENCY_DENIED, FANOUT_CAP).
2.35 Platform Boundaries
Clarify what central platform owns vs what product squads own. Platform owns router, pools, observability baselines; squads own agent graphs within policy. Boundary disputes cause shadow endpoints.
2.36 Future-Proofing Routing Schema
Extensible policy schema with version field. Add dimensions without breaking clients: carbon_budget, jurisdiction, experiment_id as optional fields.
2.37 Reference Architecture Variants for Regulated Industries
Variant R1: no public API paths, all weights on-prem, HSM-backed keys. Variant R2: hybrid with redaction gateway. Variant R3: sovereign cloud regions only. Document each with network diagrams for auditors.
2.38 Cost-Aware Routing Simulation
Offline simulator: feed week of traces, try tier policies, output $/outcome distributions. Use before changing production weights—avoid live experiments on revenue workflows without backup.
2.39 Closing Architecture Principles
Keep hot paths short, policies versioned, pools isolated, observability mandatory, and vendors interchangeable. Architectures that fail these principles don't survive first hardware refresh cycle.
2.40 Platform Engineering Operating Model
Running a factory is a product, not a project. Staff a platform product manager, three to six senior platform engineers, an SRE liaison, and a FinOps analyst embedded at least quarter-time. Roadmap items come from showback pain, incident retros, and migration milestones—not vendor roadshows.
Sprint rhythm: two-week delivery with one week stabilization each quarter for game days and DR. OKRs tie to workflow SLO attainment and $/outcome improvement, not 'models deployed.'
Internal customers rate the factory via quarterly surveys: time to onboard new workflow class, clarity of errors, trust in chargeback. Low scores trigger usability epics, not more policy PDFs.
Partner with security early on router policies. Late security review rewrites routing and delays migrations months. Security champions attend factory council.
Finally, celebrate decommissioning legacy endpoints. Each decommissioned chat cluster is reduced operational drag and clearer cost attribution.
2.41 Reference Architecture Review Questions
Before any executive demo, answer these in writing: Where does policy enforce residency? What happens when frontier tier times out? How are tool calls audited? What is max fan-out? How is $/outcome computed? Weak answers predict production pain.
2.42 Building vs Buying Control Planes
Buy components (observability, GPU cloud), build differentiation (router, FinOps tags, workflow lanes). Over-buying "AI suites" often reintroduces chat-centric metrics. Under-building governance reintroduces shadow APIs.
2.43 Technical Debt in Routing Rules
Rules accumulate exceptions: "workflow X may use frontier on Tuesdays." Schedule rule retirements. Debty rules confuse simulators and humans alike.
2.44 Multi-Tenant Noisy Neighbor Controls
Per-tenant concurrency caps and token debt prevent one tenant's marketing campaign from starving another's payroll agents. Noisy neighbor stories are common in first multi-tenant factories.
2.45 Architecture Documentation Set
Maintain: C4 context diagram, data flow for PII, sequence for happy path, sequence for degrade path, tier matrix CSV, DR topology. Update within one sprint of changes or docs lie.
2.10 Chapter 2 Synthesis
Reference architecture is deliberately boring: stateless routers, explicit tiers, separated lanes, edge/core boundaries, and observability everywhere. Boring factories survive Black Friday fan-out.
Chapter 3: FinOps Model & Showback
3.1 From Tokens to Outcomes: The Only Metric Executives Trust
FinOps for chat asked: How much per thousand messages? FinOps for agents must ask: How much to complete the workflow successfully? Define $/outcome = total_inference_cost + tool_cost + human_review_cost / successful_workflows.
I've watched teams celebrate 30% token reduction while $/outcome rose—because cheap models triggered more retries and more human escalations. Optimize the denominator and numerator together.
Practitioner insight: Publish a monthly "factory P&L" per product line: outcomes, success rate, $/outcome, and waste bucket (failed workflows, overruns). Transparency beats surprise invoices.
3.2 Cost Allocation Tags and Chargeback
Tag every span: cost_center, product_id, workflow_class, environment. Chargeback models:
- Showback (default): teams see costs, no invoice—builds awareness.
- Chargeback: internal invoices fund central GPU pools.
- Hybrid: showback until spend exceeds threshold, then chargeback with caps.
| Model | When to use | Behavioral effect |
|---|---|---|
| Showback | Early agent adoption | Visibility without blocking teams |
| Chargeback | Mature factories, GPU scarcity | Forces tier discipline |
| Hybrid | Enterprise politics | Balances innovation and accountability |
3.3 Unit Economics and Scenario Planning
Build scenarios: base, growth, black swan. Variables: workflows/month, fan-out depth, frontier share, cache hit rate, GPU $/hour. Scenario planning prevents the classic board-meeting trap—"we need 4× GPUs next quarter" without ranges.
3.4 Budget Guardrails and Token Debt
Implement soft and hard budgets per squad. Soft = alerts; hard = orchestrator rejects new fan-out branches unless override role approves. Pair with token debt tracking in-flight work, not just monthly totals.
3.5 Codelab: Cost Estimator (Python)
class="tok-cm"># cost_estimator.py — estimate $/outcome from span aggregates
from dataclasses import dataclass
PRICE_PER_1K = {class="tok-str">"frontier": 0.015, class="tok-str">"mid": 0.004, class="tok-str">"small": 0.0009}
@dataclass
class SpanCost:
tier: str
input_tokens: int
output_tokens: int
cached_tokens: int = 0
class="tok-kw">def span_usd(span: SpanCost) -> float:
billable_in = max(0, span.input_tokens - span.cached_tokens)
rate = PRICE_PER_1K[span.tier]
return (billable_in + span.output_tokens) / 1000 * rate
class="tok-kw">def workflow_usd(spans: list[SpanCost]) -> float:
return sum(span_usd(s) for s in spans)
3.6 Codelab: Showback Reporter (TypeScript)
// showback.ts — roll up spans to squad monthly
type Span = { squad: string; usd: number; workflowId: string; success: boolean };
export function monthlyShowback(spans: Span[]) {
const bySquad = new Map<string, { usd: number; outcomes: Set<string>; ok: number }>();
for (const s of spans) {
const row = bySquad.get(s.squad) ?? { usd: 0, outcomes: new Set(), ok: 0 };
row.usd += s.usd;
row.outcomes.add(s.workflowId);
if (s.success) row.ok++;
bySquad.set(s.squad, row);
}
return [...bySquad.entries()].map(([squad, v]) => ({
squad,
usd: v.usd,
workflows: v.outcomes.size,
successRate: v.ok / v.outcomes.size,
usdPerOutcome: v.usd / Math.max(1, v.ok),
}));
}
3.7 Waste Buckets: Failed, Abandoned, Over-Tiered
Classify waste:
- Failed workflows — burn tokens, no outcome (fix reliability).
- Abandoned workflows — user timeout (fix UX/latency).
- Over-tiered hops — frontier where mid suffices (fix router).
3.8 Executive Narrative and Case Study Pattern
Tie factory metrics to revenue and risk: faster claims processing, fewer compliance escapes. Reference anonymized case study patterns where migration + tiering reduced $/outcome 35–50% without success-rate drop.
3.9 Contracting with Cloud and Silicon Vendors
Negotiate committed use with burst buffers for agent seasonality. Include observability rights (per-minute GPU metrics) in contracts—vendor averages hide fan-out spikes.
3.11 Amortizing CapEx in $/Outcome
Cloud inference is OpEx-heavy; on-prem clusters blend CapEx amortization, power, cooling, and staff. For hybrid factories, build a blended rate card per GPU-hour that finance accepts, then let the cost pipeline allocate span dollars against that card. Without blended rates, product teams compare apples (API list price) to oranges (owned H100s).
3.12 Chargeback Politics and Productivity
Chargeback can starve innovation if applied too early. I recommend 12 weeks of showback with weekly office hours before first invoices. Pair chargeback with guardrailed sandboxes so teams can experiment without production budgets.
3.13 Outcome Definitions That Don't Lie
Define "successful outcome" per workflow class with product legal sign-off:
- Insurance claim: adjudicated status in {approved, denied} with audit trail.
- Code migration agent: PR merged with green CI.
- Research agent: report delivered with citations meeting policy.
Failed or partial outcomes must still record spend in a waste bucket—otherwise teams game success flags.
3.14 FinOps Rituals
Monthly: factory P&L review. Quarterly: scenario replan against actual F and G distributions. Annually: vendor commit renegotiation using your tokens-per-watt benchmarks.
3.15 Integration with Enterprise GL
Map cost_center tags to GL accounts. Export CSV or API feeds finance can ingest. The chargeback flowchart isn't vanity—it prevents manual spreadsheet hell every month.
3.16 Sensitivity Analysis Template
Variables to stress-test:
- Frontier share +10 pts
- Cache hit rate -15 pts
- Fan-out p95 +4
- Tool latency +2s
- Failure rate +3 pts
Present tornado charts to executives. They understand ranges better than point forecasts.
3.17 Human-in-the-Loop Economics
HITL isn't free. Model a fully loaded reviewer minute and add to $/outcome when workflows escalate. Sometimes a slightly more expensive tier eliminates escalations—net win on $/outcome even if tokens rise.
3.18 Case Study Narrative (Anonymized Pattern)
A global insurer moved claims agents from a single frontier endpoint to tiered factory routing with compaction hops. Tokens per workflow dropped 22%, but outcomes per hour rose 41% because p99 latency improved and escalations fell. $/outcome fell 38%. Reference similar patterns in your case study portfolio when pitching executives.
3.19 Finance Partnership Checklist
- Agree on blended GPU-hour rate
- Define outcome catalog
- Align calendar close dates for chargeback exports
- Establish variance threshold alerts (>10% MoM)
- Sponsor executive readout quarterly
Without finance partnership, FinOps remains a dashboard nobody funds.
3.20 Token Price Volatility
API list prices change. Maintain price version tables in the cost pipeline; backfill last quarter when prices drop so teams see goodwill credits in showback. Transparency builds trust when vendors cut prices.
3.21 Squad-Level Coaching
When showback highlights a squad with high $/outcome and low success rate, assign platform coach for two sprints—fix graphs, not blame people. Culture matters for sustainable factories.
3.22 Reserved Capacity vs On-Demand
Model reserved GPU blocks for baseline signatures; burst on-demand for seasonality. FinOps scenario planning should include commit utilization—finance hates 40% reserved idle.
3.23 Attribution Edge Cases
Shared platform services (embedding index, reranker) need allocation rules: by token share, by query count, or by workflow count. Document the rule; change it yearly if unfairness appears.
3.24 Executive One-Pager Template
One page: outcomes/month, $/outcome trend, top waste bucket, migration status, next quarter CapEx/OpEx ask. If you can't fit it on one page, the narrative isn't crisp enough.
3.25 Building the First Showback Dashboard
Start ugly but correct: table of squads with workflows, success rate, tokens, estimated USD, $/outcome. Add charts later. Finance prefers accurate tables over pretty lies.
3.26 Negotiating with Product Leadership
Product wants infinite frontier; finance wants caps. Mediate with data: show three tier policies side by side with projected outcomes/hour and $/outcome. Decisions become rational.
3.27 FinOps Toolchain
Typical stack: span exporter → streaming bus → warehouse → dbt models → BI dashboard → monthly CSV to ERP. Keep the pipeline boring and tested.
3.28 When Chargeback Fails
If chargeback causes teams to bypass the factory with shadow API keys, you've lost governance. Fix incentives: fund innovation sandboxes with explicit caps instead of forcing shadow IT.
3.29 Extended Narrative: FinOps as Product Management
FinOps for factories is product management with dollars attached. Outcome catalogs are your SKU list. If you can't name outcomes, you can't price them. Workshop outcomes with legal and operations before finance—otherwise you'll argue about definitions during invoice disputes.
Showback is a teaching tool. I run office hours where squads see their graphs and propose optimizations—compaction, tier changes, tool bulkheads. The best ideas come from teams who feel costs, not from central mandates.
Chargeback is a behavior tool. Apply it when scarcity is real and culture is ready. Hybrid models work: platform funds baseline capacity; squads pay marginal burst. Transparency about the formula matters more than precision to the penny.
Scenario planning saved a retail client from over-procuring 30% extra GPUs for holiday agents. We modeled fan-out under promotional campaigns, stress-tested tool latency, and kept headroom in on-demand burst instead of reserved idle. Finance approved because we showed bands, not points.
3.30 FinOps Data Model
Core tables: spans, workflows, outcomes, price_versions, allocations. Enforce referential integrity on workflow_id. Document grain: one row per hop, aggregated to workflow in BI layer.
3.31 Anomaly Detection on Spend
Alert when squad spend z-score >3 vs trailing 28 days. Investigate: new agent launch, fan-out bug, cache break, pricing change. Automate tickets to squad + platform.
3.32 Unit Economics for Internal Platforms
If you sell AI capabilities to internal business units, $/outcome becomes internal transfer price. Finance may require cost-plus model; engineering provides span-level COGS.
3.33 Budget Planning Season
Annual planning: ingest growth forecasts, agent roadmap, hardware contracts, scenario bands. Present three plans: lean, base, aggressive. Executives pick risk posture explicitly.
3.34 Transparency Reports
Quarterly AI factory transparency memo: total outcomes, total spend, $/outcome trend, top optimizations, incidents affecting cost. Builds trust with board and regulators.
3.35 FinOps for Multi-Cloud
Allocate egress, cross-cloud API fees, and duplicate indexing costs. Multi-cloud factories cost more—don't hide overhead in generic "AI line item."
3.36 Incentive Alignment
Reward squads for $/outcome improvement, not token reduction alone. Pair metrics with quality scores to prevent reckless tier chopping.
3.37 Contractual Pass-Through
When using third-party APIs, pass through list price changes with 30-day notice in internal chargeback. Surprises destroy FinOps credibility.
3.38 FinOps Tooling Evaluation Criteria
Accuracy, latency of cost pipeline (<24h lag acceptable for showback), auditability, RBAC, export formats, API for ERP. Buy vs build depends on data warehouse maturity.
3.39 Closing FinOps Principles
Measure outcomes, tag everything, showback before chargeback, scenario bands beat point forecasts, and finance is a partner—not an afterthought.
3.40 Board-Ready FinOps Narrative
Executives don't want token counts—they want risk-adjusted ROI. Frame factory investments as: capacity to complete N more outcomes per month at stable quality, with downside band if adoption overshoots.
Use analogies: GPUs are freight capacity; agents are trucks; workflows are deliveries. Empty trucks (idle GPUs) and detours (retries) cost money. FinOps makes logistics visible.
When spend spikes, bring three explanations: volume growth, efficiency regression, or price change. Data splits prevent witch hunts against single squads.
Align AI factory budget with product P&L owners who benefit from outcomes. Shared fate improves tier discipline more than central mandates.
Document assumptions in board decks: fan-out depth, frontier share, cache hit rate. Assumptions age; date-stamp them.
3.41 Working with Procurement
Procurement wants commit discounts; engineering wants burst. Present scenario bands to negotiate commit utilisation targets with escape valves for burst via on-demand. Avoid commits based on pre-agent chat forecasts.
3.42 Tax and Transfer Pricing
Multinationals may need transfer pricing for internal AI charges. FinOps tags with legal entity IDs early; retrofitting is painful.
3.43 Sustainability Reporting
If ESG reports include IT carbon, factory metrics enable tokens/watt trends post-tiering. Even rough proxies beat "we bought renewables" without workload efficiency.
3.44 Dispute Resolution
When squads dispute charges, provide span-level drill-down within 48 hours. Slow disputes erode chargeback acceptance.
3.45 FinOps Certification for Platform Team
Encourage FinOps Certified Practitioner training for platform leads. Shared vocabulary with finance accelerates budget cycles.
3.46 Quarterly Business Review Pack
Export slides automatically from warehouse: outcomes trend, $/outcome by class, waste bucket pie, migration status RAG, next quarter CapEx/OpEx ask. Consistent format builds executive trust quarter over quarter.
3.47 Aligning R&D and FinOps
Research sandboxes get explicit monthly burn caps with auto-shutdown. Research without caps becomes "surprise invoice" stories that kill chargeback programs.
3.48 Outcome Quality Metrics in FinOps
Attach quality score (human eval or automated) to $/outcome charts. Cheap outcomes that fail quality aren't wins—they're future incident costs.
3.49 Closing the Loop with Product Roadmaps
When FinOps shows rising $/outcome for a workflow class, product must choose: fund optimization, accept price increase, or reduce automation scope. Escalate to VP if unresolved two quarters—unbounded spend is a strategy failure, not an ops puzzle.
3.50 Factory Subsidy vs Full Chargeback
Some enterprises subsidize early agent adoption. Document subsidy end date upfront. Permanent subsidy trains teams to ignore efficiency forever.
FinOps maturity is a journey: showback teaches, chargeback disciplines, scenarios prevent panic.
3.10 Chapter 3 Synthesis
FinOps isn't finance vs engineering—it's a shared language. $/outcome, tagged spans, showback → chargeback maturity, and scenario bands keep factories fundable.
Chapter 4: Migration Methodology
4.1 Why Big-Bang Migrations Fail for Agents
Agents are stateful graphs. Big-bang cutovers break tool versions, change latency profiles, and invalidate prompt caches overnight. Use shadow → canary → cutover with explicit promotion criteria tied to $/outcome and SLO—not gut feel.
4.2 Shadow Traffic: Compare Without Risk
Shadow mode duplicates inference requests to the target factory while serving users from legacy. Compare: token counts, latencies, output hash distances (for deterministic tasks), and $/outcome. Don't shadow PII-heavy flows until redaction parity is proven.
4.3 Canary Releases and Automated Promotion
Start canary at 1–5% workflows per class. Promotion gates example:
- p99 latency within 10% of legacy
- $/outcome not worse than 5%
- error rate ≤ legacy baseline
- no security policy regressions
4.4 GPU Generation Planning (Vendor-Neutral)
Hardware generations differ in memory bandwidth, FP8 support, and tokens/watt. Plan migrations as capacity equivalence exercises:
- Benchmark representative agent DAGs on old vs candidate silicon.
- Model $/outcome and watts/outcome, not sticker price.
- Stage rack power and cooling before software cutover—DSX-style power orchestration trends matter at rack scale even if your design stays vendor-agnostic.
Practitioner insight: Migrate workflows, not clusters. A cluster cutover without per-workflow shadow data is a coin flip with expensive tails.
4.5 Data Plane vs Control Plane Migration Order
Migrate control plane (router, tags, observability) first—legacy data plane can remain temporarily. Then shift pools workflow-by-workflow. Embeddings and vector indexes migrate on their own track with re-embed validation.
4.6 Rollback and Failure Drills
Pre-write rollback runbooks: DNS/weight shifts, feature flags, orchestrator pins. Quarterly game-day: force canary failure and measure MTTR.
4.7 Codelab: Shadow Comparator (Python)
class="tok-cm"># shadow_compare.py
import hashlib
class="tok-kw">def fingerprint(text: str) -> str:
return hashlib.sha256(text.encode()).hexdigest()[:16]
class="tok-kw">def compare_outputs(legacy: str, candidate: str) -> dict:
return {
class="tok-str">"match": legacy == candidate,
class="tok-str">"fp_legacy": fingerprint(legacy),
class="tok-str">"fp_candidate": fingerprint(candidate),
class="tok-str">"len_delta": len(candidate) - len(legacy),
}
4.8 Codelab: Canary Controller (TypeScript)
// canary.ts
export function pickPool(workflowId: string, canaryPercent: number): "legacy" | "factory" {
const bucket = parseInt(workflowId.slice(-4), 16) % 100;
return bucket < canaryPercent ? "factory" : "legacy";
}
4.9 Organizational Readiness
Train SREs on agent SLOs, not just HTTP 500s. Align FinOps on new price sheets before cutover—surprise chargeback destroys trust.
4.11 Migration Inventory: What Must Be Cataloged
Before shadow traffic: model versions, LoRA adapters, embedding indexes, router weights, prompt templates, tool schemas, budget configs, observability dashboards, and on-call runbooks. Missing any one causes false parity signals.
4.12 Parallel Embeddings Migration
Re-embed corpora on a schedule that doesn't starve online lanes. Use batch lanes for backfill; validate recall@k on golden sets before cutover. Embedding drift hurts agents silently—quality sinks while costs look fine.
4.13 Contract and License Gates
Verify license terms for weights and APIs in the target environment. Regulated industries may prohibit cross-border failover weights. Legal sign-off is a migration gate, not paperwork afterthought.
4.14 Communication Plan
Stakeholders fear "AI downtime." Publish migration windows, success metrics, and rollback criteria. Tie comms to workflow classes affected, not technical pool names.
4.15 Post-Cutover Hypercare
First 72 hours after cutover: war room with router metrics, $/outcome deltas, and fan-out depth alarms. Freeze unrelated releases. Hypercare isn't optional for first factory cutovers.
4.16 Generational Benchmark Harness
Build a harness of 30–50 frozen workflows representing production mixes. Run on old and candidate silicon weekly. Track tokens/s, watts (if available), $/outcome, and quality scores. This is how you neutralize vendor marketing.
4.17 Decommissioning Legacy Pools
After cutover, legacy pools linger "just in case" and burn money. Set a decommission date with executive sponsor. Keep read-only shadow capability in cold storage configs, not hot GPUs.
4.18 Lessons from Failed Migrations
Common failures: migrating models before routers; skipping FinOps tag parity; allowing eval jobs into prod pools during canary; no rollback drill. Learn from others' outages—don't sponsor your own.
4.19 Shadow Metrics That Matter
Compare distributions, not just means: p99 latency, p95 tokens, error taxonomy counts. Use population stability indices on outcome labels. Shadow diffs should be automatic nightly reports, not manual spreadsheets.
4.20 Canary Cohort Selection
Stratify canary by workflow class and tenant size. Don't canary only friendly internal tenants—you'll miss production skew.
4.21 Blue/Green vs Canary for Stateless Routers
Routers can blue/green quickly; GPU pools often cannot. Sequence: blue/green router → canary data plane → full cutover.
4.22 Migration Tooling
Invest in replay tools: re-run production traces against candidate factories in batch lanes. Replay accelerates shadow coverage without risking live traffic.
4.23 Organizational RACI
| Activity | Platform | FinOps | Product | Security |
|---|---|---|---|---|
| Shadow sign-off | R | C | I | C |
| Canary promotion | R | C | A | C |
| Budget change | C | R | I | I |
Clear RACI prevents migration stalls.
4.24 Post-Migration Optimization Window
First 90 days after cutover: tune tier map weekly using showback. Biggest wins arrive after migration, not before.
4.25 Migration Communication Templates
Pre-migration: what changes, when, customer impact none expected, rollback window.
During canary: metrics tracked, known issues list, owner on-call.
Post-cutover: success criteria met, legacy decommission date, hypercare schedule.
4.26 Legal and Data Retention During Migration
Ensure shadow logs don't retain PII longer than policy allows. Mask in shadow paths when needed—even if comparison is harder.
4.27 Multi-Factory Consolidation
Enterprises sometimes operate regional factories. Consolidation promises efficiency but risks residency violations. Consolidate control planes, not necessarily data planes.
4.28 Learning Loop
After every migration, publish internal postmortem: estimated vs actual $/outcome, incidents, timeline slip causes. Institutional memory beats repeating mistakes.
4.29 Extended Narrative: Migration War Stories
The smoothest migration I led shadowed 12% of workflows for three weeks before any user-visible change. The roughest tried big-bang over a weekend because a vendor contract ended Sunday night. We rolled back at 3 AM, not from lack of talent, but from missing embedding parity and untested fallback chains.
Canary selection bias is subtle. If you only canary internal users, you'll promote on misleading metrics. Stratify by tenant size and workflow class. Include at least one "noisy neighbor" tenant in canary if production has them—you'll thank yourself later.
GPU generation planning without application benchmarks is gambling. Tokens per watt only matters through your agent graphs. A 2× hardware improvement eaten by 3× fan-out depth is a net loss.
Decommission legacy pools on calendar dates. Orphan pools are recurring invoices with zero owners.
4.30 Detailed Shadow Traffic Implementation
Shadow traffic should be representative, not merely volumetric. Stratify samples across workflow classes, tenants, and time-of-day buckets. Store shadow outputs in a comparison warehouse partitioned by workflow_id. For each shadowed hop, record legacy_hash, candidate_hash, latency_delta, token_delta, and usd_delta. Nightly jobs flag regressions beyond tolerance bands.
Privacy: mask PII in shadow storage when production payloads include restricted fields. Use tokenized identifiers for join keys. Security teams should approve shadow retention TTLs before enabling production mirroring.
Performance: shadow must not starve legacy. Cap shadow concurrency at 10–15% of legacy pool capacity or route shadow exclusively through dedicated candidate pools that don't share schedulers with production.
4.31 Canary Mathematics
If canary is 5% of workflows and error rate doubles in canary, overall error rate increases by 5% relative—small but customer-visible at scale. Compute minimum detectable effect given traffic volume before choosing canary percentage. Low-traffic classes need higher canary share or longer canary windows.
Promotion criteria should be statistical, not vibes. Use sequential testing or Bayesian dashboards to avoid peeking bias where on-call promotes early because charts "look fine."
4.32 Cutover Weekend vs Rolling
Rolling cutover shifts traffic gradually (10% → 30% → 60% → 100%) with hold points. Weekend cutover tries 0→100 quickly. Rolling suits large user bases and agents with long workflows; weekends suit internal-only agents with short workflows. Pick based on workflow duration distribution, not tradition.
4.33 Hardware Generation Cutover Checklist
- Benchmark harness green on candidate silicon
- Power/cooling validated for target racks
- Network bandwidth validated for embedding backfills
- Router cost coefficients updated
- FinOps rate card published
- Shadow parity signed
- Canary promotion signed
- Rollback weights tested
- Hypercare staffed
- Legacy decommission date scheduled
4.34 Embedding and Vector Index Cutover
Treat vector stores as migration peers, not afterthoughts. Steps: freeze writes, snapshot index, rebuild on target, recall@k validation, dual-read period, cut read traffic, decommission old index. Skipping dual-read causes subtle retrieval drift that shows up as quality regressions without hard errors.
4.35 Organizational Training Before Cutover
Run tabletop exercises: orchestrator owners, SRE, FinOps, product on-call. Walk through rollback script line by line. Untrained humans revert to restarting pods—a placebo for agent factories.
4.36 Vendor Contract Alignment
Align contract renewal with migration windows. Avoid forced migrations during holiday peaks because a vendor contract ends. Negotiate overlap months where both environments are licensed—cheaper than outage.
4.37 Post-Cutover Metrics Review
At day 7 and day 30, compare $/outcome, success rate, p99 latency, and waste buckets vs pre-migration baseline. Publish internally. If metrics aren't better or neutral, open optimization epics before declaring victory.
4.38 When to Abort Migration
Abort triggers: shadow token delta >15% without quality gain; canary success rate drop >2 pts; security policy regression; residency violation. Pre-agree abort authority (role, not committee of ten).
4.39 Documentation Deliverables
Migration isn't done without updated architecture diagrams, router policy git tags, runbooks, and FinOps rate cards. Auditors and new hires need paper trails.
4.40 Migration Program Office
For enterprises, stand up a lightweight MPO: weekly standup, RAID log, executive dashboard. Migrations fail from coordination gaps more than technical gaps.
4.41 Replay Testing at Scale
Batch-replay last week's traces nightly against candidate factories during migration programs. Automate diff reports; humans triage only regressions beyond thresholds.
4.42 Configuration Drift Detection
Drift between shadow and canary configs (prompt hashes, router version) invalidates comparisons. Config checksums must match except intentional deltas.
4.43 Migration KPI Dashboard
Single pane: shadow parity %, canary error delta, $/outcome delta, promotion readiness score, days to legacy decommission. Executives consume this, not raw logs.
4.44 Cross-Functional Migration Standups
Daily 15 minutes during canary/cutover weeks. Attendees: platform, SRE, FinOps, product owner, security delegate. Blockers escalated same day.
4.45 Lessons for SaaS Vendors
If you're a vendor migrating customer tenants, migration windows multiply. Stagger tenants by risk tier. Never migrate all tenants Friday 5 PM.
4.46 Hardware Refresh Without Application Migration
Sometimes silicon refresh doesn't require router changes—only pool swaps. Still run harness benchmarks; power and drivers change behavior.
4.47 Migration Debt Tracking
Track deferred migrations (legacy pools, old embeddings). Debt accrues interest as incidents and costs rise. Review debt quarterly in architecture council.
4.48 Closing Migration Mantra
Shadow proves truth, canary proves scale, cutover is boring, rollback is rehearsed, hypercare is staffed, decommission is dated.
4.49 Enterprise Migration Calendar
Publish a 12-month migration calendar visible to all engineering. Blackout windows for retail peaks, tax season, open enrollment. Migrations slot into green windows or don't ship.
Coordinate with procurement for hardware lead times—often longer than software schedules. Hardware on dock before cutover weekend, not 'arriving someday.'
Run executive checkpoint before canary promotion: metrics, risk, rollback owner named. No name, no promotion.
Capture lessons in a migration playbook wiki page per workflow class. Future teams copy patterns instead of reinventing shadow infrastructure.
Remember: migration ends when legacy invoices end. Until then, you're paying twice.
4.50 Migration Toolchain Wishlist
Trace replay, config diff, automated shadow reports, canary promotion bot with guardrails, one-click rollback weights. Invest once, reuse across migrations.
4.51 Parallel Run Economics
Running legacy and factory doubles cost short-term. Finance must expect temporary uplift; document end date or parallel run becomes permanent tax.
4.52 Agent Framework Version Pinning
Migrate agent frameworks and factories together when breaking SDK changes land. Coordinate version pins in monorepo tags.
4.53 Customer Zero Programs
Pilot migrations with friendly internal "customer zero" teams before regulated workflows. Learn empathy for rollback UX.
4.54 Migration Retrospective Template
What we estimated, what happened, what we'll change next migration. Store in wiki tagged #factory-migration.
4.55 Regulatory Sign-off Gates
Regulated workflows need compliance sign-off between shadow and canary. Document signatories in migration ticket. Skipping this gate delays audits, not accelerates delivery.
4.56 Automated Rollback Triggers
Wire metrics to rollback bot: if canary $/outcome or error rate breaches threshold for 15 minutes, revert weights and page owner. Humans sleep; bots guard rails.
4.57 Knowledge Share After Migration
Host 60-minute internal tech talk: what we migrated, metrics, surprises. Recording becomes onboarding asset for next wave.
4.58 Sponsor Communication
Executive sponsors want green/yellow/red migration status weekly during program. One slide, no jargon. Sponsors unblock procurement and staffing when informed.
4.59 Freeze Windows for Dependencies
If CRM or vector DB migrations align with factory migration, sequence dependencies explicitly. Parallel breaking changes multiply rollback complexity exponentially.
4.60 Migration Success Criteria Sign-Off
Document sign-off owners for shadow parity, canary health, and cutover completion in the migration ticket. Ambiguous ownership causes migrations to stall in "almost done" for quarters.
4.61 Celebrate Decommission
When legacy pools power off, send a short note to all engineering: what improved, what we learned, who to thank. Rituals reinforce that migration programs end—not linger as zombie infrastructure.
4.62 Keep Migration Playbooks Current
Update migration playbooks after every wave. Stale playbooks with wrong CLI flags cause weekend rollbacks that benchmarks never predicted.
4.63 Migration Metrics Archive
Archive shadow and canary metrics for three years if compliance requires. Cold storage is cheaper than re-running migrations because audit asked for proof.
4.64 Final Migration Principle
If shadow metrics aren't boringly green, don't canary. If canary isn't boring, don't cut over. Boring migrations are successful migrations.
4.10 Chapter 4 Synthesis
Migration is a product release with scientific promotion. Shadow proves parity; canary proves scale; cutover is boring if you did the work. Document every wave.
Chapter 5: Day-2 Operations
5.1 SLOs for Workflows, Not Requests
Define SLOs on workflow success rate, workflow p99 latency, and $/outcome variance. Supplement with per-lane GPU saturation SLOs. Error budgets: when budget burns, freeze non-critical releases and ban eval jobs from prod pools.
5.2 Incident Response for Agent Factories
Incidents differ from microservice outages:
- Model regression — quality drop without 5xx (detect via eval canaries).
- Fan-out storm — orchestrator bug spawns exponential sub-agents.
- Cache poisoning — bad memoized tool results.
- Cost runaway — budget guard failure.
Runbooks: degrade tier, disable fan-out, drain lane, pin model version. Human comms template: customer impact, ETA, dollars at risk.
Practitioner insight: Keep a "big red switch" that sets global max fan-out to 2. You'll use it once—and be glad.
5.3 Capacity Triggers and Autoscaling
Triggers should combine queue depth, tokens in flight, and p99 prefill latency—not CPU percent. Scale-out lead time for GPU nodes is hours to days; predictive scaling from capacity signatures beats reactive panic.
5.4 Model Version Drift and Eval Canaries
Run continuous eval canaries on production routers—small deterministic tasks with golden outputs. Block promotion if drift exceeds tolerance.
5.5 Security Operations Integration
Feed factory audit logs to SOC: tool calls, policy denials, override events. Correlate with identity tokens per agent role.
5.6 Codelab: SLO Burn Alert (Python)
class="tok-cm"># slo_burn.py
class="tok-kw">def error_budget_burn(success_rate: float, target: float, window_minutes: int) -> float:
budget = 1.0 - target
consumed = 1.0 - success_rate
return consumed / budget if budget > 0 else 1.0
class="tok-kw">def should_page(burn: float, threshold: float = 0.5) -> bool:
return burn >= threshold
5.7 Codelab: Capacity Webhook (TypeScript)
// capacityHook.ts
export type Signal = { lane: string; tokensInFlight: number; watermark: number };
export function action(sig: Signal): "ok" | "scale" | "degrade" {
const ratio = sig.tokensInFlight / sig.watermark;
if (ratio < 0.85) return "ok";
if (ratio < 1.0) return "scale";
return "degrade";
}
5.8 Post-Incident Reviews and Factory Changelog
Every sev-1/2 gets a blameless review: spans, dollars burned, guardrails that failed. Maintain a factory changelog—router weights, tier maps, promotion events.
5.9 Continuous Improvement Loop
Monthly factory council: platform, FinOps, product, security. Agenda: $/outcome trends, waste buckets, migration status, next quarter capacity.
5.11 On-Call Playbooks (Condensed)
Sev-1 Fan-out storm: enable global fan-out cap → drain batch lanes → page orchestrator owner.
Sev-1 Cost runaway: enable hard budget stop → list top workflows by spend → require VP override to resume.
Sev-2 Model regression: pin previous model version → open quality incident → run eval harness diff.
Sev-2 Cache poisoning: flush memoization namespace → disable tool memo for class → root-cause tool output change.
5.12 SLO Documentation Template
For each workflow class document: objective, measurement window, error budget, alert routes, runbook links, dependencies, and customer-facing comms template. Store in git beside router config—SLOs are code.
5.13 Capacity Planning Calendar
Align with business events: open enrollment, tax season, holiday retail, quarter close. Pre-scale two weeks ahead using signatures; don't wait for dashboards to turn red.
5.14 Green Ops: Tokens per Watt
Sustainability teams increasingly ask about energy. If you can't meter watts per workflow yet, proxy with tokens per watt from benchmark harnesses and publish improvement trends after tiering or silicon migrations.
5.15 Knowledge Transfer
Rotate on-call across platform and product teams quarterly. Agents fail in weird ways; siloed ops teams miss orchestrator bugs.
5.16 Audit and Compliance Logs
Retain span logs per policy (often 90–365 days). Archive to cold storage with tamper-evident buckets. Auditors ask for proof of human oversight on restricted workflows—correlate HITL tickets to workflow_id.
5.17 Continuous Profiling
Re-run workload profiling quarterly. Agent graphs drift as product teams add tools and hops. Capacity signatures go stale like firewall rules.
5.18 Factory Roadmap Linkage
Day-2 metrics should feed the factory roadmap: if waste bucket "over-tiered hops" grows, invest in router ML; if tool wait dominates, invest in integration performance, not GPUs.
5.19 Incident Metrics Beyond MTTR
Track cost of incident (tokens burned during degradation), workflows affected, and escalations to humans. MTTR alone ignores economic damage.
5.20 Game Days
Quarterly game days: inject fan-out bug in staging, cost runaway in staging, model regression in canary. Measure detection time and runbook effectiveness.
5.21 Observability Stack
Minimum: traces (workflow spans), metrics (lanes, tokens in flight), logs (policy denials), dashboards (SLO/error budget), alerts (multi-window burn rates). If you lack traces, you don't have a factory—you have servers.
5.22 Vendor Escalation Paths
When underlying GPU cloud has regional impairment, factory ops needs vendor TAM contacts and comms templates pre-written. Don't draft during outage.
5.23 Toil Reduction
Automate: tier map rollbacks, cache flushes, budget overrides with approval tokens. Manual SSH to restart model pods should be rare.
5.24 Handoff to Continuous Improvement
Close the loop: incidents → corrective actions → router/config PRs → verified in eval harness → documented in factory changelog. Ops without closure is recurring pain.
5.25 Customer Trust and External SLAs
If you sell agent outcomes externally, external SLAs must derive from internal workflow SLOs with margin. Don't promise 99.9% on workflows you haven't measured.
5.26 SLI Catalog Examples
workflow_success_rate= successes / attemptsworkflow_latency_p99= p99 end-to-end secondsusd_per_outcome_p50= median cost for successeslane_saturation= tokens_in_flight / watermarkfanout_depth_p95= p95 parallel branches per tick
Publish SLIs to product teams; SLOs are negotiated from SLIs.
5.27 Alerting Anti-Patterns
Alerting on average GPU utilization hides fan-out cliffs. Alert on burn rates, queue depth, and fan-out depth. Page humans for sustained SLO budget burn, not single blips.
5.28 Runbook Quality Bar
Runbooks must be executable by someone who didn't write them. Test quarterly. If runbook requires tribal knowledge, fix the runbook.
5.29 Preparing for the Next Hardware Generation
When new silicon arrives, don't migrate in panic. Benchmark with harness, update tokens/watt tables, adjust capacity signatures, run shadow on one workflow class, expand. Repeatable process beats launch day heroics.
5.30 Closing Operations Philosophy
Day-2 isn't maintenance—it's product development for platform teams. The factory gets better every sprint or it gets more expensive every sprint. There's no steady state in agent land.
5.31 Extended Narrative: Living with Agent Incidents
The first fan-out storm I debugged looked like a DDoS from inside: same workflow class, hundreds of sub-agents, all calling the same degraded CRM. Circuit breakers weren't fashionable yet; we hard-coded a global parallel cap and survived. Today, I'd implement token debt, per-dependency bulkheads, and an executive-visible "big red switch" tested monthly.
Model regressions are insidious—no red HTTP codes, just worse answers and more retries. Eval canaries on production routers catch these within hours if you invest in golden tasks. Skimp on eval, pay in escalations.
Capacity triggers should be rehearsed. If scale-out lead time is 6 hours, autoscaling on threshold breach must start 6 hours before you expect breach—predictive scaling from signatures, not reactive paging at 2 AM.
Close incidents with factory changelog entries. Ops knowledge should be durable, not Slack scrollback.
5.32 SLO Error Budget Policy
Define error budget policies per workflow class. Example: 99.5% monthly success allows 0.5% failures. Burn budget on deployments, model promotions, and infra changes. When budget exhausted, freeze risky changes until budget recovers. This aligns product velocity with reliability.
5.33 Incident Severity Rubric for Agents
Sev-1: widespread workflow failure, cost runaway threatening monthly budget, residency breach.
Sev-2: single class degradation, partial fan-out failure, model regression detected by canary.
Sev-3: elevated latency within SLO margin, non-critical tool degradation.
Sev-4: cosmetic dashboard issues.
Attach runbook links per severity in paging tools.
5.34 Capacity Trigger Tuning Guide
Start watermarks conservative (70% tokens in flight), observe false positive rate for two weeks, tune upward until false positives <5%. Document final values in git next to router config.
5.35 Predictive Scaling Inputs
Feed predictive scaler: business calendar events, marketing campaign schedule, historical signatures, weather if retail, tax calendar if finance. Humans override predictions with explicit flags—don't fight automation silently.
5.36 Eval Canary Design
Golden tasks: 50–200 per critical workflow class, updated monthly. Include edge cases discovered in incidents. Run every 15 minutes in production canary lane with alert on score drop >ε.
5.37 SOC Integration Details
Export spans including tool_name, policy_decision, tier, usd_estimate. Map to SIEM correlation rules for impossible travel (agent calling tools from wrong region) and privilege anomalies.
5.38 Toil Metrics
Track toil hours per week on factory ops. Goal: downward trend via automation. If toil rises with agent adoption, platform team is underwater—hire or simplify architecture.
5.39 Multi-Region Failover Drills
Fail region B while region A serves traffic—verify residency constraints still hold per tenant. Failover without residency checks is a compliance incident waiting to happen.
5.40 Customer Communication During Incidents
Template external comms: impact scope, workflows affected, ETA, workaround, postmortem promise. Legal reviews template once, not per incident at 3 AM.
5.41 Long-Term Capacity Roadmap
Rolling 12-month GPU/OpEx forecast tied to business growth assumptions and agent roadmap. Update quarterly. Tie to business advisory planning cycles.
5.42 Factory Maturity Assessments
Annual assessment against maturity model (metered → routed → FinOps → resilient → optimized). Publish gap list and investment ask. Executives fund gaps when narrative is crisp.
5.43 Handoff to Platform Product Roadmap
Day-2 findings should create epics: router ML, better compaction, tool gateway caching, etc. Ops data is product discovery.
5.44 Celebrating Reliability Wins
When error budgets recover after optimization, share credit publicly. Reliability culture needs positive reinforcement, not only incident blame.
5.45 Final Operations Checklist (Printable)
- [ ] Workflow SLOs published
- [ ] Runbooks tested this quarter
- [ ] Game day completed
- [ ] Eval canaries green
- [ ] Capacity signatures updated
- [ ] FinOps showback reviewed monthly
- [ ] Migration RAID log clear
- [ ] Big red switch tested
5.46 On-Call Health Metrics
Track pages per engineer per week, repeat incidents, mean time to mitigate. Unhealthy on-call drives attrition—fix root causes, not heroes.
5.47 Dependency Catalog for Agents
Maintain catalog of downstream systems agents call with owners and SLOs. Incidents often external; routing agents without dependency context wastes time.
5.48 Progressive Delivery for Router Changes
Use feature flags for policy bundles: 1% → 10% → 50% → 100% with automated rollback on $/outcome regression.
5.49 Waste Elimination Sprints
Quarterly sprint dedicated to top waste bucket from FinOps. Platform + squads pair; success measured in $/outcome delta next month.
5.50 Knowledge Base Hygiene
Runbooks in git with owners and last-tested dates. Stale runbooks worse than none—they breed false confidence.
5.51 Bridging Ops and Research
When research wants new frontier model, ops requires harness results and shadow week before any canary. Research velocity continues within guardrails.
5.52 Closing Day-2 Mantra
Measure workflows, page on burn, automate toil, drill failures, publish changelogs, fund improvements.
5.53 Sustainable On-Call for Agent Factories
Agent incidents are cognitively heavy—ambiguous symptoms, expensive blast radius. Limit on-call shifts to experienced engineers with factory context. Rotate shadow on-call for training without paging juniors alone.
Post-incident, fund fixes before new features. Unfixed factory debt compounds fan-out risk nonlinearly.
Measure customer-visible outcomes during incidents, not just infra green lights. Workflows failing silently hurt trust more than loud 500 errors.
Integrate contact escalation paths for sev-1 when internal runbooks exhaust—know when to pull vendor TAMs and external architects.
Sustainable ops means predictable improvement, not heroic weekends every month.
5.54 Metrics for Platform Team Health
Track: deploy frequency for router, mean time to restore factory SLO, toil hours, incident repeat rate. Healthy team improves these while agent adoption grows.
5.55 Blameless Culture with Accountability
Blameless doesn't mean consequence-free. Repeated policy bypasses get engineering manager attention. Culture supports learning; governance stops repeat negligence.
5.56 External Benchmarking
Compare your $/outcome to anonymized industry peers via advisors. Isolation breeds complacency or panic without context.
5.57 Upgrade Windows
Coordinate model upgrades with low-business-impact windows per signature calendar. Upgrades during peaks are self-inflicted sev-1s.
5.58 Ops Handover to New Hires
Onboard with game day in week two, not slide deck month two. Muscle memory matters for fan-out incidents.
5.59 Pairing SRE with FinOps During Incidents
Cost runaway incidents need joint bridge: SRE stops bleeding, FinOps estimates dollar exposure for executive updates. Siloed bridges waste critical minutes.
5.60 Publishing SLO Reports
Monthly SLO report to product VPs: error budget status, top incidents, planned improvements. Transparency reduces "why is AI slow" hallway questions.
5.61 Factory Changelog Newsletter
Monthly newsletter to engineering: router changes, tier map updates, price version changes, upcoming migrations. Surprises create resistance; newsletters create partners.
5.62 Runbook for Model Provider Outages
When public API regions fail, router should fail over region or degrade tier with customer comms template ready. Practice provider outage quarterly—it's when factories prove maturity.
5.63 Continuous Learning Budget
Allocate 10% platform capacity to toil reduction and eval improvements. Without budget, Day-2 decays into permanent firefighting.
5.64 Factory Ops Quarterly Goals
Set explicit goals: reduce p99 workflow latency 10%, cut top waste bucket 15%, complete one game day, ship two runbook automations. Goals without numbers are wishes.
5.65 Handoff to Leadership
Escalate structural factory gaps—insufficient GPU contract, missing FinOps headcount, policy gridlock—to leadership with data and proposed investment. Ops teams can't policy-hack around capacity starvation forever.
5.66 Sleep Better
Well-instrumented factories with rehearsed runbooks let on-call engineers sleep. That's the real ROI of Day-2 discipline—not slide aesthetics.
5.67 Ops Metrics in Executive Dashboards
Expose workflow SLO attainment and error budget status to executive dashboards monthly. Visibility prevents "AI is flaky" narratives without data.
5.68 Final Operations Principle
Day-2 excellence is measured in uneventful Tuesdays—not heroic Sundays. Build the factory so Tuesdays stay quiet, budgets stay predictable, and agents stay trustworthy.
5.10 Chapter 5 Synthesis
Day-2 is where factories earn trust in production. Workflow SLOs, incident runbooks, capacity triggers, and eval canaries turn agents from demo to reliable utility at scale.
Key Takeaways & FAQ
Key Takeaways
- Measure workflows, not messages: Agent economics are driven by fan-out depth, context growth, and retries—profile before you scale GPUs.
- Build a factory, not a endpoint: Separate control plane routing, model tiering, online vs batch lanes, and edge/core boundaries.
- FinOps on $/outcome: Tag spans, run showback, graduate to chargeback, and scenario-plan seasonality.
- Migrate scientifically: Shadow, canary, cutover—with GPU generation planning that's vendor-neutral and benchmark-backed.
- Operate Day-2 with workflow SLOs: Incidents include cost runaway and fan-out storms; capacity triggers use tokens in flight, not CPU alone.
Frequently Asked Questions
What's the difference between an AI factory and a model hosting cluster?
A hosting cluster serves inference requests. A factory adds workflow-aware routing, lane isolation, FinOps tagging, tier policies, migration controls, and Day-2 SLOs for multi-hop agents. Agents need orchestration economics, not just low-latency tokens.
How do I estimate GPU capacity for agent workloads?
Build capacity signatures from traced production or representative DAGs: measure tokens per workflow, fan-out p95, and lane mix. Forecast by business seasonality, not average utilization. Include burst buffers for month-end and batch reconciliation patterns.
What is $/outcome and why not tokens per dollar?
$/outcome divides total cost (inference, tools, human review) by successfully completed workflows. Tokens per dollar ignores retries, over-tiering, and failed jobs—metrics that mislead executives during agent scale-up.
When should we use batch vs online inference lanes?
Online lanes serve user-facing steps with tight p99 latency. Batch lanes handle offline reconciliation, eval at scale, and non-interactive summarization with throughput-optimized scheduling. Never share pools without hard isolation.
How does model tiering reduce cost without hurting quality?
Route hops by risk and capability: frontier for planning and high-stakes verification, mid for synthesis, small for classification. Use verifier confidence and policy metadata—not blanket cheap models for every hop.
What promotion criteria should gate canary to cutover?
Compare shadow and canary cohorts on p99 latency, $/outcome, success rate, and policy regressions. Promote only when metrics stay within agreed bands (e.g., latency within 10%, cost within 5%). Pre-authorize rollback weights and feature flags.
How do prefix caches interact with multi-agent prompts?
Place static system instructions and schemas at the prompt prefix; append volatile tool outputs at the tail. Lint prompts in CI to prevent dynamic fields from invalidating cache keys across hops.
Should edge inference replace core GPUs?
Edge handles redaction, buffering, and small models near data sources. Core hosts large models and centralized governance. Agents crossing boundaries need signed summaries and residency-aware routers—not raw PII shuttling.
How do Vera Rubin-class and DSX trends affect architecture?
They signal higher tokens-per-watt and rack-level power orchestration becoming first-class constraints. Design abstractions around workload lanes and benchmarks, not SKU loyalty, so hardware generations swap without rewriting orchestration.
What SLOs should we publish to product teams?
Workflow success rate, workflow p99 latency by class, and $/outcome variance bands. Supplement with lane saturation indicators. Avoid promising single-request latency for multi-hop agents.
How do we prevent fan-out storms?
Enforce per-workflow parallel caps in the orchestrator, token debt budgets, and a global emergency degrade switch. Alert on abnormal branching depth before GPUs saturate.
Where should human review sit in FinOps models?
Include human review minutes in $/outcome when agents escalate. Over-automation without review gates can lower token cost while increasing operational risk and rework expense.
Author Bio
Vatsal Shah is the Principal AI Architect at Agile Tech Guru. He designs AI factories, agent inference platforms, and FinOps showback systems for regulated enterprises. His work spans GPU migration programs, workflow-level observability, and factory Day-2 operations that keep agent fleets within SLO and budget.
Ready to benchmark your factory?
AI factory TCO review — We'll profile your agent DAGs, model tier mix, and $/outcome baselines, then deliver a migration and FinOps roadmap aligned to your capacity horizon.
Social Excerpt
Chatbots optimized cost per message. Agents optimize cost per workflow—and most "AI factories" are still chat clusters with extra GPUs.
Our new AI Factory & Agentic Inference Playbook covers:
- Workload science — fan-out, caching, batch vs online lanes
- Reference architectures — routing, tiering, edge vs core (vendor-neutral)
- FinOps — $/outcome, showback, chargeback scenarios
- Migration — shadow, canary, cutover + GPU generation planning
- Day-2 — workflow SLOs, incidents, capacity triggers
Read the full manual: https://agiletechguru.com/playbooks/ai-factory-agentic-inference-playbook #AIInfrastructure #FinOps #AgenticAI #MLOps
X/Twitter
1/ If your FinOps dashboard still shows cost per chat message, agent scale-up will hurt. Agents burn tokens across fan-out DAGs—not single turns. 🧵
2/ Profile fan-out, context growth, retries, and cache affinity. Meter with workflow_id from day one.
3/ Build a factory: router + tiers + online/batch lanes + FinOps tags—not one vLLM pool.
4/ Migrate with shadow → canary → cutover. Benchmark $/outcome on new silicon, not marketing TFLOPS.
5/ Day-2 = workflow SLOs + fan-out storm runbooks + capacity triggers on tokens in flight.
https://agiletechguru.com/playbooks/ai-factory-agentic-inference-playbook #AIFactory #Inference