AI Factory & Agentic Inference Playbook — Architecture, FinOps, and Migration for Token-Heavy Workloads

By Vatsal Shah · 2026-05-31 · AI Infrastructure / FinOps

STRATEGIC OVERVIEW: Chatbots optimized cost per message. Agents optimize cost per workflow—and your factory wasn't built for workflows. This five-chapter playbook gives platform engineering, FinOps, and infrastructure leaders a vendor-neutral blueprint to profile agent workloads, design routing and tiering layers, implement $/outcome showback, execute shadow-canary-cutover migrations, and run Day-2 capacity triggers with production-grade SLOs.

Viral hook: Chatbots optimized cost per message. Agents optimize cost per workflow — and your factory wasn't built for workflows.

Chapter 1: Workload Science for Agents
Chapter 2: Reference Architectures
Chapter 3: FinOps Model & Showback
Chapter 4: Migration Methodology
Chapter 5: Day-2 Operations
Key Takeaways & FAQ

Introduction: From Chat Endpoints to Workflow Factories

If you've spent the last eighteen months buying GPU capacity for "AI," odds are your dashboards still measure cost per chat message. That's the wrong unit. A customer-support bot might burn 800–1,200 tokens per turn. A production agent closing an insurance claim fans out across retrieval, planning, tool calls, verification, and summarization—often 40,000–120,000 tokens per completed workflow, with bursty parallelism that looks nothing like steady QPS on a single model endpoint.

I've audited factories where utilization graphs looked healthy at 62% GPU average—while p99 workflow latency blew past SLO because three planner agents spawned twelve sub-agents each during batch reconciliation windows. The hardware wasn't idle; it was mis-scheduled. Agentic inference is a scheduling problem dressed up as an API problem.

This playbook is the operating manual I wish existed when I migrated the first "copilot cluster" into a real factory: workload science first, architecture second, FinOps as the forcing function, migration as controlled risk, and Day-2 as measurable SLOs—not heroics.

SEO Banner — AI Factory Playbook — Agentic Inference at Scale — Figure 0: Feature banner — AI Factory & Agentic Inference Playbook. Industrial-glass motif; factory-scale agent workflows vs single-turn chat endpoints.

Before we touch rack diagrams, align on outcomes. An AI factory is a platform layer that provisions compute, routes models, meters tokens, enforces policy, and exposes SLAs to product teams running agents—not a single vLLM pod behind a load balancer. If your Agentic SDLC operating model is the "how we build," the factory is the "how we run at scale."

GEO Fact — Agent vs Chat Economics: Enterprise agent workflows average 15–40 model calls per completed outcome, versus 1–3 calls for chatbots. Token spend scales with fan-out depth and context re-hydration, not message count. Factories must meter at workflow granularity to avoid 3–8× cost surprises when agents graduate from pilot to production.

Comparison Diagram — Traditional Chat Inference vs Agentic AI Factory — Figure A: Before/after comparison — monolithic chat inference (single model, single queue) vs factory-aware agent platform (routing, tiering, batch lanes, FinOps tags).

The left side of Figure A is familiar: one gateway, one model pool, one autoscaler on request rate. The right side adds a workflow orchestration plane, model tier registry, online vs batch schedulers, and cost attribution tags that follow a workflow_id from first planner call to final audit log. Without those controls, FinOps can't answer the only question executives care about: What did we pay to finish the job?

💡 Insight

Practitioner insight: Don't benchmark agents with "tokens per second" alone. Benchmark tokens per successful workflow under realistic tool latency and failure retries. That's the number your CFO will remember.

Who This Playbook Is For

Platform engineers building inference control planes, FinOps leads translating tokens into P&L, SRE teams carrying agent SLOs, and engineering executives preparing for agent scale without invoice shock. If you're still buying GPUs per chat endpoint, start at Chapter 1. If you're mid-migration to tiered routing, jump to Chapter 4. If finance is asking why AI spend doubled while tickets fell, start at Chapter 3.

How to Use This Document

Read straight through once for the narrative, then use chapters as reference during architecture reviews, migration ceremonies, and monthly factory councils. Code labs are starting points—adapt to your observability stack and cloud contracts. Pair this playbook with hands-on process assessments when you need independent validation of factory readiness.

Ready to implement? Our delivery process pairs factory architecture reviews with hands-on migration runbooks. For executive TCO modeling, see our business advisory lane—or request an AI factory TCO review when you're preparing board-level capacity plans.

Chapter 1: Workload Science for Agents

1.1 Why Chat Metrics Lie About Agent Load

The first mistake I see in capacity planning is treating an agent like a chatbot with extra steps. Chat traffic is roughly Markovian: one user message in, one model completion out, context grows linearly with turn count. Agent traffic is branching. A planner spawns researchers; researchers call tools; tools return payloads that get re-summarized; a verifier model may re-read the entire thread. Depth isn't "turns," it's graph depth.

When you profile agents, capture four dimensions on every workflow span:

Fan-out factor (F): max parallel model calls per orchestration tick.
Context growth rate (G): tokens added per hop (tool JSON is brutal here).
Retry multiplier (R): expected re-runs after tool failure or policy rejection.
Cache affinity (C): share of prompt prefix stable across hops.

Expected tokens per workflow ≈ base_prompt × (1 + R) × Σ(hops) × (1 + tool_attachment_factor). If you only autoscaled on HTTP QPS, you'll miss the cliff when F jumps from 2 to 16 during month-end jobs.

💡 Insight

Practitioner insight: Add a workflow_id label on every inference request on day one. Without it, you'll never reconcile FinOps to product outcomes—and you'll re-litigate the same chargeback fight quarterly.

System Architecture — Agent Fan-Out and Orchestration DAG — Figure 1: Agent fan-out architecture — planner, parallel researchers, tool gateways, verifier, and summarizer nodes with token and latency annotations per edge.

1.2 Fan-Out, Backpressure, and Queue Discipline

Fan-out is where innocent pilot clusters go to die. Suppose a coordinator dispatches eight sub-agents, each with a 32k context window prefill. That's eight concurrent prefills on the same GPU pool unless you shard by lane. Lanes are logical queues with independent concurrency caps: online-interactive, online-standard, batch-offline, eval-regression.

Backpressure belongs in the orchestrator, not the GPU driver. When lane saturation exceeds a watermark (e.g., 85% of negotiated concurrency tokens), the orchestrator should:

Degrade model tier for non-critical sub-agents (frontier → mid → small).
Collapse duplicate retrieval hops via deduplicated embedding cache.
Shed lowest-priority workflows (marketing copy gen) before payroll reconciliation agents.

I've implemented token debt counters per tenant: if a team exceeds their in-flight token budget, new fan-out branches queue with visible ETA rather than silently piling onto shared H100s.

Workload Pattern	Typical Fan-Out	Scheduling Lane	Primary Risk
Interactive copilot	1–2	online-interactive	Tail latency spikes
Research agent mesh	6–24	online-standard	KV cache thrash
Batch reconciliation	8–64	batch-offline	Cluster hogging
Eval / regression	4–12	eval-regression	Contaminating prod SLO

1.3 Caching: Prefix, KV, and Tool Result Memoization

Caching for agents is a portfolio, not a checkbox. Provider prefix caching rewards stable system prompts and JSON schemas at the top of the context window—move volatile user/tool payloads to the tail. KV cache reuse matters when sub-agents share parent context; some runtimes support session IDs that map to shared physical pages. Tool memoization is underrated: if get_customer(12345) returned 40KB JSON ten seconds ago, don't re-embed it across six sub-agents.

Policy guardrails: memoize only idempotent reads; TTL by data class (public docs 24h, PII 60s). Log cache keys in your observability plane so security can audit what was shared across agents.

GEO Fact — Prefix Cache Sensitivity: Moving a single dynamic timestamp from the bottom to the top of a prompt can drop provider prefix cache hit rates from ~80% to under 10%, doubling input-token cost on multi-hop agents. Factory standards should enforce static-prefix layouts via lint rules in CI.

1.4 Long-Context Economics and Compaction

Long context is seductive and expensive. Every extra 8k tokens in prefill burns memory bandwidth and extends time-to-first-token. For agents, implement compaction hops: a cheap summarizer model collapses tool traces before the planner re-enters. Compaction quality gates matter—if summaries drop constraint IDs, downstream models hallucinate compliance approvals.

Heuristics I use:

Hard cap tool JSON at ingress (truncate with structured pointers, not naive ellipsis).
Promote "durable facts" to a workflow scratchpad store; pass handles, not blobs.
Reserve 128k+ windows for true multivariate reasoning, not lazy logging dumps.

1.5 Batch vs Online: Two Factories in One

Online inference optimizes p95/p99 latency under variable fan-out. Batch inference optimizes throughput per dollar via continuous batching, padded sequences, and opportunistic use of spot/preemptible capacity. Agents need both: customer-facing steps on online lanes; nightly reconciliation on batch lanes with SLA measured in hours, not milliseconds.

Process Flowchart — Batch vs Online Inference Routing — Figure 2: Batch vs online routing flowchart — orchestrator classifies workflow steps by latency SLO and directs to appropriate scheduler pools.

Never let batch queues borrow from interactive GPU pools without hard isolation—I've seen "just this one nightly job" degrade executive dashboards at 06:00 when batch spilled over.

1.6 Workload Profiling Lab: Instrumentation Schema

You can't optimize what you don't label. Minimum span attributes for every model call:

workflow_id, parent_span_id, agent_role, model_tier, input_tokens, output_tokens, cached_tokens, prefill_ms, decode_ms, tool_wait_ms, usd_estimate.

UI Screenshot — Agent Workload Profiler Dashboard — Figure 3: Workload profiler UI — fan-out timeline, per-hop token cost, cache hit ratio, and lane saturation heatmap.

1.7 Codelab: Workload Profiler Emitter (Python)

class="tok-cm"># workload_profiler.py — emit OpenTelemetry-friendly spans for agent hops
from dataclasses import dataclass, asdict
from time import perf_counter
from typing import Optional
import json
import sys

@dataclass
class InferenceSpan:
    workflow_id: str
    span_id: str
    parent_span_id: Optional[str]
    agent_role: str
    model_tier: str
    lane: str
    input_tokens: int
    output_tokens: int
    cached_tokens: int = 0

class Profiler:
    class="tok-kw">def __init__(self, sink=print):
        self.sink = sink

    class="tok-kw">def run_hop(self, span: InferenceSpan, callable_fn):
        t0 = perf_counter()
        result = callable_fn()
        elapsed_ms = (perf_counter() - t0) * 1000
        payload = {**asdict(span), class="tok-str">"latency_ms": round(elapsed_ms, 2)}
        self.sink(json.dumps(payload))
        return result

class="tok-cm"># Example: coordinator calls researcher sub-agent
if __name__ == class="tok-str">"__main__":
    prof = Profiler(sink=lambda line: sys.stdout.write(line + class="tok-str">"\n"))
    span = InferenceSpan(
        workflow_id=class="tok-str">"wf-claim-88421",
        span_id=class="tok-str">"span-research-3",
        parent_span_id=class="tok-str">"span-plan-1",
        agent_role=class="tok-str">"researcher",
        model_tier=class="tok-str">"mid",
        lane=class="tok-str">"online-standard",
        input_tokens=18240,
        output_tokens=960,
        cached_tokens=12000,
    )
    prof.run_hop(span, lambda: {class="tok-str">"status": class="tok-str">"ok", class="tok-str">"docs": 4})

1.8 Codelab: Fan-Out Limiter (TypeScript)

// fanoutLimiter.ts — cap parallel model calls per workflow
type Task<T> = () => Promise<T>;

export class FanoutLimiter {
  private inFlight = 0;
  constructor(
    private readonly maxParallel: number,
    private readonly onReject?: (reason: string) => void,
  ) {}

  async run<T>(task: Task<T>): Promise<T> {
    if (this.inFlight >= this.maxParallel) {
      this.onReject?.("fanout_cap");
      throw new Error("Fan-out cap exceeded; queue or degrade tier");
    }
    this.inFlight++;
    try {
      return await task();
    } finally {
      this.inFlight--;
    }
  }
}

// Usage in orchestrator
const limiter = new FanoutLimiter(8, (r) => metrics.increment("fanout_reject", { reason: r }));
await Promise.all(urls.map((u) => limiter.run(() => callModel(u))));

1.9 Capacity Signatures and Seasonality

Agents inherit business seasonality. Month-end finance agents don't look like Tuesday marketing agents. Build capacity signatures: vectors of (lane, hour, fan_out_p95, tokens_per_workflow_p95). Forecast GPU needs from signatures, not vanity utilization averages.

1.11 Deep Dive: Prefill vs Decode in Multi-Hop Agents

GPU time splits into prefill (process the prompt, populate KV cache) and decode (emit tokens autoregressively). Chat turns with short prompts are decode-heavy. Agent hops with 20k-token tool dumps are prefill-heavy. When eight sub-agents launch together, you create eight prefill storms on shared memory bandwidth—p99 TTFT explodes even if decode QPS looks fine.

Mitigations I've shipped in anger:

Staggered fan-out: release sub-agents in waves of N, not all-at-once.
Prompt deduplication: hash parent context; reuse KV where runtime allows.
Speculative sub-agent cancellation: if planner confidence drops, kill in-flight branches before prefill completes.

Track prefill_ms / (prefill_ms + decode_ms) per workflow class. When that ratio crosses 0.55 for interactive lanes, you're mis-classifying workloads—move heavy hops to batch or compact first.

1.12 Tool Latency and the Hidden Queue

Agents spend wall-clock waiting on tools—CRM APIs, SQL, vector DB, human approval. That wait isn't billed to GPUs but still blocks workflows and triggers retry loops that are billed. Instrument tool_wait_ms separately. I've seen workflows where tool wait was 70% of duration but teams only optimized model tier.

Design tool SLAs paired with inference SLAs: if CRM p95 is 4s, don't spawn six parallel CRM calls without a circuit breaker. Use bulkhead patterns per downstream system so one poisoned dependency doesn't convert into a fan-out retry tsunami.

1.13 Determinism, Temperature, and Cost Variance

Low temperature doesn't guarantee deterministic spend. Tool outputs vary; retrieval sets shift; cache keys miss. For FinOps forecasting, maintain distributions—not point estimates. Report p50 and p95 $/outcome per workflow class in showback dashboards.

For eval and regression lanes, pin temperatures and model builds. For creative marketing agents, allow variance but cap spend with hard max_cost_usd on the router.

1.14 Regulatory and Residency Impacts on Workload Shape

Restricted data often can't use public API caching or cross-region failover. That forces higher prefill repetition and blocks shared prefix caches across tenants. Factory planners must add residency overhead coefficients to capacity models—typically 15–30% more tokens per outcome for strictly pinned workloads.

1.15 Observability Anti-Patterns

Aggregating all models into one "AI spend" line item.
Logging only final hop tokens.
Ignoring failed workflows in success-rate SLOs.
Using average fan-out instead of p95 fan-out for capacity.

Fix these before you buy more GPUs. Otherwise you're funding noise.

1.16 Workshop: Building a Workload Profile in One Sprint

Week 1: instrument workflow_id and span attributes. Week 2: sample 500 production workflows per class. Week 3: compute F, G, R, C distributions and draft lane mapping. Week 4: present capacity signature to finance with scenario bands. This beats six months of "we'll monitor it later."

1.17 Case Study: Month-End Fan-Out Cliff

A fintech client ran claims reconciliation agents on the 1st of each month. Pilot metrics looked tame: 12 RPS average. Production month-end hit 180 RPS equivalent when fan-out expanded to 24 parallel researchers per coordinator. p99 workflow latency went from 38s to 11 minutes; GPUs weren't down—they were prefill-saturated.

We fixed it in three moves without buying hardware first: staggered fan-out waves of six, compaction hops before planner re-entry, and moving reconciliation to batch lanes with 4-hour SLA. p99 returned to 52s; $/outcome dropped 31% because we stopped infinite frontier retries on timeout.

1.18 KV Cache Fragmentation

When agents share partial context, fragmentation still happens if each sub-agent tweaks system prompts. Standardize system prompts per role (researcher, verifier, summarizer) across squads. Platform team owns three canonical prompts, not three hundred snowflakes.

1.19 Token Budgeting at Orchestration Design Time

Product specs should include token budgets alongside functional requirements: "This workflow may consume at most 80k tokens at p95." Architects sign off before implementation. Prevents surprise graphs that no router can tier away.

1.20 Interplay with Agentic SDLC

Coding agents from your Agentic SDLC playbook generate PRs; factory metrics prove whether those agents are affordable at scale. Link CI agent spans to the same workflow_id scheme so engineering and product agents appear in one FinOps plane.

1.21 Extended Comparison: Batch Scheduling Policies

Policy	Strength	Weakness
FIFO	Simple fairness	Head-of-line blocking
Priority by SLO	Protects interactive	Starves batch if mis-tuned
Cost-aware	Minimizes $/token	Complex to explain
Deadline-aware	Meets cutoffs	Requires accurate duration est.

Most factories start FIFO, then add priority lanes—not the other way around.

1.22 When to Reject Work

Healthy factories say no. If tokens_in_flight exceeds global watermark and workflow class isn't critical, return 429 with retry-after and estimated queue time. Silent degradation breeds mistrust; explicit queueing breeds trust.

1.23 The Physics of Concurrent Agent Graphs

Think of your factory as a queueing network. Each orchestration tick is a node; each model call is a server with service time dominated by prefill length and decode tokens. Fan-out multiplies arrivals at child servers simultaneously. The utilization law you learned in undergrad still applies: as ρ → 1, latency explodes. Agents push ρ toward 1 faster than chat because they parallelize intentionally.

That's why lane isolation is non-negotiable. Interactive ρ should stay below 0.7 sustained; batch can run hotter because SLA is measured in hours. When product demands "run everything now," translate the argument into ρ numbers executives understand.

1.24 Context Windows as a Budget, Not a Feature

Vendors market 200k+ context windows. Operationally, each additional token in prefill is a microsecond tax multiplied by cluster width. Teach product managers to write specs in token budgets and evidence handles, not "send the whole PDF." The factory enforces budgets at router ingress with structured errors when exceeded—fail fast, don't truncate silently without audit logs.

1.25 Cross-Functional Review Cadence

Weekly 30-minute review: platform, FinOps, one product owner. Agenda: top three workflows by spend, one incident, one optimization shipped. Continuity beats quarterly hero projects.

1.26 Long-Horizon Trends

Agents will get more autonomous, not less. Fan-out depth will rise unless regulators or economics constrain it. Factories that instrument workflow science now will adapt; those that bolt GPUs onto chat endpoints will replatform under duress.

1.27 Extended Narrative: Profiling in Practice

When I arrive on site, the first question isn't "how many H100s do you own?" It's "show me one workflow trace." Most teams can't. We spend week one instrumenting orchestrators—often LangGraph, Temporal, or custom asyncio graphs—and week two sampling production. The ah-ha moment is always the same: a "simple" customer service agent actually fans out to fourteen model calls because retrieval, rerank, draft, verify, and compliance checks each became separate hops without anyone drawing the graph.

We label hops with agent_role even when the same binary serves multiple roles via prompts. Roles matter for tiering: verifiers stay frontier longer; formatters drop to small tiers. We measure attachment factor—how many tokens tools add per hop. CRM JSON is the usual villain. Compression policies (field allowlists, stable key ordering) cut tokens and improve cache hits because hashes stabilize.

Batch vs online isn't religious war; it's SLO war. I ask product: "Is a human waiting?" If yes, online lane with aggressive caps. If no, batch lane with cost-aware scheduling. Mixed workflows get step-level routing inside the orchestrator—don't classify entire workflows as batch just because 80% of steps are offline; one interactive confirmation step forces online protection for that segment.

Finally, we document retry economics. A 10% retry rate on a 50k-token workflow adds 5k tokens expected value—but tail risk is higher because retries often escalate tier or widen context. Model retries as Bernoulli trials; plan capacity for tails, not means.

1.10 Chapter 1 Synthesis

Workload science is the prerequisite for every architecture diagram that follows. Profile fan-out, segregate lanes, treat caching as a system, compact context deliberately, and split batch from online as if they were two products—because to your users' SLOs, they are.

Chapter 2: Reference Architectures

2.1 Factory Layers: Control Plane vs Data Plane

A reference AI factory splits cleanly into control plane (policy, routing, registry, FinOps tags) and data plane (GPU/TPU workers, model servers, embedding indexes). Product teams submit workflows to the control plane API; they never pin to individual pods. This mirrors how Kubernetes separated scheduling from container execution—except our "pods" are quantized model replicas with wildly different memory footprints.

Core components:

Ingress gateway — authN/Z, rate limits, request normalization.
Workflow router — selects model tier, lane, and residency rules per hop.
Model registry — versions, capability matrix, cost coefficients.
Scheduler — binds requests to pools (online H100, batch A100, edge NPU).
Observability bus — spans, dollars, watts (where available).

💡 Insight

Practitioner insight: Keep the router stateless; put workflow state in a durable orchestration store. Routers scale horizontally; sticky GPU sessions do not.

System Architecture — AI Factory Reference Architecture — Figure 4: Reference architecture — control plane services atop data plane GPU pools, embedding tier, and edge inference nodes.

2.2 Routing Layer: Policy, Not Just Load Balancing

Routing is where FinOps meets security. Inputs: workflow_class, data_sensitivity, latency_slo, max_cost_usd, required_capabilities (tools, vision, JSON mode). Outputs: model_tier, region, lane, fallback_chain.

Implement fallback chains explicitly: if frontier times out, mid-tier summarizer completes with degraded quality flag—don't infinite-retry frontier and torch spend.

Process Flowchart — Model Tier Routing Decision Tree — Figure 5: Model tier routing flowchart — capability checks, cost ceiling, residency, then primary and fallback model selection.

2.3 Model Tiering Matrix

Tier	Typical use	Latency target	Cost index (illustrative)
Frontier	Planning, complex reasoning	< 8s TTFT	1.00×
Mid	Tool synthesis, summarization	< 3s TTFT	0.35×
Small	Classification, extraction	< 1s TTFT	0.08×
Local CPU	PII regex, format validation	< 50ms	~0× API

Tiering isn't "use cheap models whenever." It's risk-adjusted tiering: high-stakes hops stay frontier until verifier confidence exceeds threshold.

2.4 Edge vs Core: Where Inference Should Run

Core datacenter pools host large models, big KV caches, and centralized FinOps. Edge nodes (factory floor, retail store, regulated VPC) run small models for redaction, intent detection, and offline-first buffering. Agents spanning edge and core need synchronization contracts: edge agents emit signed summaries; core agents never pull raw PII back across the boundary without policy tokens.

Trend note (vendor-neutral): 2026 hardware roadmaps emphasize higher tokens per watt and rack-scale power orchestration—industry discussion often cites next-gen accelerators (e.g., Vera Rubin-class) and datacenter power fabrics (DSX-style orchestration) as drivers for denser agent factories. Treat these as capacity planning signals, not purchase mandates. Your factory abstractions should survive SKU changes.

Sequence Diagram — Agent Inference Request Through Factory — Figure 6: Sequence diagram — orchestrator → router → scheduler → model server → tool gateway → verifier, with span propagation.

2.5 Embedding and Retrieval Tier

Agents spend tokens re-reading retrieval results. Architect a dedicated embedding tier with:

Hybrid search (dense + lexical) behind a stable MCP/tool interface.
Chunk size standards (512–1k tokens) with metadata for citation.
Re-rankers on mid-tier models before context injection.

2.6 Multi-Region and Residency

Regulated workloads require data-plane pinning. The router enforces allowed_regions per workflow class. Failover across regions for stateful agents is painful—design idempotent workflow steps and externalize checkpoints.

UI Screenshot — Inference Routing Control Panel — Figure 7: Routing control panel — live tier weights, lane watermarks, residency overlays, and emergency degrade switches.

2.7 Codelab: Routing Policy Engine (Python)

class="tok-cm"># routing_policy.py — declarative tier selection
from dataclasses import dataclass

@dataclass
class RouteRequest:
    workflow_class: str
    sensitivity: str  class="tok-cm"># public | internal | restricted
    latency_slo_ms: int
    max_cost_usd: float
    needs_vision: bool = False

TIER_COST = {class="tok-str">"frontier": 1.0, class="tok-str">"mid": 0.35, class="tok-str">"small": 0.08}

class="tok-kw">def select_tier(req: RouteRequest) -> str:
    if req.sensitivity == class="tok-str">"restricted" and req.workflow_class == class="tok-str">"underwriting":
        return class="tok-str">"frontier"
    if req.latency_slo_ms < 1500:
        return class="tok-str">"small"
    if req.max_cost_usd < 0.02:
        return class="tok-str">"mid"
    return class="tok-str">"frontier" if req.needs_vision else class="tok-str">"mid"

2.8 Codelab: Factory Client SDK (TypeScript)

// factoryClient.ts — single entry for agent hops
export interface HopRequest {
  workflowId: string;
  spanId: string;
  agentRole: string;
  messages: Array<{ role: string; content: string }>;
  workflowClass: string;
  maxCostUsd: number;
}

export class FactoryClient {
  constructor(private readonly baseUrl: string, private readonly token: string) {}

  async complete(hop: HopRequest): Promise<{ text: string; tier: string; usd: number }> {
    const res = await fetch(`${this.baseUrl}/v1/complete`, {
      method: "POST",
      headers: {
        Authorization: `Bearer ${this.token}`,
        "Content-Type": "application/json",
        "X-Workflow-Id": hop.workflowId,
      },
      body: JSON.stringify(hop),
    });
    if (!res.ok) throw new Error(`factory error ${res.status}`);
    return res.json();
  }
}

2.9 Failure Domains and Blast Radius

Shard factories by blast radius: payments-agents never shares GPU pools with marketing-agents. Shared infrastructure is fine; shared scheduler debt is not.

2.11 Control Plane APIs Product Teams Actually Use

Expose a single POST /v1/hops/complete with explicit policy headers. Product teams shouldn't pick GPU types—they declare intent: workflow_class, sensitivity, max_cost_usd, latency_slo_ms. The factory returns tier, region, lane, usd_estimate, and trace_id.

Version the API. Pin breaking changes to quarterly trains so agent frameworks don't fracture.

2.12 High Availability Without Sticky Sessions

Stateful KV sessions tempt architects to use sticky sessions on load balancers. That complicates failover. Prefer externalized session stores or recomputable context with durable scratchpads. When sticky sessions are unavoidable, document session migration playbooks during node drains.

2.13 Heterogeneous Pools: CPU, GPU, NPU

Not every hop needs a GPU. Regex validators, JSON schema checkers, and lightweight classifiers belong on CPU pools—or WASM sandboxes. NPUs at the edge handle embedding micro-batches efficiently. The reference architecture should show three elastic pools with a unified router, not a monolith icon labeled "AI."

2.14 Network Egress and Egress Cost

Tool-heavy agents generate egress charges that dwarf token costs in some SaaS setups. Track egress per workflow. Place tool gateways in the same region as data sources. Cache tool responses where policy allows.

2.15 Vera Rubin / DSX as Planning Signals (Not Ads)

Industry chatter in 2026 highlights next-gen accelerators with improved matrix ops per watt and datacenter power orchestration that treats racks as unified power domains. Regardless of vendor, your abstraction layer should record tokens_per_watt and $/trillion_tokens from benchmarks you run—not slides you receive. Schedule re-benchmarks quarterly; agent mixes shift faster than silicon roadmaps.

2.16 Factory Maturity Model

Stage	Characteristics
0 Ad hoc	One endpoint, no tags
1 Metered	workflow_id, basic dashboards
2 Routed	tiers, lanes, fallbacks
3 FinOps	$/outcome, showback
4 Resilient	shadow/canary, workflow SLOs
5 Optimized	automated tier tuning, predictive capacity

Most enterprises claiming "AI factory" are stage 1–2. Be honest in assessments.

2.17 Partner Integration: Orchestrators and MCP

If you deploy MCP tool meshes, the factory router should sit above tool execution, not behind it—policy first, then tools, then models. Cross-link your MCP gateway standards with factory identity so audit logs correlate tool calls and inference spans under one workflow_id.

2.18 Disaster Recovery

DR for factories isn't "restore the model weights." It's: restore routing tables, tier maps, budget configs, observability pipelines, and orchestrator checkpoints. Run DR drills that fail over routers while keeping data-plane residency constraints intact.

2.19 Reference Deployment Topologies

Topology A — Single region hub: simplest, best for mid-market. Topology B — Active-active dual region: for resiliency with residency constraints per tenant. Topology C — Hub + edge satellites: manufacturing, retail, healthcare bedside. Pick topology before SKU shopping.

2.20 API Gateway vs Service Mesh

Gateways handle auth, WAF, rate limits. Mesh handles mTLS and fine-grained service policies. Inference routers often live as a control plane service behind the gateway, not inside the mesh data path—keep hot paths short.

2.21 Model Registry Fields

Registry entries should include: model_id, build, context_limit, supports_tools, supports_vision, cost_coefficient, residency, deprecation_date, eval_scorecard_id. Deprecation dates force migration conversations early.

2.22 Routing Experiments (A/B)

Run controlled experiments on tier policies with guardrails: max 5% traffic, automatic rollback on $/outcome regression. Experiments are how you learn if mid-tier can handle planner hops for specific classes—opinions don't scale.

2.23 Private vs Public Model Paths

Hybrid factories route restricted workflows to private weights; general tasks to public APIs with redaction gateways in between. Document data flow diagrams for security reviews—auditors ask every time.

2.24 Cost of Complexity

Every routing dimension adds operational burden. Start with three workflow classes and three tiers. Expand when showback proves pain, not when architects get bored.

2.25 Security Architecture Overlays

Place policy enforcement points before model calls: PII scanners, prompt injection classifiers (small models), tool allowlists, output filters for restricted classes. Security isn't a post-hoc filter on completions—it's part of routing decisions (deny, degrade, require_hITL).

2.26 Interoperability Standards

Push internal teams toward OpenTelemetry trace context propagation. Align tool interfaces with MCP where possible so agent frameworks swap without rewriting factory clients. Standards reduce glue work more than another Kubernetes operator.

2.27 Scaling the Control Plane

Control plane services are cheap relative to GPUs—but they must scale horizontally. Use read replicas for registry and policy stores; cache hot policy bundles at routers with version hashes. Stale policy is a security incident; overloaded policy servers are an availability incident.

2.28 Architectural Review Checklist

Before production launch: lanes defined? tiers mapped? residency enforced? budgets attached? rollback tested? shadow metrics compared? If any answer is no, you're not ready—regardless of demo applause.

2.29 Extended Narrative: Designing the Router

The router is the moral center of the factory. It encodes what you value: safety, cost, speed, quality. Start with a rules engine you can read in a code review; add ML routing only when rules leave money on the table and you have labeled outcomes. Rules should live in git, versioned, reviewed by security and FinOps, deployed like any other service.

Fallback chains must be tested. If mid-tier fails open to frontier automatically, you'll never know mid-tier was broken until the invoice arrives. Tests should inject failures: latency spikes, 429 storms, garbage outputs. The router should degrade gracefully—shorter answers, narrower tools, delayed fan-out—before hard failing user workflows.

Edge vs core decisions are data residency decisions first, latency second. I've seen edge deployments justified for latency while the real win was keeping patient data off the WAN. Be honest in architecture decision records.

When industry news discusses Vera Rubin-class accelerators or DSX power fabrics, translate hype into benchmark tasks: your top five workflows, measured on candidate kit, reported as $/outcome and tokens/watt. Everything else is marketing until it passes your harness.

2.30 Service Level Objectives for Routing

The router itself needs SLOs: policy evaluation p99 <20ms, registry lookup p99 <10ms, decision availability 99.95%. Router outages stall every agent—treat control plane as tier-0.

2.31 Data Planes for Embeddings vs Generation

Split embedding inference from autoregressive generation pools. Embedding bursts from retrieval shouldn't delay decode on interactive lanes. Publish separate capacity signatures for each.

2.32 Quota and Throttle Design

Per-tenant quotas: max concurrent workflows, max tokens in flight, max fan-out depth. Expose quota headers in API responses so product UIs explain waits instead of mysterious hangs.

2.33 Testing Reference Architectures

Contract tests between orchestrator and factory client. Golden routing decisions for fixture requests. Chaos tests: registry down, policy store stale, GPU pool 429 storm.

2.34 Documentation for Product Teams

Developer portal pages: how to declare workflow classes, how to read showback tags, how to request tier exceptions, how to interpret errors (BUDGET_EXCEEDED, RESIDENCY_DENIED, FANOUT_CAP).

2.35 Platform Boundaries

Clarify what central platform owns vs what product squads own. Platform owns router, pools, observability baselines; squads own agent graphs within policy. Boundary disputes cause shadow endpoints.

2.36 Future-Proofing Routing Schema

Extensible policy schema with version field. Add dimensions without breaking clients: carbon_budget, jurisdiction, experiment_id as optional fields.

2.37 Reference Architecture Variants for Regulated Industries

Variant R1: no public API paths, all weights on-prem, HSM-backed keys. Variant R2: hybrid with redaction gateway. Variant R3: sovereign cloud regions only. Document each with network diagrams for auditors.

2.38 Cost-Aware Routing Simulation

Offline simulator: feed week of traces, try tier policies, output $/outcome distributions. Use before changing production weights—avoid live experiments on revenue workflows without backup.

2.39 Closing Architecture Principles

Keep hot paths short, policies versioned, pools isolated, observability mandatory, and vendors interchangeable. Architectures that fail these principles don't survive first hardware refresh cycle.

2.40 Platform Engineering Operating Model

Running a factory is a product, not a project. Staff a platform product manager, three to six senior platform engineers, an SRE liaison, and a FinOps analyst embedded at least quarter-time. Roadmap items come from showback pain, incident retros, and migration milestones—not vendor roadshows.

Sprint rhythm: two-week delivery with one week stabilization each quarter for game days and DR. OKRs tie to workflow SLO attainment and $/outcome improvement, not 'models deployed.'

Internal customers rate the factory via quarterly surveys: time to onboard new workflow class, clarity of errors, trust in chargeback. Low scores trigger usability epics, not more policy PDFs.

Partner with security early on router policies. Late security review rewrites routing and delays migrations months. Security champions attend factory council.

Finally, celebrate decommissioning legacy endpoints. Each decommissioned chat cluster is reduced operational drag and clearer cost attribution.

2.41 Reference Architecture Review Questions

Before any executive demo, answer these in writing: Where does policy enforce residency? What happens when frontier tier times out? How are tool calls audited? What is max fan-out? How is $/outcome computed? Weak answers predict production pain.

2.42 Building vs Buying Control Planes

Buy components (observability, GPU cloud), build differentiation (router, FinOps tags, workflow lanes). Over-buying "AI suites" often reintroduces chat-centric metrics. Under-building governance reintroduces shadow APIs.

2.43 Technical Debt in Routing Rules

Rules accumulate exceptions: "workflow X may use frontier on Tuesdays." Schedule rule retirements. Debty rules confuse simulators and humans alike.

2.44 Multi-Tenant Noisy Neighbor Controls

Per-tenant concurrency caps and token debt prevent one tenant's marketing campaign from starving another's payroll agents. Noisy neighbor stories are common in first multi-tenant factories.

2.45 Architecture Documentation Set

Maintain: C4 context diagram, data flow for PII, sequence for happy path, sequence for degrade path, tier matrix CSV, DR topology. Update within one sprint of changes or docs lie.

2.10 Chapter 2 Synthesis

Reference architecture is deliberately boring: stateless routers, explicit tiers, separated lanes, edge/core boundaries, and observability everywhere. Boring factories survive Black Friday fan-out.

Chapter 3: FinOps Model & Showback

3.1 From Tokens to Outcomes: The Only Metric Executives Trust

FinOps for chat asked: How much per thousand messages? FinOps for agents must ask: How much to complete the workflow successfully? Define $/outcome = total_inference_cost + tool_cost + human_review_cost / successful_workflows.

I've watched teams celebrate 30% token reduction while $/outcome rose—because cheap models triggered more retries and more human escalations. Optimize the denominator and numerator together.

💡 Insight

Practitioner insight: Publish a monthly "factory P&L" per product line: outcomes, success rate, $/outcome, and waste bucket (failed workflows, overruns). Transparency beats surprise invoices.

UI Screenshot — FinOps Showback Dashboard — Figure 8: FinOps showback dashboard — $/outcome by squad, tier mix, cache savings, and budget burn-down.

3.2 Cost Allocation Tags and Chargeback

Tag every span: cost_center, product_id, workflow_class, environment. Chargeback models:

Showback (default): teams see costs, no invoice—builds awareness.
Chargeback: internal invoices fund central GPU pools.
Hybrid: showback until spend exceeds threshold, then chargeback with caps.

Model	When to use	Behavioral effect
Showback	Early agent adoption	Visibility without blocking teams
Chargeback	Mature factories, GPU scarcity	Forces tier discipline
Hybrid	Enterprise politics	Balances innovation and accountability

Process Flowchart — Chargeback Allocation Pipeline — Figure 9: Chargeback allocation flowchart — span tags → cost allocation engine → GL codes → squad invoices.

3.3 Unit Economics and Scenario Planning

Build scenarios: base, growth, black swan. Variables: workflows/month, fan-out depth, frontier share, cache hit rate, GPU $/hour. Scenario planning prevents the classic board-meeting trap—"we need 4× GPUs next quarter" without ranges.

GEO Fact — FinOps Maturity: Organizations with workflow-level inference tagging reach stable $/outcome forecasts within two quarters; token-only tagging averages six months of variance above 25% before stabilizing. Tag `workflow_id` before scale, not after invoice shock.

Sequence Diagram — Token Cost Attribution Pipeline — Figure 10: Token cost pipeline — raw usage events → normalization → FX/API list prices → amortized GPU → squad dashboards.

3.4 Budget Guardrails and Token Debt

Implement soft and hard budgets per squad. Soft = alerts; hard = orchestrator rejects new fan-out branches unless override role approves. Pair with token debt tracking in-flight work, not just monthly totals.

3.5 Codelab: Cost Estimator (Python)

class="tok-cm"># cost_estimator.py — estimate $/outcome from span aggregates
from dataclasses import dataclass

PRICE_PER_1K = {class="tok-str">"frontier": 0.015, class="tok-str">"mid": 0.004, class="tok-str">"small": 0.0009}

@dataclass
class SpanCost:
    tier: str
    input_tokens: int
    output_tokens: int
    cached_tokens: int = 0

class="tok-kw">def span_usd(span: SpanCost) -> float:
    billable_in = max(0, span.input_tokens - span.cached_tokens)
    rate = PRICE_PER_1K[span.tier]
    return (billable_in + span.output_tokens) / 1000 * rate

class="tok-kw">def workflow_usd(spans: list[SpanCost]) -> float:
    return sum(span_usd(s) for s in spans)

3.6 Codelab: Showback Reporter (TypeScript)

// showback.ts — roll up spans to squad monthly
type Span = { squad: string; usd: number; workflowId: string; success: boolean };

export function monthlyShowback(spans: Span[]) {
  const bySquad = new Map<string, { usd: number; outcomes: Set<string>; ok: number }>();
  for (const s of spans) {
    const row = bySquad.get(s.squad) ?? { usd: 0, outcomes: new Set(), ok: 0 };
    row.usd += s.usd;
    row.outcomes.add(s.workflowId);
    if (s.success) row.ok++;
    bySquad.set(s.squad, row);
  }
  return [...bySquad.entries()].map(([squad, v]) => ({
    squad,
    usd: v.usd,
    workflows: v.outcomes.size,
    successRate: v.ok / v.outcomes.size,
    usdPerOutcome: v.usd / Math.max(1, v.ok),
  }));
}

3.7 Waste Buckets: Failed, Abandoned, Over-Tiered

Classify waste:

Failed workflows — burn tokens, no outcome (fix reliability).
Abandoned workflows — user timeout (fix UX/latency).
Over-tiered hops — frontier where mid suffices (fix router).

3.8 Executive Narrative and Case Study Pattern

Tie factory metrics to revenue and risk: faster claims processing, fewer compliance escapes. Reference anonymized case study patterns where migration + tiering reduced $/outcome 35–50% without success-rate drop.

3.9 Contracting with Cloud and Silicon Vendors

Negotiate committed use with burst buffers for agent seasonality. Include observability rights (per-minute GPU metrics) in contracts—vendor averages hide fan-out spikes.

3.11 Amortizing CapEx in $/Outcome

Cloud inference is OpEx-heavy; on-prem clusters blend CapEx amortization, power, cooling, and staff. For hybrid factories, build a blended rate card per GPU-hour that finance accepts, then let the cost pipeline allocate span dollars against that card. Without blended rates, product teams compare apples (API list price) to oranges (owned H100s).

3.12 Chargeback Politics and Productivity

Chargeback can starve innovation if applied too early. I recommend 12 weeks of showback with weekly office hours before first invoices. Pair chargeback with guardrailed sandboxes so teams can experiment without production budgets.

3.13 Outcome Definitions That Don't Lie

Define "successful outcome" per workflow class with product legal sign-off:

Insurance claim: adjudicated status in {approved, denied} with audit trail.
Code migration agent: PR merged with green CI.
Research agent: report delivered with citations meeting policy.

Failed or partial outcomes must still record spend in a waste bucket—otherwise teams game success flags.

3.14 FinOps Rituals

Monthly: factory P&L review. Quarterly: scenario replan against actual F and G distributions. Annually: vendor commit renegotiation using your tokens-per-watt benchmarks.

3.15 Integration with Enterprise GL

Map cost_center tags to GL accounts. Export CSV or API feeds finance can ingest. The chargeback flowchart isn't vanity—it prevents manual spreadsheet hell every month.

3.16 Sensitivity Analysis Template

Variables to stress-test:

Frontier share +10 pts
Cache hit rate -15 pts
Fan-out p95 +4
Tool latency +2s
Failure rate +3 pts

Present tornado charts to executives. They understand ranges better than point forecasts.

3.17 Human-in-the-Loop Economics

HITL isn't free. Model a fully loaded reviewer minute and add to $/outcome when workflows escalate. Sometimes a slightly more expensive tier eliminates escalations—net win on $/outcome even if tokens rise.

3.18 Case Study Narrative (Anonymized Pattern)

A global insurer moved claims agents from a single frontier endpoint to tiered factory routing with compaction hops. Tokens per workflow dropped 22%, but outcomes per hour rose 41% because p99 latency improved and escalations fell. $/outcome fell 38%. Reference similar patterns in your case study portfolio when pitching executives.

3.19 Finance Partnership Checklist

Agree on blended GPU-hour rate
Define outcome catalog
Align calendar close dates for chargeback exports
Establish variance threshold alerts (>10% MoM)
Sponsor executive readout quarterly

Without finance partnership, FinOps remains a dashboard nobody funds.

3.20 Token Price Volatility

API list prices change. Maintain price version tables in the cost pipeline; backfill last quarter when prices drop so teams see goodwill credits in showback. Transparency builds trust when vendors cut prices.

3.21 Squad-Level Coaching

When showback highlights a squad with high $/outcome and low success rate, assign platform coach for two sprints—fix graphs, not blame people. Culture matters for sustainable factories.

3.22 Reserved Capacity vs On-Demand

Model reserved GPU blocks for baseline signatures; burst on-demand for seasonality. FinOps scenario planning should include commit utilization—finance hates 40% reserved idle.

3.23 Attribution Edge Cases

Shared platform services (embedding index, reranker) need allocation rules: by token share, by query count, or by workflow count. Document the rule; change it yearly if unfairness appears.

One page: outcomes/month, $/outcome trend, top waste bucket, migration status, next quarter CapEx/OpEx ask. If you can't fit it on one page, the narrative isn't crisp enough.

3.25 Building the First Showback Dashboard

Start ugly but correct: table of squads with workflows, success rate, tokens, estimated USD, $/outcome. Add charts later. Finance prefers accurate tables over pretty lies.

3.26 Negotiating with Product Leadership

Product wants infinite frontier; finance wants caps. Mediate with data: show three tier policies side by side with projected outcomes/hour and $/outcome. Decisions become rational.

3.27 FinOps Toolchain

Typical stack: span exporter → streaming bus → warehouse → dbt models → BI dashboard → monthly CSV to ERP. Keep the pipeline boring and tested.

3.28 When Chargeback Fails

If chargeback causes teams to bypass the factory with shadow API keys, you've lost governance. Fix incentives: fund innovation sandboxes with explicit caps instead of forcing shadow IT.

3.29 Extended Narrative: FinOps as Product Management

FinOps for factories is product management with dollars attached. Outcome catalogs are your SKU list. If you can't name outcomes, you can't price them. Workshop outcomes with legal and operations before finance—otherwise you'll argue about definitions during invoice disputes.

Showback is a teaching tool. I run office hours where squads see their graphs and propose optimizations—compaction, tier changes, tool bulkheads. The best ideas come from teams who feel costs, not from central mandates.

Chargeback is a behavior tool. Apply it when scarcity is real and culture is ready. Hybrid models work: platform funds baseline capacity; squads pay marginal burst. Transparency about the formula matters more than precision to the penny.

Scenario planning saved a retail client from over-procuring 30% extra GPUs for holiday agents. We modeled fan-out under promotional campaigns, stress-tested tool latency, and kept headroom in on-demand burst instead of reserved idle. Finance approved because we showed bands, not points.

3.30 FinOps Data Model

Core tables: spans, workflows, outcomes, price_versions, allocations. Enforce referential integrity on workflow_id. Document grain: one row per hop, aggregated to workflow in BI layer.

3.31 Anomaly Detection on Spend

Alert when squad spend z-score >3 vs trailing 28 days. Investigate: new agent launch, fan-out bug, cache break, pricing change. Automate tickets to squad + platform.

3.32 Unit Economics for Internal Platforms

If you sell AI capabilities to internal business units, $/outcome becomes internal transfer price. Finance may require cost-plus model; engineering provides span-level COGS.

3.33 Budget Planning Season

Annual planning: ingest growth forecasts, agent roadmap, hardware contracts, scenario bands. Present three plans: lean, base, aggressive. Executives pick risk posture explicitly.

3.34 Transparency Reports

Quarterly AI factory transparency memo: total outcomes, total spend, $/outcome trend, top optimizations, incidents affecting cost. Builds trust with board and regulators.

3.35 FinOps for Multi-Cloud

Allocate egress, cross-cloud API fees, and duplicate indexing costs. Multi-cloud factories cost more—don't hide overhead in generic "AI line item."

3.36 Incentive Alignment

Reward squads for $/outcome improvement, not token reduction alone. Pair metrics with quality scores to prevent reckless tier chopping.

3.37 Contractual Pass-Through

When using third-party APIs, pass through list price changes with 30-day notice in internal chargeback. Surprises destroy FinOps credibility.

3.38 FinOps Tooling Evaluation Criteria

Accuracy, latency of cost pipeline (<24h lag acceptable for showback), auditability, RBAC, export formats, API for ERP. Buy vs build depends on data warehouse maturity.

3.39 Closing FinOps Principles

Measure outcomes, tag everything, showback before chargeback, scenario bands beat point forecasts, and finance is a partner—not an afterthought.

3.40 Board-Ready FinOps Narrative

Executives don't want token counts—they want risk-adjusted ROI. Frame factory investments as: capacity to complete N more outcomes per month at stable quality, with downside band if adoption overshoots.

Use analogies: GPUs are freight capacity; agents are trucks; workflows are deliveries. Empty trucks (idle GPUs) and detours (retries) cost money. FinOps makes logistics visible.

When spend spikes, bring three explanations: volume growth, efficiency regression, or price change. Data splits prevent witch hunts against single squads.

Align AI factory budget with product P&L owners who benefit from outcomes. Shared fate improves tier discipline more than central mandates.

Document assumptions in board decks: fan-out depth, frontier share, cache hit rate. Assumptions age; date-stamp them.

3.41 Working with Procurement

Procurement wants commit discounts; engineering wants burst. Present scenario bands to negotiate commit utilisation targets with escape valves for burst via on-demand. Avoid commits based on pre-agent chat forecasts.

3.42 Tax and Transfer Pricing

Multinationals may need transfer pricing for internal AI charges. FinOps tags with legal entity IDs early; retrofitting is painful.

3.43 Sustainability Reporting

If ESG reports include IT carbon, factory metrics enable tokens/watt trends post-tiering. Even rough proxies beat "we bought renewables" without workload efficiency.

3.44 Dispute Resolution

When squads dispute charges, provide span-level drill-down within 48 hours. Slow disputes erode chargeback acceptance.

3.45 FinOps Certification for Platform Team

Encourage FinOps Certified Practitioner training for platform leads. Shared vocabulary with finance accelerates budget cycles.

3.46 Quarterly Business Review Pack

Export slides automatically from warehouse: outcomes trend, $/outcome by class, waste bucket pie, migration status RAG, next quarter CapEx/OpEx ask. Consistent format builds executive trust quarter over quarter.

3.47 Aligning R&D and FinOps

Research sandboxes get explicit monthly burn caps with auto-shutdown. Research without caps becomes "surprise invoice" stories that kill chargeback programs.

3.48 Outcome Quality Metrics in FinOps

Attach quality score (human eval or automated) to $/outcome charts. Cheap outcomes that fail quality aren't wins—they're future incident costs.

3.49 Closing the Loop with Product Roadmaps

When FinOps shows rising $/outcome for a workflow class, product must choose: fund optimization, accept price increase, or reduce automation scope. Escalate to VP if unresolved two quarters—unbounded spend is a strategy failure, not an ops puzzle.

3.50 Factory Subsidy vs Full Chargeback

Some enterprises subsidize early agent adoption. Document subsidy end date upfront. Permanent subsidy trains teams to ignore efficiency forever.

FinOps maturity is a journey: showback teaches, chargeback disciplines, scenarios prevent panic.

3.10 Chapter 3 Synthesis

FinOps isn't finance vs engineering—it's a shared language. $/outcome, tagged spans, showback → chargeback maturity, and scenario bands keep factories fundable.

Chapter 4: Migration Methodology

4.1 Why Big-Bang Migrations Fail for Agents

Agents are stateful graphs. Big-bang cutovers break tool versions, change latency profiles, and invalidate prompt caches overnight. Use shadow → canary → cutover with explicit promotion criteria tied to $/outcome and SLO—not gut feel.

System Architecture — Phased Migration Topology — Figure 11: Phased migration topology — legacy pool, shadow mirror, canary slice, target factory pool with traffic percentages.

4.2 Shadow Traffic: Compare Without Risk

Shadow mode duplicates inference requests to the target factory while serving users from legacy. Compare: token counts, latencies, output hash distances (for deterministic tasks), and $/outcome. Don't shadow PII-heavy flows until redaction parity is proven.

4.3 Canary Releases and Automated Promotion

Start canary at 1–5% workflows per class. Promotion gates example:

p99 latency within 10% of legacy
$/outcome not worse than 5%
error rate ≤ legacy baseline
no security policy regressions

Process Flowchart — Shadow Canary Cutover Migration — Figure 12: Shadow → canary → cutover flowchart with rollback triggers and executive checkpoint.

4.4 GPU Generation Planning (Vendor-Neutral)

Hardware generations differ in memory bandwidth, FP8 support, and tokens/watt. Plan migrations as capacity equivalence exercises:

Benchmark representative agent DAGs on old vs candidate silicon.
Model $/outcome and watts/outcome, not sticker price.
Stage rack power and cooling before software cutover—DSX-style power orchestration trends matter at rack scale even if your design stays vendor-agnostic.

💡 Insight

Practitioner insight: Migrate workflows, not clusters. A cluster cutover without per-workflow shadow data is a coin flip with expensive tails.

UI Screenshot — Migration Status Dashboard — Figure 13: Migration status dashboard — shadow diff metrics, canary cohort health, rollback arm button.

4.5 Data Plane vs Control Plane Migration Order

Migrate control plane (router, tags, observability) first—legacy data plane can remain temporarily. Then shift pools workflow-by-workflow. Embeddings and vector indexes migrate on their own track with re-embed validation.

4.6 Rollback and Failure Drills

Pre-write rollback runbooks: DNS/weight shifts, feature flags, orchestrator pins. Quarterly game-day: force canary failure and measure MTTR.

4.7 Codelab: Shadow Comparator (Python)

class="tok-cm"># shadow_compare.py
import hashlib

class="tok-kw">def fingerprint(text: str) -> str:
    return hashlib.sha256(text.encode()).hexdigest()[:16]

class="tok-kw">def compare_outputs(legacy: str, candidate: str) -> dict:
    return {
        class="tok-str">"match": legacy == candidate,
        class="tok-str">"fp_legacy": fingerprint(legacy),
        class="tok-str">"fp_candidate": fingerprint(candidate),
        class="tok-str">"len_delta": len(candidate) - len(legacy),
    }

4.8 Codelab: Canary Controller (TypeScript)

// canary.ts
export function pickPool(workflowId: string, canaryPercent: number): "legacy" | "factory" {
  const bucket = parseInt(workflowId.slice(-4), 16) % 100;
  return bucket < canaryPercent ? "factory" : "legacy";
}

4.9 Organizational Readiness

Train SREs on agent SLOs, not just HTTP 500s. Align FinOps on new price sheets before cutover—surprise chargeback destroys trust.

4.11 Migration Inventory: What Must Be Cataloged

Before shadow traffic: model versions, LoRA adapters, embedding indexes, router weights, prompt templates, tool schemas, budget configs, observability dashboards, and on-call runbooks. Missing any one causes false parity signals.

4.12 Parallel Embeddings Migration

Re-embed corpora on a schedule that doesn't starve online lanes. Use batch lanes for backfill; validate recall@k on golden sets before cutover. Embedding drift hurts agents silently—quality sinks while costs look fine.

4.13 Contract and License Gates

Verify license terms for weights and APIs in the target environment. Regulated industries may prohibit cross-border failover weights. Legal sign-off is a migration gate, not paperwork afterthought.

4.14 Communication Plan

Stakeholders fear "AI downtime." Publish migration windows, success metrics, and rollback criteria. Tie comms to workflow classes affected, not technical pool names.

4.15 Post-Cutover Hypercare

First 72 hours after cutover: war room with router metrics, $/outcome deltas, and fan-out depth alarms. Freeze unrelated releases. Hypercare isn't optional for first factory cutovers.

4.16 Generational Benchmark Harness

Build a harness of 30–50 frozen workflows representing production mixes. Run on old and candidate silicon weekly. Track tokens/s, watts (if available), $/outcome, and quality scores. This is how you neutralize vendor marketing.

4.17 Decommissioning Legacy Pools

After cutover, legacy pools linger "just in case" and burn money. Set a decommission date with executive sponsor. Keep read-only shadow capability in cold storage configs, not hot GPUs.

4.18 Lessons from Failed Migrations

Common failures: migrating models before routers; skipping FinOps tag parity; allowing eval jobs into prod pools during canary; no rollback drill. Learn from others' outages—don't sponsor your own.

4.19 Shadow Metrics That Matter

Compare distributions, not just means: p99 latency, p95 tokens, error taxonomy counts. Use population stability indices on outcome labels. Shadow diffs should be automatic nightly reports, not manual spreadsheets.

4.20 Canary Cohort Selection

Stratify canary by workflow class and tenant size. Don't canary only friendly internal tenants—you'll miss production skew.

4.21 Blue/Green vs Canary for Stateless Routers

Routers can blue/green quickly; GPU pools often cannot. Sequence: blue/green router → canary data plane → full cutover.

4.22 Migration Tooling

Invest in replay tools: re-run production traces against candidate factories in batch lanes. Replay accelerates shadow coverage without risking live traffic.

4.23 Organizational RACI

Activity	Platform	FinOps	Product	Security
Shadow sign-off	R	C	I	C
Canary promotion	R	C	A	C
Budget change	C	R	I	I

Clear RACI prevents migration stalls.

4.24 Post-Migration Optimization Window

First 90 days after cutover: tune tier map weekly using showback. Biggest wins arrive after migration, not before.

4.25 Migration Communication Templates

Pre-migration: what changes, when, customer impact none expected, rollback window.

During canary: metrics tracked, known issues list, owner on-call.

Post-cutover: success criteria met, legacy decommission date, hypercare schedule.

4.26 Legal and Data Retention During Migration

Ensure shadow logs don't retain PII longer than policy allows. Mask in shadow paths when needed—even if comparison is harder.

4.27 Multi-Factory Consolidation

Enterprises sometimes operate regional factories. Consolidation promises efficiency but risks residency violations. Consolidate control planes, not necessarily data planes.

4.28 Learning Loop

After every migration, publish internal postmortem: estimated vs actual $/outcome, incidents, timeline slip causes. Institutional memory beats repeating mistakes.

4.29 Extended Narrative: Migration War Stories

The smoothest migration I led shadowed 12% of workflows for three weeks before any user-visible change. The roughest tried big-bang over a weekend because a vendor contract ended Sunday night. We rolled back at 3 AM, not from lack of talent, but from missing embedding parity and untested fallback chains.

Canary selection bias is subtle. If you only canary internal users, you'll promote on misleading metrics. Stratify by tenant size and workflow class. Include at least one "noisy neighbor" tenant in canary if production has them—you'll thank yourself later.

GPU generation planning without application benchmarks is gambling. Tokens per watt only matters through your agent graphs. A 2× hardware improvement eaten by 3× fan-out depth is a net loss.

Decommission legacy pools on calendar dates. Orphan pools are recurring invoices with zero owners.

4.30 Detailed Shadow Traffic Implementation

Shadow traffic should be representative, not merely volumetric. Stratify samples across workflow classes, tenants, and time-of-day buckets. Store shadow outputs in a comparison warehouse partitioned by workflow_id. For each shadowed hop, record legacy_hash, candidate_hash, latency_delta, token_delta, and usd_delta. Nightly jobs flag regressions beyond tolerance bands.

Privacy: mask PII in shadow storage when production payloads include restricted fields. Use tokenized identifiers for join keys. Security teams should approve shadow retention TTLs before enabling production mirroring.

Performance: shadow must not starve legacy. Cap shadow concurrency at 10–15% of legacy pool capacity or route shadow exclusively through dedicated candidate pools that don't share schedulers with production.

4.31 Canary Mathematics

If canary is 5% of workflows and error rate doubles in canary, overall error rate increases by 5% relative—small but customer-visible at scale. Compute minimum detectable effect given traffic volume before choosing canary percentage. Low-traffic classes need higher canary share or longer canary windows.

Promotion criteria should be statistical, not vibes. Use sequential testing or Bayesian dashboards to avoid peeking bias where on-call promotes early because charts "look fine."

4.32 Cutover Weekend vs Rolling

Rolling cutover shifts traffic gradually (10% → 30% → 60% → 100%) with hold points. Weekend cutover tries 0→100 quickly. Rolling suits large user bases and agents with long workflows; weekends suit internal-only agents with short workflows. Pick based on workflow duration distribution, not tradition.

4.33 Hardware Generation Cutover Checklist

Benchmark harness green on candidate silicon
Power/cooling validated for target racks
Network bandwidth validated for embedding backfills
Router cost coefficients updated
FinOps rate card published
Shadow parity signed
Canary promotion signed
Rollback weights tested
Hypercare staffed
Legacy decommission date scheduled

4.34 Embedding and Vector Index Cutover

Treat vector stores as migration peers, not afterthoughts. Steps: freeze writes, snapshot index, rebuild on target, recall@k validation, dual-read period, cut read traffic, decommission old index. Skipping dual-read causes subtle retrieval drift that shows up as quality regressions without hard errors.

4.35 Organizational Training Before Cutover

Run tabletop exercises: orchestrator owners, SRE, FinOps, product on-call. Walk through rollback script line by line. Untrained humans revert to restarting pods—a placebo for agent factories.

4.36 Vendor Contract Alignment

Align contract renewal with migration windows. Avoid forced migrations during holiday peaks because a vendor contract ends. Negotiate overlap months where both environments are licensed—cheaper than outage.

4.37 Post-Cutover Metrics Review

At day 7 and day 30, compare $/outcome, success rate, p99 latency, and waste buckets vs pre-migration baseline. Publish internally. If metrics aren't better or neutral, open optimization epics before declaring victory.

4.38 When to Abort Migration

Abort triggers: shadow token delta >15% without quality gain; canary success rate drop >2 pts; security policy regression; residency violation. Pre-agree abort authority (role, not committee of ten).

4.39 Documentation Deliverables

Migration isn't done without updated architecture diagrams, router policy git tags, runbooks, and FinOps rate cards. Auditors and new hires need paper trails.

4.40 Migration Program Office

For enterprises, stand up a lightweight MPO: weekly standup, RAID log, executive dashboard. Migrations fail from coordination gaps more than technical gaps.

4.41 Replay Testing at Scale

Batch-replay last week's traces nightly against candidate factories during migration programs. Automate diff reports; humans triage only regressions beyond thresholds.

4.42 Configuration Drift Detection

Drift between shadow and canary configs (prompt hashes, router version) invalidates comparisons. Config checksums must match except intentional deltas.

4.43 Migration KPI Dashboard

Single pane: shadow parity %, canary error delta, $/outcome delta, promotion readiness score, days to legacy decommission. Executives consume this, not raw logs.

4.44 Cross-Functional Migration Standups

Daily 15 minutes during canary/cutover weeks. Attendees: platform, SRE, FinOps, product owner, security delegate. Blockers escalated same day.

4.45 Lessons for SaaS Vendors

If you're a vendor migrating customer tenants, migration windows multiply. Stagger tenants by risk tier. Never migrate all tenants Friday 5 PM.

4.46 Hardware Refresh Without Application Migration

Sometimes silicon refresh doesn't require router changes—only pool swaps. Still run harness benchmarks; power and drivers change behavior.

4.47 Migration Debt Tracking

Track deferred migrations (legacy pools, old embeddings). Debt accrues interest as incidents and costs rise. Review debt quarterly in architecture council.

4.48 Closing Migration Mantra

Shadow proves truth, canary proves scale, cutover is boring, rollback is rehearsed, hypercare is staffed, decommission is dated.

4.49 Enterprise Migration Calendar

Publish a 12-month migration calendar visible to all engineering. Blackout windows for retail peaks, tax season, open enrollment. Migrations slot into green windows or don't ship.

Coordinate with procurement for hardware lead times—often longer than software schedules. Hardware on dock before cutover weekend, not 'arriving someday.'

Run executive checkpoint before canary promotion: metrics, risk, rollback owner named. No name, no promotion.

Capture lessons in a migration playbook wiki page per workflow class. Future teams copy patterns instead of reinventing shadow infrastructure.

Remember: migration ends when legacy invoices end. Until then, you're paying twice.

4.50 Migration Toolchain Wishlist

Trace replay, config diff, automated shadow reports, canary promotion bot with guardrails, one-click rollback weights. Invest once, reuse across migrations.

4.51 Parallel Run Economics

Running legacy and factory doubles cost short-term. Finance must expect temporary uplift; document end date or parallel run becomes permanent tax.

4.52 Agent Framework Version Pinning

Migrate agent frameworks and factories together when breaking SDK changes land. Coordinate version pins in monorepo tags.

4.53 Customer Zero Programs

Pilot migrations with friendly internal "customer zero" teams before regulated workflows. Learn empathy for rollback UX.

4.54 Migration Retrospective Template

What we estimated, what happened, what we'll change next migration. Store in wiki tagged #factory-migration.

4.55 Regulatory Sign-off Gates

Regulated workflows need compliance sign-off between shadow and canary. Document signatories in migration ticket. Skipping this gate delays audits, not accelerates delivery.

4.56 Automated Rollback Triggers

Wire metrics to rollback bot: if canary $/outcome or error rate breaches threshold for 15 minutes, revert weights and page owner. Humans sleep; bots guard rails.

Host 60-minute internal tech talk: what we migrated, metrics, surprises. Recording becomes onboarding asset for next wave.

Executive sponsors want green/yellow/red migration status weekly during program. One slide, no jargon. Sponsors unblock procurement and staffing when informed.

4.59 Freeze Windows for Dependencies

If CRM or vector DB migrations align with factory migration, sequence dependencies explicitly. Parallel breaking changes multiply rollback complexity exponentially.

4.60 Migration Success Criteria Sign-Off

Document sign-off owners for shadow parity, canary health, and cutover completion in the migration ticket. Ambiguous ownership causes migrations to stall in "almost done" for quarters.

4.61 Celebrate Decommission

When legacy pools power off, send a short note to all engineering: what improved, what we learned, who to thank. Rituals reinforce that migration programs end—not linger as zombie infrastructure.

4.62 Keep Migration Playbooks Current

Update migration playbooks after every wave. Stale playbooks with wrong CLI flags cause weekend rollbacks that benchmarks never predicted.

4.63 Migration Metrics Archive

Archive shadow and canary metrics for three years if compliance requires. Cold storage is cheaper than re-running migrations because audit asked for proof.

4.64 Final Migration Principle

If shadow metrics aren't boringly green, don't canary. If canary isn't boring, don't cut over. Boring migrations are successful migrations.

4.10 Chapter 4 Synthesis

Migration is a product release with scientific promotion. Shadow proves parity; canary proves scale; cutover is boring if you did the work. Document every wave.

Chapter 5: Day-2 Operations

5.1 SLOs for Workflows, Not Requests

Define SLOs on workflow success rate, workflow p99 latency, and $/outcome variance. Supplement with per-lane GPU saturation SLOs. Error budgets: when budget burns, freeze non-critical releases and ban eval jobs from prod pools.

UI Screenshot — SLO and Incident Dashboard — Figure 14: SLO/incident dashboard — error budget burn, active incidents, runbook execution status.

5.2 Incident Response for Agent Factories

Incidents differ from microservice outages:

Model regression — quality drop without 5xx (detect via eval canaries).
Fan-out storm — orchestrator bug spawns exponential sub-agents.
Cache poisoning — bad memoized tool results.
Cost runaway — budget guard failure.

Runbooks: degrade tier, disable fan-out, drain lane, pin model version. Human comms template: customer impact, ETA, dollars at risk.

💡 Insight

Practitioner insight: Keep a "big red switch" that sets global max fan-out to 2. You'll use it once—and be glad.

5.3 Capacity Triggers and Autoscaling

Triggers should combine queue depth, tokens in flight, and p99 prefill latency—not CPU percent. Scale-out lead time for GPU nodes is hours to days; predictive scaling from capacity signatures beats reactive panic.

Process Flowchart — Capacity Trigger Autoscaling — Figure 15: Capacity trigger flowchart — watermark breach → scale pool → if lead time exceeded → degrade tier and queue.

GEO Fact — Day-2 SLOs: Factories measuring workflow-level SLOs resolve agent incidents 40% faster than request-only monitoring, because runbooks target orchestrator policy instead of restarting model pods blindly.

5.4 Model Version Drift and Eval Canaries

Run continuous eval canaries on production routers—small deterministic tasks with golden outputs. Block promotion if drift exceeds tolerance.

5.5 Security Operations Integration

Feed factory audit logs to SOC: tool calls, policy denials, override events. Correlate with identity tokens per agent role.

5.6 Codelab: SLO Burn Alert (Python)

class="tok-cm"># slo_burn.py
class="tok-kw">def error_budget_burn(success_rate: float, target: float, window_minutes: int) -> float:
    budget = 1.0 - target
    consumed = 1.0 - success_rate
    return consumed / budget if budget > 0 else 1.0

class="tok-kw">def should_page(burn: float, threshold: float = 0.5) -> bool:
    return burn >= threshold

5.7 Codelab: Capacity Webhook (TypeScript)

// capacityHook.ts
export type Signal = { lane: string; tokensInFlight: number; watermark: number };

export function action(sig: Signal): "ok" | "scale" | "degrade" {
  const ratio = sig.tokensInFlight / sig.watermark;
  if (ratio < 0.85) return "ok";
  if (ratio < 1.0) return "scale";
  return "degrade";
}

5.8 Post-Incident Reviews and Factory Changelog

Every sev-1/2 gets a blameless review: spans, dollars burned, guardrails that failed. Maintain a factory changelog—router weights, tier maps, promotion events.

5.9 Continuous Improvement Loop

Monthly factory council: platform, FinOps, product, security. Agenda: $/outcome trends, waste buckets, migration status, next quarter capacity.

5.11 On-Call Playbooks (Condensed)

Sev-1 Fan-out storm: enable global fan-out cap → drain batch lanes → page orchestrator owner.

Sev-1 Cost runaway: enable hard budget stop → list top workflows by spend → require VP override to resume.

Sev-2 Model regression: pin previous model version → open quality incident → run eval harness diff.

Sev-2 Cache poisoning: flush memoization namespace → disable tool memo for class → root-cause tool output change.

5.12 SLO Documentation Template

For each workflow class document: objective, measurement window, error budget, alert routes, runbook links, dependencies, and customer-facing comms template. Store in git beside router config—SLOs are code.

5.13 Capacity Planning Calendar

Align with business events: open enrollment, tax season, holiday retail, quarter close. Pre-scale two weeks ahead using signatures; don't wait for dashboards to turn red.

5.14 Green Ops: Tokens per Watt

Sustainability teams increasingly ask about energy. If you can't meter watts per workflow yet, proxy with tokens per watt from benchmark harnesses and publish improvement trends after tiering or silicon migrations.

5.15 Knowledge Transfer

Rotate on-call across platform and product teams quarterly. Agents fail in weird ways; siloed ops teams miss orchestrator bugs.

5.16 Audit and Compliance Logs

Retain span logs per policy (often 90–365 days). Archive to cold storage with tamper-evident buckets. Auditors ask for proof of human oversight on restricted workflows—correlate HITL tickets to workflow_id.

5.17 Continuous Profiling

Re-run workload profiling quarterly. Agent graphs drift as product teams add tools and hops. Capacity signatures go stale like firewall rules.

5.18 Factory Roadmap Linkage

Day-2 metrics should feed the factory roadmap: if waste bucket "over-tiered hops" grows, invest in router ML; if tool wait dominates, invest in integration performance, not GPUs.

5.19 Incident Metrics Beyond MTTR

Track cost of incident (tokens burned during degradation), workflows affected, and escalations to humans. MTTR alone ignores economic damage.

5.20 Game Days

Quarterly game days: inject fan-out bug in staging, cost runaway in staging, model regression in canary. Measure detection time and runbook effectiveness.

5.21 Observability Stack

Minimum: traces (workflow spans), metrics (lanes, tokens in flight), logs (policy denials), dashboards (SLO/error budget), alerts (multi-window burn rates). If you lack traces, you don't have a factory—you have servers.

5.22 Vendor Escalation Paths

When underlying GPU cloud has regional impairment, factory ops needs vendor TAM contacts and comms templates pre-written. Don't draft during outage.

5.23 Toil Reduction

Automate: tier map rollbacks, cache flushes, budget overrides with approval tokens. Manual SSH to restart model pods should be rare.

5.24 Handoff to Continuous Improvement

Close the loop: incidents → corrective actions → router/config PRs → verified in eval harness → documented in factory changelog. Ops without closure is recurring pain.

5.25 Customer Trust and External SLAs

If you sell agent outcomes externally, external SLAs must derive from internal workflow SLOs with margin. Don't promise 99.9% on workflows you haven't measured.

5.26 SLI Catalog Examples

workflow_success_rate = successes / attempts
workflow_latency_p99 = p99 end-to-end seconds
usd_per_outcome_p50 = median cost for successes
lane_saturation = tokens_in_flight / watermark
fanout_depth_p95 = p95 parallel branches per tick

Publish SLIs to product teams; SLOs are negotiated from SLIs.

5.27 Alerting Anti-Patterns

Alerting on average GPU utilization hides fan-out cliffs. Alert on burn rates, queue depth, and fan-out depth. Page humans for sustained SLO budget burn, not single blips.

5.28 Runbook Quality Bar

Runbooks must be executable by someone who didn't write them. Test quarterly. If runbook requires tribal knowledge, fix the runbook.

5.29 Preparing for the Next Hardware Generation

When new silicon arrives, don't migrate in panic. Benchmark with harness, update tokens/watt tables, adjust capacity signatures, run shadow on one workflow class, expand. Repeatable process beats launch day heroics.

5.30 Closing Operations Philosophy

Day-2 isn't maintenance—it's product development for platform teams. The factory gets better every sprint or it gets more expensive every sprint. There's no steady state in agent land.

5.31 Extended Narrative: Living with Agent Incidents

The first fan-out storm I debugged looked like a DDoS from inside: same workflow class, hundreds of sub-agents, all calling the same degraded CRM. Circuit breakers weren't fashionable yet; we hard-coded a global parallel cap and survived. Today, I'd implement token debt, per-dependency bulkheads, and an executive-visible "big red switch" tested monthly.

Model regressions are insidious—no red HTTP codes, just worse answers and more retries. Eval canaries on production routers catch these within hours if you invest in golden tasks. Skimp on eval, pay in escalations.

Capacity triggers should be rehearsed. If scale-out lead time is 6 hours, autoscaling on threshold breach must start 6 hours before you expect breach—predictive scaling from signatures, not reactive paging at 2 AM.

Close incidents with factory changelog entries. Ops knowledge should be durable, not Slack scrollback.

5.32 SLO Error Budget Policy

Define error budget policies per workflow class. Example: 99.5% monthly success allows 0.5% failures. Burn budget on deployments, model promotions, and infra changes. When budget exhausted, freeze risky changes until budget recovers. This aligns product velocity with reliability.

5.33 Incident Severity Rubric for Agents

Sev-1: widespread workflow failure, cost runaway threatening monthly budget, residency breach.

Sev-2: single class degradation, partial fan-out failure, model regression detected by canary.

Sev-3: elevated latency within SLO margin, non-critical tool degradation.

Sev-4: cosmetic dashboard issues.

Attach runbook links per severity in paging tools.

5.34 Capacity Trigger Tuning Guide

Start watermarks conservative (70% tokens in flight), observe false positive rate for two weeks, tune upward until false positives <5%. Document final values in git next to router config.

5.35 Predictive Scaling Inputs

Feed predictive scaler: business calendar events, marketing campaign schedule, historical signatures, weather if retail, tax calendar if finance. Humans override predictions with explicit flags—don't fight automation silently.

5.36 Eval Canary Design

Golden tasks: 50–200 per critical workflow class, updated monthly. Include edge cases discovered in incidents. Run every 15 minutes in production canary lane with alert on score drop >ε.

5.37 SOC Integration Details

Export spans including tool_name, policy_decision, tier, usd_estimate. Map to SIEM correlation rules for impossible travel (agent calling tools from wrong region) and privilege anomalies.

5.38 Toil Metrics

Track toil hours per week on factory ops. Goal: downward trend via automation. If toil rises with agent adoption, platform team is underwater—hire or simplify architecture.

5.39 Multi-Region Failover Drills

Fail region B while region A serves traffic—verify residency constraints still hold per tenant. Failover without residency checks is a compliance incident waiting to happen.

5.40 Customer Communication During Incidents

Template external comms: impact scope, workflows affected, ETA, workaround, postmortem promise. Legal reviews template once, not per incident at 3 AM.

5.41 Long-Term Capacity Roadmap

Rolling 12-month GPU/OpEx forecast tied to business growth assumptions and agent roadmap. Update quarterly. Tie to business advisory planning cycles.

5.42 Factory Maturity Assessments

Annual assessment against maturity model (metered → routed → FinOps → resilient → optimized). Publish gap list and investment ask. Executives fund gaps when narrative is crisp.

5.43 Handoff to Platform Product Roadmap

Day-2 findings should create epics: router ML, better compaction, tool gateway caching, etc. Ops data is product discovery.

5.44 Celebrating Reliability Wins

When error budgets recover after optimization, share credit publicly. Reliability culture needs positive reinforcement, not only incident blame.

5.45 Final Operations Checklist (Printable)

[ ] Workflow SLOs published
[ ] Runbooks tested this quarter
[ ] Game day completed
[ ] Eval canaries green
[ ] Capacity signatures updated
[ ] FinOps showback reviewed monthly
[ ] Migration RAID log clear
[ ] Big red switch tested

5.46 On-Call Health Metrics

Track pages per engineer per week, repeat incidents, mean time to mitigate. Unhealthy on-call drives attrition—fix root causes, not heroes.

5.47 Dependency Catalog for Agents

Maintain catalog of downstream systems agents call with owners and SLOs. Incidents often external; routing agents without dependency context wastes time.

5.48 Progressive Delivery for Router Changes

Use feature flags for policy bundles: 1% → 10% → 50% → 100% with automated rollback on $/outcome regression.

5.49 Waste Elimination Sprints

Quarterly sprint dedicated to top waste bucket from FinOps. Platform + squads pair; success measured in $/outcome delta next month.

5.50 Knowledge Base Hygiene

Runbooks in git with owners and last-tested dates. Stale runbooks worse than none—they breed false confidence.

5.51 Bridging Ops and Research

When research wants new frontier model, ops requires harness results and shadow week before any canary. Research velocity continues within guardrails.

5.52 Closing Day-2 Mantra

Measure workflows, page on burn, automate toil, drill failures, publish changelogs, fund improvements.

5.53 Sustainable On-Call for Agent Factories

Agent incidents are cognitively heavy—ambiguous symptoms, expensive blast radius. Limit on-call shifts to experienced engineers with factory context. Rotate shadow on-call for training without paging juniors alone.

Post-incident, fund fixes before new features. Unfixed factory debt compounds fan-out risk nonlinearly.

Measure customer-visible outcomes during incidents, not just infra green lights. Workflows failing silently hurt trust more than loud 500 errors.

Integrate contact escalation paths for sev-1 when internal runbooks exhaust—know when to pull vendor TAMs and external architects.

Sustainable ops means predictable improvement, not heroic weekends every month.

5.54 Metrics for Platform Team Health

Track: deploy frequency for router, mean time to restore factory SLO, toil hours, incident repeat rate. Healthy team improves these while agent adoption grows.

5.55 Blameless Culture with Accountability

Blameless doesn't mean consequence-free. Repeated policy bypasses get engineering manager attention. Culture supports learning; governance stops repeat negligence.

5.56 External Benchmarking

Compare your $/outcome to anonymized industry peers via advisors. Isolation breeds complacency or panic without context.

5.57 Upgrade Windows

Coordinate model upgrades with low-business-impact windows per signature calendar. Upgrades during peaks are self-inflicted sev-1s.

5.58 Ops Handover to New Hires

Onboard with game day in week two, not slide deck month two. Muscle memory matters for fan-out incidents.

5.59 Pairing SRE with FinOps During Incidents

Cost runaway incidents need joint bridge: SRE stops bleeding, FinOps estimates dollar exposure for executive updates. Siloed bridges waste critical minutes.

5.60 Publishing SLO Reports

Monthly SLO report to product VPs: error budget status, top incidents, planned improvements. Transparency reduces "why is AI slow" hallway questions.

Monthly newsletter to engineering: router changes, tier map updates, price version changes, upcoming migrations. Surprises create resistance; newsletters create partners.

5.62 Runbook for Model Provider Outages

When public API regions fail, router should fail over region or degrade tier with customer comms template ready. Practice provider outage quarterly—it's when factories prove maturity.

5.63 Continuous Learning Budget

Allocate 10% platform capacity to toil reduction and eval improvements. Without budget, Day-2 decays into permanent firefighting.

5.64 Factory Ops Quarterly Goals

Set explicit goals: reduce p99 workflow latency 10%, cut top waste bucket 15%, complete one game day, ship two runbook automations. Goals without numbers are wishes.

5.65 Handoff to Leadership

Escalate structural factory gaps—insufficient GPU contract, missing FinOps headcount, policy gridlock—to leadership with data and proposed investment. Ops teams can't policy-hack around capacity starvation forever.

5.66 Sleep Better

Well-instrumented factories with rehearsed runbooks let on-call engineers sleep. That's the real ROI of Day-2 discipline—not slide aesthetics.

5.67 Ops Metrics in Executive Dashboards

Expose workflow SLO attainment and error budget status to executive dashboards monthly. Visibility prevents "AI is flaky" narratives without data.

5.68 Final Operations Principle

Day-2 excellence is measured in uneventful Tuesdays—not heroic Sundays. Build the factory so Tuesdays stay quiet, budgets stay predictable, and agents stay trustworthy.

5.10 Chapter 5 Synthesis

Day-2 is where factories earn trust in production. Workflow SLOs, incident runbooks, capacity triggers, and eval canaries turn agents from demo to reliable utility at scale.

Key Takeaways & FAQ

Key Takeaways

Measure workflows, not messages: Agent economics are driven by fan-out depth, context growth, and retries—profile before you scale GPUs.
Build a factory, not a endpoint: Separate control plane routing, model tiering, online vs batch lanes, and edge/core boundaries.
FinOps on $/outcome: Tag spans, run showback, graduate to chargeback, and scenario-plan seasonality.
Migrate scientifically: Shadow, canary, cutover—with GPU generation planning that's vendor-neutral and benchmark-backed.
Operate Day-2 with workflow SLOs: Incidents include cost runaway and fan-out storms; capacity triggers use tokens in flight, not CPU alone.

Frequently Asked Questions

What's the difference between an AI factory and a model hosting cluster?

A hosting cluster serves inference requests. A factory adds workflow-aware routing, lane isolation, FinOps tagging, tier policies, migration controls, and Day-2 SLOs for multi-hop agents. Agents need orchestration economics, not just low-latency tokens.

How do I estimate GPU capacity for agent workloads?

Build capacity signatures from traced production or representative DAGs: measure tokens per workflow, fan-out p95, and lane mix. Forecast by business seasonality, not average utilization. Include burst buffers for month-end and batch reconciliation patterns.

What is $/outcome and why not tokens per dollar?

$/outcome divides total cost (inference, tools, human review) by successfully completed workflows. Tokens per dollar ignores retries, over-tiering, and failed jobs—metrics that mislead executives during agent scale-up.

When should we use batch vs online inference lanes?

Online lanes serve user-facing steps with tight p99 latency. Batch lanes handle offline reconciliation, eval at scale, and non-interactive summarization with throughput-optimized scheduling. Never share pools without hard isolation.

How does model tiering reduce cost without hurting quality?

Route hops by risk and capability: frontier for planning and high-stakes verification, mid for synthesis, small for classification. Use verifier confidence and policy metadata—not blanket cheap models for every hop.

What promotion criteria should gate canary to cutover?

Compare shadow and canary cohorts on p99 latency, $/outcome, success rate, and policy regressions. Promote only when metrics stay within agreed bands (e.g., latency within 10%, cost within 5%). Pre-authorize rollback weights and feature flags.

How do prefix caches interact with multi-agent prompts?

Place static system instructions and schemas at the prompt prefix; append volatile tool outputs at the tail. Lint prompts in CI to prevent dynamic fields from invalidating cache keys across hops.

Should edge inference replace core GPUs?

Edge handles redaction, buffering, and small models near data sources. Core hosts large models and centralized governance. Agents crossing boundaries need signed summaries and residency-aware routers—not raw PII shuttling.

How do Vera Rubin-class and DSX trends affect architecture?

They signal higher tokens-per-watt and rack-level power orchestration becoming first-class constraints. Design abstractions around workload lanes and benchmarks, not SKU loyalty, so hardware generations swap without rewriting orchestration.

What SLOs should we publish to product teams?

Workflow success rate, workflow p99 latency by class, and $/outcome variance bands. Supplement with lane saturation indicators. Avoid promising single-request latency for multi-hop agents.

How do we prevent fan-out storms?

Enforce per-workflow parallel caps in the orchestrator, token debt budgets, and a global emergency degrade switch. Alert on abnormal branching depth before GPUs saturate.

Where should human review sit in FinOps models?

Include human review minutes in $/outcome when agents escalate. Over-automation without review gates can lower token cost while increasing operational risk and rework expense.

Author Bio

Vatsal Shah is the Principal AI Architect at Agile Tech Guru. He designs AI factories, agent inference platforms, and FinOps showback systems for regulated enterprises. His work spans GPU migration programs, workflow-level observability, and factory Day-2 operations that keep agent fleets within SLO and budget.

Ready to benchmark your factory?

AI factory TCO review — We'll profile your agent DAGs, model tier mix, and $/outcome baselines, then deliver a migration and FinOps roadmap aligned to your capacity horizon.

Chatbots optimized cost per message. Agents optimize cost per workflow—and most "AI factories" are still chat clusters with extra GPUs.

Our new AI Factory & Agentic Inference Playbook covers:

Workload science — fan-out, caching, batch vs online lanes
Reference architectures — routing, tiering, edge vs core (vendor-neutral)
FinOps — $/outcome, showback, chargeback scenarios
Migration — shadow, canary, cutover + GPU generation planning
Day-2 — workflow SLOs, incidents, capacity triggers

Read the full manual: https://agiletechguru.com/playbooks/ai-factory-agentic-inference-playbook #AIInfrastructure #FinOps #AgenticAI #MLOps

X/Twitter

1/ If your FinOps dashboard still shows cost per chat message, agent scale-up will hurt. Agents burn tokens across fan-out DAGs—not single turns. 🧵

2/ Profile fan-out, context growth, retries, and cache affinity. Meter with workflow_id from day one.

3/ Build a factory: router + tiers + online/batch lanes + FinOps tags—not one vLLM pool.

4/ Migrate with shadow → canary → cutover. Benchmark $/outcome on new silicon, not marketing TFLOPS.

5/ Day-2 = workflow SLOs + fan-out storm runbooks + capacity triggers on tokens in flight.

https://agiletechguru.com/playbooks/ai-factory-agentic-inference-playbook #AIFactory #Inference

AI Factory & Agentic Inference Playbook — Architecture, FinOps, and Migration for Token-Heavy Workloads

AI Factory & Agentic Inference Playbook — Architecture, FinOps, and Migration for Token-Heavy Workloads

Table of Contents

Introduction: From Chat Endpoints to Workflow Factories

Who This Playbook Is For

How to Use This Document

Chapter 1: Workload Science for Agents

1.1 Why Chat Metrics Lie About Agent Load

1.2 Fan-Out, Backpressure, and Queue Discipline

1.3 Caching: Prefix, KV, and Tool Result Memoization

1.4 Long-Context Economics and Compaction

1.5 Batch vs Online: Two Factories in One

1.6 Workload Profiling Lab: Instrumentation Schema

1.7 Codelab: Workload Profiler Emitter (Python)

1.8 Codelab: Fan-Out Limiter (TypeScript)

1.9 Capacity Signatures and Seasonality

1.11 Deep Dive: Prefill vs Decode in Multi-Hop Agents

1.12 Tool Latency and the Hidden Queue

1.13 Determinism, Temperature, and Cost Variance

1.14 Regulatory and Residency Impacts on Workload Shape

1.15 Observability Anti-Patterns

1.16 Workshop: Building a Workload Profile in One Sprint

1.17 Case Study: Month-End Fan-Out Cliff

1.18 KV Cache Fragmentation

1.19 Token Budgeting at Orchestration Design Time

1.20 Interplay with Agentic SDLC

1.21 Extended Comparison: Batch Scheduling Policies

1.22 When to Reject Work

1.23 The Physics of Concurrent Agent Graphs

1.24 Context Windows as a Budget, Not a Feature

1.25 Cross-Functional Review Cadence

1.26 Long-Horizon Trends

1.27 Extended Narrative: Profiling in Practice

1.10 Chapter 1 Synthesis

Chapter 2: Reference Architectures

2.1 Factory Layers: Control Plane vs Data Plane

2.2 Routing Layer: Policy, Not Just Load Balancing

2.3 Model Tiering Matrix

2.4 Edge vs Core: Where Inference Should Run

2.5 Embedding and Retrieval Tier

2.6 Multi-Region and Residency

2.7 Codelab: Routing Policy Engine (Python)

2.8 Codelab: Factory Client SDK (TypeScript)

2.9 Failure Domains and Blast Radius

2.11 Control Plane APIs Product Teams Actually Use

2.12 High Availability Without Sticky Sessions

2.13 Heterogeneous Pools: CPU, GPU, NPU

2.14 Network Egress and Egress Cost

2.15 Vera Rubin / DSX as Planning Signals (Not Ads)

2.16 Factory Maturity Model

2.17 Partner Integration: Orchestrators and MCP

2.18 Disaster Recovery

2.19 Reference Deployment Topologies

2.20 API Gateway vs Service Mesh

2.21 Model Registry Fields

2.22 Routing Experiments (A/B)

2.23 Private vs Public Model Paths

2.24 Cost of Complexity

2.25 Security Architecture Overlays

2.26 Interoperability Standards

2.27 Scaling the Control Plane

2.28 Architectural Review Checklist

2.29 Extended Narrative: Designing the Router

2.30 Service Level Objectives for Routing

2.31 Data Planes for Embeddings vs Generation

2.32 Quota and Throttle Design

2.33 Testing Reference Architectures

2.34 Documentation for Product Teams

2.35 Platform Boundaries

2.36 Future-Proofing Routing Schema

2.37 Reference Architecture Variants for Regulated Industries

2.38 Cost-Aware Routing Simulation

2.39 Closing Architecture Principles

2.40 Platform Engineering Operating Model

2.41 Reference Architecture Review Questions

2.42 Building vs Buying Control Planes

2.43 Technical Debt in Routing Rules

2.44 Multi-Tenant Noisy Neighbor Controls

2.45 Architecture Documentation Set

2.10 Chapter 2 Synthesis