Guild logo Guild
How It Works / Cost-Aware Model Tiering
How It Works

Cost-Aware Model Tiering

Guild auto-scores each lane to cheap, mid, or powerful. You only pay for the compute the task actually needs — deterministic signals, per-lane overrides, zero-config defaults.

Cost-Aware Model Tiering

Every specialist lane runs on the cheapest model that can handle the work — file reads and summarization on haiku, drafting and reasoning on sonnet, architecture and security review on opus. Guild scores each lane automatically before dispatch and prints the resolved tier so you can see exactly what ran and why.

Three-tier model diagram: cheap (haiku) for I/O and classification, mid (sonnet) for drafting and reasoning, powerful (opus) for architecture and security
Guild routes each lane to the right tier. You only pay opus rates when the stakes justify it.

The three tiers

TierDefault modelTypical work
cheaphaikuFile read, tokenize, chunk, summarize, classify, tag — pure I/O, template-guided, low ambiguity
midsonnetDraft, reason, plan subtasks, single-doc + cross-file relationship extraction — default task-agent tier
powerfulopusArchitecture decisions, security review, graph schema/topology, advisor/critic passes — high-stakes, low frequency

The tier-to-model map is host-agnostic: it lives in settings.json under models.tiers as { cheap, mid, powerful } → { claude, codex, gemini }. Codex and Gemini slots are null by default; adding a new host is a config edit plus an adapter.

Auto-scoring a lane

For each lane, the orchestrator computes a complexity score from deterministic signals — no LLM call, no guesswork:

SignalScore contribution
workType verb: read/summarize0
workType verb: draft/extract+1
workType verb: architect/review/schema+2
Declared blast-radius or file count: moderate+1
Declared blast-radius or file count: high+2
Upstream depends-on: contract present+1
Security/correctness sensitivity flag+1
Prior-attempt escalation on this lane+1 (sticky for the run)

Score bands map to tiers:

ScoreTier
0cheap
1–2mid
≥ 3powerful

Score and resolved tier are printed at dispatch. Signal weights are tunable via models.scoreWeights in settings.json.

Precedence

--model-tier=<tier>   CLI flag           (top — run-level escape hatch)
  > model_tier: <tier> in plan lane      (per-lane override in .guild/plan/*.md)
    > settings.json models:              (repo config)
      > built-in default                 (cheap-biased tier-map)

Use --model-tier only as a one-off override. Permanent adjustments belong in settings.json or the plan lane. See the configuration reference for all models.* keys.

Advisor escalation

When a cheap or mid specialist hits a sub-question above its tier — something ambiguous enough that guessing would be wrong — it gets one powerful sub-answer for that specific question. The original specialist continues with the answer folded in. No wholesale re-run on the expensive model.

Three triggers for advisor escalation:

  1. Explicit signal — the specialist emits status: "escalate" plus an escalate_reason in its guild.handoff.v2 envelope.
  2. Uncertainty markers — the orchestrator detects uncertainty phrases in the output (e.g., “I’m not sure”, “unclear”, “cannot determine”) matching the models.escalationMarkers list.
  3. O-3 short-output heuristic — output token count falls below the per-(task_type, tier) floor stored in models.shortOutputThreshold. Silent until the bucket has ≥30 calibration samples.

Advisor protocol:

  • The advisor receives only the draft + the escalated sub-question + a compact critique instruction (~50 tokens).
  • The advisor never sees raw file context — this keeps the expensive call cheap.
  • Advisor consults are capped per lane at models.advisorRounds (default 2).
  • Exhausting the round cap records inconclusive: advisor budget exhausted rather than silently escalating cost.
  • The escalation trail (trigger, sub-question, advisor tier, result, round count) is written to .guild/runs/<run-id>/.

O-3 short-output threshold — calibration

models.shortOutputThreshold maps task_type → tier → output-token floor. When a lane’s output token count falls below the floor for its (task_type, tier) bucket, the orchestrator fires advisor escalation.

The key is empty by default. O-3 is dormant until you calibrate it. Nothing auto-writes this key.

To calibrate:

  1. Accumulate ≥30 run samples for the (task_type, tier) buckets you want to tune (normal runs create samples automatically).
  2. Run the analyzer:
    npx tsx benchmark/src/calibrate-o3-cli.ts
  3. The CLI prints a proposed models.shortOutputThreshold JSON fragment (p10 output-token baseline per bucket). Review the proposal, then land it in .guild/settings.json yourself — nothing is auto-written.

Example proposal output:

// proposed — review before landing in .guild/settings.json
"shortOutputThreshold": {
  "draft": { "cheap": 40, "mid": 120 },
  "extract": { "mid": 80 }
}

The §task§agent ephemeral lifecycle

One agent per task. Dismissed on completion. Never shared across tasks.

  1. Spawn — a new agent at the resolved tier with task-scoped context pulled from the wiki (recall-before-read; 6k hard cap — see Context Assembly).
  2. Work — the agent executes, escalating via advisor protocol if it hits something above its tier.
  3. Extract — on completion, the agent extracts learnings into its guild.handoff.v2 envelope (learnings[]). The orchestrator lands these in .guild/runs/<run-id>/ as candidates for gated reflection.
  4. Dismiss — the agent terminates. No idle agents persist. The next task spawns a fresh agent.

Two concurrent tasks get two distinct agents — never shared. This lifecycle is orthogonal to the D5 agent_mode dispatch ladder (see Architecture & Lifecycle): D5 picks the backend, the §task§agent lifecycle fixes the per-task lifecycle on whichever backend D5 selects.

See also

  • Configuration reference — all models.* config keys including models.tiers, models.scoreWeights, models.advisorRounds, models.escalationMarkers, models.shortOutputThreshold.
  • Context Assembly — the recall-before-read implementation + the two recall paths (SQLite FTS5 / guild-memory MCP BM25).
  • Architecture & Lifecycle — the D5 dispatch ladder and execution backends.