Cost-Aware Model Tiering
Every specialist lane runs on the cheapest model that can handle the work — file reads and summarization on haiku, drafting and reasoning on sonnet, architecture and security review on opus. Guild scores each lane automatically before dispatch and prints the resolved tier so you can see exactly what ran and why.
The three tiers
| Tier | Default model | Typical work |
|---|---|---|
cheap | haiku | File read, tokenize, chunk, summarize, classify, tag — pure I/O, template-guided, low ambiguity |
mid | sonnet | Draft, reason, plan subtasks, single-doc + cross-file relationship extraction — default task-agent tier |
powerful | opus | Architecture decisions, security review, graph schema/topology, advisor/critic passes — high-stakes, low frequency |
The tier-to-model map is host-agnostic: it lives in settings.json under models.tiers as { cheap, mid, powerful } → { claude, codex, gemini }. Codex and Gemini slots are null by default; adding a new host is a config edit plus an adapter.
Auto-scoring a lane
For each lane, the orchestrator computes a complexity score from deterministic signals — no LLM call, no guesswork:
| Signal | Score contribution |
|---|---|
workType verb: read/summarize | 0 |
workType verb: draft/extract | +1 |
workType verb: architect/review/schema | +2 |
| Declared blast-radius or file count: moderate | +1 |
| Declared blast-radius or file count: high | +2 |
Upstream depends-on: contract present | +1 |
| Security/correctness sensitivity flag | +1 |
| Prior-attempt escalation on this lane | +1 (sticky for the run) |
Score bands map to tiers:
| Score | Tier |
|---|---|
| 0 | cheap |
| 1–2 | mid |
| ≥ 3 | powerful |
Score and resolved tier are printed at dispatch. Signal weights are tunable via models.scoreWeights in settings.json.
Precedence
--model-tier=<tier> CLI flag (top — run-level escape hatch)
> model_tier: <tier> in plan lane (per-lane override in .guild/plan/*.md)
> settings.json models: (repo config)
> built-in default (cheap-biased tier-map)
Use --model-tier only as a one-off override. Permanent adjustments belong in settings.json or the plan lane. See the configuration reference for all models.* keys.
Advisor escalation
When a cheap or mid specialist hits a sub-question above its tier — something ambiguous enough that guessing would be wrong — it gets one powerful sub-answer for that specific question. The original specialist continues with the answer folded in. No wholesale re-run on the expensive model.
Three triggers for advisor escalation:
- Explicit signal — the specialist emits
status: "escalate"plus anescalate_reasonin itsguild.handoff.v2envelope. - Uncertainty markers — the orchestrator detects uncertainty phrases in the output (e.g., “I’m not sure”, “unclear”, “cannot determine”) matching the
models.escalationMarkerslist. - O-3 short-output heuristic — output token count falls below the per-
(task_type, tier)floor stored inmodels.shortOutputThreshold. Silent until the bucket has ≥30 calibration samples.
Advisor protocol:
- The advisor receives only the draft + the escalated sub-question + a compact critique instruction (~50 tokens).
- The advisor never sees raw file context — this keeps the expensive call cheap.
- Advisor consults are capped per lane at
models.advisorRounds(default2). - Exhausting the round cap records
inconclusive: advisor budget exhaustedrather than silently escalating cost. - The escalation trail (trigger, sub-question, advisor tier, result, round count) is written to
.guild/runs/<run-id>/.
O-3 short-output threshold — calibration
models.shortOutputThreshold maps task_type → tier → output-token floor. When a lane’s output token count falls below the floor for its (task_type, tier) bucket, the orchestrator fires advisor escalation.
The key is empty by default. O-3 is dormant until you calibrate it. Nothing auto-writes this key.
To calibrate:
- Accumulate ≥30 run samples for the
(task_type, tier)buckets you want to tune (normal runs create samples automatically). - Run the analyzer:
npx tsx benchmark/src/calibrate-o3-cli.ts - The CLI prints a proposed
models.shortOutputThresholdJSON fragment (p10 output-token baseline per bucket). Review the proposal, then land it in.guild/settings.jsonyourself — nothing is auto-written.
Example proposal output:
// proposed — review before landing in .guild/settings.json
"shortOutputThreshold": {
"draft": { "cheap": 40, "mid": 120 },
"extract": { "mid": 80 }
}
The §task§agent ephemeral lifecycle
One agent per task. Dismissed on completion. Never shared across tasks.
- Spawn — a new agent at the resolved tier with task-scoped context pulled from the wiki (recall-before-read; 6k hard cap — see Context Assembly).
- Work — the agent executes, escalating via advisor protocol if it hits something above its tier.
- Extract — on completion, the agent extracts learnings into its
guild.handoff.v2envelope (learnings[]). The orchestrator lands these in.guild/runs/<run-id>/as candidates for gated reflection. - Dismiss — the agent terminates. No idle agents persist. The next task spawns a fresh agent.
Two concurrent tasks get two distinct agents — never shared. This lifecycle is orthogonal to the D5 agent_mode dispatch ladder (see Architecture & Lifecycle): D5 picks the backend, the §task§agent lifecycle fixes the per-task lifecycle on whichever backend D5 selects.
See also
- Configuration reference — all
models.*config keys includingmodels.tiers,models.scoreWeights,models.advisorRounds,models.escalationMarkers,models.shortOutputThreshold. - Context Assembly — the recall-before-read implementation + the two recall paths (SQLite FTS5 / guild-memory MCP BM25).
- Architecture & Lifecycle — the D5 dispatch ladder and execution backends.