Evolve skills and specialists, Guild Stack

After this page, you can run an evolution workflow, read the proposal artifacts, decide whether shadow-mode evidence is good enough, and roll back a promoted skill if the next run proves it wrong. The invariant is visible on every path: no silent mutation. Guild Stack may propose, evaluate, archive, version, and report; it does not quietly rewrite live skills, agents, wiki content, permissions, sandbox policy, or runtime policy.

Try it: /guild:evolve <skill>, snapshots the current skill and opens the eval-and-shadow evolution workspace for that named skill.

Skill evolutionroute_self_evolution_loop

Reflectrun lesson
Proposecandidate change
Evaluatepositive + negative cases
Shadowhistorical traces
Promotehuman approval
Rollbackversioned snapshot

Each run can leave proposal evidence. Evals, shadow mode, and an approval gate decide whether anything becomes live.

The proposal workspace

Power users should start with the run artifact, not a promise. A skill evolution attempt writes a workspace like this:

.guild/evolve/<run-id>/
  pipeline.md
  evals.json
  assertions.json
  runs/
    A/                 # current skill baseline
    B/                 # proposed edit
  grading.json
  flip-report.md
  shadow-report.md
  gate.json
  archived/            # rejected attempts are kept, not deleted

The live skill is snapshotted before the attempt under .guild/skill-versions/<skill>/v<n>/. The wrapper stops before promotion, so the proposal workspace is inspectable before any write-back.

Shadow-mode artifact panel

Shadow mode replays the proposed trigger behavior against historical prompts without changing live routing. For skill evolution, the report is diagnostic: it gives you evidence at the gate. For evolution-proposed specialist creation, historical shadow mode is a hard gate because a new routable role can steal work from existing specialists. Human-requested creation uses applicable history when available and relies on the mandatory prospective paired evals when no relevant corpus exists.

Field	What to inspect
`historical_runs`	Which previous runs were replayed.
`total_prompts`	How much real prompt history the proposed trigger saw.
`total_divergences`	How often the proposed behavior differed from the historical route.
`divergence_rate`	Whether the proposal is a narrow correction or a routing risk.
`gate.json`	Which promotion condition cleared, or why the attempt stayed archived.

skill: guild-context-assemble
proposed_name: tighter-context-boundary
historical_runs: 12
total_prompts: 84
total_divergences: 3
divergence_rate: 0.036

That artifact does not approve the change by itself. It makes the approval conversation concrete.

What can change

Guild Stack’s evolution surface is intentionally narrow:

Surface	What Guild Stack may propose	What stays gated
Skill instances	clearer workflow steps, output shape, examples, anti-patterns, or trigger text	high-impact behavior changes and live promotion
Agent instances	trigger boundaries, handoff expectations, adjacent `DO NOT TRIGGER` clauses	tool permissions, isolation, model changes, and routing promotion
Templates	versioned template changes only after an explicit template-change gate	bulk mutation of existing skills or agents
Knowledge and retrieval	candidate improvements, ranking or packaging suggestions	treating a new source as normative
Runtime policy	proposals only	permissions, sandbox, destructive actions, network, spend, and production-sensitive work

That boundary matters for power users: you get better routing and clearer specialist behavior without losing an inspectable approval path. It also keeps the beginner path calm: a first project does not need to understand the factory before running Guild. Harness developers can inspect the actual artifacts under .guild/evolve/, .guild/reflections/, .guild/skill-versions/, and project-local .guild/agents/ when a specialist has been approved.

The run-to-proposal loop

Every completed run can feed the next one through five visible states:

Run, specialists execute; handoff receipts and telemetry accumulate under .guild/runs/<run-id>/.
Reflect, the Stop hook may fire guild:reflect; one reflection file lands at .guild/reflections/<run-id>.md. Reflections can name skill improvements, missing-specialist candidates, context-bundle issues, or followup backlog. No live skill, agent, or wiki page is mutated.
Propose, thresholds turn repeated evidence into a candidate: skill evolution when repeated reflections name the same skill, or specialist creation when the same capability gap recurs across separate runs.
Evaluate and shadow, paired variants, grader output, flip reports, and shadow-mode replay show what would improve, regress, or route differently.
Promote or archive, the promotion gate writes gate.json; rejected attempts stay under .guild/evolve/<run-id>/archived/.

Learning checkpoints follow the same posture. They auto-capture run facts and propose candidates; they do not create a new promotion path and they do not bypass guild:wiki-ingest, guild:decisions, or an evolve gate.

Promotion conditions

The skill gate promotes only when one of the approved conditions is true:

Condition	Why it can pass
No regressions and at least one fix	The proposed edit improves coverage without breaking an existing case.
No behavior flip and token usage drops by at least 10%	The skill gets cheaper without changing observable routing.
Regressions exist and you explicitly approve	The tradeoff is visible and accepted.
Doc-only fast path	The edit is prose or description only, changes no trigger/body/eval logic, and you explicitly approve it.

If none of those conditions clears, the attempt is archived. The rejected attempt still teaches future work because it keeps the evals, grading, and shadow evidence.

Template provenance

Every skill and specialist instance derives from one of two read-only authoring templates shipped by the plugin:

Template	Produces
`guild.skill_template.vN`	skill directories and `SKILL.md` bodies
`guild.agent_template.vN`	agent definitions and specialist metadata

The domain specialist roster adds a third read-only template surface: the specialist type templates at templates/specialists/<role>.md (guild.specialist_template.v1). Minting copies a type template byte-for-byte into .guild/agents/<role>.md and stamps the instance derived_from_template.

Project-local artifacts fill these templates; they do not fork the template contract. Authored instances carry derived_from_template: guild.{skill,agent}_template.vN frontmatter so later migrations can tell which template produced them. Template changes are lazy and staged: an existing skill or agent is migrated only when it next enters the normal paired-eval and shadow-mode gate.

Product templates are a different surface. They seed product-definition outputs, but v2 does not ship a paired-eval migration policy for product-template instances. Treat those as one-shot seeds or new template versions, not living instances Guild Stack re-conforms automatically.

Project-specific specialists

Guild Stack ships 15 domain specialist type templates plus two registered machinery agents (advisor, developer). Template-covered domains are minted into your project deterministically, so most teams never need this path. When your project repeatedly needs work none of the shipped roles should own, Guild can propose a project-specific specialist for that boundary.

When it triggers. Specialist creation has two origins. guild:reflect can propose a missing specialist when the same kind of work repeatedly surfaces; that evolution-proposed path requires repeated evidence, a distinct trigger boundary, context-isolation payoff, reflection/team gaps, eval coverage, and historical shadow replay. A user can also explicitly approve creation for a genuine team-compose gap. That human-requested path does not require pre-existing run or reflection history, but it keeps the prospective design and routing gates.

What gets created. A new agent definition at .guild/agents/<role>.md and companion skills under .guild/skills/<role>-*/, both written to your repo’s .guild/ directory, not the Guild Stack plugin itself.

The minting gates (7 steps):

Step	What happens
1	Interview, role name, trigger phrases, example outputs, dependencies
2	Draft under `.guild/agents/proposed/<role>.md` (incubation tree, not yet routable)
3	Boundary scan, description-similarity check against existing project instances and shipped specialist templates
4	Propose adjacent-boundary edits, add `DO NOT TRIGGER for: <new-domain>` to overlapping specialists
5	Gate boundary edits via `guild:evolve-skill`, paired evals verify adjacent specialists still route correctly
6	Gate new specialist, at least 3 positive and 3 negative paired eval cases for every origin. Historical shadow replay is required for evolution-proposed roles and used for human-requested roles when an applicable corpus exists; no applicable history is recorded as `not_applicable`.
7	Register, move from `proposed/` to `.guild/agents/<role>.md`; add the specialist to future team composition

After step 7, Guild Stack can dispatch the project specialist immediately through the definition-path mechanism at the role’s configured tier. The role is not host-registered as a new native subagent type, and restarting does not change that; Guild Stack composition and dispatch carry the project definition explicitly.

Reuse, never re-creation. guild:team-compose reads the shipped template library and your repo’s .guild/agents/*.md on future runs. A minted or promoted specialist becomes a candidate for later teams without reopening the creation workflow; an existing instance is always reused, never re-created.

guild:team-compose
  to shipped specialist templates (templates/specialists/*.md), minted on demand
  to project-local specialists (.guild/agents/*), minted or promoted, reused every run
  to gap? to propose, skip, substitute, or compose from existing roles

Skills that evolve

guild:evolve-skill runs the 10-step pipeline when repeated reflections propose the same edit, or when you explicitly run /guild:evolve <skill>.

The threshold is counted across .guild/reflections/*.md, not from one bad run. On demand, the command creates the same evolution workspace so you can inspect the candidate without waiting for the threshold.

Pipeline, 10 steps:

Step	What happens	Tooling
1	Snapshot, current skill to `.guild/skill-versions/<skill>/v<n>/`	`scripts/evolve-loop.ts`
2	Load eval cases, from `skills/<path>/evals.json`. Bootstrap 2-3 from reflections if none exist	,
3	Spawn paired subagents: A = current skill, B = proposed edit	,
4	Drafter writes assertions in parallel	,
5	Grader evaluates each assertion to `.guild/evolve/<run-id>/grading.json`	,
6	Flip report, pass rate, token delta, PtoF regressions vs FtoP fixes	`scripts/flip-report.ts`
7	Shadow mode, replays proposed trigger behavior against historical prompts and writes `shadow-report.md` with `total_prompts`, `total_divergences`, and `divergence_rate`	`scripts/shadow-mode.ts`
8	Promotion gate, see criteria below	,
9	Description optimizer, on promote, derives a bounded description from eval cases	`scripts/description-optimizer.ts`
10	Reject path, archive attempt under `.guild/evolve/<run-id>/archived/`	,

The promotion gate uses the four conditions listed above and records the reason in gate.json.

On promote: scripts/description-optimizer.ts runs, the live skill is written back through the evolve path, and the version history grows under .guild/skill-versions/<skill>/. On reject: archived under .guild/evolve/<run-id>/archived/. No live state changes.

Learning from previous runs

Evolution proposals are grounded in evidence from your actual runs rather than a single post-run hunch.

How the thresholds work. guild:reflect writes structured frontmatter to each reflection file:

proposals:
  skill_improvement: [guild:context-assemble]
  missing_specialist: [data-scientist]

guild:evolve walks .guild/reflections/*.md and counts how many times each skill appears across runs. The ≥3 threshold fires evolution, evidence from three separate real runs, not one bad run.

Shadow mode replays history. When guild:evolve-skill reaches step 7, scripts/shadow-mode.ts replays the proposed skill against the UserPromptSubmit events captured in .guild/runs/<id>/events.ndjson, the actual prompts your team sent. If the proposed edit would have misfired on past real work, you see that before promotion.

The same pattern holds for specialists. A specialist candidate recorded in one reflection is a weak signal. Three separate runs surfacing the same gap is a strong signal. The 7-step minting workflow requires that evidence before step 2 completes.

Versioning and rollback

Every skill edit is a versioned snapshot under .guild/skill-versions/<skill>/v<n>/. No operation destroys history:

/guild:rollback <skill> [n]

Walks the skill back n versions. Rollbacks themselves snapshot as new versions, there is no destructive path.

scripts/rollback-walker.ts enumerates versions and, with --steps <n>, emits a proposed_rollback YAML action. Read-only, the actual rollback is performed by skills/meta/rollback-skill/SKILL.md. The rollback path appends a new snapshot sourced from the older version and reruns evals so drift is visible.

Shadow mode detail

Shadow mode (skill evolution step 7) runs the proposed trigger description against historical traces without changing live routing. It records:

how many historical prompts were replayed
how many trigger decisions diverged
the divergence rate
the most frequent historical specialist involved in the replay

For skill evolution, shadow mode is diagnostic, it exits 0 and does not block the pipeline by itself. Its shadow-report.md gives you evidence at the promotion gate.

For evolution-proposed specialist creation, historical shadow mode is a hard gate at step 6. For human-requested creation, Guild Stack replays applicable history when it exists; an empty or irrelevant corpus is recorded as not_applicable, while the prospective paired eval gate remains mandatory.

Description optimizer

scripts/description-optimizer.ts runs as the final step before a promoted skill edit is written back.

Inputs: the skill’s evals.json should_trigger / should_not_trigger arrays.
Output: a YAML description: <...> to stdout (no file writes).
Deterministic, no LLM call. Tests in scripts/__tests__/ pin the output for a given input fixture.

Purpose: prevent under-trigger bias and overruns of the 1024-char description field Claude Code enforces on skills.

Where reflections go

Post-run reflections flow through a staged write path, proposals, not immediate promotion:

guild:reflect writes proposals to .guild/reflections/<run-id>.md.
When ≥3 proposals accumulate for a skill, evolution is queued.
When a missing-specialist gap recurs ≥3 runs, the minting workflow opens.
You gate promotion via /guild:evolve (skill pipeline) or approve the create-specialist gates.
Promoted skills and registered specialists take effect on the next run.

See Project Memory & Wiki Pattern for how reflections that contain knowledge insights feed into .guild/wiki/ via guild:wiki-ingest.

Reporting problems upstream

Some run learnings are not about your project — they are about Guild Stack itself: a broken flow, a missing host-adapter behavior, an unsafe default, a portability defect. Guild Stack routes those to the plugin instead of your wiki, and the routing is deterministic code, not a judgment call — so a plugin bug can’t be silently rewritten as a project note, and a project detail can’t leak into a public issue.

The flow is consent-gated at every step:

Findings are written as structured JSON and classified project-vs-plugin by run-learning-classifier.ts — the same classifier every time, not per-run discretion.
Plugin-classified findings become sanitized issue drafts under .guild/feedback/<run-id>/, with private absolute paths, tokens, and emails redacted. Nothing is sent yet.
Guild Stack asks you per draft. Only an explicit approve step reaches gh issue create against the Guild Stack repo. Denials are recorded, and non-interactive sessions never file.

You reach this path through /guild:fix when diagnosing a suspicious run, or automatically after a non-trivial run via the reflect phase. Project-level learnings stay in your .guild/ tree behind the normal review gate; only the plugin-level ones are ever proposed for upstream filing, and only with your per-draft approval.

Source-backed boundaries

The public rule is simple: Guild may propose, evaluate, archive, and version; humans and gates decide what becomes live. The shipped evolution module owns /guild:evolve, /guild:rollback, guild-evolve-skill, guild-reflect, guild-learning-checkpoint, shadow-mode reports, append-only skill snapshots, and proposal artifacts under .guild/. It does not promote wiki content, permissions, sandbox policy, runtime policy, or template migrations by itself.

Page evidence and guidance