Guild logo Guild
How It Works / Self-Evolving Skills
How It Works

Self-Evolving Skills

Guild skills improve through a 10-step gated pipeline: paired evals, flip report, shadow mode, promotion gate. You approve every promotion — nothing auto-promotes.

Self-Evolving Skills

After each run, Guild proposes improvements to its own skills. You gate every promotion — paired evaluations must pass, shadow-mode replay must confirm no regressions, and you approve the final gate. Nothing auto-promotes.

Memory and recall diagram showing the flow from run observations through reflect, wiki-ingest, and evolve-skill back into the specialist context bundle
Observations from each run flow through a staged write path: reflect → wiki candidates → gated skill promotion. You control what gets promoted.

Two triggers

Automatic — reflection threshold. After each run, the Stop hook fires guild:reflect when the completion heuristic passes (≥1 specialist dispatched, ≥1 edit, no error). The skill files proposed edits under .guild/reflections/<skill>/. When ≥3 proposed edits accumulate for a single skill, the orchestrator queues it for evolution.

Explicit — /guild:evolve [skill] [--auto]. Trigger evolution for a specific skill on demand. --auto runs unattended through the promotion gate.

Specialist candidate. A cluster of related skill edits that repeatedly co-activate, exceed the token budget, or appear as a missing role in ≥3 team-compose runs queues a candidate specialist — not an immediate add. It incubates under agents/proposed/<role>.md until the gates pass.

Pipeline — 10 steps

Driven by skills/meta/evolve-skill/SKILL.md, with tooling under scripts/:

StepWhat happensTooling
1Snapshot — current skill → .guild/skill-versions/<skill>/v<n>/scripts/evolve-loop.ts
2Load eval cases — from skills/<path>/evals/evals.json. If none exist, bootstrap 2–3 cases from accumulated reflections
3Spawn paired subagents in the same turn: Agent A = current skill, Agent B = proposed edit. Net-new skill: A = no-skill baseline
4Drafter writes assertions in parallel while the runs execute
5Grader evaluates each assertion → .guild/evolve/<run-id>/grading.json
6Flip report — computes pass_rate, duration_ms, total_tokens, mean ± stddev, delta; P→F regressions vs F→P fixesscripts/flip-report.ts
7Shadow mode — replays proposed skill against historical traces under .guild/runs/*/events.ndjson. Diagnostic only; never blocksscripts/shadow-mode.ts
8Promotion gate — see below
9Description optimizer — on promote, derives a ≤1024-char description from should_trigger / should_not_trigger evals. Deterministic, no LLM callscripts/description-optimizer.ts
10Reject path — archive the attempt under .guild/evolve/<run-id>/archive/ for future iterations

Promotion gate — 3 criteria

Promote if any of:

  1. 0 regressions AND ≥1 fix — the proposed edit strictly improves coverage.
  2. No flip change AND tokens ↓ ≥10% — no behavioral change, real efficiency win.
  3. Regressions present AND you approve via the review viewer.

On promote: scripts/description-optimizer.ts runs, the commit lands, version is bumped. On reject: the attempt is archived under .guild/evolve/<run-id>/archive/; no live state changes.

Versioning and rollback

Every skill edit is a versioned artifact under .guild/skill-versions/<skill>/v<n>/. No operation destroys history:

/guild:rollback <skill> [n]

Walks the skill back n versions. Rollbacks themselves snapshot as new versions.

scripts/rollback-walker.ts enumerates versions and, with --steps <n>, emits a proposed_rollback YAML action. It is read-only — the actual rollback is performed by skills/meta/rollback-skill/SKILL.md.

Shadow mode

Shadow mode (step 7) is the safety valve before promotion. scripts/shadow-mode.ts replays the proposed skill against historical traces without changing live routing, recording:

  • Trigger accuracy vs the live skill on the same prompts
  • Boundary collisions with adjacent specialists
  • Token deltas vs baseline
  • Output-quality divergence

Shadow mode is diagnostic only — it always exits 0 and never blocks the pipeline. Its shadow-report.md feeds your decision on the promotion gate when regressions are flagged.

Description optimizer

scripts/description-optimizer.ts runs as the final step before commit.

  • Inputs: the skill’s evals.json should_trigger / should_not_trigger arrays.
  • Output: a YAML description: <...> to stdout (no file writes).
  • Deterministic — no LLM call. Tests in scripts/__tests__/ pin the output for a given input fixture.

Purpose: prevent under-trigger bias and overruns of the 1024-char description field Claude Code enforces on skills.

Creating a new specialist — same gate

skills/meta/create-specialist/SKILL.md handles the net-new-specialist path, invoked when guild:team-compose hits a capability gap it can’t fill from the existing 14:

  1. Interview the user for role, responsibilities, example outputs, dependencies.
  2. Draft agents/proposed/<new>.md + 2–5 proposed T5 skills.
  3. Boundary scan — description-similarity check against all existing agents/*.md.
  4. Propose boundary edits — add DO NOT TRIGGER for: <new-domain> lines to each overlapping specialist.
  5. Gate the boundary edits through guild:evolve-skill — paired evals verify adjacent specialists still trigger for their domain but don’t steal the new specialist’s triggers.
  6. Gate the new specialist — paired evals + shadow mode on historical specs.
  7. Register — move proposed files to agents/ and skills/specialists/; add to guild:team-compose’s candidate list.

Failure at any gate stops the process and returns refinement options.

Restart required after registration. Claude Code loads plugin agent and skill manifests at session startup. The files are live on disk immediately, but the current session won’t route to the new specialist until the plugin reloads.

Where reflections go

Post-run reflections flow through a staged write path — they’re proposals, not immediate commits:

  1. guild:reflect writes proposed edits to .guild/reflections/<skill>/.
  2. When ≥3 proposals accumulate for a skill, evolution is queued.
  3. You gate promotion via /guild:evolve — the 10-step pipeline runs, and you approve the promotion gate.
  4. Promoted skills are committed; rejected attempts are archived.

See Project Memory & Wiki Pattern for how reflections that contain knowledge insights feed into .guild/wiki/ via guild:wiki-ingest.

See also