Tracks skill usage patterns, edit frequency, and success rates to suggest improvements and optimizations. Manages skill versioning with safe rollback capability and confidence scoring for suggestions. Use when reviewing skill performance, applying auto-suggested changes, or rolling back problematic versions.

Reference medium

Auto-activated — this skill loads automatically when Claude detects matching context.

Connections

Analytics Api Design Database Patterns Feedback Product Analytics

Skill Evolution Tracks skill usage patterns, edit frequency, and success rates to suggest improvements and optimizations. Manages skill versioning with safe rollback capability and confidence scoring for suggestions. Use when reviewing skill performance, applying auto-suggested changes, or rolling back problematic versions.

Skill Evolution Manager

Enables skills to automatically improve based on usage patterns, user edits, and success rates. Provides version control with safe rollback capability.

Overview

Reviewing how skills are performing across sessions
Identifying patterns in user edits to skill outputs
Applying learned improvements to skill templates
Rolling back problematic skill changes
Tracking skill version history and success rates

Quick Reference

Command	Description
`/ork:skill-evolution`	Show evolution report for all skills
`/ork:skill-evolution analyze <skill-id>`	Analyze specific skill patterns
`/ork:skill-evolution evolve <skill-id>`	Review and apply suggestions
`/ork:skill-evolution history <skill-id>`	Show version history
`/ork:skill-evolution rollback <skill-id> <version>`	Restore previous version
`/ork:skill-evolution promote <skill-id>`	Holdout bake-off: promote a candidate only if it beats champion by margin

How It Works

The skill evolution system operates in three phases:

COLLECT                    ANALYZE                    ACT
───────                    ───────                    ───
┌─────────────┐           ┌─────────────┐           ┌─────────────┐
│ PostTool    │──────────▶│ Evolution   │──────────▶│ /ork:skill- │
│ Edit        │  patterns │ Analyzer    │ suggest   │ evolution   │
│ Tracker     │           │ Engine      │           │ command     │
└─────────────┘           └─────────────┘           └─────────────┘
     │                          │                          │
     ▼                          ▼                          ▼
┌─────────────┐           ┌─────────────┐           ┌─────────────┐
│ edit-       │           │ evolution-  │           │ versions/   │
│ patterns.   │           │ registry.   │           │ snapshots   │
│ jsonl       │           │ json        │           │             │
└─────────────┘           └─────────────┘           └─────────────┘

Load details: Read("$\{CLAUDE_SKILL_DIR\}/rules/pattern-detection-heuristics.md") for tracked edit patterns and detection regexes. Load details: Read("$\{CLAUDE_SKILL_DIR\}/rules/confidence-scoring.md") for suggestion thresholds.

Subcommands

Each subcommand is documented with implementation details, shell commands, and sample output. Load details: Read("$\{CLAUDE_SKILL_DIR\}/references/evolution-commands.md")

Report (Default)

/ork:skill-evolution — Shows evolution report for all tracked skills with usage counts, success rates, and pending suggestions.

Analyze

/ork:skill-evolution analyze <skill-id> — Deep-dives into edit patterns for a specific skill, showing frequency, sample counts, and confidence scores.

Evolve

/ork:skill-evolution evolve <skill-id> — Interactive review of improvement suggestions. Uses AskUserQuestion for each suggestion (Apply / Skip / Reject). Creates version snapshot before applying.

History

/ork:skill-evolution history <skill-id> — Shows version history with performance metrics per version.

Rollback

/ork:skill-evolution rollback <skill-id> <version> — Restores a previous version after confirmation. Current version is backed up automatically.

Data Files

File	Purpose	Format
`.claude/feedback/edit-patterns.jsonl`	Raw edit pattern events	JSONL (append-only)
`.claude/feedback/evolution-registry.json`	Aggregated suggestions	JSON
`.claude/feedback/metrics.json`	Skill usage metrics	JSON
`skills/<cat>/<name>/versions/`	Version snapshots	Directory
`skills/<cat>/<name>/versions/manifest.json`	Version metadata	JSON

Auto-Evolution Safety

Load details: Read("$\{CLAUDE_SKILL_DIR\}/rules/auto-evolution-triggers.md") for full safety mechanisms, health monitoring, and trigger criteria.

Key safeguards: version snapshots before changes, auto-alert on >20% success rate drop, human review required, rejected suggestions never re-suggested.

Holdout-Promotion Gate (Champion / Challenger)

/ork:skill-evolution promote <skill-id> — promote a candidate edit ONLY if it beats the current version on a sealed holdout eval set by a margin (default 0.5 on the 0–10 rubric). Both scores + the promote/reject decision append to an append-only promotion-ledger.jsonl for audit. This is the objective gate evolve was missing — an "Apply?" prompt is not proof the edit is better — and the native mechanism for "auto-evaluate all skills + agents to standard": a skill graduates by winning a bake-off, not by a human eyeballing a diff.

champion (HEAD) -- vs -- challenger (candidate)
        +--------+--------+
   SEALED HOLDOUT (hash-locked) -- bare-eval forked graders
                 v
   delta = challenger - champion >= margin ?  ->  PROMOTE : REJECT  (both logged)

Grading runs through bare-eval (--bare --print, CLAUDE_CODE_FORK_SUBAGENT=1) so champion and challenger see byte-identical, isolated conditions — the only variable is the SKILL.md version. Ties reject (incumbent wins); a per-dimension min_blocker auto-rejects a one-axis regression even when the composite improves.

Anti-gaming: the challenger generator never reads evals/holdout.jsonl (input allowlist = edit-patterns.jsonl only), and every bake-off + CI recomputes LC_ALL=C sha256 of the holdout against holdout.lock.json — train-on-test or a silent holdout edit fails closed. Expensive by design (2*N grader calls; --bare requires ANTHROPIC_API_KEY and bills tokens), so it runs on-demand or in CI only, never in a background hook ($0 idle).

Load details: Read("$\{CLAUDE_SKILL_DIR\}/references/holdout-promotion-gate.md") — ledger/lock JSON schemas, decision rule, run loop, /goal + CI wiring.

References

Load on demand with Read("$\{CLAUDE_SKILL_DIR\}/references/<file>"):

File	Content
`evolution-commands.md`	Subcommand implementation, shell commands, and sample output
`evolution-analysis.md`	Evolution analysis methodology
`version-management.md`	Version management guide
`holdout-promotion-gate.md`	Champion/challenger gate: sealed holdout, ledger schema, decision rule, /goal + CI wiring

Rules

Load on demand with Read("$\{CLAUDE_SKILL_DIR\}/rules/<file>"):

File	Content
`pattern-detection-heuristics.md`	Edit pattern categories and regex detection
`confidence-scoring.md`	Suggestion thresholds and confidence criteria
`auto-evolution-triggers.md`	Safety mechanisms and trigger criteria

ork:configure - Configure OrchestKit settings
ork:doctor - Diagnose OrchestKit issues
feedback-dashboard - View comprehensive feedback metrics

Rules (3)

Auto-Evolution Triggers — HIGH

Auto-Evolution Safety & Trigger Criteria

Safety Mechanisms

Version Snapshots: Always created before changes
Rollback Triggers: Auto-alert if success rate drops >20%
Human Review: High-confidence suggestions require approval
Rejection Memory: Rejected suggestions are never re-suggested

Health Monitoring

The system monitors skill health and can trigger warnings:

WARNING: api-design-framework success rate dropped from 94% to 71%
Consider: /ork:skill-evolution rollback api-design-framework 1.1.0

Incorrect:

# Auto-apply pattern after 2 uses, no rollback tracking
confidence: 60%, samples: 2 → APPLY

Correct:

# Require minimum samples and high confidence before suggesting
confidence: 85%, samples: 8 → SUGGEST (requires human approval)
confidence: 60%, samples: 2 → TRACK ONLY (below threshold)

When Auto-Evolution Activates

Pattern frequency exceeds the Add Threshold (70%)
At least Minimum Samples (5) uses recorded
No prior rejection for the same pattern on the same skill
Current skill version success rate is stable (no recent drops)

When Rollback Is Triggered

Success rate drops more than 20% after an evolution
Alert is surfaced in the next report or analyze invocation
User is prompted to rollback via AskUserQuestion

Confidence Scoring — HIGH

Confidence Scoring & Suggestion Thresholds

Thresholds

Threshold	Default	Description
Minimum Samples	5	Uses before generating suggestions
Add Threshold	70%	Frequency to suggest adding pattern
Auto-Apply Confidence	85%	Confidence for auto-application
Rollback Trigger	-20%	Success rate drop to trigger rollback

Confidence Calculation

Confidence is calculated as the ratio of users who apply a pattern to total uses:

confidence = pattern_frequency / total_uses

Below 70%: Pattern tracked but no suggestion generated
70%-84%: Suggestion generated, requires human approval via evolve subcommand
85%+: Auto-apply eligible (still requires human confirmation via AskUserQuestion)

Incorrect:

# Apply pattern with only 2 data points
pattern_frequency: 2/3 (67%) → auto-apply  # Too few samples, unreliable

Correct:

# Wait for minimum samples before generating suggestions
pattern_frequency: 6/8 (75%) → suggest (requires approval)
pattern_frequency: 2/3 (67%) → track only (below 5 minimum samples)

Suggestion States

Suggestions progress through: pending → applied | rejected

Applied: Pattern added to skill template, version bumped
Rejected: Marked in registry, never re-suggested for this skill

Pattern Detection Heuristics — HIGH

Edit Pattern Detection Heuristics

The system tracks these common edit patterns users apply after skill output:

Pattern	Description	Detection Regex
`add_pagination`	User adds pagination to API responses	`limit.offset`, `cursor.pagination`
`add_rate_limiting`	User adds rate limiting	`rate.?limit`, `throttl`
`add_error_handling`	User adds try/catch blocks	`try.*catch`, `except`
`add_types`	User adds TypeScript/Python types	`interface\s`, `Optional`
`add_validation`	User adds input validation	`validate`, `Pydantic`, `Zod`
`add_logging`	User adds logging/observability	`logger\.`, `console.log`
`remove_comments`	User removes generated comments	Pattern removal detection
`add_auth_check`	User adds authentication checks	`@auth`, `@require_auth`

Incorrect:

# Generic pattern — matches too broadly
{"pattern": "add_.*", "regex": ".*"}  # Matches everything, useless signal

Correct:

# Specific pattern with focused regex
{"pattern": "add_pagination", "regex": r"limit.*offset|cursor.*pagination"}

How Detection Works

The PostTool Edit Tracker hook monitors file edits after skill invocations. When a user edits skill output, the edit is classified against the patterns above using regex matching. Results are appended to .claude/feedback/edit-patterns.jsonl.

References (5)

Evolution Analysis

Evolution Analysis Methodology

Reference guide for understanding how the skill evolution system analyzes patterns and generates suggestions.

Pattern Detection Algorithm

1. Data Collection (PostTool Hook)

When a Write or Edit tool is used after a skill was recently loaded:

IF skill_loaded_within(5_minutes) AND tool IN (Write, Edit):
    content = get_edit_content()
    patterns = detect_patterns(content)
    IF patterns.length > 0:
        log_to_edit_patterns_jsonl(skill_id, patterns)

2. Pattern Matching

The system uses regex patterns to categorize edits:

PATTERN_DETECTORS=(
    ["add_pagination"]="limit.*offset|page.*size|cursor.*pagination|Paginated"
    ["add_rate_limiting"]="rate.?limit|throttl|RateLimiter|requests.?per"
    ["add_caching"]="@cache|cache_key|TTL|redis|memcache|@cached"
    ["add_retry_logic"]="retry|backoff|max_attempts|tenacity|Retry"
    ["add_error_handling"]="try.*catch|except|raise.*Exception|throw.*Error"
    ["add_validation"]="validate|Validator|@validate|Pydantic|Zod|yup"
    ["add_logging"]="logger\.|logging\.|console\.log|winston|pino"
    ["add_types"]=": *(str|int|bool|List|Dict|Optional)|interface\s|type\s.*="
    ["add_auth_check"]="@auth|@require_auth|isAuthenticated|requiresAuth"
    ["add_test_case"]="def test_|it\(|describe\(|expect\(|@pytest"
)

3. Frequency Calculation

For each skill with sufficient usage:

frequency = pattern_count / total_skill_uses

4. Confidence Scoring

Confidence combines frequency with sample size:

confidence = frequency × min(samples / 20, 1.0)

This means:

100% frequency with 5 samples = 0.25 confidence (needs more data)
100% frequency with 20+ samples = 1.0 confidence (high certainty)
70% frequency with 15 samples = 0.53 confidence (moderate)

Suggestion Thresholds

Metric	Threshold	Purpose
MIN_SAMPLES	5	Prevent premature suggestions
ADD_THRESHOLD	0.70	70%+ users add = suggest adding
REMOVE_THRESHOLD	0.70	70%+ users remove = suggest removing
AUTO_APPLY_CONFIDENCE	0.85	Auto-apply if very high confidence

Suggestion Types

Add Suggestions

Generated when users frequently add similar content:

{
  "type": "add",
  "target": "template",
  "pattern": "add_pagination",
  "reason": "85% of users add pagination after using this skill"
}

Remove Suggestions

Generated when users frequently remove generated content:

{
  "type": "remove",
  "target": "template",
  "pattern": "remove_comments",
  "reason": "72% of users remove docstrings from generated code"
}

Analysis Best Practices

Wait for sufficient data: Don't act on suggestions until MIN_SAMPLES reached
Review high-confidence first: Focus on suggestions with confidence > 0.80
Consider context: A pattern may be added for specific use cases only
Monitor after changes: Track success rate changes after evolution

Interpreting Results

High-Value Improvements

Frequency > 80%, Confidence > 0.70
Pattern is universally applicable
Easy to add to skill template

Conditional Improvements

Frequency 50-80%
May be context-dependent
Consider adding as optional reference

Skip/Investigate

Frequency < 50%
Might be edge case or user preference
Review individual edit patterns for context

Evolution Commands

Evolution Subcommand Reference

Detailed implementation and sample output for each subcommand.

Subcommand: Report (Default)

Usage: /ork:skill-evolution

Shows evolution report for all tracked skills.

Implementation

# Run the evolution engine report
"${CLAUDE_PROJECT_DIR}/.claude/scripts/evolution-engine.sh" report

Sample Output

Skill Evolution Report
══════════════════════════════════════════════════════════════

Skills Summary:
┌────────────────────────────┬─────────┬─────────┬───────────┬────────────┐
│ Skill                      │ Uses    │ Success │ Avg Edits │ Suggestions│
├────────────────────────────┼─────────┼─────────┼───────────┼────────────┤
│ api-design-framework       │     156 │     94% │       1.8 │          2 │
│ database-schema-designer   │      89 │     91% │       2.1 │          1 │
│ fastapi-patterns           │      67 │     88% │       2.4 │          3 │
└────────────────────────────┴─────────┴─────────┴───────────┴────────────┘

Summary:
  Skills tracked: 3
  Total uses: 312
  Overall success rate: 91%

Top Pending Suggestions:
1. 93% | api-design-framework | add add_pagination
2. 88% | api-design-framework | add add_rate_limiting
3. 85% | fastapi-patterns | add add_error_handling

Subcommand: Analyze

Usage: /ork:skill-evolution analyze <skill-id>

Analyzes edit patterns for a specific skill.

Implementation

# Run analysis for specific skill
"${CLAUDE_PROJECT_DIR}/.claude/scripts/evolution-engine.sh" analyze "$SKILL_ID"

Sample Output

Skill Analysis: api-design-framework
────────────────────────────────────
Uses: 156 | Success: 94% | Avg Edits: 1.8

Edit Patterns Detected:
┌──────────────────────────┬─────────┬──────────┬────────────┐
│ Pattern                  │ Freq    │ Samples  │ Confidence │
├──────────────────────────┼─────────┼──────────┼────────────┤
│ add_pagination           │    85%  │ 132/156  │       0.93 │
│ add_rate_limiting        │    72%  │ 112/156  │       0.88 │
│ add_error_handling       │    45%  │  70/156  │       0.56 │
└──────────────────────────┴─────────┴──────────┴────────────┘

Pending Suggestions:
1. 93% conf: ADD add_pagination to template
2. 88% conf: ADD add_rate_limiting to template

Run `/ork:skill-evolution evolve api-design-framework` to review

Subcommand: Evolve

Usage: /ork:skill-evolution evolve <skill-id>

Interactive review and application of improvement suggestions.

Implementation

Get Suggestions:

SUGGESTIONS=$("${CLAUDE_PROJECT_DIR}/.claude/scripts/evolution-engine.sh" suggest "$SKILL_ID")

For Each Suggestion, Present Interactive Options:

Use AskUserQuestion to let the user decide on each suggestion:

{
  "questions": [{
    "question": "Apply suggestion: ADD add_pagination to template? (93% confidence, 132/156 users add this)",
    "header": "Evolution",
    "options": [
      {"label": "Apply", "description": "Add this pattern to the skill template"},
      {"label": "Skip", "description": "Skip for now, ask again later"},
      {"label": "Reject", "description": "Never suggest this again"}
    ],
    "multiSelect": false
  }]
}

On Apply:
- Create version snapshot first
- Apply the suggestion to skill files
- Update evolution registry
On Reject:
- Mark suggestion as rejected in registry
- Will not be suggested again

Applying Suggestions

When a user accepts a suggestion, the implementation depends on the suggestion type:

For add suggestions to templates:

Add the pattern to the skill's template files
Update SKILL.md with new guidance

For add suggestions to references:

Create new reference file in references/ directory

For remove suggestions:

Remove the identified content
Archive in version snapshot first

Subcommand: History

Usage: /ork:skill-evolution history <skill-id>

Shows version history with performance metrics.

Implementation

# Run version manager list
"${CLAUDE_PROJECT_DIR}/.claude/scripts/version-manager.sh" list "$SKILL_ID"

Sample Output

Version History: api-design-framework
══════════════════════════════════════════════════════════════

Current Version: 1.2.0

┌─────────┬────────────┬─────────┬───────┬───────────┬────────────────────────────┐
│ Version │ Date       │ Success │ Uses  │ Avg Edits │ Changelog                  │
├─────────┼────────────┼─────────┼───────┼───────────┼────────────────────────────┤
│ 1.2.0   │ 2026-01-14 │    94%  │   156 │       1.8 │ Added pagination pattern   │
│ 1.1.0   │ 2026-01-05 │    89%  │    80 │       2.3 │ Added error handling ref   │
│ 1.0.0   │ 2025-11-01 │    78%  │    45 │       3.2 │ Initial release            │
└─────────┴────────────┴─────────┴───────┴───────────┴────────────────────────────┘

Subcommand: Rollback

Usage: /ork:skill-evolution rollback <skill-id> <version>

Restores a skill to a previous version.

Implementation

Confirm with User:

Use AskUserQuestion for confirmation:

{
  "questions": [{
    "question": "Rollback api-design-framework from 1.2.0 to 1.0.0? Current version will be backed up.",
    "header": "Rollback",
    "options": [
      {"label": "Confirm Rollback", "description": "Restore version 1.0.0"},
      {"label": "Cancel", "description": "Keep current version"}
    ],
    "multiSelect": false
  }]
}

On Confirm:

"${CLAUDE_PROJECT_DIR}/.claude/scripts/version-manager.sh" restore "$SKILL_ID" "$VERSION"

Report Result:

Restored api-design-framework to version 1.0.0
Previous version backed up to: versions/.backup-1.2.0-1736867234

Holdout Promotion Gate

Holdout-Promotion Gate (Champion / Challenger)

Promote a skill/agent edit ONLY if a challenger beats the champion on a fresh, sealed holdout eval set by a configurable margin. Every decision — promoted or rejected, with both scores — is persisted for audit. This is the objective gate the evolve subcommand was missing: an AskUserQuestion "Apply?" is not evidence the new version is better.

This is the native mechanism behind the recorded goal auto-evaluate all skills + subagents to standard: a skill graduates by winning a holdout bake-off, not by a human eyeballing a diff.

 CHAMPION (current SKILL.md)        CHALLENGER (candidate edit)
        |                                   |
        +------------------+----------------+
                           v
            SEALED HOLDOUT EVAL SET  (N cases, hash-locked)
                           |  bare-eval forked graders (isolated context)
                           v
        champion_score            challenger_score
                           |
                           v
         challenger - champion >= margin ?
                +----------+----------+
              YES                    NO
          PROMOTE                 REJECT
   (snapshot + apply)       (discard challenger; ledger: rejected;
   (ledger: promoted)        SKILL.md byte-identical)

Vocabulary

Term	Meaning
champion	The version currently on disk (`SKILL.md` HEAD). The incumbent.
challenger	The candidate edit `evolve` produced, applied to a scratch copy.
holdout	A sealed eval set the challenger was NOT derived from. Frozen by content hash so it can't be silently tuned to.
margin	Minimum `challenger - champion` to promote. Default `0.5` on the 0–10 rubric. Per-skill configurable.
bake-off	One champion-vs-challenger run over the full holdout, both graded by identical forked graders.

The sealed holdout set

Lives beside the skill (paths are flat — src/skills/<skill>/, resolve at runtime via $\{CLAUDE_SKILL_DIR\}/evals/):

src/skills/<skill>/evals/holdout.jsonl      # sealed cases (append-only)
src/skills/<skill>/evals/holdout.lock.json  # {hash, n, frozen_at, rubric, min_pass, margin}

One eval case per line in holdout.jsonl:

{"id":"h-001","prompt":"...task the skill must handle...","must":["assertions present in output"],"difficulty":"medium"}

holdout.lock.json pins it:

{
  "schema": "ork-holdout/1.0",
  "skill": "assess",
  "hash": "sha256:1f3a...",
  "n": 12,
  "frozen_at": "2026-06-20T00:00:00Z",
  "rubric": "src/skills/assess/rubric.json",
  "min_pass": 7.0,
  "margin": 0.5
}

rubric points at the per-skill rubric.json (the file with the actual weight / min_pass / min_blocker values) — NOT shared/rubric.schema.json, which only defines structure and carries no thresholds to enforce against.

Sealing rules (the anti-tuning contract):

The challenger generator (evolve / evolution-engine.sh) MUST NOT read evals/holdout.jsonl. Its input allowlist is .claude/feedback/edit-patterns.jsonl + evolution-registry.json only — the train signal, never the holdout.
A bake-off recomputes LC_ALL=C sha256(sort) of the cases and aborts if it differs from holdout.lock.json.hash. (LC_ALL=C so the sort — and therefore the hash — is locale-independent across machines/CI.) A mid-flight edit invalidates the run, fail-closed.
Growing the holdout is a separate, reviewed commit that re-freezes holdout.lock.json AND resets stored scores for that skill (old scores were graded against a different set; they are not comparable). No special CI label in v1 — a re-freeze is just a normal reviewed diff.

Build the holdout with golden-dataset curation rules (difficulty balance, ≥2 domain tags, canonical inputs). The holdout is a golden-dataset slice that is never shown to the challenger generator.

Grading: bare-eval, forked, isolated

Both versions are graded by the SAME grader over the SAME holdout via bare-eval:

# illustrative — from src/skills/bare-eval; forked graders for cross-case determinism
export CLAUDE_CODE_FORK_SUBAGENT=1      # fresh context per case (CC 2.1.121)
claude -p "$grade_prompt" --bare --print --max-turns 1 \
  --json-schema "$skill_rubric_json" --output-format json

Dependency / cost (declare it): --bare requires ANTHROPIC_API_KEY (OAuth/keychain is disabled in bare mode) and bills tokens directly (not subscription). A bake-off is 2 * N grader calls. So this runs on-demand (promote) or in CI only — never in a background hook ($0 idle, matching the no-paid-background-LLM rule).

Identical grader + identical ork-rubric/1.0 rubric + identical holdout → the only variable is the SKILL.md version. That isolation is the whole point.
Forked subagents stop case N's state leaking into case N+1, so champion and challenger see byte-identical conditions.
Score = weighted composite over the rubric dimensions, averaged over N cases. A challenger below min_blocker on ANY dimension is auto-rejected regardless of margin (a one-axis regression can't be bought with gains elsewhere).

Decision rule

promote IFF:
  challenger_composite - champion_composite >= margin
  AND challenger_composite >= holdout.min_pass
  AND no challenger dimension < its min_blocker
  AND holdout hash matches lock
ELSE reject

Ties (delta < margin) reject — the incumbent wins ties. New work must clear the bar, not merely match it; this biases toward stability and stops score-noise from churning versions.

Precedence: LOCK `min_pass` vs RUBRIC `min_blocker`

Two min_pass-flavored thresholds exist. They are not in conflict — they govern different axes, so both apply and neither overrides the other:

Source	Governs	Effect
LOCK `holdout.lock.json.min_pass` (`ork-holdout/1.0`)	the composite (whole-bake-off eligibility)	challenger's weighted composite must be `>= min_pass` to be promotable — the bake-off's pass bar
RUBRIC per-dimension `min_blocker` (`ork-rubric/1.0`)	each single dimension	any one dimension below its `min_blocker` is a hard blocker, regardless of composite
RUBRIC `composite.min_pass` (`ork-rubric/1.0`)	the rubric's own composite floor	the skill's standalone grading gate (e.g. assess's 5.5 implement gate); the bake-off uses the stricter LOCK value as its composite bar

So the LOCK's min_pass is the authoritative composite gate for promotion (it's holdout-specific and can be set stricter than the rubric's general-purpose floor — 7.0 vs 5.5 for assess), while min_blocker is the per-dimension floor that a high composite can never buy back. The rubric's own composite.min_pass is the skill's day-to-day grading gate and is not consulted by the Decision Rule. (The schema's min_blocker <= min_pass invariant is per-dimension and unrelated to the LOCK composite.)

The promotion ledger

Every bake-off appends one immutable record (both outcomes) to src/skills/<skill>/evals/promotion-ledger.jsonl (ork-promotion/1.0):

{
  "schema": "ork-promotion/1.0",
  "skill": "assess",
  "ts": "2026-06-20T14:03:11Z",
  "holdout_hash": "sha256:1f3a...",
  "holdout_n": 12,
  "margin": 0.5,
  "champion_version": "2.3.0",
  "challenger_source": "edit-pattern:add-pagination-assert (conf 0.82, 14 samples)",
  "champion_score": 7.4,
  "challenger_score": 8.1,
  "delta": 0.7,
  "per_dimension": { "correctness": {"champion": 7.0, "challenger": 8.2} },
  "decision": "promoted",
  "reason": "delta 0.7 >= margin 0.5; min_pass 7.0 met; no dimension under blocker",
  "promoted_to_version": "2.4.0",
  "grader_model": "claude-opus-4-8[1m]",
  "fork_subagent": true
}

On rejected, promoted_to_version is null and reason names the failing clause. grep '"decision":"rejected"' shows every change tried and refused, with numbers — that is the auditability deliverable.

Run loop — `/ork:skill-evolution promote <skill-id>`

1. LOAD    champion = SKILL.md HEAD; suggestion = top pending from evolution-registry.json
2. SEAL    recompute LC_ALL=C sha256(sort holdout.jsonl); abort if != holdout.lock.json.hash
3. BUILD   write challenger to a scratch copy; apply the suggestion.
           generator never reads holdout.jsonl (allowlist enforced)
4. GRADE   per case, forked bare-eval grader scores champion AND challenger → composites
5. DECIDE  apply the Decision Rule
6. RECORD  append one promotion-ledger.jsonl record (ALWAYS, both outcomes)
7. ACT     promoted -> snapshot champion into versions/<v>/, copy challenger over SKILL.md,
           bump frontmatter version, update versions/manifest.json successRate.
           rejected -> rm scratch; nothing on disk changes.
8. REPORT  print delta table + decision + the ledger line

`/goal` and CI wiring

As a /goal boolean (promotion is falsifiable, so it composes via ork:prd-to-goal):

/goal until jq -e '.[-1].decision=="promoted"' src/skills/assess/evals/promotion-ledger.jsonl abort-if no_progress_for_3_turns
# run /ork:skill-evolution promote assess each turn; abort if SKILL.md never changes (challenger never won)

As a CI gate (wired — assess is the first instance) — run-skill-eval.sh --holdout-promote <skill> runs the bake-off headless and exits non-zero on rejected, so a PR editing a skill can't merge unless its challenger beat the champion; the sealed-hash check runs first (holdout hash mismatch, exit 3, on an un-re-frozen edit). Today only assess is seeded (src/skills/assess/evals/ — a starter holdout whose expected labels still need human review before real promotions are trusted); other skills fail closed (exit 2, "holdout eval set not found") until seeded. Run it on-demand or in CI; --dry-run validates the seal + rubric with zero grader spend.

Agents too: the same loop applies to src/agents/<name>.md — holdout cases are tasks, the grader scores the agent's transcript. bare-eval honors agent tools:/permissionMode under --print (CC 2.1.119), so the agent is graded with its real tool surface.

Anti-gaming guardrail

The one way it gets gamed: the challenger generator peeks at the holdout and tunes the edit to those exact cases (train-on-test), or an operator quietly edits holdout.jsonl to drop the cases the challenger fails.

The checks that block it:

Hash seal (step 2 + CI), computed LC_ALL=C so it's deterministic. Any holdout change without a reviewed re-freeze → holdout hash mismatch, fail-closed, no promotion.
Generator isolation — evolution-engine.sh's input allowlist must exclude evals/holdout.jsonl. Enforced by tests/unit/test-evolution-engine.sh (two assertions: the engine source statically references no holdout path, and a planted holdout canary token never leaks into a generated challenger) — the engine script is the real surface, not a generic markdown grep.
Tie-loses + per-dimension blocker — even a higher composite can't promote if it tanks any dimension below min_blocker, so you can't trade a correctness regression for a verbosity win to clear the margin.

Anti-patterns

Anti-pattern	Why it's wrong	Do instead
Re-use the train set as holdout	Train-on-test; every challenger "wins"	Hold cases out; freeze with `holdout.lock.json`
Promote on a tie (delta 0)	Score noise churns versions endlessly	Incumbent wins ties; require `>= margin`
Different grader/holdout per side	The version isn't the only variable; result is meaningless	Same grader, rubric, sealed set, forked context
Average composite hides a one-axis regression	Ship a more-correct-but-insecure version	Per-dimension `min_blocker` auto-rejects
Edit `holdout.jsonl` to drop failing cases	Silent goal-shifting; numbers stop being comparable	Hash seal + reviewed re-freeze + score reset
Run the bake-off in a background hook	Bills `ANTHROPIC_API_KEY` per session per dev	On-demand `promote` / CI only; $0 idle
Skip the ledger on rejection	Lose the audit trail of what was refused	Append a record for BOTH outcomes, always
Bump version without a winning bake-off	The "to standard" claim is unproven	`promoted_to_version` is set only by the Decision Rule

Acceptance (self-check)

LC_ALL=C hash of holdout.jsonl equals holdout.lock.json.hash before any bake-off; a tampered holdout aborts with no ledger append.
A challenger scoring < champion + margin → ledger decision:"rejected" AND git diff --quiet src/skills/<skill>/SKILL.md (unchanged).
A winning challenger → decision:"promoted", a new versions/<v>/ snapshot, a bumped frontmatter version.
Every bake-off appends exactly one promotion-ledger.jsonl line (both outcomes).

Storage Patterns

Storage Patterns: Rolling Logbook vs Index-Per-Entry

When a skill (or a project's .claude/rules/*.md files) needs to accumulate state across sessions — decisions, observations, patterns, knowledge — there are two storage patterns. Pick wrong and Claude Code's 40k-char auto-load threshold ambushes you 3-6 months later.

Critical: .claude/rules/** is auto-loaded RECURSIVELY. CC globs the rules directory recursively — every *.md at any depth (.claude/rules/decisions/2026-04-15-postgres.md included) loads into every <system-reminder>. Splitting a rolling file into per-entry files inside .claude/rules/ therefore multiplies the loaded surface instead of cutting it. Per-entry files must live outside .claude/rules/ (e.g. docs/decisions/); keep only a slim index in .claude/rules/. This is the #1 mistake — see #2589.

TL;DR

Need	Use
Append-only, < 30k chars total expected lifetime	Rolling logbook
Append-forever, no natural upper bound	Index-per-entry
Reads are chronological narrative	Rolling logbook
Reads are by-key or by-date lookup	Index-per-entry
Each entry independently meaningful	Index-per-entry

If unsure, default to index-per-entry — bounded by construction (with entries stored outside .claude/rules/).

Pattern 1 — Rolling Logbook

Single Markdown file appended forever.

.claude/rules/recent-decisions.md
  # Recent Project Decisions
  ## 2026-04-15 — Use Postgres not MongoDB
  ## 2026-04-22 — Brainstorm: dual-write
  ## ... (one entry every week, no upper bound)

Strengths

Trivial to write: >> file.md and you're done
Operator-readable as chronological narrative
One file means one place to check in / out of cache
Diffs cleanly in PRs (additions only)

Weaknesses

Grows unbounded. No mechanism stops you at any size.
CC auto-loads everything under .claude/rules/** into every <system-reminder>. At 40,000 chars CC emits a yellow warning, but by then you've burned that context on every prompt for weeks.
Stale entries pollute the loaded context (Q1 2024 decisions for a Q4 2026 session — irrelevant but billed).
Recursive search/replace breaks (one wrong-line edit corrupts a 30k file).

Concrete failure case: yonatan-hq/platform/.claude/rules/recent-decisions.md ballooned to 53.8k chars in seven months. CC flagged it as "Large file will impact performance" — every session paid ~14k extra context tokens before the operator noticed.

Pattern 2 — Index-Per-Entry

One small index file inside .claude/rules/ (one bullet per entry); individual entries live outside the auto-load zone (e.g. docs/decisions/) and are loaded on-demand via Read.

.claude/rules/                            ←  auto-loaded RECURSIVELY by CC
└── decisions-index.md                    ←  ≤ 200 lines, the ONLY decisions file that loads
    - [Use Postgres not MongoDB](../../docs/decisions/2026-04-15-postgres.md) — chose Postgres for jsonb
    - [Dual-write analytics](../../docs/decisions/2026-04-22-dual-write.md) — HTTP sink alongside JSONL
    ...

docs/decisions/                           ←  OUTSIDE .claude/rules/ → never auto-loaded
├── 2026-04-15-postgres.md                ←  loaded only when relevant via Read
├── 2026-04-22-dual-write.md
└── ...

Do NOT put the per-entry files under .claude/rules/decisions/. That directory is inside the recursive auto-load zone, so all N entries would load every session — the exact bloat this pattern exists to prevent. Keep entries in docs/decisions/ (the same not-auto-loaded zone as docs/archives/).

Strengths

Bounded. Index grows by one line per entry — even 500 entries is ~50 lines, and it's the only thing that loads.
On-demand load. Operator (or Claude) reads the specific entry that matters; the other 499 stay on disk, out of context.
Entries live outside the auto-load zone (docs/decisions/), so CC never globs them into a <system-reminder>.
Each entry is independently meaningful, addressable, and editable.
Old entries don't pollute current context.

Weaknesses

Two-step writes: append to index + write new file under docs/decisions/. Five extra seconds per entry. Bash function or skill can hide this.
Two-step reads: scan index, then Read the relevant file. Costs one extra tool call.
More PR-diff noise (one new file per entry vs append to one).
Filenames must be unique and well-chosen — bad naming kills the on-demand pattern.

Migration Path

When a rolling logbook crosses the 30k-char mark, migrate proactively. The entries move out of .claude/rules/; only a slim index stays behind.

Create the out-of-zone entry directory: mkdir -p docs/decisions

Split entries one-per-file into it. A small script can split on ## headers:

cd docs/decisions && csplit -k ../../.claude/rules/recent-decisions.md '/^## /' '{*}'

Create .claude/rules/decisions-index.md with one bullet per file, each linking to ../../docs/decisions/<file>.md.
Remove the old rolling file from .claude/rules/ (archive it: git mv .claude/rules/recent-decisions.md docs/decisions/_legacy-rolling.md).
Update any skill/hook references to point at the index.

docs/decisions/_legacy-rolling.md stays accessible via Read but won't auto-load (it's outside .claude/rules/).

Anti-pattern (causes #2589): git mv recent-decisions.md decisions/ while staying inside .claude/rules/. That keeps every split file in the recursive auto-load zone — yonatan-hq/platform did this and grew to 95 files / 656 KB + a 132 KB index ≈ 200k tokens/session before the relocation to docs/decisions/ (platform#5468) fixed it.

Pattern selection by skill

Skill	Pattern	Why
`memory`	index-per-entry	Per-fact files keyed by topic; index in MEMORY.md, entries outside `.claude/rules/`
`remember`	index-per-entry	Same as memory — entries grow forever, lookups are by-key
`recent-decisions` (rules-level)	rolling index in `.claude/rules/` + entries in `docs/decisions/`	Entries MUST sit outside the recursive auto-load zone (#2589)
`goal-history.jsonl`	rolling (JSONL)	Not auto-loaded by CC; consumed by monitor on-demand. Different mechanism.
Skill-internal `references/*.md`	one file per concept	Loaded explicitly via Read in SKILL.md, not auto-globbed

Detection

The lifecycle/rules-size-check hook (#1815, hardened in #2589) runs on SessionStart and emits an operator-facing stderr warning on two independent signals:

Per-file — any single auto-loaded file ≥ 35k chars (WARN) / 38k (CRITICAL), just under CC's 40k cliff. The scan is recursive, so a large file nested under .claude/rules/decisions/ is caught too, not just top-level files.
Aggregate — the SUM of every *.md under .claude/rules/** ≥ 50k chars (WARN) / 64k (CRITICAL, matching yonatan-hq/platform#5468's budget). This catches the #2589 failure mode that per-file checks miss: 95 per-entry files of ~7k each individually pass the per-file gate, but their ~200k aggregate loads into every prompt.

If the aggregate warning ever fires, the fix is exactly this doc's prescription: move the per-entry files to docs/decisions/ and keep only the slim index in .claude/rules/.

When the rolling pattern is still right

Don't blanket-reject rolling logbooks. They're correct when:

The file has a natural upper bound (e.g., "this lists the 12 active milestones — milestones don't accumulate, they close")
Total bytes are known to stay under 20k chars even at 5× growth
Readability as a single narrative is the primary read mode

The 40k-char cliff isn't a hard rule — it's a heuristic for "auto-loaded into every prompt." If your file isn't under .claude/rules/** at all (e.g. it's in docs/), it isn't auto-loaded and the trade-off shifts.

lifecycle/rules-size-check (hook, #1815) — pre-flight warning when .claude/rules/** approaches the cliff
src/skills/CONTRIBUTING-SKILLS.md#storage-patterns — short pointer to this reference
src/skills/memory/ — index-per-entry pattern done well (canonical reference implementation)
#2589 / yonatan-hq/platform#5468 — the recursive-autoload bug this guidance now warns against

Version Management

Version Management Guide

Reference guide for managing skill versions with safe rollback capability.

Version Structure

Each skill can have versioned snapshots stored in:

skills/<category>/<skill-name>/
├── SKILL.md                 # Current version
├── SKILL.md        # Current metadata
├── references/              # Current references
├── scripts/               # Current templates
└── versions/
    ├── manifest.json        # Version history metadata
    ├── 1.0.0/
    │   ├── SKILL.md
    │   ├── SKILL.md
    │   ├── references/
    │   └── CHANGELOG.md
    └── 1.1.0/
        ├── SKILL.md
        ├── SKILL.md
        ├── references/
        └── CHANGELOG.md

Manifest Schema

The manifest.json tracks version history:

{
  "$schema": "../../../../../../.claude/schemas/skill-evolution.schema.json",
  "skillId": "api-design-framework",
  "currentVersion": "1.2.0",
  "versions": [
    {
      "version": "1.0.0",
      "date": "2025-11-01",
      "successRate": 0.78,
      "uses": 45,
      "avgEdits": 3.2,
      "changelog": "Initial release"
    },
    {
      "version": "1.1.0",
      "date": "2026-01-05",
      "successRate": 0.89,
      "uses": 80,
      "avgEdits": 1.8,
      "changelog": "Added pagination pattern (85% users added manually)"
    }
  ],
  "suggestions": [],
  "editPatterns": {},
  "lastAnalyzed": "2026-01-14T10:30:00Z"
}

Versioning Workflow

Creating a Version

Before making changes, create a version snapshot:

version-manager.sh create <skill-id> "Description of changes"

The system:
- Bumps version number (patch by default)
- Copies current files to versions/<new-version>/
- Records current metrics in manifest
- Creates CHANGELOG.md

Comparing Versions

Compare two versions to see what changed:

version-manager.sh diff <skill-id> 1.0.0 1.1.0

Shows:

File differences (unified diff)
Metrics comparison (success rate, uses, avg edits)

Restoring a Version

If a change causes problems, rollback:

version-manager.sh restore <skill-id> <version>

The system:

Backs up current version to .backup-<version>-<timestamp>
Copies snapshot files to skill root
Updates manifest with rollback entry

Automatic Safety Checks

Rollback Triggers

The system monitors for:

Trigger	Threshold	Action
Success rate drop	-20%	Warning + rollback suggestion
Avg edits increase	+50%	Warning (users fighting skill)
Consecutive failures	5+	Alert to review

Health Check Integration

The posttool hooks monitor skill health:

check_skill_health() {
    local skill_id="$1"
    local current_rate=$(get_recent_success_rate "$skill_id" 10)
    local baseline_rate=$(get_version_baseline "$skill_id")

    if (( $(echo "$baseline_rate - $current_rate > 0.20" | bc -l) )); then
        echo "WARNING: $skill_id dropped from ${baseline_rate} to ${current_rate}"
    fi
}

Best Practices

When to Create Versions

Before applying evolution suggestions
Before major skill modifications
After validating improvements work well
At regular intervals (weekly/monthly) for active skills

Version Naming

Use semantic versioning:

Major (2.0.0): Breaking changes to skill behavior
Minor (1.1.0): New features/patterns added
Patch (1.0.1): Bug fixes, minor improvements

Cleanup Policy

Keep last 5 versions minimum
Archive versions older than 90 days
Never delete versions with good metrics (baseline references)

Metrics Interpretation

Success Rate Trends

Pattern	Interpretation
Increasing	Evolution working well
Stable	Skill mature and effective
Decreasing	Investigate recent changes

Average Edits Trends

Pattern	Interpretation
Decreasing	Skill producing better output
Stable	Consistent quality
Increasing	Users modifying more (skill may need updates)

Recovery Scenarios

Accidental Breaking Change

# 1. Check history
version-manager.sh list <skill-id>

# 2. Find last good version
version-manager.sh metrics <skill-id>

# 3. Restore
version-manager.sh restore <skill-id> 1.1.0

Gradual Degradation

# 1. Compare versions
version-manager.sh diff <skill-id> 1.0.0 1.2.0

# 2. Identify problematic changes
# 3. Create new version fixing issues

Skill Evolution

On this page