Comprehensive verification using parallel test agents for unit tests, integration tests, E2E validation, security scanning, and type checking. Runs coverage analysis, detects regressions, and validates against project conventions. Reports pass/fail with detailed findings and coverage deltas. Use when verifying implementations, validating changes after /ork:implement, or running pre-merge quality gates.

Command high

Invoke

/ork:verify

Connections

Depends on

Code Review Playbook Testing Unit Testing E2e Testing Llm Testing Integration Testing Perf Memory Quality Gates Chain Patterns Browser Tools

Used by

Fix Issue Implement Swarm Migrate Upgrade Assessment

Bare Eval Review Pr Architecture Patterns Assess Audit Skills

Verify Comprehensive verification using parallel test agents for unit tests, integration tests, E2E validation, security scanning, and type checking. Runs coverage analysis, detects regressions, and validates against project conventions. Reports pass/fail with detailed findings and coverage deltas. Use when verifying implementations, validating changes after /ork:implement, or running pre-merge quality gates.

Verify Feature

Comprehensive verification using parallel specialized agents with nuanced grading (0-10 scale) and improvement suggestions.

Quick Start

/ork:verify authentication flow
/ork:verify --model=opus user profile feature
/ork:verify --scope=backend database migrations

Argument Resolution

SCOPE = "$ARGUMENTS"       # Full argument string, e.g., "authentication flow"
SCOPE_TOKEN = "$ARGUMENTS[0]"  # First token for flag detection (e.g., "--scope=backend")
# $ARGUMENTS[0], $ARGUMENTS[1] etc. for indexed access (CC 2.1.59)

# Model override detection (CC 2.1.72)
MODEL_OVERRIDE = None
for token in "$ARGUMENTS".split():
    if token.startswith("--model="):
        MODEL_OVERRIDE = token.split("=", 1)[1]  # "opus", "sonnet", "haiku", "fable"
        SCOPE = SCOPE.replace(token, "").strip()

# Streak gate detection (#2540) — consecutive-pass mode
STREAK_TARGET = None
for token in "$ARGUMENTS".split():
    if token.startswith("--streak="):
        STREAK_TARGET = int(token.split("=", 1)[1])  # N consecutive READY verdicts required (N >= 2)
        SCOPE = SCOPE.replace(token, "").strip()
# When set, apply the Streak Gate (see below). Full protocol: references/streak-gate.md

Pass MODEL_OVERRIDE to all Agent() calls via model=MODEL_OVERRIDE when set. Accepts symbolic names (opus, sonnet, haiku, fable on harnesses whose Agent tool lists it; note fable is premium API spend after 2026-07-12) or full IDs (claude-opus-4-8) per CC 2.1.74.

Opus 4.8: Agents use native adaptive thinking (no MCP sequential-thinking needed); defaults to high effort (CC 2.1.154+). Extended 128K output supports comprehensive verification reports.

STEP 0: Effort-Aware Verification Scaling (CC 2.1.76)

Scale verification depth based on /effort level:

Effort Level	Phases Run	Agents	Output
low	Run tests only → pass/fail	0 agents	Quick check
medium	Tests + code quality + security	3 agents	Score + top issues
high (default)	All 8 phases + visual capture	6-7 agents	Full report + grades
xhigh (Opus 4.8, CC 2.1.111+)	All 8 phases + additional cross-file pattern sweep + self-verification pass	6-7 agents	Full report with uncertainty annotations

Override: Explicit user selection (e.g., "Full verification") overrides /effort downscaling.

STEP 0a: Verify User Intent with AskUserQuestion

BEFORE creating tasks, clarify verification scope:

AskUserQuestion(
  questions=[{
    "question": "What scope for this verification?",
    "header": "Scope",
    "options": [
      # multiSelect questions do not render previews (single-select only) — kept text-only
      {"label": "Full verification (Recommended)", "description": "All tests + security + code quality + visual + grades"},
      {"label": "Tests only", "description": "Run unit + integration + e2e tests"},
      {"label": "Security & code quality", "description": "Security audit (OWASP/CVE/secrets) + lint/types/complexity"},
      {"label": "Quick check", "description": "Just run tests, skip detailed analysis"}
    ],
    "multiSelect": true
  }]
)

Based on answer, adjust workflow:

Full verification: All 10 phases (8 + 2.5 + 8.5), 7 parallel agents including visual capture
Tests only: Skip phases 2 (security), 5 (UI/UX analysis)
Security & code quality: Run security-auditor + code-quality-reviewer agents
Quick check: Run tests only, skip grading and suggestions

STEP 0b: Select Orchestration Mode

Load details: Read("$\{CLAUDE_SKILL_DIR\}/references/orchestration-mode.md") for env var check logic, Agent Teams vs Task Tool comparison, and mode selection rules.

Choose Agent Teams (mesh -- verifiers share findings) or Task tool (star -- all report to lead) based on the orchestration mode reference.

MCP Probe + Resume

# memory is alwaysLoad in .mcp.json (CC 2.1.121+, #1541) — probe below kept as fallback for older CC:
ToolSearch(query="select:mcp__memory__search_nodes")
Write(".claude/chain/capabilities.json", { memory, timestamp })

Read(".claude/chain/state.json")  # resume if exists

Handoff File

After verification completes, write results:

Write(".claude/chain/verify-results.json", JSON.stringify({
  "phase": "verify", "skill": "verify",
  "timestamp": now(), "status": "completed",
  "outputs": {
    "tests_passed": N, "tests_failed": N,
    "coverage": "87%", "security_scan": "clean"
  }
}))

Regression Monitor (CC 2.1.71)

Optionally schedule post-verification monitoring:

# Guard: Skip cron in headless/CI (CLAUDE_CODE_DISABLE_CRON)
# if env CLAUDE_CODE_DISABLE_CRON is set, run a single check instead
CronCreate(
  schedule="0 8 * * *",
  prompt="Daily regression check: npm test.
    If 7 consecutive passes → CronDelete.
    If failures → alert with details."
)

Task Management (CC 2.1.16)

# 1. Create main verification task
TaskCreate(
  subject="Verify [feature-name] implementation",
  description="Comprehensive verification with nuanced grading",
  activeForm="Verifying [feature-name] implementation"
)

# 2. Create subtasks for 8-phase process
TaskCreate(subject="Run code quality checks", activeForm="Running quality checks")    # id=2
TaskCreate(subject="Execute security audit", activeForm="Running security audit")     # id=3
TaskCreate(subject="Verify test coverage", activeForm="Verifying test coverage")      # id=4
TaskCreate(subject="Validate API", activeForm="Validating API")                       # id=5
TaskCreate(subject="Check UI/UX", activeForm="Checking UI/UX")                       # id=6
TaskCreate(subject="Calculate grades", activeForm="Calculating grades")               # id=7
TaskCreate(subject="Generate suggestions", activeForm="Generating suggestions")       # id=8
TaskCreate(subject="Compile report", activeForm="Compiling report")                   # id=9

# 3. Set dependencies — phases 2-6 run in parallel, 7-9 are sequential
TaskUpdate(taskId="7", addBlockedBy=["2", "3", "4", "5", "6"])  # Grading needs all checks
TaskUpdate(taskId="8", addBlockedBy=["7"])  # Suggestions need grades
TaskUpdate(taskId="9", addBlockedBy=["8"])  # Report needs suggestions

# 4. Before starting each task, verify it's unblocked
task = TaskGet(taskId="2")  # Verify blockedBy is empty

# 5. Update status as you progress
TaskUpdate(taskId="2", status="in_progress")  # When starting
TaskUpdate(taskId="2", status="completed")    # When done — repeat for each subtask

8-Phase Workflow

Load details: Read("$\{CLAUDE_SKILL_DIR\}/references/verification-phases.md") for complete phase details, agent spawn definitions, Agent Teams alternative, and team teardown.

Phase	Activities	Output
1. Context Gathering	Git diff, commit history	Changes summary
2. Parallel Agent Dispatch	6 agents evaluate	0-10 scores
2.5 Visual Capture	Screenshot routes, AI vision eval	Gallery + visual score
3. Test Execution	Backend + frontend tests	Coverage data
4. Nuanced Grading	Composite score calculation	Grade (A-F)
5. Improvement Suggestions	Effort vs impact analysis	Prioritized list
6. Alternative Comparison	Compare approaches (optional)	Recommendation
7. Metrics Tracking	Trend analysis	Historical data
8. Report Compilation	Evidence artifacts + gallery.html	Final report
8.5 Agentation Loop	User annotates, ui-feedback fixes	Before/after diffs

Phase 2 Agents (Quick Reference)

Agent	Focus	Output
code-quality-reviewer	Lint, types, patterns	Quality 0-10
security-auditor	OWASP, secrets, CVEs	Security 0-10
test-generator	Coverage, test quality	Coverage 0-10
backend-system-architect	API design, async	API 0-10
frontend-ui-developer	React 19, Zod, a11y	UI 0-10
python-performance-engineer	Latency, resources, scaling	Performance 0-10

Launch ALL agents in ONE message with run_in_background=True and max_turns=25.

Progressive Output (CC 2.1.76+)

Output each agent's score as soon as it completes — don't wait for all 6-7 agents.

Focus mode (CC 2.1.101): In focus mode, include the full composite score, all dimension scores, and the verdict in your final message — the user didn't see the incremental outputs.

Security:     8.2/10 — No critical vulnerabilities found
Code Quality: 7.5/10 — 3 complexity hotspots identified
[...remaining agents still running...]

This gives users real-time visibility into multi-agent verification. If any dimension scores below the security_minimum threshold (default 5.0), flag it as a blocker immediately — the user can terminate early without waiting for remaining agents.

Monitor + Partial Results (CC 2.1.98)

Use Monitor for streaming test execution output from background scripts:

# Stream test output in real-time instead of waiting for completion
Bash(command="npm test 2>&1", run_in_background=true)
Monitor(pid=test_task_id)  # Each line → notification

Full pattern reference (when to use vs. TaskOutput, until-condition gates, anti-patterns): Read("$\{CLAUDE_PLUGIN_ROOT\}/skills/chain-patterns/references/monitor-patterns.md").

Partial results (CC 2.1.98): If a verification agent fails mid-analysis, synthesize partial scores rather than re-spawning:

for agent_result in verification_results:
    if "[PARTIAL RESULT]" in agent_result.output:
        # Extract whatever scores the agent produced before crashing
        partial_score = parse_score(agent_result.output)  # May be incomplete
        scores[agent_result.dimension] = {
            "score": partial_score, "partial": True,
            "note": "Agent crashed — score based on partial analysis"
        }
        # A 4-dimension score is better than no score. Do NOT re-spawn.

Phase 2.5: Visual Capture (NEW — runs in parallel with Phase 2)

Load details: Read("$\{CLAUDE_SKILL_DIR\}/references/visual-capture.md") for auto-detection, route discovery, screenshot capture, and AI vision evaluation.

Summary: Auto-detects project framework, starts dev server, discovers routes, uses agent-browser to screenshot each route, evaluates with Claude vision, generates self-contained gallery.html with base64-embedded images.

Output: verification-output/\{timestamp\}/gallery.html — open in browser to see all screenshots with AI evaluations, scores, and annotation diffs.

Graceful degradation: If no frontend detected or server won't start, skips visual capture with a warning — never blocks verification.

Phase 8.5: Agentation Visual Feedback (opt-in)

Load details: Read("$\{CLAUDE_SKILL_DIR\}/references/visual-capture.md") (Phase 8.5 section) for agentation loop workflow.

Trigger: Only when agentation MCP is configured. Offers user the choice to annotate the live UI. ui-feedback agent processes annotations, re-screenshots show before/after.

Grading & Scoring

Load Read("$\{CLAUDE_PLUGIN_ROOT\}/skills/quality-gates/references/unified-scoring-framework.md") for dimensions, weights, grade thresholds, and improvement prioritization. Load Read("$\{CLAUDE_SKILL_DIR\}/references/quality-model.md") for verify-specific extensions (Visual dimension). Load Read("$\{CLAUDE_SKILL_DIR\}/references/grading-rubric.md") for per-agent scoring criteria.

Dimension-Level Blockers (ork-rubric/1.0)

Composite is necessary but not sufficient — a strong composite can average away a critical dimension. In Phase 4 (Nuanced Grading), read per-dimension thresholds from $\{CLAUDE_SKILL_DIR\}/rubric.json (schema: $\{CLAUDE_PLUGIN_ROOT\}/skills/shared/rubric.schema.json): security min_blocker 4.0, compliance min_pass 6.0.

ANY dimension below its min_blocker → verdict is BLOCKED regardless of composite. Report it explicitly: Security 3.2/10 (CRITICAL BLOCKER — below min_blocker 4.0).
A dimension below its min_pass (but at/above min_blocker) caps the verdict at IMPROVEMENTS RECOMMENDED — it cannot grade READY FOR MERGE.
Blocked verdicts list every tripped dimension first, each with the fix needed to clear it.
A project .claude/policies/verification-policy.json (see Policy-as-Code) may tighten these thresholds, never loosen them below the rubric defaults.

Threshold bands and reporting format: references/grading-rubric.md ("Dimension-Level Blockers" section).

Streak Gate (consecutive-pass mode)

A single green is not proof — flaky and order-dependent suites pass once and fail the next run. With --streak=N, verify declares READY FOR MERGE only after N consecutive passing runs, resetting the count to 0 on any non-ready verdict. The count persists across independent runs in .claude/chain/verify-streak.json, keyed by scope.

--streak=N (N ≥ 2; 3 is the sensible default). Absent ⇒ today's single pass/fail behavior, unchanged. Target may also come from .claude/policies/verification-policy.json ("streak_target"); the flag wins.
The gate sits above the verdict — it never loosens a blocker, it only withholds "done" until the streak is met. Each run re-executes the actual tests (no cached passes — that independence is the whole point).
Reset rule: any non-READY FOR MERGE verdict (tripped blocker, failing test, or IMPROVEMENTS RECOMMENDED) zeroes the count. No partial credit.
The verdict surfaces the count: STREAK 2/3 — one more green to merge, or streak reset to 0/3 (security 3.2 < 4.0).
This is the native mechanism the prd-to-goal quality-streak recipe (#2539) leans on. Pair it with a /goal loop, but rm the ledger first — /goal reads until before the turn's verify, so a stale met:true exits with zero runs (see streak-gate.md "Stale-ledger guard").

Full protocol — ledger schema, run loop, /goal wiring, and /ork:cover reuse: Read("$\{CLAUDE_SKILL_DIR\}/references/streak-gate.md").

Evidence & Test Execution

Load details: Read("$\{CLAUDE_SKILL_DIR\}/rules/evidence-collection.md") for git commands, test execution patterns, metrics tracking, and post-verification feedback.

Policy-as-Code

Load details: Read("$\{CLAUDE_SKILL_DIR\}/references/policy-as-code.md") for configuration.

Define verification rules in .claude/policies/verification-policy.json:

{
  "thresholds": {
    "composite_minimum": 6.0,
    "security_minimum": 7.0,
    "coverage_minimum": 70
  },
  "blocking_rules": [
    {"dimension": "security", "below": 5.0, "action": "block"}
  ]
}

Verification Manifest (VERIFIED vs CLAIMED)

Agent scores, tool summaries, and every "X is clean / passing / fixed" sentence are claims until the lead re-runs the proof. Before the verdict, build a Verification Manifest marking every load-bearing claim ✅ VERIFIED (lead ran it fresh — cites command · exit · key line), 🟡 CLAIMED (an agent/tool/doc asserted it, not re-run), ⬜ UNCHECKED, or ⚪ WAIVED (accepted non-blocking, with a reason). An agent's "PASS" copied into the report is still CLAIMED — VERIFIED means the lead ran it; a sub-agent's number (price, model-id, count) is CLAIMED until checked against source.

Verdict rule: any load-bearing claim still 🟡 CLAIMED or ⬜ UNCHECKED caps the verdict at IMPROVEMENTS RECOMMENDED (never READY FOR MERGE) until it is ✅ VERIFIED or ⚪ WAIVED — this stacks with the dimension-level blockers (both must clear), and under --streak=N it resets the streak.

Protocol — claim sources, build step, template, and anti-patterns (laundering, optimism-marking, omission): Read("$\{CLAUDE_SKILL_DIR\}/references/verification-manifest.md").

Report Format

Load details: Read("$\{CLAUDE_SKILL_DIR\}/references/report-template.md") for full format. Summary:

# Feature Verification Report

**Composite Score: [N.N]/10** (Grade: [LETTER])

## Verdict
**[READY FOR MERGE | IMPROVEMENTS RECOMMENDED | BLOCKED]**

[--streak=N mode only: **STREAK [current]/[target]** — READY FOR MERGE requires the full target; any non-ready run resets to 0.]

## Verification Manifest
[✅ VERIFIED · 🟡 CLAIMED · ⬜ UNCHECKED · ⚪ WAIVED — any load-bearing 🟡/⬜ caps the verdict below READY FOR MERGE]
| # | Load-bearing claim | Asserted by | Provenance | Evidence (cmd · exit · key line) |

Push notifications (CC 2.1.110+): Verify runs for >5 min are common on complex changes. When the final verdict is ready, call PushNotification to alert the user — they likely walked away from the terminal. Requires Remote Control with "Push when Claude decides" config; fails silently for users without it.
PushNotification(
  message=f"ork:verify complete — \{verdict\} · \{score\}/10 · \{blockers_count\} blockers",
  status="proactive"
)

References

Load on demand with Read("$\{CLAUDE_SKILL_DIR\}/references/<file>"):

File	Content
`verification-phases.md`	8-phase workflow, agent spawn definitions, Agent Teams mode
`visual-capture.md`	Phase 2.5 + 8.5: screenshot capture, AI vision, gallery generation, agentation loop
`quality-model.md`	Scoring dimensions and weights (8 unified)
`grading-rubric.md`	Per-agent scoring criteria
`report-template.md`	Full report format with visual evidence section
`verification-manifest.md`	VERIFIED‑vs‑CLAIMED provenance ledger: states, verdict rule, claim sources, template, anti‑patterns
`alternative-comparison.md`	Approach comparison template
`orchestration-mode.md`	Agent Teams vs Task Tool
`policy-as-code.md`	Verification policy configuration
`verification-checklist.md`	Pre-flight checklist
`streak-gate.md`	`--streak=N` consecutive-pass gate: ledger schema, reset rule, `/goal` wiring, cover reuse

Rules

Load on demand with Read("$\{CLAUDE_SKILL_DIR\}/rules/<file>"):

File	Content
`scoring-rubric.md`	Composite scoring, grades, verdicts
`evidence-collection.md`	Evidence gathering and test patterns

Verification Gate (Cross-Cutting)

Load Read("$\{CLAUDE_PLUGIN_ROOT\}/skills/shared/rules/verification-gate.md") — the minimum 5-step gate that applies to ALL completion claims across all skills. This is non-negotiable: NO COMPLETION CLAIMS WITHOUT FRESH VERIFICATION EVIDENCE.

Anti-Sycophancy Protocol

Load Read("$\{CLAUDE_PLUGIN_ROOT\}/skills/shared/rules/anti-sycophancy.md") — all verification agents report findings directly without performative agreement. "Should be fine" is not evidence. "Tests pass (exit 0, 47/47)" is.

Agent Status Protocol

All verification agents MUST report using the standardized protocol: Read("$\{CLAUDE_PLUGIN_ROOT\}/agents/shared/status-protocol.md"). Never report DONE if concerns exist. Never silently produce work you're unsure about.

Agent Coordination

SendMessage (Cross-Agent Findings)

When a security agent finds a critical issue, share it with other verification agents:

SendMessage(to="test-generator", message="Security: SQL injection in user_service.py:88 — add parameterized query test")
SendMessage(to="code-quality-reviewer", message="Security finding at user_service.py:88 — flag in review")

Skill Chain

After verification, chain to commit if all gates pass:

TaskCreate(subject="Commit verified changes", activeForm="Committing")
TaskUpdate(taskId=commit_id, addBlockedBy=[verify_task_id])
# Then: /ork:commit

Session recovery (CC 2.1.108+): After idle periods or interruptions, use /recap to restore conversational context alongside checkpoint-resume state. Enabled by default since CC 2.1.110 (even with telemetry disabled).

Quality Bar

Done means all of these hold:

verdict is exactly one of READY FOR MERGE / IMPROVEMENTS RECOMMENDED / BLOCKED, with the composite and every dimension score cited
every load-bearing "passing/clean/fixed" claim sits in the Verification Manifest marked VERIFIED (lead re-ran, cites command · exit · key line), CLAIMED, UNCHECKED, or WAIVED
test evidence is the actual runner summary line (command, exit code, pass count) — never paraphrase
any dimension below its min_blocker is reported BLOCKED regardless of composite
READY FOR MERGE only when no load-bearing claim is still CLAIMED/UNCHECKED (and under --streak=N, the full streak is met)

ork:implement - Full implementation with verification
ork:review-pr - PR-specific verification
testing-unit / testing-integration / testing-e2e - Test execution patterns
ork:quality-gates - Quality gate patterns
browser-tools - Browser automation for visual capture

Version: 4.5.0 (July 2026) — Added the Verification Manifest (VERIFIED vs CLAIMED) — a load-bearing-claim provenance ledger that caps the verdict below READY FOR MERGE until unverified claims are re-run or waived Version: 4.4.0 (June 2026) — Added --streak=N consecutive-pass gate (#2540)

Rules (2)

Evidence Collection Patterns — HIGH

Evidence Collection Patterns

Phase 1: Context Gathering

Run these commands in parallel in ONE message:

git diff main --stat
git log main..HEAD --oneline
git diff main --name-only | sort -u

Incorrect:

# Sequential — wastes time, no coverage data
cd backend && pytest tests/
cd frontend && npm test

Correct:

# Parallel with coverage — run both in ONE message
cd backend && poetry run pytest tests/ -v --cov=app --cov-report=json
cd frontend && npm run test -- --coverage

Phase 3: Parallel Test Execution

Run backend and frontend tests in parallel:

# PARALLEL - Backend and frontend
cd backend && poetry run pytest tests/ -v --cov=app --cov-report=json
cd frontend && npm run test -- --coverage

Phase 7: Metrics Tracking

Store verification metrics in memory for trend analysis:

mcp__memory__create_entities(entities=[{
  "name": "verification-{date}-{feature}",
  "entityType": "VerificationMetrics",
  "observations": [f"composite_score: {score}", ...]
}])

Query trends: mcp__memory__search_nodes(query="VerificationMetrics")

Phase 2.5: Visual Evidence Collection

Run in parallel with Phase 2 agents. Auto-detects frontend framework and captures screenshots.

Incorrect:

# Manual screenshots with no structure
open http://localhost:3000
# Take manual screenshot...

Correct:

# Automated visual capture with AI evaluation
Agent(
  subagent_type="general-purpose",
  prompt="Visual capture: detect framework, start server, screenshot routes via agent-browser, evaluate with Claude vision, generate gallery.html",
  run_in_background=True
)

Output structure:

verification-output/{timestamp}/
├── screenshots/          (PNGs per route, base64 in gallery)
├── ai-evaluations/       (JSON per screenshot with score + issues)
├── annotations/          (before/after if agentation used)
│   ├── before/
│   └── after/
└── gallery.html          (self-contained, open in browser)

Phase 8.5: Post-Verification Feedback

After report compilation, store verification scores in the memory graph for KPI baseline tracking:

Query trends: mcp__memory__search_nodes(query="VerificationScores")

Scoring Rubric — HIGH

Scoring Rubric

Composite Score

Each agent produces a 0-10 score with decimals for nuance. The composite score is a weighted sum using the weights from Quality Model.

Grade Thresholds

Grade	Score Range	Verdict
A+	9.0-10.0	EXCELLENT
A	8.0-8.9	READY FOR MERGE
B	7.0-7.9	READY FOR MERGE
C	6.0-6.9	IMPROVEMENTS RECOMMENDED
D	5.0-5.9	IMPROVEMENTS RECOMMENDED
F	0.0-4.9	BLOCKED

Key Decisions

Decision	Choice	Rationale
Scoring scale	0-10 with decimals	Nuanced, not binary
Improvement priority	Impact / Effort ratio	Do high-value first
Alternative comparison	Optional phase	Only when multiple valid approaches
Metrics persistence	Memory MCP	Track trends over time

Incorrect:

Security: "looks fine"  → 8/10    # No evidence, subjective
Performance: "fast enough" → 7/10  # No benchmarks

Correct:

Security: "11/11 injection tests pass, 13 deny patterns, 0 CVEs" → 9/10
Performance: "p99 latency 142ms (budget: 300ms), 0 N+1 queries" → 8.5/10

Improvement Suggestions

Each suggestion includes effort (1-5) and impact (1-5) with priority = impact/effort. See Quality Model for scale definitions and quick wins formula.

Blocking Rules

Verification can be blocked by policy-as-code rules. See Policy-as-Code for configuration of composite minimums, dimension minimums, and blocking rules.

References (11)

Alternative Comparison

Evaluate current implementation against alternative approaches.

When to Compare

Multiple valid architectures exist
User asks "is this the best way?"
Major patterns were chosen (ORM vs raw SQL, REST vs GraphQL)
Performance/scalability concerns raised

Comparison Criteria

For Each Alternative

Criterion	Weight	Description
Effort	30%	Implementation complexity (1-5 scale)
Risk	25%	Technical and operational risk (1-5 scale)
Benefit	45%	Value delivered, performance, maintainability (1-5 scale)

Migration Cost

Factor	Estimate
Code changes	Files/lines affected
Data migration	Schema changes, backfill
Testing	New test coverage needed
Rollback risk	Reversibility

Decision Matrix Format

Approach	Effort	Risk	Benefit	Score
Current	N	N	N	(E0.3 + R0.25 + B*0.45)
Alt A	N	N	N	calculated
Alt B	N	N	N	calculated

Note: Higher effort and risk are bad (invert for scoring), higher benefit is good.

Recommendation Formula:

Score = (5 - Effort) * 0.3 + (5 - Risk) * 0.25 + Benefit * 0.45

Output Template

### Alternative Comparison: [Topic]

**Current Approach:** [description]
- Score: N/10
- Pros: [strengths]
- Cons: [weaknesses]

**Alternative A:** [description]
- Score: N/10
- Pros: [strengths]
- Cons: [weaknesses]
- Migration effort: [1-5]

**Recommendation:** [Keep current / Switch to Alt A]
**Justification:** [1-2 sentences]

Grading Rubric

Verification Grading Rubric

0-10 scoring criteria for each verification dimension.

Score Levels

Range	Level	Description
0-3	Poor	Critical issues, blocks merge
4-6	Adequate	Functional but needs improvement
7-9	Good	Ready for merge, minor suggestions
10	Excellent	Exemplary, reference quality

Dimension Rubrics

Correctness (Weight: 14%)

Score	Criteria
10	All functional requirements met, edge cases handled, zero regressions
8-9	Core requirements met, most edge cases handled
6-7	Core paths work, some edge cases missing
4-5	Partial functionality, notable gaps
1-3	Broken core paths
0	Does not run

Maintainability (Weight: 14%)

Score	Criteria
10	Zero lint errors/warnings, strict types, exemplary patterns, low complexity
8-9	Zero errors, < 5 warnings, minimal `any`, good patterns
6-7	1-3 errors, some warnings, acceptable patterns
4-5	4-10 errors, pattern issues, needs refactoring
1-3	Many errors, poor patterns, high complexity
0	Lint/type check fails to run

Performance (Weight: 11%)

Score	Criteria
10	p99 within budget, zero N+1, optimal caching, efficient resource usage
8-9	Good latency, no N+1, reasonable caching
6-7	Acceptable latency, minor inefficiencies
4-5	Notable bottlenecks, missing caching
1-3	Severe bottlenecks, resource leaks
0	Unresponsive or crashes under load

Security (Weight: 18%)

Score	Criteria
10	No vulnerabilities, all OWASP compliant, secure by design
8-9	No critical/high, all OWASP, excellent practices
6-7	No critical, 1-2 high, most OWASP compliant
4-5	No critical, 3-5 high, some gaps
1-3	1+ critical or many high vulnerabilities
0	Multiple critical, secrets exposed

Scalability (Weight: 9%)

Score	Criteria
10	Horizontal scaling ready, stateless design, efficient data patterns
8-9	Good scaling patterns, minor bottlenecks
6-7	Scales for current needs, some concerns
4-5	Will hit limits soon, needs rework
1-3	Single-instance only, monolithic state
0	Cannot handle production load

Testability (Weight: 12%)

Score	Criteria
10	>= 90% coverage, meaningful assertions, edge cases, no flaky tests
8-9	>= 80% coverage, good assertions, critical paths
6-7	>= 70% coverage (target), basic assertions
4-5	50-69% coverage
1-3	30-49% coverage
0	< 30% coverage or tests fail to run

Compliance (Weight: 12%)

Score	Criteria
10	Perfect REST/UI contracts, RFC 9457 errors, full Zod, WCAG AA
8-9	Good conventions, proper validation, accessibility
6-7	Acceptable patterns, minor inconsistencies
4-5	Several convention violations
1-3	Poor API/UI design, missing validation
0	Broken contracts or inaccessible

Visual (Weight: 10%)

Score	Criteria
10	Pixel-perfect layout, full a11y, complete content, responsive
8-9	Good layout, minor visual issues, WCAG AA
6-7	Acceptable layout, some a11y gaps
4-5	Layout issues, missing content, a11y problems
1-3	Broken layout, major content missing
0	Page fails to render

Note: Visual weight is 0.00 for API-only projects — redistributed proportionally. See Quality Model.

Grade Interpretation

Composite	Grade	Verdict
9.0-10.0	A+	EXCELLENT
8.0-8.9	A	READY FOR MERGE
7.0-7.9	B	READY FOR MERGE
6.0-6.9	C	IMPROVEMENTS RECOMMENDED
5.0-5.9	D	IMPROVEMENTS RECOMMENDED
0.0-4.9	F	BLOCKED

Dimension-Level Blockers (ork-rubric/1.0)

The composite table above is overridden by per-dimension floors. Thresholds live in ../rubric.json (min_blocker / min_pass fields; schema: ../../shared/rubric.schema.json) — they map to the 0-3 "Poor, blocks merge" band of each dimension rubric.

Dimension	Threshold	Effect when below
Security	`min_blocker` 4.0	Verdict BLOCKED regardless of composite
Compliance	`min_pass` 6.0	Verdict capped at IMPROVEMENTS RECOMMENDED

Reporting format — the tripped dimension leads the verdict, with the threshold named:

Security 3.2/10 (CRITICAL BLOCKER — below min_blocker 4.0)

A passing composite never clears a tripped min_blocker; list every tripped dimension with the fix required to clear it.

Orchestration Mode

Orchestration Mode Selection

Shared logic for choosing between Agent Teams and Task tool orchestration in assess/verify skills.

Environment Check

# Agent Teams is GA since CC 2.1.33
import os
force_task_tool = os.environ.get("ORCHESTKIT_FORCE_TASK_TOOL") == "1"

if force_task_tool:
    mode = "task_tool"
else:
    # Teams available by default — use for full multi-dimensional work
    mode = "agent_teams" if scope == "full" else "task_tool"

Decision Rules

Full assessment/verification scope --> Agent Teams mode (GA since CC 2.1.33)
Quick/single-dimension scope --> Task tool mode
ORCHESTKIT_FORCE_TASK_TOOL=1 --> Task tool (override)

Agent Teams vs Task Tool

Aspect	Task Tool (Star)	Agent Teams (Mesh)
Topology	All agents report to lead	Agents communicate with each other
Finding correlation	Lead cross-references after completion	Agents share findings in real-time
Cross-domain overlap	Independent scoring	Agents alert each other about overlapping concerns
Cost	~200K tokens	~500K tokens
Best for	Focused/single-dimension work	Full multi-dimensional assessment/verification

Fallback

If Agent Teams encounters issues mid-execution, fall back to Task tool for remaining work. This is safe because both modes produce the same output format (dimensional scores 0-10).

Context Window Note

For full codebase work (>20 files), use the 1M context window to avoid agent context exhaustion. On 200K context, scope discovery should limit files to prevent overflow.

Policy As Code

Policy-as-Code

Define verification policies as machine-readable configuration.

Policy Structure

version: "1.0"
name: policy-name
description: What this policy enforces

thresholds:
  composite_minimum: 6.0
  coverage_minimum: 70

rules:
  blockers: []    # Fail verification
  warnings: []    # Note but continue
  info: []        # Informational only

Rule Definition

Blocker Rules (Must Pass)

blockers:
  - dimension: security
    condition: below
    value: 5.0
    message: "Security score below minimum"

  - check: critical_vulnerabilities
    condition: above
    value: 0
    message: "Critical vulnerabilities found"

  - check: type_errors
    condition: above
    value: 0
    message: "TypeScript errors must be zero"

Warning Rules (Should Fix)

warnings:
  - dimension: code_quality
    condition: below
    value: 7.0
    message: "Code quality could be improved"

  - check: test_coverage
    condition: below
    value: 80
    message: "Coverage below recommended 80%"

Info Rules (Awareness)

info:
  - check: todo_count
    condition: above
    value: 5
    message: "Multiple TODOs found in code"

Threshold Configuration

Threshold	Type	Description
composite_minimum	float	Overall score minimum (0-10)
coverage_minimum	int	Test coverage percentage
critical_vulnerabilities	int	Max critical vulns (0)
high_vulnerabilities	int	Max high vulns
lint_errors	int	Max lint errors (0)
type_errors	int	Max type errors (0)

Custom Rules

custom_rules:
  - name: no_console_log
    pattern: "console\\.log"
    file_glob: "**/*.ts"
    exclude: ["**/*.test.ts"]
    severity: warning
    message: "Remove console.log from production"

Policy Location

Store at: .claude/policies/verification-policy.yaml

Multiple policies: .claude/policies/\{name\}-policy.yaml

Quality Model

Quality Model (verify)

Extends the unified scoring framework with Visual as the 8th dimension.

Canonical source: quality-gates/references/unified-scoring-framework.md Load: Read("$\{CLAUDE_PLUGIN_ROOT\}/skills/quality-gates/references/unified-scoring-framework.md")

verify-Specific Extensions

Visual Dimension (8th)

Dimension	Weight	What It Measures
Visual	0.10	Layout correctness, a11y, content completeness, responsiveness

When Visual is active, base dimensions scale: adjusted = base_weight * (1.0 / 1.10). When Visual is skipped (API-only), base weights stay at 1.00.

Dimensions Used (with Visual)

Dimension	Adjusted Weight
Correctness	0.14
Maintainability	0.14
Performance	0.11
Security	0.18
Scalability	0.09
Testability	0.12
Compliance	0.12
Visual	0.10

See unified framework for grade thresholds, improvement prioritization, effort/impact scales, and blocking rules.

Report Template

Verification Report Template

Copy this template and fill in results from parallel agent verification.

Quick Copy Template

# Feature Verification Report

**Date**: [TODAY'S DATE]
**Branch**: [branch-name]
**Feature**: [feature description]
**Reviewer**: Claude Code with 5 parallel subagents
**Verification Duration**: [X minutes]

---

## Summary

**Status**: [READY FOR MERGE | NEEDS ATTENTION | BLOCKED]

[1-2 sentence summary of verification results]

---

## Verification Manifest

> Provenance of every load-bearing claim. Any 🟡 CLAIMED or ⬜ UNCHECKED row caps the
> status below READY FOR MERGE until it is ✅ VERIFIED or ⚪ WAIVED. See
> `references/verification-manifest.md`.

| # | Load-bearing claim | Asserted by | Provenance | Evidence (cmd · exit · key line) |
|---|--------------------|-------------|------------|----------------------------------|
| 1 | [e.g. Types check clean] | [agent/tool/doc] | ✅ VERIFIED / 🟡 CLAIMED / ⬜ UNCHECKED / ⚪ WAIVED | [`cmd` · exit 0 · key line, or reason] |

**Manifest impact**: [none — all load-bearing claims VERIFIED] / [verdict capped at IMPROVEMENTS RECOMMENDED — rows [N] unverified]

---

## Agent Results

### 1. Code Quality (code-quality-reviewer)

| Check | Tool | Exit Code | Errors | Warnings | Status |
|-------|------|-----------|--------|----------|--------|
| Backend Lint | Ruff | 0/1 | N | N | PASS/FAIL |
| Backend Types | ty | 0/1 | N | N | PASS/FAIL |
| Frontend Lint | Biome | 0/1 | N | N | PASS/FAIL |
| Frontend Types | tsc | 0/1 | N | N | PASS/FAIL |

**Pattern Compliance:**
- [ ] No `console.log` in production code
- [ ] No `any` types in TypeScript
- [ ] Exhaustive switches with `assertNever`
- [ ] SOLID principles followed
- [ ] Cyclomatic complexity < 10

**Findings:**
- [List any pattern violations]

---

### 2. Security Audit (security-auditor)

| Check | Tool | Critical | High | Medium | Low | Status |
|-------|------|----------|------|--------|-----|--------|
| JS Dependencies | npm audit | N | N | N | N | PASS/BLOCK |
| Python Dependencies | pip-audit | N | N | N | N | PASS/BLOCK |
| Secrets Scan | grep/gitleaks | N/A | N/A | N/A | N | PASS/BLOCK |

**OWASP Top 10 Compliance:**
- [ ] A01: Broken Access Control
- [ ] A02: Cryptographic Failures
- [ ] A03: Injection
- [ ] A04: Insecure Design
- [ ] A05: Security Misconfiguration
- [ ] A06: Vulnerable Components
- [ ] A07: Auth Failures
- [ ] A08: Data Integrity Failures
- [ ] A09: Logging Failures
- [ ] A10: SSRF

**Findings:**
- [List any security issues]

---

### 3. Test Coverage (test-generator)

| Suite | Total | Passed | Failed | Skipped | Coverage | Target | Status |
|-------|-------|--------|--------|---------|----------|--------|--------|
| Backend Unit | N | N | N | N | X% | 70% | PASS/FAIL |
| Backend Integration | N | N | N | N | X% | 70% | PASS/FAIL |
| Frontend Unit | N | N | N | N | X% | 70% | PASS/FAIL |
| E2E | N | N | N | N | N/A | N/A | PASS/FAIL |

**Test Quality:**
- [ ] Meaningful assertions (not just `assert result`)
- [ ] Edge cases covered (empty, error, timeout)
- [ ] No flaky tests (no sleep, no timing deps)
- [ ] MSW used for API mocking (not jest.mock)

**Coverage Gaps:**
- [List uncovered critical paths]

---

### 4. API Compliance (backend-system-architect)

| Check | Compliant | Issues |
|-------|-----------|--------|
| REST Conventions | Yes/No | [details] |
| Pydantic v2 Validation | Yes/No | [details] |
| RFC 9457 Error Handling | Yes/No | [details] |
| Async Timeout Protection | Yes/No | [details] |
| No N+1 Queries | Yes/No | [details] |

**Findings:**
- [List any API compliance issues]

---

### 5. UI Compliance (frontend-ui-developer)

| Check | Compliant | Issues |
|-------|-----------|--------|
| React 19 APIs (useOptimistic, useFormStatus, use()) | Yes/No | [details] |
| Zod Validation on API Responses | Yes/No | [details] |
| Exhaustive Type Checking | Yes/No | [details] |
| Skeleton Loading States | Yes/No | [details] |
| Prefetching on Navigation | Yes/No | [details] |
| WCAG 2.1 AA Accessibility | Yes/No | [details] |

**Findings:**
- [List any UI compliance issues]

---

## Quality Gates Summary

| Gate | Required | Actual | Status |
|------|----------|--------|--------|
| Test Coverage | >= 70% | X% | PASS/FAIL |
| Security Critical | 0 | N | PASS/FAIL |
| Security High | <= 5 | N | PASS/FAIL |
| Type Errors | 0 | N | PASS/FAIL |
| Lint Errors | 0 | N | PASS/FAIL |

**Overall Gate Status**: [ALL PASS | SOME FAIL]

---

## Blockers (Must Fix Before Merge)

1. [Blocker description with file:line reference]
2. [Blocker description with file:line reference]

---

## Suggestions (Non-Blocking)

1. [Suggestion for improvement]
2. [Suggestion for improvement]

---

## Visual Verification

**Visual Score: [N.N]/10**

| Route | Screenshot | AI Score | Issues | Status |
|-------|-----------|----------|--------|--------|
| / | [thumbnail] | N.N/10 | N | PASS/WARN/FAIL |
| /dashboard | [thumbnail] | N.N/10 | N | PASS/WARN/FAIL |
| /settings | [thumbnail] | N.N/10 | N | PASS/WARN/FAIL |

**Gallery**: Open `verification-output/{timestamp}/gallery.html` for full screenshots with AI evaluations.

### Agentation Annotations (if applicable)

| Annotation | Route | Resolution | Before/After |
|-----------|-------|------------|--------------|
| [user comment] | /dashboard | [fix summary] | [see gallery] |

---

## Evidence Artifacts

| Artifact | Location | Generated |
|----------|----------|-----------|
| Test Results | `/tmp/test_results.log` | [timestamp] |
| Coverage Report | `/tmp/coverage.json` | [timestamp] |
| Security Scan | `/tmp/security_audit.json` | [timestamp] |
| Lint Report | `/tmp/lint_results.log` | [timestamp] |
| Visual Gallery | `verification-output/{timestamp}/gallery.html` | [timestamp] |
| Screenshots | `verification-output/{timestamp}/screenshots/` | [timestamp] |
| AI Evaluations | `verification-output/{timestamp}/ai-evaluations/` | [timestamp] |

---

## Verification Metadata

- **Agents Used**: 7 (code-quality-reviewer, security-auditor, test-generator, backend-system-architect, frontend-ui-developer, python-performance-engineer, visual-capture)
- **Parallel Execution**: Yes
- **Total Tool Calls**: ~N
- **Context Usage**: ~N tokens

Status Definitions

Status	Emoji	Meaning	Action Required
READY FOR MERGE	Green	All checks pass, no blockers	Approve PR
NEEDS ATTENTION	Yellow	Minor issues found	Review suggestions, optionally fix
BLOCKED	Red	Critical issues found	Must fix before merge

Severity Levels

Level	Threshold	Action	Blocks Merge
Critical	Any	Fix immediately	YES
High	> 5	Fix before merge	YES
Medium	> 20	Should fix	NO (with justification)
Low	> 50	Nice to have	NO
Info	N/A	Informational	NO

Agent Output JSON Schemas

code-quality-reviewer Output

{
  "linting": {"tool": "ruff|biome", "exit_code": 0, "errors": 0, "warnings": 0},
  "type_check": {"tool": "ty|tsc", "exit_code": 0, "errors": 0},
  "patterns": {"violations": [], "compliance": "PASS|FAIL"},
  "approval": {"status": "APPROVED|NEEDS_FIXES", "blockers": []}
}

security-auditor Output

{
  "scan_summary": {"files_scanned": 100, "vulnerabilities_found": 0},
  "critical": [],
  "high": [],
  "secrets_detected": [],
  "recommendations": [],
  "approval": {"status": "PASS|BLOCK", "blockers": []}
}

test-generator Output

{
  "coverage": {"current": 85, "target": 70, "passed": true},
  "test_summary": {"total": 100, "passed": 98, "failed": 2, "skipped": 0},
  "gaps": ["file:line - reason"],
  "quality_issues": [],
  "approval": {"status": "PASS|FAIL", "blockers": []}
}

backend-system-architect Output

{
  "api_compliance": {"rest_conventions": true, "issues": []},
  "validation": {"pydantic_v2": true, "issues": []},
  "error_handling": {"rfc9457": true, "issues": []},
  "async_safety": {"timeouts": true, "issues": []},
  "approval": {"status": "PASS|FAIL", "blockers": []}
}

frontend-ui-developer Output

{
  "react_19": {"apis_used": ["useOptimistic"], "missing": [], "compliant": true},
  "zod_validation": {"validated_endpoints": 10, "unvalidated": []},
  "type_safety": {"exhaustive_switches": true, "any_types": 0},
  "ux_patterns": {"skeletons": true, "prefetching": true},
  "accessibility": {"wcag_issues": []},
  "approval": {"status": "PASS|FAIL", "blockers": []}
}

Streak Gate

Streak Gate — consecutive-pass verification

A single green is not proof. Flaky suites, race conditions, and order-dependent tests pass once and fail the next run. The streak gate makes /ork:verify declare a feature done only after N consecutive passing runs, resetting the count to zero on any failure. It is the flakiness defense that single-shot pass/fail can't give.

Loop-Library theme: Quality Streak ("fixes product failures until a defined streak of realistic tests passes"). This is the native ork mechanism the prd-to-goal quality-streak recipe leans on.

Invocation

/ork:verify --streak=3 authentication flow     # need 3 greens in a row
/ork:verify --streak=3                          # continues an existing streak for the same scope

--streak=N (N ≥ 2). When absent, verify behaves exactly as before (single pass/fail). The target may also come from the verify rubric's streak_target slot (src/skills/verify/rubric.json, an integer ≥ 2 validated by src/skills/shared/rubric.schema.json — a configured value is schema-checked, not silently ignored); the explicit flag always wins.

Parse it alongside the other flags in Argument Resolution:

STREAK_TARGET = None
for token in "$ARGUMENTS".split():
    if token.startswith("--streak="):
        STREAK_TARGET = int(token.split("=", 1)[1])   # explicit override
        SCOPE = SCOPE.replace(token, "").strip()
STREAK_TARGET = STREAK_TARGET or rubric.get("streak_target")  # validated slot; may stay None

Ledger

The streak persists across independent verify runs in .claude/chain/verify-streak.json so each invocation extends (or breaks) the run before it:

{
  "schema": "verify-streak/1.0",
  "scope": "authentication flow",
  "target": 3,
  "current": 2,
  "met": false,
  "reset_count": 1,
  "last_run_ts": "2026-06-20T10:31Z",
  "history": [
    { "ts": "2026-06-20T10:01Z", "verdict": "fail", "composite": 5.4, "blocker": "security 3.2" },
    { "ts": "2026-06-20T10:14Z", "verdict": "pass", "composite": 7.8 },
    { "ts": "2026-06-20T10:31Z", "verdict": "pass", "composite": 8.1 }
  ]
}

scope keys the streak — switching scope starts a fresh streak. current is the live consecutive-pass count; met is current >= target. last_run_ts is the timestamp of the run that last wrote the ledger — the freshness stamp a /goal loop checks so it never trusts a met:true it didn't just produce (see Stale-ledger guard).

Scope keying is normalized. The raw scope string is trimmed and its internal whitespace collapsed before it keys the streak, so "auth flow" and "auth flow" (or a trailing space) extend the same streak instead of silently starting fresh ones:

def streak_key(scope: str) -> str:
    return " ".join(scope.split())   # trim + collapse runs of whitespace

Run protocol

Each /ork:verify --streak=N invocation:

key = streak_key(SCOPE)                     # normalized: trim + collapse whitespace
ledger = read(".claude/chain/verify-streak.json") or new_ledger(key, N)
if ledger.scope != key or ledger.target != N:
    ledger = new_ledger(key, N)            # scope/target change → fresh streak

verdict = run_full_verification()          # the normal 8-phase verify, UNCHANGED

if verdict == "READY FOR MERGE":
    ledger.current += 1                    # extend the streak
else:
    if ledger.current > 0: ledger.reset_count += 1
    ledger.current = 0                     # ANY non-ready verdict breaks it
ledger.history.append({ts, verdict, composite, blocker?})
ledger.met = ledger.current >= ledger.target
ledger.last_run_ts = now_iso()             # stamp THIS run — the freshness proof
write_atomic(".claude/chain/verify-streak.json", ledger)   # tmp + rename, never in-place

Atomic write (concurrency safety). The ledger is the counter, so a torn or last-writer-wins write corrupts the whole feature. Two verify runs on the same scope (e.g. parallel worktrees) must not race: write to verify-streak.json.tmp.<pid> then rename() over the target (an atomic filesystem op on POSIX). Never mutate the file in place. If two runs still interleave, rename-last-wins loses at most one increment — it never leaves a half-written ledger.

Reset rule: a streak breaks on any non-READY FOR MERGE verdict — a tripped dimension blocker, a failing test, or an IMPROVEMENTS-RECOMMENDED. One red zeroes the count. There is no partial credit.

Independence rule (the whole point): each run must re-execute the actual tests — no cached results, no "already passed last turn." A streak over cached runs proves nothing. If the suite is fast, run it fresh each turn; if it is slow, that cost is the price of trusting the green.

Verdict mapping

The streak gate sits above the normal verdict — it never loosens a blocker, it only withholds "done" until the streak is met:

Streak state	Reported verdict
this run not READY (blocker/fail)	the normal verdict (BLOCKED / IMPROVEMENTS RECOMMENDED) + `streak reset to 0/N`
READY but `current < target`	STREAK PROGRESS — `current`/`target` (not done; run again)
READY and `current >= target`	READY FOR MERGE (streak `target`/`target` met)

Always surface the count: STREAK 2/3 — one more green to merge or streak reset to 0/3 (security 3.2 < 4.0). The user must see how close (or how broken) the streak is.

Wiring into `/goal`

The streak gate is what makes the quality-streak recipe converge. The until-clause reads the ledger; each loop turn runs verify:

# Stamp the loop start; honor met only for a ledger written AFTER it.
LOOP_START="$(date -u +%Y-%m-%dT%H:%M:%SZ)"
/goal until jq -e --arg t "$LOOP_START" '.met==true and .last_run_ts >= $t' .claude/chain/verify-streak.json
/goal abort-if turns > 15 OR tokens > 150000 OR no_progress_for_4_turns

no_progress_for_4_turns is deliberately generous: a streak that keeps resetting is progress information (it's surfacing real flakiness), so give it room before aborting.

Stale-ledger guard (the first-run race). /goal evaluates the until-clause at the top of each turn — before that turn's verify runs. If a previous completed streak for the same scope left met:true in the ledger, a bare .met==true check exits on turn 1 having run zero fresh verifications. Defense (robust form): the until-clause compares last_run_ts against the loop-start timestamp, so met:true is honored only when the ledger was written this loop — you never trust a met you didn't just produce. This needs no rm and is safe even if a stale ledger exists (the old rm -f reset still works as a simpler fallback, but the timestamp guard is preferred because it also survives a mid-loop scope reuse). ISO-8601 UTC timestamps compare correctly as strings.

Reuse in `/ork:cover`

cover already auto-heals up to 3 iterations. The same ledger + reset protocol applies: after generating tests, require the suite to pass N times consecutively before declaring coverage done — this catches flaky generated tests before they land. Same verify-streak/1.0 ledger, keyed by the cover scope. (Cover wiring is a follow-up; the protocol here is the shared contract.)

Anti-patterns

Bad	Why it fails	Good
Count cached "passes" toward the streak	Proves the cache is green, not the code	Re-run the real suite each turn
Let an IMPROVEMENTS-RECOMMENDED extend the streak	Streak then means "good enough once", not "green N times"	Only `READY FOR MERGE` extends; everything else resets
Streak target of 1	Identical to single-shot verify — no flakiness defense	N ≥ 2 (3 is the sensible default)
Raise the streak target to force a pass	Gaming in reverse — moving the goalposts	Target is set once, up front, from intent/policy

Verification Checklist

Pre-flight checklist for comprehensive feature verification with parallel agents.

Pre-Verification Setup

Context Gathering

Run git diff main --stat to understand change scope
Run git log main..HEAD --oneline to see commit history
Identify affected domains (backend/frontend/both)
Check for any existing failing tests

Task Creation (CC 2.1.16)

Create parent verification task
Create subtasks for each agent domain
Set proper dependencies if needed

Agent Dispatch Checklist

Required Agents (Full-Stack)

Agent	Launched	Completed	Status
code-quality-reviewer	[ ]	[ ]	Pending
security-auditor	[ ]	[ ]	Pending
test-generator	[ ]	[ ]	Pending
backend-system-architect	[ ]	[ ]	Pending
frontend-ui-developer	[ ]	[ ]	Pending

Optional Agents (Add as Needed)

Condition	Agent	Launched
AI/ML features	llm-integrator	[ ]
Performance-critical	frontend-performance-engineer	[ ]
Database changes	database-engineer	[ ]

Quality Gate Checklist

Mandatory Gates

Gate	Threshold	Actual	Pass
Test Coverage	>= 70%	___%	[ ]
Security Critical	0	___	[ ]
Security High	<= 5	___	[ ]
Type Errors	0	___	[ ]
Lint Errors	0	___	[ ]

Code Quality Gates

Check	Status
No console.log in production	[ ]
No `any` types	[ ]
Exhaustive switches (assertNever)	[ ]
Proper error handling	[ ]
No hardcoded secrets	[ ]

Frontend-Specific Gates (if applicable)

Check	Status
React 19 APIs used	[ ]
Zod validation on API responses	[ ]
Skeleton loading states	[ ]
Prefetching on links	[ ]
WCAG 2.1 AA compliance	[ ]

Backend-Specific Gates (if applicable)

Check	Status
REST conventions followed	[ ]
Pydantic v2 validation	[ ]
RFC 9457 error handling	[ ]
Async timeout protection	[ ]
No N+1 queries	[ ]

Evidence Collection

Required Evidence

Optional Evidence

E2E test screenshots
Performance benchmarks
Bundle size analysis
Accessibility audit

Report Generation

Report Sections

Final Steps

Update all task statuses to completed
Store verification evidence in context
Generate final report markdown

Quick Reference: Agent Prompts

code-quality-reviewer

Focus: Lint, type check, anti-patterns, SOLID, complexity

security-auditor

Focus: Dependency audit, secrets, OWASP Top 10, rate limiting

test-generator

Focus: Coverage gaps, test quality, edge cases, flaky tests

backend-system-architect

Focus: REST, Pydantic v2, RFC 9457, async safety, N+1

frontend-ui-developer

Focus: React 19, Zod, exhaustive types, skeletons, prefetch, a11y

Troubleshooting

Agent Not Responding

Check if agent was launched with run_in_background=True
Verify agent name matches exactly
Check for context window limits

Tests Failing

Run tests locally first
Check for missing dependencies
Verify test database state
Look for timing-dependent tests

Coverage Below Threshold

Identify uncovered files
Check for excluded patterns
Focus on critical paths first

Verification Manifest

Verification Manifest (VERIFIED vs CLAIMED)

The report's agent scores, tool summaries, and every "X is fixed / clean / passing" sentence are claims until the lead independently re-runs the proof. The single most common verification failure is relaying an agent's assertion as a fact — a sub-agent reports "tsc clean", the lead copies that into the report, it ships, and it does not work.

The Verification Manifest is a provenance ledger that closes that seam. For every load-bearing claim in the run, the lead records whether it was independently VERIFIED (with the exact command + output), merely CLAIMED (asserted by an agent / tool / doc, never re-run), or UNCHECKED (assumed). It is the operational form of the Verification Gate and Anti-Sycophancy rules — turning "no completion claims without fresh evidence" from a principle into a filled-in table.

Definitions

Load-bearing claim — one where flipping it false would change the verdict, ship a bug, or invalidate the merge. You do not manifest every trivial statement; you manifest the claims the merge rests on. If in doubt, it's load-bearing.

Provenance states

State	Meaning	Row must include
✅ VERIFIED	The lead ran the proof fresh this session and read the output.	The command · exit code · the key output line
🟡 CLAIMED	A sub-agent, tool summary, doc, or memory asserted it — not independently re-run by the lead.	Who asserted it
⬜ UNCHECKED	Assumed true; nobody ran a proof (e.g. "no other caller depends on this" with no grep).	Why it was assumed
⚪ WAIVED	A CLAIMED/UNCHECKED item deliberately accepted as non-blocking.	A one-line reason + issue ref if deferred

An agent reporting "PASS" and the lead copying that into the manifest is still 🟡 CLAIMED — not ✅ VERIFIED. VERIFIED means the lead ran it. The distinction is the whole point.

The rule (wired to the verdict)

Any load-bearing claim that is 🟡 CLAIMED or ⬜ UNCHECKED caps the verdict at IMPROVEMENTS RECOMMENDED — it cannot grade READY FOR MERGE — until it is ✅ VERIFIED or explicitly ⚪ WAIVED with a reason.

This sits alongside the dimension-level blockers (see grading-rubric.md): a strong composite score never launders an unverified load-bearing claim into "done." Both gates must clear. Under --streak=N, a run carrying any load-bearing CLAIMED/UNCHECKED item is not READY and therefore resets the streak to 0 (consistent with the existing reset rule).

How to build it (a synthesis step, after agents return, before the verdict)

Enumerate claims from three sources:
- a. Agent output — every sub-agent approval / score / "PASS" / "clean" / "no X found" is a CLAIM by default.
- b. The working narrative — every "X is fixed / passing / done" sentence.
- c. Premises the change rests on — facts pulled from a doc, memory, or prior session ("the migration already ran", "the endpoint returns Y"). Docs and memory are Tier 2–4 (see the context-precedence rule): CLAIMED until re-checked at HEAD.
For each, decide: can I cheaply run the proof now?
- Yes → run it fresh, capture command · exit · key line → ✅ VERIFIED.
- No → 🟡 CLAIMED / ⬜ UNCHECKED; if load-bearing, it caps the verdict (or ⚪ WAIVE it with a reason).
Spend verification on the load-bearing few. You need not re-run everything — you need to never present a CLAIMED item as VERIFIED.

Template

## Verification Manifest

| # | Load-bearing claim | Asserted by | Provenance | Evidence (cmd · exit · key line) |
|---|--------------------|-------------|------------|----------------------------------|
| 1 | Types check clean (web) | frontend-ui-developer | ✅ VERIFIED | `pnpm --filter web type-check` · exit 0 · 0 errors |
| 2 | Unit suite green | test-generator | ✅ VERIFIED | `uv run pytest tests/unit -q` · exit 0 · 214 passed |
| 3 | No N+1 in order_service | backend-system-architect | 🟡 CLAIMED | agent asserted; not re-run → **caps verdict** |
| 4 | Migration a2f1b already applied | memory (2026-07-01) | ⬜ UNCHECKED | Tier-2 premise; re-check at HEAD before relying |
| 5 | Bundle within budget | — | ⚪ WAIVED | deferred, tracked #NNNN (non-blocking) |

**Manifest verdict impact:** rows 3 & 4 are load-bearing and unverified →
verdict capped at IMPROVEMENTS RECOMMENDED until VERIFIED or WAIVED.

Anti-patterns

Laundering — copying an agent's "PASS" into the table as ✅ VERIFIED without re-running. That's CLAIMED. VERIFIED is the lead's fresh run.
Optimism-marking — flipping everything to ✅ to clear the gate. The manifest measures honesty, not confidence. "Should be fine" is ⬜, not ✅.
Convenient omission — leaving a load-bearing claim off the table to dodge the verdict cap. The omission is the bug the manifest exists to catch.
Trusting agent numbers — a price, model-id, cost, or count emitted by a sub-agent is CLAIMED until checked against source. (Sub-agents have fabricated off-by-1000× figures and retired model ids.) Central-verify before the row goes ✅.

Effort scaling

`/effort`	Manifest
low (quick check)	Optional — skip unless a claim would block.
medium	Required for any claim that would block or pass the verdict.
high / xhigh	Full manifest, including doc/memory premises (source c).

Verification Phases

Verification Phases — Detailed Workflow

Phase Overview

Phase	Activities	Output
1. Context Gathering	Git diff, commit history	Changes summary
2. Parallel Agent Dispatch	6 agents evaluate	0-10 scores
2.5 Visual Capture	Screenshot routes, AI vision eval	Gallery + visual score
3. Test Execution	Backend + frontend tests	Coverage data
4. Nuanced Grading	Composite score calculation	Grade (A-F)
5. Improvement Suggestions	Effort vs impact analysis	Prioritized list
6. Alternative Comparison	Compare approaches (optional)	Recommendation
7. Metrics Tracking	Trend analysis	Historical data
8. Report Compilation	Evidence artifacts + gallery.html	Final report
8.5 Agentation Loop	User annotates, ui-feedback fixes	Before/after diffs

Phase 2: Parallel Agent Dispatch (6 Agents)

Launch ALL agents in ONE message with run_in_background=True and max_turns=25. Pass model=MODEL_OVERRIDE when user specifies --model=opus (CC 2.1.72).

Agent	Focus	Output
code-quality-reviewer	Lint, types, patterns	Quality 0-10
security-auditor	OWASP, secrets, CVEs	Security 0-10
test-generator	Coverage, test quality	Coverage 0-10
backend-system-architect	API design, async	API 0-10
frontend-ui-developer	React 19, Zod, a11y	UI 0-10
python-performance-engineer	Latency, resources, scaling	Performance 0-10

Use python-performance-engineer for backend-focused verification or frontend-performance-engineer for frontend-focused verification. See Quality Model for Performance (0.11) and Scalability (0.09) weights.

Optionally add monitoring-engineer as a conditional observability verifier when the change touches services, handlers, background jobs, or infra (skip for pure UI/docs). It scores whether the new code is operable in production — structured logging on critical paths, metrics/SLIs, error/alert coverage — not just correct.

See Grading Rubric for detailed scoring criteria.

Task Tool Mode (Default)

# PARALLEL — All 6 in ONE message
Agent(
  subagent_type="ork:code-quality-reviewer",
  model=MODEL_OVERRIDE,  # None inherits default; "opus" for thorough verification (CC 2.1.72)
  prompt="""# Cache-optimized: stable content first (CC 2.1.72)
  Verify code quality. Score 0-10.
  Check: lint errors, type coverage, cyclomatic complexity, DRY, SOLID.
  Budget: 15 tool calls max.
  Return: score (0-10), reasoning, evidence, 2-3 improvement suggestions.
  Feature: {feature}. Scope: ONLY review files in {scope_files}.""",
  run_in_background=True, max_turns=25
)
Agent(
  subagent_type="ork:security-auditor",
  model=MODEL_OVERRIDE,
  prompt="""# Cache-optimized: stable content first (CC 2.1.72)
  Security verification. Score 0-10.
  Check: OWASP Top 10, secrets in code, dependency CVEs, auth patterns.
  Budget: 15 tool calls max.
  Return: score (0-10), vulnerabilities found, severity ratings.
  Feature: {feature}. Scope: ONLY review files in {scope_files}.""",
  run_in_background=True, max_turns=25
)
Agent(
  subagent_type="ork:test-generator",
  model=MODEL_OVERRIDE,
  prompt="""# Cache-optimized: stable content first (CC 2.1.72)
  Verify test coverage. Score 0-10.
  Check: test existence, type matching, quality, edge cases, coverage %.
  Run existing tests and report results.
  Budget: 15 tool calls max.
  Return: score (0-10), coverage %, gaps identified.
  Feature: {feature}. Scope: ONLY review files in {scope_files}.""",
  run_in_background=True, max_turns=25
)
Agent(
  subagent_type="ork:backend-system-architect",
  model=MODEL_OVERRIDE,
  prompt="""# Cache-optimized: stable content first (CC 2.1.72)
  Verify API design and backend patterns. Score 0-10.
  Check: REST conventions, async patterns, transaction boundaries, error handling.
  Budget: 15 tool calls max.
  Return: score (0-10), pattern compliance, issues found.
  Feature: {feature}. Scope: ONLY review files in {scope_files}.""",
  run_in_background=True, max_turns=25
)
Agent(
  subagent_type="ork:frontend-ui-developer",
  model=MODEL_OVERRIDE,
  prompt="""# Cache-optimized: stable content first (CC 2.1.72)
  Verify frontend implementation. Score 0-10.
  Check: React 19 patterns, Zod schemas, accessibility (WCAG 2.1 AA), loading states.
  Budget: 15 tool calls max.
  Return: score (0-10), pattern compliance, a11y issues.
  Feature: {feature}. Scope: ONLY review files in {scope_files}.""",
  run_in_background=True, max_turns=25
)
Agent(
  subagent_type="ork:python-performance-engineer",
  model=MODEL_OVERRIDE,
  prompt="""# Cache-optimized: stable content first (CC 2.1.72)
  Verify performance and scalability. Score 0-10.
  Check: latency hotspots, N+1 queries, resource usage, caching, scaling patterns.
  Budget: 15 tool calls max.
  Return: score (0-10), bottlenecks found, optimization suggestions.
  Feature: {feature}. Scope: ONLY review files in {scope_files}.""",
  run_in_background=True, max_turns=25
)
# Conditional 7th agent — observability completeness. Spawn ONLY when the change
# touches services/handlers, background jobs, or infra (skip for pure UI/docs).
Agent(
  subagent_type="ork:monitoring-engineer",
  model=MODEL_OVERRIDE,
  prompt="""# Cache-optimized: stable content first (CC 2.1.72)
  Verify observability completeness. Score 0-10.
  Check: structured logging on critical paths, metrics/SLIs for the new surface,
  error and alert coverage, trace propagation. Skip pure UI/docs changes.
  Budget: 12 tool calls max.
  Return: score (0-10), observability gaps, instrumentation suggestions.
  Feature: {feature}. Scope: ONLY review files in {scope_files}.""",
  run_in_background=True, max_turns=20
)

Agent Teams Alternative

In Agent Teams mode, form a verification team where agents share findings and coordinate scoring:

# CC 2.1.178+: one implicit team per session — no TeamCreate.
# Spawn teammates directly via Agent(name=...). Requires
# CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1 (set in ork.settings.json).

Agent(subagent_type="ork:code-quality-reviewer", name="quality-verifier",
     team_name="verify-{feature}", model=MODEL_OVERRIDE,
     prompt="""# Cache-optimized: stable content first (CC 2.1.72)
     Verify code quality. Score 0-10.
     When you find patterns that affect security, message security-verifier.
     When you find untested code paths, message test-verifier.
     Share your quality score with all teammates for composite calculation.
     Feature: {feature}.""")

Agent(subagent_type="ork:security-auditor", name="security-verifier",
     team_name="verify-{feature}", model=MODEL_OVERRIDE,
     prompt="""# Cache-optimized: stable content first (CC 2.1.72)
     Security verification. Score 0-10.
     When quality-verifier flags security-relevant patterns, investigate deeper.
     When you find vulnerabilities in API endpoints, message api-verifier.
     Share severity findings with test-verifier for test gap analysis.
     Feature: {feature}.""")

Agent(subagent_type="ork:test-generator", name="test-verifier",
     team_name="verify-{feature}", model=MODEL_OVERRIDE,
     prompt="""# Cache-optimized: stable content first (CC 2.1.72)
     Verify test coverage. Score 0-10.
     When quality-verifier or security-verifier flag untested paths, quantify the gap.
     Run existing tests and report coverage metrics.
     Message the lead with coverage data for composite scoring.
     Feature: {feature}.""")

Agent(subagent_type="ork:backend-system-architect", name="api-verifier",
     team_name="verify-{feature}", model=MODEL_OVERRIDE,
     prompt="""# Cache-optimized: stable content first (CC 2.1.72)
     Verify API design and backend patterns. Score 0-10.
     When security-verifier flags endpoint issues, validate and score.
     Share API compliance findings with ui-verifier for consistency check.
     Feature: {feature}.""")

Agent(subagent_type="ork:frontend-ui-developer", name="ui-verifier",
     team_name="verify-{feature}", model=MODEL_OVERRIDE,
     prompt="""# Cache-optimized: stable content first (CC 2.1.72)
     Verify frontend implementation. Score 0-10.
     When api-verifier shares API patterns, verify frontend matches.
     Check React 19 patterns, accessibility, and loading states.
     Share findings with quality-verifier for overall assessment.
     Feature: {feature}.""")

# Conditional 6th agent — use python-performance-engineer for backend,
# frontend-performance-engineer for frontend
Agent(subagent_type="ork:python-performance-engineer", name="perf-verifier",
     team_name="verify-{feature}", model=MODEL_OVERRIDE,
     prompt="""# Cache-optimized: stable content first (CC 2.1.72)
     Verify performance and scalability. Score 0-10.
     Assess latency, resource usage, caching, and scaling patterns.
     When security-verifier flags resource-intensive endpoints, profile them.
     Share performance findings with api-verifier and quality-verifier.
     Feature: {feature}.""")

Team teardown after report compilation:

# After composite grading and report generation
# CC 2.1.178+: no TeamDelete — teammates wind down at turn end
# (press Ctrl+F twice to stop lingering background teammates).

# Worktree cleanup (CC 2.1.72)
ExitWorktree(action="keep")

Fallback: If team formation fails, use standard Phase 2 Task spawns above.

Manual cleanup: To stop lingering background teammates, press Ctrl+F twice (foreground/Stop).

Phase 2.5: Visual Capture (Parallel with Phase 2)

Runs as a 7th parallel agent alongside the 6 verification agents. See Visual Capture for full details.

# Launch IN THE SAME MESSAGE as Phase 2 agents
Agent(
  subagent_type="general-purpose",
  description="Visual capture and AI evaluation",
  prompt="""Visual verification capture for: {feature}
  1. Detect project type from package.json
  2. Start dev server (auto-detect framework)
  3. Discover routes (framework-aware scan)
  4. Use agent-browser to screenshot each route (max 20)
  5. Read each screenshot PNG for AI vision evaluation
  6. Score layout, accessibility, content completeness (0-10 per route)
  7. Read gallery template from ${CLAUDE_SKILL_DIR}/assets/gallery-template.html
  8. Generate gallery.html with base64-embedded screenshots
  9. Write to verification-output/{timestamp}/gallery.html
  10. Kill dev server

  If no frontend detected, write skip notice and exit.
  If server fails to start, write warning and exit.
  Never block — graceful degradation only.""",
  run_in_background=True, max_turns=30
)

Output: verification-output/\{timestamp\}/ folder with screenshots, AI evaluations (JSON), and gallery.html.

Phase 8.5: Agentation Visual Feedback (Opt-In)

Trigger: Only when agentation MCP is configured in .mcp.json. Runs AFTER Phase 8 report compilation.

# Check agentation availability
ToolSearch(query="select:mcp__agentation__agentation_get_all_pending")

# If available, offer user choice
AskUserQuestion(questions=[{
  "question": "Agentation detected. Annotate the live UI before finalizing?",
  "header": "Visual Feedback Loop",
  "options": [
    {"label": "Yes", "description": "I'll mark issues, ui-feedback agent fixes them, gallery updates with before/after"},
    {"label": "Skip", "description": "Finalize with current screenshots"}
  ]
}])

# If yes: watch → acknowledge → dispatch ui-feedback → re-screenshot → update gallery
# Max 3 rounds (configurable in verification-config.yaml)

Phase 4: Nuanced Grading

See Quality Model for scoring dimensions, weights, and grade interpretation. See Grading Rubric for detailed per-agent scoring criteria.

Phase 5: Improvement Suggestions

Each suggestion includes effort (1-5) and impact (1-5) with priority = impact/effort. See Quality Model for scale definitions and quick wins formula.

Phase 6: Alternative Comparison (Optional)

See Alternative Comparison for template.

Use when:

Multiple valid approaches exist
User asked "is this the best way?"
Major architectural decisions made

Phase 8: Report Compilation

See Report Template for full format.

# Feature Verification Report

**Composite Score: [N.N]/10** (Grade: [LETTER])

## Top Improvement Suggestions
| # | Suggestion | Effort | Impact | Priority |
|---|------------|--------|--------|----------|
| 1 | [highest] | [N] | [N] | [N.N] |

## Verdict
**[READY FOR MERGE | IMPROVEMENTS RECOMMENDED | BLOCKED]**

Visual Capture

Visual Capture — Phase 2.5

Visual verification that produces browsable screenshot evidence with AI evaluation.

Architecture

Phase 2 agents (parallel)
         |
    Phase 2.5 (runs IN PARALLEL with Phase 2 agents)
         |
         v
┌─────────────────────────────────────────────────┐
│  1. Detect project type (package.json scan)      │
│  2. Start dev server (framework-aware)           │
│  3. Wait for server ready (poll localhost)        │
│  4. Discover routes (framework-aware)            │
│  5. agent-browser: navigate + screenshot each    │
│  6. Claude vision: evaluate each screenshot      │
│  7. Generate gallery.html (self-contained)       │
│  8. Stop dev server                              │
└─────────────────────────────────────────────────┘

Step 1: Project Type Detection

Scan codebase to determine framework and dev server command:

# PARALLEL — detect framework signals
Grep(pattern="\"next\":", glob="package.json", output_mode="content")
Grep(pattern="\"vite\":", glob="package.json", output_mode="content")
Grep(pattern="\"react-scripts\":", glob="package.json", output_mode="content")
Grep(pattern="\"vue\":", glob="package.json", output_mode="content")
Grep(pattern="\"nuxt\":", glob="package.json", output_mode="content")
Grep(pattern="\"@angular/core\":", glob="package.json", output_mode="content")
Glob(pattern="**/manage.py")
Glob(pattern="**/main.py")
Glob(pattern="**/app.py")
Glob(pattern="**/index.html")

Detection Matrix

Signal	Framework	Start Command	Default Port
`"next":` in package.json	Next.js	`npm run dev`	3000
`"vite":` in package.json	Vite	`npm run dev`	5173
`"react-scripts":`	CRA	`npm start`	3000
`"vue":` + no vite	Vue CLI	`npm run serve`	8080
`"nuxt":`	Nuxt	`npm run dev`	3000
`"@angular/core":`	Angular	`npx ng serve`	4200
`manage.py` exists	Django	`python manage.py runserver`	8000
`main.py`/`app.py` + FastAPI	FastAPI	`uvicorn app:app`	8000
`index.html` only	Static	`npx serve .`	3000
None of the above	Skip visual capture	N/A	N/A

Override via Config

If .claude/verification-config.yaml exists with a visual section, use those settings instead of auto-detection.

Step 2: Start Dev Server

Bash(
  command=f"{start_command} &",
  description="Start dev server for visual capture",
  run_in_background=True
)

Wait for server readiness:

Bash(command=f"for i in $(seq 1 30); do curl -s http://localhost:{port} > /dev/null && exit 0; sleep 1; done; exit 1",
     description="Wait for dev server to be ready (max 30s)")

If server fails to start: Skip visual capture with a warning in the report. Do NOT block verification.

Step 3: Route Discovery

Next.js App Router

Glob(pattern="**/app/**/page.{tsx,jsx,ts,js}")
# Extract route from file path: app/dashboard/page.tsx → /dashboard

Next.js Pages Router

Glob(pattern="**/pages/**/*.{tsx,jsx,ts,js}")
# Exclude _app, _document, _error, api/
# Extract route: pages/about.tsx → /about

React Router

Grep(pattern="<Route.*path=[\"']([^\"']+)", glob="**/*.{tsx,jsx}", output_mode="content")

FastAPI / Express

Grep(pattern="@(app|router)\\.(get|post)\\([\"'](/[^\"']*)", glob="**/*.py", output_mode="content")
Grep(pattern="(app|router)\\.(get|post)\\([\"'](/[^\"']*)", glob="**/*.{ts,js}", output_mode="content")

Fallback

If no routes discovered, screenshot just the root URL: http://localhost:\{port\}/

Max Routes

Cap at 20 routes to keep gallery manageable and generation fast. Prioritize:

Root /
Routes matching changed files (from Phase 1 git diff)
Routes with most sub-routes (likely important sections)

Step 4: Screenshot Capture

Use agent-browser to navigate and screenshot each route:

# For each route:
# 1. Navigate
agent-browser navigate http://localhost:{port}{route_path}
# 2. Wait for content
agent-browser wait-for-network-idle
# 3. Capture
agent-browser screenshot --full-page --path verification-output/{timestamp}/screenshots/{idx}-{slug}.png

Auth-Protected Routes

If verification-config.yaml specifies auth:

# Login first
agent-browser navigate http://localhost:{port}/login
agent-browser fill "#email" "test@example.com"
agent-browser fill "#password" "test123"
agent-browser click "button[type=submit]"
agent-browser wait-for-navigation
# Then screenshot protected routes

Viewport Options

Default: 1280x720. If mobile: true in config, also capture at 375x812.

Step 5: AI Vision Evaluation

For each screenshot, use Claude's vision (Read tool on PNG) with a structured evaluation prompt:

Read(file_path=f"verification-output/{timestamp}/screenshots/{filename}")

Then evaluate using this prompt template (include it in the visual capture agent's instructions):

Evaluate this screenshot of route "{route_path}" against these 6 criteria.
For EACH criterion, provide a severity (ok/warning/error) and specific observation.
Do NOT use generic "looks good" — cite what you actually see.

1. LAYOUT: Overflow, alignment, spacing, responsive grid. Check: content cut off? Overlapping elements? Scroll needed?
2. NAVIGATION: Is nav present and functional? Sidebar, breadcrumbs, TOC visible? Active state correct?
3. CONTENT: Text readable? Headings hierarchical? Data populated (not placeholder/loading)? Counts/numbers accurate?
4. ACCESSIBILITY: Contrast sufficient? Focus indicators visible? Text size adequate? Color-only information?
5. INTERACTIVITY: Buttons/links styled consistently? Hover/focus states? Forms labeled? CTAs discoverable?
6. BRANDING: Consistent with site theme? Dark/light mode correct? Typography matches design system?

Output as JSON array — exactly 6 items, one per criterion:
[{"severity": "ok|warning|error", "message": "CRITERION: specific observation with evidence"}]
Score 0-10 based on: 0 errors=9+, 1-2 warnings=7-8, errors=5-6, multiple errors=<5.

Per-route evaluation output (6+ items, never a single line):

{
  "route": "/dashboard",
  "score": 7.5,
  "evaluation": [
    {"severity": "ok", "message": "LAYOUT: Content within viewport, no horizontal overflow, grid columns align properly"},
    {"severity": "ok", "message": "NAVIGATION: Sidebar present with 8 sections, 'Dashboard' correctly highlighted as active"},
    {"severity": "warning", "message": "CONTENT: Stats show '79 skills' but should be '89 skills' — stale count detected"},
    {"severity": "ok", "message": "ACCESSIBILITY: Body text ~16px on dark bg (#e6edf3 on #0d1117), contrast ratio ~13:1, passes WCAG AAA"},
    {"severity": "warning", "message": "INTERACTIVITY: Code block copy buttons present but no visible hover state change"},
    {"severity": "ok", "message": "BRANDING: Dark theme consistent, green accent (#3fb950) used for active states, monospace for code"}
  ]
}

Cross-Route Summary

After evaluating all routes, synthesize a summary object for the gallery:

# Build summary from all per-route evaluations
summary = {
  "total_routes": len(routes),
  "avg_score": round(sum(r.score for r in routes) / len(routes), 1),
  "pass_count": len([r for r in routes if r.score >= 7]),
  "warn_count": len([r for r in routes if 5 <= r.score < 7]),
  "fail_count": len([r for r in routes if r.score < 5]),
  "common_issues": [  # Issues appearing on 2+ routes
    {"count": 3, "message": "Stale skill count (79 instead of 89) on 3/5 pages"},
    {"count": 2, "message": "Code block copy buttons lack hover state feedback"}
  ],
  "strengths": [  # Positive patterns across routes
    "Consistent dark theme and typography across all pages",
    "Sidebar navigation present and correctly highlights active page"
  ]
}

Include this summary in GALLERY_JSON alongside routes.

Step 6: Gallery Generation

Read the gallery template:

Read(file_path="${CLAUDE_SKILL_DIR}/assets/gallery-template.html")

Build the GALLERY_JSON data structure:

{
  "branch": "feat/new-feature",
  "date": "2026-03-10",
  "timestamp": "2026-03-10T14:30:00Z",
  "compositeScore": 8.2,
  "visualScore": 7.8,
  "routes": [
    {
      "id": "homepage",
      "name": "Homepage",
      "path": "/",
      "screenshot": "data:image/png;base64,...",
      "score": 8.5,
      "evaluation": [
        {"severity": "ok", "message": "Layout consistent"},
        {"severity": "warning", "message": "Hero image loading slowly"}
      ],
      "annotations": [],
      "apiResponse": null
    }
  ]
}

Base64 encoding: Convert each PNG to base64 for self-contained HTML:

base64 -i screenshots/01-homepage.png

Size guard: If total HTML > 10MB, use maxDiffPixelRatio compression or reduce to top 10 routes.

Write the final gallery:

Write(file_path=f"verification-output/{timestamp}/gallery.html", content=rendered_html)

Step 7: Cleanup

# Kill dev server
Bash(command="kill $(lsof -ti :PORT) 2>/dev/null || true", description="Stop dev server")

Phase 8.5: Agentation Loop (Opt-In)

Trigger: Only when agentation MCP is configured in .mcp.json.

# Check if agentation is available
ToolSearch(query="select:mcp__agentation__agentation_get_all_pending")

If available, offer the user:

AskUserQuestion(questions=[{
  "question": "Agentation is configured. Want to annotate the UI before finalizing?",
  "header": "Visual Feedback Loop",
  "options": [
    {"label": "Yes, let me annotate", "description": "I'll mark issues on the live UI, then ui-feedback agent fixes them"},
    {"label": "Skip", "description": "Finalize gallery with current screenshots"}
  ]
}])

If yes:

# 1. Watch for annotations
mcp__agentation__agentation_get_all_pending()

# 2. For each annotation:
mcp__agentation__agentation_acknowledge(annotationId=id)

# 3. Dispatch ui-feedback agent
Agent(subagent_type="ork:ui-feedback",
  prompt="Process agentation annotation: {annotation}. Fix the issue, then resolve.",
  run_in_background=True)

# 4. After fixes, re-screenshot affected routes
# 5. Save before/after pairs
# 6. Update gallery with annotation diffs

Max Rounds

Default 3 rounds of annotate-fix-verify. Configurable in verification-config.yaml.

Graceful Degradation

Failure	Behavior
No frontend detected	Skip visual capture, log info in report
Dev server won't start	Skip visual capture with warning
agent-browser unavailable	Skip screenshots, try `curl` for API-only
Screenshot fails on a route	Skip that route, continue with others
Base64 output too large	Compress or reduce route count
Agentation not configured	Skip Layer 2 entirely (no prompt)
Auth flow fails	Skip protected routes, screenshot public only

All 5 dimensions rated (0-10 scale)
Weights applied correctly (20/25/20/20/15)
Composite score calculated
Grade letter assigned (A+ to F)

Evidence Collected

Improvements Documented

Each suggestion has effort estimate (1-5)
Each suggestion has impact estimate (1-5)
Priority calculated (Impact / Effort)
Quick wins identified (low effort, high impact)

Alternatives Considered

Current approach scored
At least one alternative evaluated
Migration cost estimated
Recommendation documented

Policy Compliance

No blocking rule violations
Warning rules acknowledged
Thresholds checked (composite, security, coverage)

Report Generated

All sections filled
Verdict assigned (Ready/Recommended/Blocked)
Tasks updated to completed