Fixes GitHub issues using parallel analysis agents for root cause investigation, code exploration, and regression detection. Reads issue context from gh CLI, searches codebase and memory for related patterns, generates a fix with tests, and links the resolution back to the issue via PR. Includes prevention analysis to avoid recurrence. Use when debugging errors, resolving regressions, fixing bugs, or triaging issues.
Opus 4.6: Root cause analysis uses native adaptive thinking. Dynamic token budgets scale with context window for thorough investigation.
CC ≥ 2.1.119 multi-host note (M122): Issue fetching works against GitHub, GitLab, Bitbucket, and GitHub Enterprise. The argument is either a numeric ID (use the configured default remote's host) or a full URL (parsed via parsePrUrl/parseIssueUrl from src/hooks/src/lib/pr-host-parser.ts). Branch on the detected host family for the right CLI: gh issue view (GitHub/GHE), glab issue view (GitLab), bb issue view (Bitbucket). Reference: src/skills/chain-patterns/references/pr-from-platform.md.
ISSUE_NUMBER = "$ARGUMENTS[0]" # e.g., "123" (CC 2.1.59 indexed access)# $ARGUMENTS contains the full argument string# $ARGUMENTS[0] is the first space-separated token
Run BEFORE any other step. Detect available MCP servers and check for resumable state.
# Probe MCPs (parallel — all in ONE message):# memory is alwaysLoad in .mcp.json (CC 2.1.121+, #1541) — probe below kept as fallback for older CC:ToolSearch(query="select:mcp__memory__search_nodes")ToolSearch(query="select:mcp__context7__resolve-library-id")# Write capability map:Write(".claude/chain/capabilities.json", JSON.stringify({ "memory": <true if found>, "context7": <true if found>, "timestamp": now()}))# Check for resumable state:Read(".claude/chain/state.json")# If exists and skill == "fix-issue":# Read last handoff, skip to current_phase# Tell user: "Resuming from Phase {N}"# If not exists: write initial stateWrite(".claude/chain/state.json", JSON.stringify({ "skill": "fix-issue", "issue": ISSUE_NUMBER, "current_phase": 1, "completed_phases": [], "capabilities": capabilities}))
Before diagnosis kicks off, optionally invoke scripts/prior_fix_lookup.py <session-dir> to surface similar fixes already recorded in the memory MCP. READ-ONLY — no writeback. Self-skips on every non-happy-path so it never blocks the fix:
Auto-skip conditions (all exit 0, all WARN-logged):
Skip reason
Trigger
signal absent
error_text missing OR signature extractor returns None
yg-mcp-core not importable
yg-mcp-core>=0.3.0 not installed (orchestkit is public; yg-mcp-core lives on private pypi.yonyon.ai — HQ-only)
memory MCP unreachable
MCP server down OR .mcp.json doesn't define memory
Session dir must contain fix-issue-input.json (with error_text: str). The signature extractor (signature_lib.extract_signature) normalizes Python tracebacks, JS stack traces, and generic <Type>: <msg> errors to a <error_type> <primary_path>:<lineno> shape used as the search_nodes query. Handoff JSON at <session-dir>/prior-fix-matches.json records status, signature, and matches_count; the top-3 matches land in <session-dir>/prior-fix-matches.md as a Markdown table.
Mirrors the memory-consumer pattern from PR #1889 but read-only. Closes orchestkit#1895.
BEFORE doing ANYTHING else (after MCP probe), create tasks to track progress:
# 1. Create main task IMMEDIATELYTaskCreate( subject="Fix Issue: #{ISSUE_NUMBER}", description="Systematic issue resolution with RCA and prevention", activeForm="Fixing issue #{ISSUE_NUMBER}")# 2. Create subtasks for each key phaseTaskCreate(subject="Understand issue", activeForm="Reading issue details")TaskCreate(subject="Hypothesis & RCA", activeForm="Analyzing root cause")TaskCreate(subject="Implement fix", activeForm="Applying fix with tests")TaskCreate(subject="Validate & prevent", activeForm="Validating fix and prevention")TaskCreate(subject="Commit and PR", activeForm="Creating PR for fix")# 3. Set dependencies for sequential phasesTaskUpdate(taskId="3", addBlockedBy=["2"])TaskUpdate(taskId="4", addBlockedBy=["3"])TaskUpdate(taskId="5", addBlockedBy=["4"])TaskUpdate(taskId="6", addBlockedBy=["5"])# 4. Before starting each task, verify it's unblockedtask = TaskGet(taskId="2") # Verify blockedBy is empty# 5. Update status as you progressTaskUpdate(taskId="2", status="in_progress") # When startingTaskUpdate(taskId="2", status="completed") # When done
Once the approach is chosen, ask whether to run CI locally before pushing — orthogonal to fix depth:
# Skip when invocation flag is explicit:# /ork:fix-issue 123 --local-ci → skip, run full suite locally# /ork:fix-issue 123 --security-only → skip, security tests only# /ork:fix-issue 123 --push-and-let-ci → skip, no local run## Force local-CI when issue has security or data-loss labels (warns user it overrode their choice).AskUserQuestion(questions=[{ "question": "Before push?", "header": "Local CI", "options": [ {"label": "Push and let CI run (default)", "description": "Fastest round-trip, CI catches failures", "markdown": "```\nPush + Remote CI\n────────────────\n fix ──▶ commit ──▶ push ──▶ CI\n │\n v\n 5-15 min\n + fastest local turnaround\n - failures discovered remotely\n - rebase if CI red on main move\n```"}, {"label": "Run full suite locally first", "description": "~2-3 min extra; catches CI failures locally before push", "markdown": "```\nLocal Full Suite + Push\n───────────────────────\n fix ──▶ commit ──▶ npm test ──▶ push\n │\n v\n ~2-3 min\n + catches all CI failures locally\n - slower per-iteration\n recommended when issue label =\n security | data-loss\n```"}, {"label": "Run security tests only", "description": "~30s; covers the usual blocker class — secrets, deps, common vulns", "markdown": "```\nSecurity-only + Push\n────────────────────\n fix ──▶ commit ──▶ test:security ──▶ push\n │\n v\n ~30s\n + catches secrets/deps/owasp\n + faster than full suite\n - misses lint/type/unit issues\n```"} ]}])
Override rule: if the issue's GitHub labels include security or data-loss, override the user's selection with "Run full suite locally first" and surface a one-line notification: "Security/data-loss label detected — running full local suite as a precaution." The user can still bypass with the --push-and-let-ci arg, which logs the bypass for audit.
If 'Investigate first' selected:
# 1. Enter read-only plan modeEnterPlanMode("Investigate issue: $ISSUE_REF")# 2. Investigation phase — Read/Grep/Glob ONLY, no Write/Edit# - Read the issue description and linked context# - Trace the error path through relevant code# - Search for related issues, past fixes, test failures# - Build hypothesis list with evidence# 3. Produce RCA report:# - Root cause hypothesis (ranked by confidence)# - Affected files and blast radius# - Recommended approach (proper fix vs quick fix)# - Risk assessment# 4. Exit plan mode — returns analysis for user decisionExitPlanMode()# 5. User reviews RCA. If "proceed with fix" → continue to Phase 5 (Fix).# If "need more info" → re-enter investigation.
Load Read("$\{CLAUDE_SKILL_DIR\}/rules/evidence-gathering.md") for detailed workflow adjustments per approach.
Choose Agent Teams (mesh) or Task tool (star). Load Read("$\{CLAUDE_SKILL_DIR\}/references/agent-selection.md") for the selection criteria, cost comparison, and task creation patterns.
When the issue involves a running web app, API, or UI bug, discover services and inspect visually before forming hypotheses:
# 1. Discover services via Portless (preferred)portless list 2>/dev/null# api → api.localhost:1355 (port 8080)# app → app.localhost:1355 (port 3000)# 2. Fallback: discover ports manuallylsof -iTCP -sTCP:LISTEN -nP | grep -E 'node|python|java'# 3. Visual inspection with agent-browseragent-browser open "http://app.localhost:1355"agent-browser screenshot /tmp/issue-before.png # capture broken stateagent-browser console # check for JS errorsagent-browser network log # inspect failed API callsagent-browser get text @error-banner # extract error messages
Use Portless named URLs (*.localhost:1355) in all investigation steps — they're stable, self-documenting, and eliminate port-guessing failures. Install with npm i -g portless.
Output results incrementally as each phase completes — don't batch until the PR:
After Phase
Show User
1. Understand Issue
Problem statement, affected files
3. Hypothesis Formation
Ranked hypotheses with confidence scores
4. RCA
Confirmed root cause, evidence chain
6. Implementation
Fix description, files changed
7. Validation
Test results, before/after behavior
For the proper fix path with 5 parallel RCA agents, output each agent's findings as they return — don't wait for all 5. If one agent identifies the root cause with high confidence early, flag it immediately so the user can confirm and skip remaining agents.
After Phase 11 (commit + PR), schedule CI monitoring:
# Guard: Skip cron in headless/CI (CLAUDE_CODE_DISABLE_CRON)# if env CLAUDE_CODE_DISABLE_CRON is set, run a single check insteadCronCreate( schedule="*/5 * * * *", prompt="Check CI for PR #{pr_number}: gh pr checks {pr_number} --repo {repo}. All pass → CronDelete this job. Any fail → alert with details.")
If worktree isolation was used in Phase 4, clean up after validation:
# After Phase 7 validation passes — exit worktree, keep branch for PRExitWorktree(action="keep")
Every EnterWorktree or isolation: "worktree" agent must have a matching cleanup. If agents used isolation: "worktree", they handle their own exit — but if the lead entered a worktree in Step 0, it must call ExitWorktree before Phase 11 commit.
Fault tree analysis, AND/OR gates, critical systems
Push notifications (CC 2.1.110+): Issue-fix flows can span 10–20 min with RCA → fix → test → PR. When the fix lands and tests pass, call PushNotification so the user knows the fix is ready for review. Requires Remote Control + "Push when Claude decides" config; fails silently if unavailable.
When spawning the 5 RCA agents (debug-investigator, code-quality-reviewer, test-generator, etc.) — whether in-session via the Agent tool or headless via claude -p --bare — set explicit per-role flags so behaviour is deterministic across interactive and CI runs:
Agent role
--permission-mode
--effort
RCA / investigation (debug-investigator, Explore)
dontAsk
low — medium
Test reproduction (test-generator)
acceptEdits
medium
Fix authoring (production code)
default (keep user in loop)
medium — high
Verification (code-quality-reviewer)
dontAsk
low
Never use bypassPermissions — fix-issue's RCA phase often touches code paths; the audit trail matters. For headless invocations (e.g. from /ork:ci-sentinel or a cron-driven bug sweep), pass the flags explicitly:
claude -p --bare \ --permission-mode dontAsk \ --effort medium \ --max-turns 12 \ "/ork:fix-issue <N>"
Before claiming any fix is complete, apply the 5-step verification gate: Read("$\{CLAUDE_PLUGIN_ROOT\}/skills/shared/rules/verification-gate.md"). "Should work now" is not evidence — run the test, read the output, cite the result.
When reporting fix status, follow Read("$\{CLAUDE_PLUGIN_ROOT\}/skills/shared/rules/anti-sycophancy.md") — state findings directly, no performative language. Use the agent status protocol: DONE, DONE_WITH_CONCERNS, BLOCKED, or NEEDS_CONTEXT.
browser-tools - Visual inspection with agent-browser + Portless
ork:issue-progress-tracking - Auto-updates from commits
ork:remember - Store lessons learned
Session recovery (CC 2.1.108+): After idle periods or interruptions, use /recap to restore conversational context alongside checkpoint-resume state. Enabled by default since CC 2.1.110 (even with telemetry disabled).
If the AskUserQuestion picker stalls (CC 2.1.139 input bug — orchestkit#1795), set ORK_ASK_FALLBACK=text before starting CC. The lifecycle/ask-fallback-injector hook injects a reminder telling the assistant to pose options inline as a numbered list and ask the user to reply with the option number.
If 'Plan first' selected: Call EnterPlanMode("Investigate issue #$ISSUE: $TITLE"), perform research using Read/Grep/Glob only, then ExitPlanMode with the plan for user approval before proceeding.
Based on answer, adjust workflow:
Proper fix: All 11 phases, parallel agents for RCA
Top-down, deductive analysis mapping all paths to a failure using boolean logic (AND/OR gates). Best for critical systems and exhaustive failure analysis.
The smallest set of basic events that together cause the top event:
Top: User Cannot Authenticate (OR gate) Cut Set 1: {Wrong Password} — single point of failure Cut Set 2: {Expired Token} — single point of failure Cut Set 3: {DB Down} — single point of failure Cut Set 4: {Account Locked} — single point of failure
Single-event cut sets indicate no redundancy — add defense-in-depth.
Incorrect — jumping to one cause without category analysis:
### API Latency Spike Analysis**Root Cause:** N+1 query in user endpoint**Fix:** Add query optimization
Correct — fishbone analysis across all categories:
### API Latency Spike — Fishbone Analysis**Code:**- N+1 query in user endpoint (CONFIRMED via query log)- Sync blocking call to external API**Infrastructure:**- DB connection pool exhausted (CONFIRMED: 0 available connections)- Network saturation (ruled out: 20% utilization)**Dependencies:**- Redis timeout increased (ruled out: within SLA)**Configuration:**- Connection pool size too small (CONFIRMED: 10 max, need 50)**Process:**- No load testing in CI (process gap)**Root Causes (cross-category):**1. N+1 query (Code) + small pool (Config) = exhaustion2. Missing load tests (Process) = undetected before prod**Actions:**- Fix N+1 query immediately- Increase pool size 10 → 50- Add load tests to CI
**Problem:** Website was down for 2 hours**Why 1:** The application server ran out of memory and crashed.**Why 2:** A memory leak in the image processing service accumulated over time.**Why 3:** The service wasn't releasing image buffers after processing.**Why 4:** The cleanup code had a bug introduced in last week's release.**Why 5:** We don't have automated memory leak detection in our test suite.**Root Cause:** Missing automated memory leak testing**Action:** Add memory profiling to CI pipeline, add cleanup tests
Concurrent failures — 5 Whys assumes linear causation
Incorrect — stopping at symptom without root cause:
**Problem:** Website was down for 2 hours**Why 1:** The application server crashed.**Action:** Restart the server
Correct — drilling down to root cause with 5 Whys:
**Problem:** Website was down for 2 hours**Why 1:** The application server ran out of memory and crashed. Evidence: Out-of-memory error in logs**Why 2:** A memory leak in the image processing service accumulated over time. Evidence: Memory usage increased 2GB/hour in metrics**Why 3:** The service wasn't releasing image buffers after processing. Evidence: Code review shows missing .dispose() calls**Why 4:** The cleanup code had a bug introduced in last week's release. Evidence: Git blame + diff shows removal of cleanup in PR #234**Why 5:** We don't have automated memory leak detection in our test suite. Evidence: No memory profiling in CI pipeline**Root Cause:** Missing automated memory leak testing**Actions:**- Add memory profiling to CI pipeline- Add cleanup tests for image processing- Revert PR #234's cleanup removal
In Agent Teams mode, form an investigation team where RCA agents share hypotheses and evidence in real-time:
TeamCreate(team_name="fix-issue-{number}", description="RCA for issue #{number}")Agent(subagent_type="debug-investigator", name="root-cause-tracer", team_name="fix-issue-{number}", prompt="""Trace the root cause for issue #{number}: {issue description} Hypotheses: {hypothesis list from Phase 3} Test each hypothesis. When you find evidence supporting or refuting a hypothesis, message impact-analyst and the relevant domain expert (backend-expert or frontend-expert). If you find conflicting evidence, share it with ALL teammates for debate.""")Agent(subagent_type="debug-investigator", name="impact-analyst", team_name="fix-issue-{number}", prompt="""Analyze the impact and blast radius for issue #{number}. When root-cause-tracer shares evidence, assess how many code paths are affected. Message test-planner with affected paths so they can plan regression tests. If the impact is larger than expected, message the lead immediately.""")Agent(subagent_type="backend-system-architect", name="backend-expert", team_name="fix-issue-{number}", prompt="""Investigate backend aspects of issue #{number}. When root-cause-tracer shares backend-related hypotheses, design the fix approach. Message frontend-expert if the fix affects API contracts. Share fix design with test-planner for test requirements.""")Agent(subagent_type="frontend-ui-developer", name="frontend-expert", team_name="fix-issue-{number}", prompt="""Investigate frontend aspects of issue #{number}. When root-cause-tracer shares frontend-related hypotheses, design the fix approach. If backend-expert changes API contracts, adapt the frontend fix accordingly. Share component changes with test-planner.""")Agent(subagent_type="test-generator", name="test-planner", team_name="fix-issue-{number}", prompt="""Plan regression tests for issue #{number}. When root-cause-tracer confirms the root cause, write a failing test that reproduces it. When backend-expert or frontend-expert share fix designs, plan verification tests. Start with the regression test BEFORE the fix is applied (TDD approach).""")
Team teardown after fix is implemented and validated:
Launch ALL 5 agents in ONE message with run_in_background=True and max_turns=25.
Agents that edit files SHOULD use isolation: "worktree" to prevent conflicts:
# PARALLEL — All 5 in ONE messageAgent( subagent_type="debug-investigator", prompt="""# Cache-optimized: stable content first (CC 2.1.73) ROOT CAUSE TRACING 1. Trace the code path that triggers the bug 2. Identify the exact line/condition causing the failure 3. Check git blame for when the bug was introduced SUMMARY: End with: "RESULT: Root cause is [X] in [file:line] — introduced in [commit]" Issue: #$ARGUMENTS Investigate the primary hypothesis: {hypothesis_1} Evidence files: {relevant_files} """, isolation="worktree", run_in_background=True, max_turns=25)Agent( subagent_type="debug-investigator", prompt="""# Cache-optimized: stable content first (CC 2.1.73) IMPACT ANALYSIS Assess the blast radius of the confirmed root cause: 1. What other code paths are affected? 2. Are there similar patterns elsewhere that might have the same bug? 3. What's the user-facing impact scope? SUMMARY: End with: "RESULT: Impact scope: [N] files, [M] code paths affected" Issue: #$ARGUMENTS Evidence files: {relevant_files} """, run_in_background=True, max_turns=25)Agent( subagent_type="backend-system-architect", prompt="""# Cache-optimized: stable content first (CC 2.1.73) BACKEND FIX DESIGN Design the fix approach for the backend: 1. Propose minimal code changes to resolve the root cause 2. Identify edge cases the fix must handle 3. Assess risk of regression SUMMARY: End with: "RESULT: Fix requires changes to [N] files — risk: [low/medium/high]" Issue: #$ARGUMENTS """, run_in_background=True, max_turns=25)Agent( subagent_type="frontend-ui-developer", prompt="""# Cache-optimized: stable content first (CC 2.1.73) FRONTEND FIX DESIGN Design the fix approach for the frontend (if applicable): 1. UI/UX impact of the bug and proposed fix 2. Component changes needed 3. Accessibility implications SUMMARY: End with: "RESULT: Frontend [affected/not affected] — [N] components to update" Issue: #$ARGUMENTS """, run_in_background=True, max_turns=25)Agent( subagent_type="test-generator", prompt="""# Cache-optimized: stable content first (CC 2.1.73) TEST REQUIREMENTS Define the test plan for the fix: 1. Write a FAILING regression test that reproduces the bug 2. Identify edge cases that must be covered 3. Match test types to the fix using the Test Requirements Matrix SUMMARY: End with: "RESULT: [N] tests needed — regression test targets [file:function]" Issue: #$ARGUMENTS """, run_in_background=True, max_turns=25)
Each agent outputs structured findings and a SUMMARY line.
# Search for past fixesmcp__memory__search_nodes(query="fix authentication error")# Search by error typemcp__memory__search_nodes(query="TypeError resolution")# Search by componentmcp__memory__search_nodes(query="UserService bug")