Orchestrator Tick Loop Lifecycle
The orchestrator is Symphony's execution engine — a daemon process that polls the database every 5 seconds, dispatching agents, detecting failures, and advancing phase transitions. This document explains the tick loop's internal state machine, counter lifecycle, exit handler decision tree, and race condition defenses so that contributors can understand the execution flow without reading source code.
Tick Loop State Machine
Each tick executes 8 phases in a fixed, deterministic order. The order is intentional: phase transitions run first to unblock completed work, reconcile runs before dispatch to clean up crashed agents, and retries run before new dispatch to prioritize recovery over fresh work.
Graceful Shutdown: When
stop()is called, the tick loop drains the current in-flight tick before proceeding with agent shutdown. Each sub-operation checksshuttingDownbetween steps via ado-while(false)early-exit pattern, andawaitShutdown()waits for the tick completion promise with a timeout cap ofmin(5s, 20% of shutdownTimeout). Health metrics are written in afinallyblock to guarantee execution even during shutdown.
Phase Reference Table
| # | Phase | Function(s) | Purpose | Source |
|---|---|---|---|---|
| 1 | Phase Transitions | processPhaseTransitions() | Validate completion claims, swap phase labels, dispatch judges for review | phaseTransition.ts:31 → orchestrator.ts |
| 2 | Pipeline Safety | detectPipelineMismatches(), assignMissingPhaseLabels() | Fix data integrity — ensure issues have correct phase labels | pipelineMismatch.ts → orchestrator.ts |
| 3 | Needs-Revision Expiry | processNeedsRevisionExpiry() | Circuit breaker for stuck revisions — expires stale needs-revision labels | phaseTransition.ts → orchestrator.ts |
| 4 | Reconcile | reconcile() | Detect dead agents (PID check), stall timeouts, DB orphans, issue-level orphans | orchestrator.ts |
| 5 | Process Retries | processRetries() | Dispatch due entries from the retry queue with exponential backoff | orchestrator.ts |
| 6 | Dispatch Intake | dispatchIntakeBatches() | Process pending batch intake requests (bulk issue creation) | orchestrator.ts |
| 7 | Dispatch | runDispatchCycle() | Fill available slots: judges, phase agents, workers, scanners | dispatch/cycle.ts |
| 8 | Background Workers | workerRegistry.runDue() | Learning consolidation, artifact pruning, prompt metrics, WAL checkpoints | orchestrator.ts |
| — | Health Metrics | writeHealthSnapshot() | Record tick timing and system health to disk (runs in finally block) | orchestrator.ts |
Why This Order?
- Phase transitions first — unblocks agents waiting for phase advancement. A completed research artifact should advance to architecture before the next dispatch tries to assign a new researcher.
- Pipeline safety second — repairs data inconsistencies before any dispatch decisions are made.
- Reconcile before dispatch — frees slots held by dead agents so dispatch can use them. Without this, a crashed agent would block its slot until the next tick after dispatch.
- Retries before new dispatch — honors backoff commitments. Issues that already failed deserve their scheduled retry before fresh work gets dispatched.
- Intake before dispatch — intake batches may create new issues that become dispatch candidates.
- Background workers and health last — non-critical maintenance that shouldn't delay core dispatch work.
Tick Error Handling
Each sub-operation runs inside its own try/catch. If any sub-operation throws, the error is logged and subsequent sub-operations continue. An outer try/catch acts as a safety net for unexpected errors. A finally block guarantees that health metrics (writeHealthSnapshot()) and tick timing (tickTracker.recordTick()) always execute, even during errors or shutdown.
Graceful Shutdown Drain
When stop() is called, the orchestrator uses a two-phase shutdown:
- Tick drain —
awaitShutdown()waits for the current in-flight tick to complete viatickCompletionPromise, with a timeout cap ofmin(5s, 20% of shutdownTimeout). This ensures in-flight phase transitions and DB writes complete cleanly. - Agent shutdown — After the tick drains (or times out), the existing agent wait logic sends SIGTERM and polls for process exit.
Inside tick(), each sub-operation is followed by an if (this.shuttingDown) break check inside a do-while(false) block. When shutdown is signaled mid-tick, remaining sub-operations are skipped but the finally block still runs, ensuring health metrics are written and the tick completion promise resolves.
Source: orchestrator.ts:tick(), orchestrator.ts:awaitShutdown()
Counter Lifecycle
The orchestrator tracks four in-memory counter maps to detect repeated failures and trigger circuit breakers. All counters are scoped per-issue and stored in the OrchestratorContext object.
| Counter | Increment Condition | Reset Condition | Circuit Break | Scope |
|---|---|---|---|---|
consecutiveFailures | Agent exits with exitCode !== 0 | Agent exits with exitCode === 0 (any success) | >= maxRetries (default 3) OR totalRuns >= maxRetries | Per-issue, tracks crashes and timeouts |
consecutiveStaleMates | Worker produces no git changes (detectWorkerStalemate returns true) | Worker produces git changes OR autoCreatePr succeeds | >= maxStaleRuns (default 3) OR totalRuns >= maxRetries | Per-issue, worker agents only |
consecutiveNoSignals | Agent produces no work product (no PR, no claim, no subtasks) | Agent produces any work product | Does not circuit break — only increases retry backoff | Per-issue, all agent types |
totalRuns (DB query) | Every agent run (inserted into agent_runs) | Never resets — absolute count | >= maxRetries — absolute safety net | Per-issue, prevents runaway loops |
Counter State Diagrams
consecutiveFailures
consecutiveStaleMates
consecutiveNoSignals
Key Constraints
- In-memory only — all counters are lost on orchestrator restart. The
totalRunsquery against theagent_runstable provides a persistent safety net. - Consecutive, not cumulative — a single success resets
consecutiveFailuresandconsecutiveNoSignalsto zero. This prevents one bad run from permanently penalizing an issue. - Backoff uses consecutive counter — the
attemptparameter passed toretryQueue.scheduleRetry()must be the consecutive counter value, nottotalRuns. UsingtotalRunswould cause excessive backoff on issues that had early successes.
Source: exitHandler.ts:280-476, orchestrator.ts:43-58
Exit Handler Decision Tree
When an agent process exits, the exit handler determines what to do based on exit code, agent type, and work product. Every path through the handler updates counters, DB records, and optionally schedules retries.
Exit Path Summary
| Path | Counter(s) Affected | Action | Next Status |
|---|---|---|---|
| Issue done/cancelled (Layer 2) | None | Cleanup worktree, return | done/cancelled |
| Failed (exitCode !== 0) | consecutiveFailures++ | Circuit break or schedule retry | todo |
| Success + agent changed status | consecutiveFailures reset | Respect agent decision | (agent-set) |
| Success + pending completion claim | consecutiveFailures reset | Leave for transition processor | in_progress |
| Success + phase agent | consecutiveFailures reset | Leave for transition processor | in_progress |
| Success + judge no verdict | Judge comment count++ | Move to todo (>= 3) or leave in review | review or todo |
| Success + worker PR created | staleMates reset | Auto-create PR, move to review | review |
| Success + worker stalemate | staleMates++ | Circuit break (>= 3) or retry | todo |
| Success + planner has subtasks | noSignals reset | Return | in_progress |
| Success + no work product | noSignals++ | Retry with backoff or max-runs circuit break | todo |
Source: exitHandler.ts:160-538
Race Condition Defense
Symphony uses four defensive layers to prevent duplicate dispatch, orphaned agents, and stale state. Each layer catches problems that earlier layers might miss.
Layer 1: Atomic Claim (Pre-Dispatch)
Before dispatching an agent, the orchestrator checks the issue's current status fresh from the database and atomically sets it to in_progress. If the status has already changed (another tick claimed it, or a human moved it), dispatch is aborted.
- Protects against: Concurrent dispatch attempts, human status changes before claim
- Implementation:
dispatchSetup.ts:claimIssueForDispatch()
Layer 2: Post-Exit Failsafe
After an agent exits, the exit handler re-reads the issue from the database before processing. If the issue became done or cancelled during the agent's execution (e.g., a human merged the PR manually), the handler cleans up the worktree and returns early — skipping retry logic entirely.
- Protects against: Issues completed manually while an agent was running
- Implementation:
exitHandler.ts:250-277
Layer 3: Reconcile GC (Three-Level Orphan Detection)
The reconcile phase runs on every tick and detects orphaned agents at three levels:
| Level | Detection Method | Action | Catches |
|---|---|---|---|
| 1. PID liveness | process.kill(pid, 0) on in-memory agents | Trigger exit handler with exitCode=1 | Crashed agent processes |
| 2. DB orphan | Query agent_runs with status=running, check PID alive, skip if in ctx.running | Mark run as failed, reset issue to todo | Agents that died between ticks, orchestrator restarts |
| 3. Issue orphan | Query issues with status=in_progress, skip if in ctx.running or has DB-level running run | Reset issue to todo | Exit handler failures, partial DB updates |
Additionally, reconcile checks for stall timeouts — agents that have been running longer than stall_timeout_ms are killed via SIGTERM.
Synthetic Key Handling
The ctx.running map uses issue IDs as keys for regular agents, but uses synthetic keys like intake:<batchId> for non-issue agents (intake batch processors). The reconcile loop guards against these synthetic keys to prevent handleAgentExit from receiving non-issue keys:
- Detection:
isSyntheticRunningKey()checks if a key starts with a known prefix (intake:, etc.) - Dead PID handling: If a synthetic agent's PID is dead,
handleSyntheticAgentDeath()handles cleanup inline — removing from the running map, cleaning temp files, and marking the intake batch as failed - Stall detection skipped: Synthetic agents have their own exit handlers (in
intakeDispatcher.ts), so reconcile skips stall detection for them
Source: syntheticKeys.ts, orchestrator.ts:reconcile()
- Protects against: Crashed processes, orchestrator restarts, partial DB updates
- Implementation:
orchestrator.ts:249-345,syntheticKeys.ts
Layer 4: Dispatch Guard (Double-Dispatch Prevention)
After launching an agent subprocess but before recording it in ctx.running, the dispatcher checks if an agent already exists for this issue. If so, it kills the new process, marks the run as failed, and aborts. This prevents the ctx.running map (keyed by issueId) from being overwritten, which would orphan the existing agent.
- Protects against: Double-dispatch in the same tick cycle, race between phase transition judge dispatch and regular dispatch
- Implementation:
agentDispatcher.ts:479-491
Status Transitions
Issues follow a canonical status transition map with three ownership domains:
Transition Ownership
| Transition | Owner | Trigger |
|---|---|---|
backlog → todo | Orchestrator | Auto-promote when no todo work exists |
todo → in_progress | Orchestrator | Atomic claim at dispatch time |
in_progress → review | Agent (MCP create_pr) or Orchestrator (autoCreatePr) | PR created from worktree changes |
in_progress → todo | Orchestrator | Agent failure, stalemate, or retry |
in_progress → blocked | Agent (MCP update_issue_status) | Dependency not met |
blocked → todo | Orchestrator | autoUnblockParents() when children complete |
review → done | Human only | Merge PR via UI |
review → todo | Judge (MCP reject_pr) or Orchestrator | PR rejected or judge exhausted retries |
done → todo | Human | Reopen issue |
cancelled → backlog/todo | Human | Restore issue |
Source: server/utils/statusTransitions.ts, app/utils/statusTransitions.ts
Retry Queue
Failed or stalled agents are scheduled for retry with exponential backoff. The retry queue is in-memory and processed each tick during the "Process Retries" phase.
Backoff Formula
delay = min(10_000 * 2^(attempt - 1), maxRetryBackoffMs)Where attempt is the consecutive failure/stalemate/no-signal count (not total runs).
Backoff Schedule
| Attempt | Delay | Cumulative |
|---|---|---|
| 1 | 10s | 10s |
| 2 | 20s | 30s |
| 3 | 40s | 70s |
| 4 | 80s | 150s |
| 5 | 160s | 310s |
| 6+ | 300s (max, configurable) | 610s+ |
Important: Consecutive Counter, Not Total Runs
The attempt parameter must be the consecutive counter value. Using totalRuns causes excessive backoff — an issue with 10 historical runs but only 1 recent failure would get a 5-minute backoff instead of 10 seconds.
// Correct — consecutive failures as attempt
const failures = (ctx.consecutiveFailures.get(issueId) ?? 0) + 1
ctx.retryQueue.scheduleRetry(issueId, identifier, failures, 'timeout')
// Wrong — total runs causes excessive backoff
ctx.retryQueue.scheduleRetry(issueId, identifier, totalRuns, 'timeout')Retry Processing
Each tick, processRetries() pulls due entries from the queue and dispatches them:
- Remove entry from queue
- Re-read issue from DB — skip if status is not
todo - Determine phase from issue labels
- Check concurrency limits (global and per-project)
- If no slots available, re-queue with
attempt + 1 - Dispatch agent with the correct phase and recovery strategy (if any)
Context Recovery Cascade
When agents fail due to token/context limits, a progressive recovery cascade adjusts dispatch context instead of retrying identically. Strategies escalate from budget reduction → output compaction → fresh dispatch with learnings → model escalation. Token errors fast-track to output compaction. Recovery attempts count against the same max_retries budget — no additional retries are created.
See docs/patterns/retry-with-backoff.md § "Context Recovery Cascade" for full details.
Source: contextRecovery.ts, retryQueue.ts, orchestrator.ts:processRetries()
Phase Transition Flow
Phase transitions are the mechanism by which issues advance through the readiness pipeline. Agents submit completion claims; the orchestrator validates and transitions.
Precondition Checks
Before validating an artifact, the transition processor verifies:
- Phase match — the claim's
completed_phasematches the issue's currentphase:*label - Artifact exists — the file at the resolved artifact path exists on disk
- Required sections — the artifact contains all sections defined in the contract's
required_sections
If preconditions fail, the claim is marked as processed with a rejection reason, and a needs-revision label is added.
Validation Modes
| Mode | Behavior | Used By |
|---|---|---|
structural | Auto-advance if artifact exists with required sections | phase:research |
judge | Dispatch judge to review artifact; advance only on approve_phase | phase:architecture |
trust | Structural checks run as warnings, always advances | (configurable) |
Adaptive Phase Skipping
Agents can recommend skipping phases by including skip_to in their completion claim. For example, a research agent assessing a simple ticket can skip directly to phase:ready. Skipped phases get skip:<phase> labels for traceability.
Source: phaseTransition.ts:31-440, contractParser.ts
Reconcile (GC)
The reconcile phase is the orchestrator's garbage collector — it detects and cleans up agents and issues that are in inconsistent states. It runs on every tick before dispatch.
Three-Level Detection
| Level | What It Checks | Detection | Recovery Action |
|---|---|---|---|
| 1. In-memory PID | Agents in ctx.running | process.kill(pid, 0) — throws if process dead | Trigger exit handler with exitCode=1 |
| 2. DB orphan | agent_runs with status=running not in ctx.running | PID liveness check on DB records | Mark run as failed, reset issue to todo |
| 3. Issue orphan | Issues with status=in_progress and no active agent | No entry in ctx.running AND no running agent_run | Reset issue to todo |
Stall Detection
In addition to orphan detection, reconcile checks for stalled agents — processes that are still alive but have exceeded stall_timeout_ms. Stalled agents are killed via SIGTERM, and their exit is processed as a failure with timedOut: true.
Why Three Levels?
Each level catches a different failure mode:
- Level 1 catches processes that crashed between ticks (most common)
- Level 2 catches processes that died but whose in-memory entry was already cleaned up (e.g., orchestrator restart with a different
ctx.runningstate) - Level 3 is defense-in-depth — catches cases where the
agent_runwas marked as failed but the issue status was never reset (e.g., the.catchblock in dispatch updatedagentRunsbut threw before updatingissues)
Source: orchestrator.ts:249-345
Dispatch Priority
The dispatch cycle fills available slots in a strict priority order within each project:
| Priority | Category | Condition | Rationale |
|---|---|---|---|
| 0 | Setup Wizard | wizardCompleted === false | Must configure project before any work |
| 1 | Judge | Issues in review status | Unblocks the pipeline — reviewed PRs become mergeable |
| 2 | Phase Agents | Issues with phase:research, phase:architecture, phase:grooming | Matures tickets toward readiness |
| 3 | Workers / Planners | phase:ready issues or needs-planning issues | Core implementation work |
| 4 | Scanner | No todo work AND backlog < 10 | Background improvement discovery |
Worker Slot Reserve
When maxConcurrency >= 2 and there are dispatchable worker items, 1 slot is reserved exclusively for phase:ready workers. This prevents phase agents (research, architecture, grooming) from consuming all available slots and starving implementation work.
The reserve only activates when worker candidates actually exist — if no phase:ready issues are available, phase agents can use all slots.
Judge Slot Allocation
Judges receive up to 25% of global slots (minimum 1) as a reserved pool. Judges can exceed globalMax by this amount to avoid being blocked behind workers. This ensures the review pipeline doesn't stall when all slots are occupied by workers.
Per-Project Limits
Each project has a configurable maxConcurrency limit. The dispatch cycle respects both global and per-project limits, using the more restrictive of the two. The getAvailableSlotsForProject() function returns an AvailableSlotsResult with both the available count and an optional reason:
{ available: min(globalMax - globalRunning, projectMax - projectRunning),
reason?: 'global_full' | 'project_full' }The retry queue uses the reason field for differentiated backoff:
project_full: Short fixed delay (10s) — the project will free up soonglobal_full: Normal exponential backoff — the entire system is congested
Source: dispatchOrchestrator.ts, orchestrator.ts:processRetries(), concurrency.ts
Recovery Scenarios
When the orchestrator crashes or restarts, various in-flight state must be recovered. The table below maps each failure scenario to its recovery mechanism.
| Scenario | What's Lost | Recovery Mechanism | Recovery Time |
|---|---|---|---|
| Graceful shutdown mid-tick | Remaining sub-operations in current tick | shuttingDown breaks skip remaining sub-ops; finally block writes health metrics; tickCompletionPromise resolves to unblock awaitShutdown() | Immediate |
| Tick crash mid-phase-transition | Unprocessed claims in current iteration | Claims are idempotent — picked up on next tick | ~5s (next tick) |
| Tick crash mid-reconcile | Partial orphan detection | Orphans detected again on next tick | ~5s |
| Tick crash mid-dispatch | In-flight slot allocation, partially launched agent | Reconcile detects orphaned in_progress issues | ~5s |
| Orchestrator restart | All ctx.running entries, all counter maps | recoverOrphanedRuns() marks all DB running agent_runs as failed and resets issues to todo | Immediate on boot |
| Agent process crash | Agent's in-memory state | PID liveness check in reconcile triggers exit handler | ~5s (next tick) |
| In-memory counters lost | consecutiveFailures, staleMates, noSignals maps | totalRuns DB query provides absolute safety net — prevents runaway retry loops even without counter history | Immediate |
| Orphaned worktrees | Disk space consumed by unused worktrees | Background ArtifactPruner worker cleans stale worktrees; manual git worktree prune also works | Background worker interval |
| Retry queue lost | Scheduled retries with backoff | Issues are in todo status — dispatch cycle re-picks them on next tick (backoff timing is lost, but work continues) | ~5s |
Startup Recovery
On boot, the orchestrator calls recoverOrphanedRuns() which:
- Queries all
agent_runswithstatus = 'running' - Marks each as
failedwith error'Orchestrator restarted' - Resets the associated issue to
todo
This ensures no issues are permanently stuck in in_progress after a crash.
Source: orchestrator.ts:224-247
Further Reading
- Architecture — three-process model, database schema, directory structure
- Phase Pipeline — phase contracts, validation modes, artifact system
- Agent Types — worker, judge, planner, researcher profiles
- Dispatcher — candidate selection, eligibility filters
- Agent Lifecycle — prompt composition, worktree setup, MCP tools