Skip to content

Orchestrator Tick Loop Lifecycle

The orchestrator is Symphony's execution engine — a daemon process that polls the database every 5 seconds, dispatching agents, detecting failures, and advancing phase transitions. This document explains the tick loop's internal state machine, counter lifecycle, exit handler decision tree, and race condition defenses so that contributors can understand the execution flow without reading source code.

Tick Loop State Machine

Each tick executes 8 phases in a fixed, deterministic order. The order is intentional: phase transitions run first to unblock completed work, reconcile runs before dispatch to clean up crashed agents, and retries run before new dispatch to prioritize recovery over fresh work.

Graceful Shutdown: When stop() is called, the tick loop drains the current in-flight tick before proceeding with agent shutdown. Each sub-operation checks shuttingDown between steps via a do-while(false) early-exit pattern, and awaitShutdown() waits for the tick completion promise with a timeout cap of min(5s, 20% of shutdownTimeout). Health metrics are written in a finally block to guarantee execution even during shutdown.

Phase Reference Table

#PhaseFunction(s)PurposeSource
1Phase TransitionsprocessPhaseTransitions()Validate completion claims, swap phase labels, dispatch judges for reviewphaseTransition.ts:31orchestrator.ts
2Pipeline SafetydetectPipelineMismatches(), assignMissingPhaseLabels()Fix data integrity — ensure issues have correct phase labelspipelineMismatch.tsorchestrator.ts
3Needs-Revision ExpiryprocessNeedsRevisionExpiry()Circuit breaker for stuck revisions — expires stale needs-revision labelsphaseTransition.tsorchestrator.ts
4Reconcilereconcile()Detect dead agents (PID check), stall timeouts, DB orphans, issue-level orphansorchestrator.ts
5Process RetriesprocessRetries()Dispatch due entries from the retry queue with exponential backofforchestrator.ts
6Dispatch IntakedispatchIntakeBatches()Process pending batch intake requests (bulk issue creation)orchestrator.ts
7DispatchrunDispatchCycle()Fill available slots: judges, phase agents, workers, scannersdispatch/cycle.ts
8Background WorkersworkerRegistry.runDue()Learning consolidation, artifact pruning, prompt metrics, WAL checkpointsorchestrator.ts
Health MetricswriteHealthSnapshot()Record tick timing and system health to disk (runs in finally block)orchestrator.ts

Why This Order?

  1. Phase transitions first — unblocks agents waiting for phase advancement. A completed research artifact should advance to architecture before the next dispatch tries to assign a new researcher.
  2. Pipeline safety second — repairs data inconsistencies before any dispatch decisions are made.
  3. Reconcile before dispatch — frees slots held by dead agents so dispatch can use them. Without this, a crashed agent would block its slot until the next tick after dispatch.
  4. Retries before new dispatch — honors backoff commitments. Issues that already failed deserve their scheduled retry before fresh work gets dispatched.
  5. Intake before dispatch — intake batches may create new issues that become dispatch candidates.
  6. Background workers and health last — non-critical maintenance that shouldn't delay core dispatch work.

Tick Error Handling

Each sub-operation runs inside its own try/catch. If any sub-operation throws, the error is logged and subsequent sub-operations continue. An outer try/catch acts as a safety net for unexpected errors. A finally block guarantees that health metrics (writeHealthSnapshot()) and tick timing (tickTracker.recordTick()) always execute, even during errors or shutdown.

Graceful Shutdown Drain

When stop() is called, the orchestrator uses a two-phase shutdown:

  1. Tick drainawaitShutdown() waits for the current in-flight tick to complete via tickCompletionPromise, with a timeout cap of min(5s, 20% of shutdownTimeout). This ensures in-flight phase transitions and DB writes complete cleanly.
  2. Agent shutdown — After the tick drains (or times out), the existing agent wait logic sends SIGTERM and polls for process exit.

Inside tick(), each sub-operation is followed by an if (this.shuttingDown) break check inside a do-while(false) block. When shutdown is signaled mid-tick, remaining sub-operations are skipped but the finally block still runs, ensuring health metrics are written and the tick completion promise resolves.

Source: orchestrator.ts:tick(), orchestrator.ts:awaitShutdown()


Counter Lifecycle

The orchestrator tracks four in-memory counter maps to detect repeated failures and trigger circuit breakers. All counters are scoped per-issue and stored in the OrchestratorContext object.

CounterIncrement ConditionReset ConditionCircuit BreakScope
consecutiveFailuresAgent exits with exitCode !== 0Agent exits with exitCode === 0 (any success)>= maxRetries (default 3) OR totalRuns >= maxRetriesPer-issue, tracks crashes and timeouts
consecutiveStaleMatesWorker produces no git changes (detectWorkerStalemate returns true)Worker produces git changes OR autoCreatePr succeeds>= maxStaleRuns (default 3) OR totalRuns >= maxRetriesPer-issue, worker agents only
consecutiveNoSignalsAgent produces no work product (no PR, no claim, no subtasks)Agent produces any work productDoes not circuit break — only increases retry backoffPer-issue, all agent types
totalRuns (DB query)Every agent run (inserted into agent_runs)Never resets — absolute count>= maxRetries — absolute safety netPer-issue, prevents runaway loops

Counter State Diagrams

consecutiveFailures

consecutiveStaleMates

consecutiveNoSignals

Key Constraints

  • In-memory only — all counters are lost on orchestrator restart. The totalRuns query against the agent_runs table provides a persistent safety net.
  • Consecutive, not cumulative — a single success resets consecutiveFailures and consecutiveNoSignals to zero. This prevents one bad run from permanently penalizing an issue.
  • Backoff uses consecutive counter — the attempt parameter passed to retryQueue.scheduleRetry() must be the consecutive counter value, not totalRuns. Using totalRuns would cause excessive backoff on issues that had early successes.

Source: exitHandler.ts:280-476, orchestrator.ts:43-58


Exit Handler Decision Tree

When an agent process exits, the exit handler determines what to do based on exit code, agent type, and work product. Every path through the handler updates counters, DB records, and optionally schedules retries.

Exit Path Summary

PathCounter(s) AffectedActionNext Status
Issue done/cancelled (Layer 2)NoneCleanup worktree, returndone/cancelled
Failed (exitCode !== 0)consecutiveFailures++Circuit break or schedule retrytodo
Success + agent changed statusconsecutiveFailures resetRespect agent decision(agent-set)
Success + pending completion claimconsecutiveFailures resetLeave for transition processorin_progress
Success + phase agentconsecutiveFailures resetLeave for transition processorin_progress
Success + judge no verdictJudge comment count++Move to todo (>= 3) or leave in reviewreview or todo
Success + worker PR createdstaleMates resetAuto-create PR, move to reviewreview
Success + worker stalematestaleMates++Circuit break (>= 3) or retrytodo
Success + planner has subtasksnoSignals resetReturnin_progress
Success + no work productnoSignals++Retry with backoff or max-runs circuit breaktodo

Source: exitHandler.ts:160-538


Race Condition Defense

Symphony uses four defensive layers to prevent duplicate dispatch, orphaned agents, and stale state. Each layer catches problems that earlier layers might miss.

Layer 1: Atomic Claim (Pre-Dispatch)

Before dispatching an agent, the orchestrator checks the issue's current status fresh from the database and atomically sets it to in_progress. If the status has already changed (another tick claimed it, or a human moved it), dispatch is aborted.

  • Protects against: Concurrent dispatch attempts, human status changes before claim
  • Implementation: dispatchSetup.ts:claimIssueForDispatch()

Layer 2: Post-Exit Failsafe

After an agent exits, the exit handler re-reads the issue from the database before processing. If the issue became done or cancelled during the agent's execution (e.g., a human merged the PR manually), the handler cleans up the worktree and returns early — skipping retry logic entirely.

  • Protects against: Issues completed manually while an agent was running
  • Implementation: exitHandler.ts:250-277

Layer 3: Reconcile GC (Three-Level Orphan Detection)

The reconcile phase runs on every tick and detects orphaned agents at three levels:

LevelDetection MethodActionCatches
1. PID livenessprocess.kill(pid, 0) on in-memory agentsTrigger exit handler with exitCode=1Crashed agent processes
2. DB orphanQuery agent_runs with status=running, check PID alive, skip if in ctx.runningMark run as failed, reset issue to todoAgents that died between ticks, orchestrator restarts
3. Issue orphanQuery issues with status=in_progress, skip if in ctx.running or has DB-level running runReset issue to todoExit handler failures, partial DB updates

Additionally, reconcile checks for stall timeouts — agents that have been running longer than stall_timeout_ms are killed via SIGTERM.

Synthetic Key Handling

The ctx.running map uses issue IDs as keys for regular agents, but uses synthetic keys like intake:<batchId> for non-issue agents (intake batch processors). The reconcile loop guards against these synthetic keys to prevent handleAgentExit from receiving non-issue keys:

  1. Detection: isSyntheticRunningKey() checks if a key starts with a known prefix (intake:, etc.)
  2. Dead PID handling: If a synthetic agent's PID is dead, handleSyntheticAgentDeath() handles cleanup inline — removing from the running map, cleaning temp files, and marking the intake batch as failed
  3. Stall detection skipped: Synthetic agents have their own exit handlers (in intakeDispatcher.ts), so reconcile skips stall detection for them

Source: syntheticKeys.ts, orchestrator.ts:reconcile()

  • Protects against: Crashed processes, orchestrator restarts, partial DB updates
  • Implementation: orchestrator.ts:249-345, syntheticKeys.ts

Layer 4: Dispatch Guard (Double-Dispatch Prevention)

After launching an agent subprocess but before recording it in ctx.running, the dispatcher checks if an agent already exists for this issue. If so, it kills the new process, marks the run as failed, and aborts. This prevents the ctx.running map (keyed by issueId) from being overwritten, which would orphan the existing agent.

  • Protects against: Double-dispatch in the same tick cycle, race between phase transition judge dispatch and regular dispatch
  • Implementation: agentDispatcher.ts:479-491

Status Transitions

Issues follow a canonical status transition map with three ownership domains:

Transition Ownership

TransitionOwnerTrigger
backlog → todoOrchestratorAuto-promote when no todo work exists
todo → in_progressOrchestratorAtomic claim at dispatch time
in_progress → reviewAgent (MCP create_pr) or Orchestrator (autoCreatePr)PR created from worktree changes
in_progress → todoOrchestratorAgent failure, stalemate, or retry
in_progress → blockedAgent (MCP update_issue_status)Dependency not met
blocked → todoOrchestratorautoUnblockParents() when children complete
review → doneHuman onlyMerge PR via UI
review → todoJudge (MCP reject_pr) or OrchestratorPR rejected or judge exhausted retries
done → todoHumanReopen issue
cancelled → backlog/todoHumanRestore issue

Source: server/utils/statusTransitions.ts, app/utils/statusTransitions.ts


Retry Queue

Failed or stalled agents are scheduled for retry with exponential backoff. The retry queue is in-memory and processed each tick during the "Process Retries" phase.

Backoff Formula

delay = min(10_000 * 2^(attempt - 1), maxRetryBackoffMs)

Where attempt is the consecutive failure/stalemate/no-signal count (not total runs).

Backoff Schedule

AttemptDelayCumulative
110s10s
220s30s
340s70s
480s150s
5160s310s
6+300s (max, configurable)610s+

Important: Consecutive Counter, Not Total Runs

The attempt parameter must be the consecutive counter value. Using totalRuns causes excessive backoff — an issue with 10 historical runs but only 1 recent failure would get a 5-minute backoff instead of 10 seconds.

typescript
// Correct — consecutive failures as attempt
const failures = (ctx.consecutiveFailures.get(issueId) ?? 0) + 1
ctx.retryQueue.scheduleRetry(issueId, identifier, failures, 'timeout')

// Wrong — total runs causes excessive backoff
ctx.retryQueue.scheduleRetry(issueId, identifier, totalRuns, 'timeout')

Retry Processing

Each tick, processRetries() pulls due entries from the queue and dispatches them:

  1. Remove entry from queue
  2. Re-read issue from DB — skip if status is not todo
  3. Determine phase from issue labels
  4. Check concurrency limits (global and per-project)
  5. If no slots available, re-queue with attempt + 1
  6. Dispatch agent with the correct phase and recovery strategy (if any)

Context Recovery Cascade

When agents fail due to token/context limits, a progressive recovery cascade adjusts dispatch context instead of retrying identically. Strategies escalate from budget reduction → output compaction → fresh dispatch with learnings → model escalation. Token errors fast-track to output compaction. Recovery attempts count against the same max_retries budget — no additional retries are created.

See docs/patterns/retry-with-backoff.md § "Context Recovery Cascade" for full details.

Source: contextRecovery.ts, retryQueue.ts, orchestrator.ts:processRetries()


Phase Transition Flow

Phase transitions are the mechanism by which issues advance through the readiness pipeline. Agents submit completion claims; the orchestrator validates and transitions.

Precondition Checks

Before validating an artifact, the transition processor verifies:

  1. Phase match — the claim's completed_phase matches the issue's current phase:* label
  2. Artifact exists — the file at the resolved artifact path exists on disk
  3. Required sections — the artifact contains all sections defined in the contract's required_sections

If preconditions fail, the claim is marked as processed with a rejection reason, and a needs-revision label is added.

Validation Modes

ModeBehaviorUsed By
structuralAuto-advance if artifact exists with required sectionsphase:research
judgeDispatch judge to review artifact; advance only on approve_phasephase:architecture
trustStructural checks run as warnings, always advances(configurable)

Adaptive Phase Skipping

Agents can recommend skipping phases by including skip_to in their completion claim. For example, a research agent assessing a simple ticket can skip directly to phase:ready. Skipped phases get skip:<phase> labels for traceability.

Source: phaseTransition.ts:31-440, contractParser.ts


Reconcile (GC)

The reconcile phase is the orchestrator's garbage collector — it detects and cleans up agents and issues that are in inconsistent states. It runs on every tick before dispatch.

Three-Level Detection

LevelWhat It ChecksDetectionRecovery Action
1. In-memory PIDAgents in ctx.runningprocess.kill(pid, 0) — throws if process deadTrigger exit handler with exitCode=1
2. DB orphanagent_runs with status=running not in ctx.runningPID liveness check on DB recordsMark run as failed, reset issue to todo
3. Issue orphanIssues with status=in_progress and no active agentNo entry in ctx.running AND no running agent_runReset issue to todo

Stall Detection

In addition to orphan detection, reconcile checks for stalled agents — processes that are still alive but have exceeded stall_timeout_ms. Stalled agents are killed via SIGTERM, and their exit is processed as a failure with timedOut: true.

Why Three Levels?

Each level catches a different failure mode:

  • Level 1 catches processes that crashed between ticks (most common)
  • Level 2 catches processes that died but whose in-memory entry was already cleaned up (e.g., orchestrator restart with a different ctx.running state)
  • Level 3 is defense-in-depth — catches cases where the agent_run was marked as failed but the issue status was never reset (e.g., the .catch block in dispatch updated agentRuns but threw before updating issues)

Source: orchestrator.ts:249-345


Dispatch Priority

The dispatch cycle fills available slots in a strict priority order within each project:

PriorityCategoryConditionRationale
0Setup WizardwizardCompleted === falseMust configure project before any work
1JudgeIssues in review statusUnblocks the pipeline — reviewed PRs become mergeable
2Phase AgentsIssues with phase:research, phase:architecture, phase:groomingMatures tickets toward readiness
3Workers / Plannersphase:ready issues or needs-planning issuesCore implementation work
4ScannerNo todo work AND backlog < 10Background improvement discovery

Worker Slot Reserve

When maxConcurrency >= 2 and there are dispatchable worker items, 1 slot is reserved exclusively for phase:ready workers. This prevents phase agents (research, architecture, grooming) from consuming all available slots and starving implementation work.

The reserve only activates when worker candidates actually exist — if no phase:ready issues are available, phase agents can use all slots.

Judge Slot Allocation

Judges receive up to 25% of global slots (minimum 1) as a reserved pool. Judges can exceed globalMax by this amount to avoid being blocked behind workers. This ensures the review pipeline doesn't stall when all slots are occupied by workers.

Per-Project Limits

Each project has a configurable maxConcurrency limit. The dispatch cycle respects both global and per-project limits, using the more restrictive of the two. The getAvailableSlotsForProject() function returns an AvailableSlotsResult with both the available count and an optional reason:

{ available: min(globalMax - globalRunning, projectMax - projectRunning),
  reason?: 'global_full' | 'project_full' }

The retry queue uses the reason field for differentiated backoff:

  • project_full: Short fixed delay (10s) — the project will free up soon
  • global_full: Normal exponential backoff — the entire system is congested

Source: dispatchOrchestrator.ts, orchestrator.ts:processRetries(), concurrency.ts


Recovery Scenarios

When the orchestrator crashes or restarts, various in-flight state must be recovered. The table below maps each failure scenario to its recovery mechanism.

ScenarioWhat's LostRecovery MechanismRecovery Time
Graceful shutdown mid-tickRemaining sub-operations in current tickshuttingDown breaks skip remaining sub-ops; finally block writes health metrics; tickCompletionPromise resolves to unblock awaitShutdown()Immediate
Tick crash mid-phase-transitionUnprocessed claims in current iterationClaims are idempotent — picked up on next tick~5s (next tick)
Tick crash mid-reconcilePartial orphan detectionOrphans detected again on next tick~5s
Tick crash mid-dispatchIn-flight slot allocation, partially launched agentReconcile detects orphaned in_progress issues~5s
Orchestrator restartAll ctx.running entries, all counter mapsrecoverOrphanedRuns() marks all DB running agent_runs as failed and resets issues to todoImmediate on boot
Agent process crashAgent's in-memory statePID liveness check in reconcile triggers exit handler~5s (next tick)
In-memory counters lostconsecutiveFailures, staleMates, noSignals mapstotalRuns DB query provides absolute safety net — prevents runaway retry loops even without counter historyImmediate
Orphaned worktreesDisk space consumed by unused worktreesBackground ArtifactPruner worker cleans stale worktrees; manual git worktree prune also worksBackground worker interval
Retry queue lostScheduled retries with backoffIssues are in todo status — dispatch cycle re-picks them on next tick (backoff timing is lost, but work continues)~5s

Startup Recovery

On boot, the orchestrator calls recoverOrphanedRuns() which:

  1. Queries all agent_runs with status = 'running'
  2. Marks each as failed with error 'Orchestrator restarted'
  3. Resets the associated issue to todo

This ensures no issues are permanently stuck in in_progress after a crash.

Source: orchestrator.ts:224-247


Further Reading

  • Architecture — three-process model, database schema, directory structure
  • Phase Pipeline — phase contracts, validation modes, artifact system
  • Agent Types — worker, judge, planner, researcher profiles
  • Dispatcher — candidate selection, eligibility filters
  • Agent Lifecycle — prompt composition, worktree setup, MCP tools