Orchestrator Tick Loop Lifecycle

The orchestrator is Symphony's execution engine — a daemon process that polls the database every 5 seconds, dispatching agents, detecting failures, and advancing phase transitions. This document explains the tick loop's internal state machine, counter lifecycle, exit handler decision tree, and race condition defenses so that contributors can understand the execution flow without reading source code.

Tick Loop State Machine

Each tick executes 8 phases in a fixed, deterministic order. The order is intentional: phase transitions run first to unblock completed work, reconcile runs before dispatch to clean up crashed agents, and retries run before new dispatch to prioritize recovery over fresh work.

Graceful Shutdown: When stop() is called, the tick loop drains the current in-flight tick before proceeding with agent shutdown. Each sub-operation checks shuttingDown between steps via a do-while(false) early-exit pattern, and awaitShutdown() waits for the tick completion promise with a timeout cap of min(5s, 20% of shutdownTimeout). Health metrics are written in a finally block to guarantee execution even during shutdown.

Phase Reference Table

#	Phase	Function(s)	Purpose	Source
1	Phase Transitions	`processPhaseTransitions()`	Validate completion claims, swap phase labels, dispatch judges for review	`phaseTransition.ts:31` → `orchestrator.ts`
2	Pipeline Safety	`detectPipelineMismatches()`, `assignMissingPhaseLabels()`	Fix data integrity — ensure issues have correct phase labels	`pipelineMismatch.ts` → `orchestrator.ts`
3	Needs-Revision Expiry	`processNeedsRevisionExpiry()`	Circuit breaker for stuck revisions — expires stale `needs-revision` labels	`phaseTransition.ts` → `orchestrator.ts`
4	Reconcile	`reconcile()`	Detect dead agents (PID check), stall timeouts, DB orphans, issue-level orphans	`orchestrator.ts`
5	Process Retries	`processRetries()`	Dispatch due entries from the retry queue with exponential backoff	`orchestrator.ts`
6	Dispatch Intake	`dispatchIntakeBatches()`	Process pending batch intake requests (bulk issue creation)	`orchestrator.ts`
7	Dispatch	`runDispatchCycle()`	Fill available slots: judges, phase agents, workers, scanners	`dispatch/cycle.ts`
8	Background Workers	`workerRegistry.runDue()`	Learning consolidation, artifact pruning, prompt metrics, WAL checkpoints	`orchestrator.ts`
—	Health Metrics	`writeHealthSnapshot()`	Record tick timing and system health to disk (runs in `finally` block)	`orchestrator.ts`

Why This Order?

Phase transitions first — unblocks agents waiting for phase advancement. A completed research artifact should advance to architecture before the next dispatch tries to assign a new researcher.
Pipeline safety second — repairs data inconsistencies before any dispatch decisions are made.
Reconcile before dispatch — frees slots held by dead agents so dispatch can use them. Without this, a crashed agent would block its slot until the next tick after dispatch.
Retries before new dispatch — honors backoff commitments. Issues that already failed deserve their scheduled retry before fresh work gets dispatched.
Intake before dispatch — intake batches may create new issues that become dispatch candidates.
Background workers and health last — non-critical maintenance that shouldn't delay core dispatch work.

Tick Error Handling

Each sub-operation runs inside its own try/catch. If any sub-operation throws, the error is logged and subsequent sub-operations continue. An outer try/catch acts as a safety net for unexpected errors. A finally block guarantees that health metrics (writeHealthSnapshot()) and tick timing (tickTracker.recordTick()) always execute, even during errors or shutdown.

Graceful Shutdown Drain

When stop() is called, the orchestrator uses a two-phase shutdown:

Tick drain — awaitShutdown() waits for the current in-flight tick to complete via tickCompletionPromise, with a timeout cap of min(5s, 20% of shutdownTimeout). This ensures in-flight phase transitions and DB writes complete cleanly.
Agent shutdown — After the tick drains (or times out), the existing agent wait logic sends SIGTERM and polls for process exit.

Inside tick(), each sub-operation is followed by an if (this.shuttingDown) break check inside a do-while(false) block. When shutdown is signaled mid-tick, remaining sub-operations are skipped but the finally block still runs, ensuring health metrics are written and the tick completion promise resolves.

Source: orchestrator.ts:tick(), orchestrator.ts:awaitShutdown()

Counter Lifecycle

The orchestrator tracks four in-memory counter maps to detect repeated failures and trigger circuit breakers. All counters are scoped per-issue and stored in the OrchestratorContext object.

Counter	Increment Condition	Reset Condition	Circuit Break	Scope
`consecutiveFailures`	Agent exits with `exitCode !== 0`	Agent exits with `exitCode === 0` (any success)	`>= maxRetries` (default 3) OR `totalRuns >= maxRetries`	Per-issue, tracks crashes and timeouts
`consecutiveStaleMates`	Worker produces no git changes (`detectWorkerStalemate` returns true)	Worker produces git changes OR `autoCreatePr` succeeds	`>= maxStaleRuns` (default 3) OR `totalRuns >= maxRetries`	Per-issue, worker agents only
`consecutiveNoSignals`	Agent produces no work product (no PR, no claim, no subtasks)	Agent produces any work product	Does not circuit break — only increases retry backoff	Per-issue, all agent types
`totalRuns` (DB query)	Every agent run (inserted into `agent_runs`)	Never resets — absolute count	`>= maxRetries` — absolute safety net	Per-issue, prevents runaway loops

Counter State Diagrams

consecutiveFailures

consecutiveStaleMates

consecutiveNoSignals

Key Constraints

In-memory only — all counters are lost on orchestrator restart. The totalRuns query against the agent_runs table provides a persistent safety net.
Consecutive, not cumulative — a single success resets consecutiveFailures and consecutiveNoSignals to zero. This prevents one bad run from permanently penalizing an issue.
Backoff uses consecutive counter — the attempt parameter passed to retryQueue.scheduleRetry() must be the consecutive counter value, not totalRuns. Using totalRuns would cause excessive backoff on issues that had early successes.

Source: exitHandler.ts:280-476, orchestrator.ts:43-58

Exit Handler Decision Tree

When an agent process exits, the exit handler determines what to do based on exit code, agent type, and work product. Every path through the handler updates counters, DB records, and optionally schedules retries.

Exit Path Summary

Path	Counter(s) Affected	Action	Next Status
Issue done/cancelled (Layer 2)	None	Cleanup worktree, return	done/cancelled
Failed (exitCode !== 0)	`consecutiveFailures++`	Circuit break or schedule retry	todo
Success + agent changed status	`consecutiveFailures` reset	Respect agent decision	(agent-set)
Success + pending completion claim	`consecutiveFailures` reset	Leave for transition processor	in_progress
Success + phase agent	`consecutiveFailures` reset	Leave for transition processor	in_progress
Success + judge no verdict	Judge comment count++	Move to todo (>= 3) or leave in review	review or todo
Success + worker PR created	`staleMates` reset	Auto-create PR, move to review	review
Success + worker stalemate	`staleMates++`	Circuit break (>= 3) or retry	todo
Success + planner has subtasks	`noSignals` reset	Return	in_progress
Success + no work product	`noSignals++`	Retry with backoff or max-runs circuit break	todo

Source: exitHandler.ts:160-538

Race Condition Defense

Symphony uses four defensive layers to prevent duplicate dispatch, orphaned agents, and stale state. Each layer catches problems that earlier layers might miss.

Layer 1: Atomic Claim (Pre-Dispatch)

Before dispatching an agent, the orchestrator checks the issue's current status fresh from the database and atomically sets it to in_progress. If the status has already changed (another tick claimed it, or a human moved it), dispatch is aborted.

Protects against: Concurrent dispatch attempts, human status changes before claim
Implementation: dispatchSetup.ts:claimIssueForDispatch()

Layer 2: Post-Exit Failsafe

After an agent exits, the exit handler re-reads the issue from the database before processing. If the issue became done or cancelled during the agent's execution (e.g., a human merged the PR manually), the handler cleans up the worktree and returns early — skipping retry logic entirely.

Protects against: Issues completed manually while an agent was running
Implementation: exitHandler.ts:250-277

Layer 3: Reconcile GC (Three-Level Orphan Detection)

The reconcile phase runs on every tick and detects orphaned agents at three levels:

Level	Detection Method	Action	Catches
1. PID liveness	`process.kill(pid, 0)` on in-memory agents	Trigger exit handler with `exitCode=1`	Crashed agent processes
2. DB orphan	Query `agent_runs` with `status=running`, check PID alive, skip if in `ctx.running`	Mark run as failed, reset issue to `todo`	Agents that died between ticks, orchestrator restarts
3. Issue orphan	Query `issues` with `status=in_progress`, skip if in `ctx.running` or has DB-level running run	Reset issue to `todo`	Exit handler failures, partial DB updates

Additionally, reconcile checks for stall timeouts — agents that have been running longer than stall_timeout_ms are killed via SIGTERM.

Synthetic Key Handling

The ctx.running map uses issue IDs as keys for regular agents, but uses synthetic keys like intake:<batchId> for non-issue agents (intake batch processors). The reconcile loop guards against these synthetic keys to prevent handleAgentExit from receiving non-issue keys:

Detection: isSyntheticRunningKey() checks if a key starts with a known prefix (intake:, etc.)
Dead PID handling: If a synthetic agent's PID is dead, handleSyntheticAgentDeath() handles cleanup inline — removing from the running map, cleaning temp files, and marking the intake batch as failed
Stall detection skipped: Synthetic agents have their own exit handlers (in intakeDispatcher.ts), so reconcile skips stall detection for them

Source: syntheticKeys.ts, orchestrator.ts:reconcile()

Protects against: Crashed processes, orchestrator restarts, partial DB updates
Implementation: orchestrator.ts:249-345, syntheticKeys.ts

Layer 4: Dispatch Guard (Double-Dispatch Prevention)

After launching an agent subprocess but before recording it in ctx.running, the dispatcher checks if an agent already exists for this issue. If so, it kills the new process, marks the run as failed, and aborts. This prevents the ctx.running map (keyed by issueId) from being overwritten, which would orphan the existing agent.

Protects against: Double-dispatch in the same tick cycle, race between phase transition judge dispatch and regular dispatch
Implementation: agentDispatcher.ts:479-491

Status Transitions

Issues follow a canonical status transition map with three ownership domains:

Transition Ownership

Transition	Owner	Trigger
`backlog → todo`	Orchestrator	Auto-promote when no todo work exists
`todo → in_progress`	Orchestrator	Atomic claim at dispatch time
`in_progress → review`	Agent (MCP `create_pr`) or Orchestrator (`autoCreatePr`)	PR created from worktree changes
`in_progress → todo`	Orchestrator	Agent failure, stalemate, or retry
`in_progress → blocked`	Agent (MCP `update_issue_status`)	Dependency not met
`blocked → todo`	Orchestrator	`autoUnblockParents()` when children complete
`review → done`	Human only	Merge PR via UI
`review → todo`	Judge (MCP `reject_pr`) or Orchestrator	PR rejected or judge exhausted retries
`done → todo`	Human	Reopen issue
`cancelled → backlog/todo`	Human	Restore issue

Source: server/utils/statusTransitions.ts, app/utils/statusTransitions.ts

Retry Queue

Failed or stalled agents are scheduled for retry with exponential backoff. The retry queue is in-memory and processed each tick during the "Process Retries" phase.

Backoff Formula

delay = min(10_000 * 2^(attempt - 1), maxRetryBackoffMs)

Where attempt is the consecutive failure/stalemate/no-signal count (not total runs).

Backoff Schedule

Attempt	Delay	Cumulative
1	10s	10s
2	20s	30s
3	40s	70s
4	80s	150s
5	160s	310s
6+	300s (max, configurable)	610s+

Important: Consecutive Counter, Not Total Runs

The attempt parameter must be the consecutive counter value. Using totalRuns causes excessive backoff — an issue with 10 historical runs but only 1 recent failure would get a 5-minute backoff instead of 10 seconds.

typescript

// Correct — consecutive failures as attempt
const failures = (ctx.consecutiveFailures.get(issueId) ?? 0) + 1
ctx.retryQueue.scheduleRetry(issueId, identifier, failures, 'timeout')

// Wrong — total runs causes excessive backoff
ctx.retryQueue.scheduleRetry(issueId, identifier, totalRuns, 'timeout')

Retry Processing

Each tick, processRetries() pulls due entries from the queue and dispatches them:

Remove entry from queue
Re-read issue from DB — skip if status is not todo
Determine phase from issue labels
Check concurrency limits (global and per-project)
If no slots available, re-queue with attempt + 1
Dispatch agent with the correct phase and recovery strategy (if any)

Context Recovery Cascade

When agents fail due to token/context limits, a progressive recovery cascade adjusts dispatch context instead of retrying identically. Strategies escalate from budget reduction → output compaction → fresh dispatch with learnings → model escalation. Token errors fast-track to output compaction. Recovery attempts count against the same max_retries budget — no additional retries are created.

See docs/patterns/retry-with-backoff.md § "Context Recovery Cascade" for full details.

Source: contextRecovery.ts, retryQueue.ts, orchestrator.ts:processRetries()

Phase Transition Flow

Phase transitions are the mechanism by which issues advance through the readiness pipeline. Agents submit completion claims; the orchestrator validates and transitions.

Precondition Checks

Before validating an artifact, the transition processor verifies:

Phase match — the claim's completed_phase matches the issue's current phase:* label
Artifact exists — the file at the resolved artifact path exists on disk
Required sections — the artifact contains all sections defined in the contract's required_sections

If preconditions fail, the claim is marked as processed with a rejection reason, and a needs-revision label is added.

Validation Modes

Mode	Behavior	Used By
`structural`	Auto-advance if artifact exists with required sections	`phase:research`
`judge`	Dispatch judge to review artifact; advance only on `approve_phase`	`phase:architecture`
`trust`	Structural checks run as warnings, always advances	(configurable)

Adaptive Phase Skipping

Agents can recommend skipping phases by including skip_to in their completion claim. For example, a research agent assessing a simple ticket can skip directly to phase:ready. Skipped phases get skip:<phase> labels for traceability.

Source: phaseTransition.ts:31-440, contractParser.ts

Reconcile (GC)

The reconcile phase is the orchestrator's garbage collector — it detects and cleans up agents and issues that are in inconsistent states. It runs on every tick before dispatch.

Three-Level Detection

Level	What It Checks	Detection	Recovery Action
1. In-memory PID	Agents in `ctx.running`	`process.kill(pid, 0)` — throws if process dead	Trigger exit handler with `exitCode=1`
2. DB orphan	`agent_runs` with `status=running` not in `ctx.running`	PID liveness check on DB records	Mark run as `failed`, reset issue to `todo`
3. Issue orphan	Issues with `status=in_progress` and no active agent	No entry in `ctx.running` AND no running `agent_run`	Reset issue to `todo`

Stall Detection

In addition to orphan detection, reconcile checks for stalled agents — processes that are still alive but have exceeded stall_timeout_ms. Stalled agents are killed via SIGTERM, and their exit is processed as a failure with timedOut: true.

Why Three Levels?

Each level catches a different failure mode:

Level 1 catches processes that crashed between ticks (most common)
Level 2 catches processes that died but whose in-memory entry was already cleaned up (e.g., orchestrator restart with a different ctx.running state)
Level 3 is defense-in-depth — catches cases where the agent_run was marked as failed but the issue status was never reset (e.g., the .catch block in dispatch updated agentRuns but threw before updating issues)

Source: orchestrator.ts:249-345

Dispatch Priority

The dispatch cycle fills available slots in a strict priority order within each project:

Priority	Category	Condition	Rationale
0	Setup Wizard	`wizardCompleted === false`	Must configure project before any work
1	Judge	Issues in `review` status	Unblocks the pipeline — reviewed PRs become mergeable
2	Phase Agents	Issues with `phase:research`, `phase:architecture`, `phase:grooming`	Matures tickets toward readiness
3	Workers / Planners	`phase:ready` issues or `needs-planning` issues	Core implementation work
4	Scanner	No todo work AND backlog < 10	Background improvement discovery

Worker Slot Reserve

When maxConcurrency >= 2 and there are dispatchable worker items, 1 slot is reserved exclusively for phase:ready workers. This prevents phase agents (research, architecture, grooming) from consuming all available slots and starving implementation work.

The reserve only activates when worker candidates actually exist — if no phase:ready issues are available, phase agents can use all slots.

Judge Slot Allocation

Judges receive up to 25% of global slots (minimum 1) as a reserved pool. Judges can exceed globalMax by this amount to avoid being blocked behind workers. This ensures the review pipeline doesn't stall when all slots are occupied by workers.

Per-Project Limits

Each project has a configurable maxConcurrency limit. The dispatch cycle respects both global and per-project limits, using the more restrictive of the two. The getAvailableSlotsForProject() function returns an AvailableSlotsResult with both the available count and an optional reason:

{ available: min(globalMax - globalRunning, projectMax - projectRunning),
  reason?: 'global_full' | 'project_full' }

The retry queue uses the reason field for differentiated backoff:

project_full: Short fixed delay (10s) — the project will free up soon
global_full: Normal exponential backoff — the entire system is congested

Source: dispatchOrchestrator.ts, orchestrator.ts:processRetries(), concurrency.ts

Recovery Scenarios

When the orchestrator crashes or restarts, various in-flight state must be recovered. The table below maps each failure scenario to its recovery mechanism.

Scenario	What's Lost	Recovery Mechanism	Recovery Time
Graceful shutdown mid-tick	Remaining sub-operations in current tick	`shuttingDown` breaks skip remaining sub-ops; `finally` block writes health metrics; `tickCompletionPromise` resolves to unblock `awaitShutdown()`	Immediate
Tick crash mid-phase-transition	Unprocessed claims in current iteration	Claims are idempotent — picked up on next tick	~5s (next tick)
Tick crash mid-reconcile	Partial orphan detection	Orphans detected again on next tick	~5s
Tick crash mid-dispatch	In-flight slot allocation, partially launched agent	Reconcile detects orphaned `in_progress` issues	~5s
Orchestrator restart	All `ctx.running` entries, all counter maps	`recoverOrphanedRuns()` marks all DB `running` agent_runs as `failed` and resets issues to `todo`	Immediate on boot
Agent process crash	Agent's in-memory state	PID liveness check in reconcile triggers exit handler	~5s (next tick)
In-memory counters lost	`consecutiveFailures`, `staleMates`, `noSignals` maps	`totalRuns` DB query provides absolute safety net — prevents runaway retry loops even without counter history	Immediate
Orphaned worktrees	Disk space consumed by unused worktrees	Background `ArtifactPruner` worker cleans stale worktrees; manual `git worktree prune` also works	Background worker interval
Retry queue lost	Scheduled retries with backoff	Issues are in `todo` status — dispatch cycle re-picks them on next tick (backoff timing is lost, but work continues)	~5s

Startup Recovery

On boot, the orchestrator calls recoverOrphanedRuns() which:

Queries all agent_runs with status = 'running'
Marks each as failed with error 'Orchestrator restarted'
Resets the associated issue to todo

This ensures no issues are permanently stuck in in_progress after a crash.

Source: orchestrator.ts:224-247

Orchestrator Tick Loop Lifecycle ​

Tick Loop State Machine ​

Phase Reference Table ​

Why This Order? ​

Tick Error Handling ​

Graceful Shutdown Drain ​

Counter Lifecycle ​

Counter State Diagrams ​

consecutiveFailures ​

consecutiveStaleMates ​

consecutiveNoSignals ​

Key Constraints ​

Exit Handler Decision Tree ​

Exit Path Summary ​

Race Condition Defense ​

Layer 1: Atomic Claim (Pre-Dispatch) ​

Layer 2: Post-Exit Failsafe ​

Layer 3: Reconcile GC (Three-Level Orphan Detection) ​

Synthetic Key Handling ​

Layer 4: Dispatch Guard (Double-Dispatch Prevention) ​

Status Transitions ​

Transition Ownership ​

Retry Queue ​

Backoff Formula ​

Backoff Schedule ​

Important: Consecutive Counter, Not Total Runs ​

Retry Processing ​

Context Recovery Cascade ​

Phase Transition Flow ​

Precondition Checks ​

Validation Modes ​

Adaptive Phase Skipping ​

Reconcile (GC) ​

Three-Level Detection ​

Stall Detection ​

Why Three Levels? ​

Dispatch Priority ​

Worker Slot Reserve ​

Judge Slot Allocation ​

Per-Project Limits ​

Recovery Scenarios ​

Startup Recovery ​

Further Reading ​