# Scheduler v2 — Migration Plan

> **Status**: Proposal, companion to [`README.md`](./README.md) and
> [`current-system-inventory.md`](./current-system-inventory.md).

Sequencing principle: each phase lands independently, keeps the full test
suite green, and shrinks the v1 surface the next phase has to reason about.
Phases 0–2 and phase E are mechanical/verifiable correctness or deletion
work, worth doing even if v2 stalled afterwards (phase E in particular fixes
live bugs against the current scheduler). Phase 3 is the structural cutover;
it is built as the new component set and flipped in one short-lived branch
series — **no long-lived runtime flag** for old-vs-new scheduler (we just
removed one mode flag; we should not mint another).

---

## Phase 0 — Remove push mode

Pure deletion; no behavior change in production (pull has been the default
and only production mode).

1. Delete `push-execution.ts`, `push-notifications.ts`,
   `push-subscriptions.ts`, `push-events.ts`, `push-continuation.ts`.
2. Remove `pullMode` field, `enablePullMode` / `disablePullMode` /
   `isPullModeEnabled`, and every mode branch (inventory §12 lists all
   sites). Inline the pull side.
3. `schedulerRuntimeFingerprint`: keep emitting `runner:scheduler:pull` (or
   bump to a versioned string and accept that pre-existing observations
   miss once — decide with the memory owners; misses are safe, just a
   one-time re-run).
4. Tests: rewrite the few push-baseline assertions (`scheduler-pull.test.ts`
   toggles modes to compare); delete mode-toggle tests.
5. Docs: update `pull-based-scheduler/README.md` mode-control section (or
   fold the doc into scheduler-v2 once phase 3 lands).
6. Telemetry: retire `scheduler.mode.change`.

Exit: no `push` identifier under `packages/runner/src/scheduler/`; suite
green.

## Phase 1 — Static write surface (P4 prerequisite)

Confirmed (2026-06-11): the pattern builder already produces exactly one
output redirect per node — the transformer cannot bind to multiple outputs.
**No corpus audit and no new builder/transformer enforcement is needed.**
What this phase does is stop *discovering* write sets from runs and freeze
the surface at registration:

1. Compute the write surface at node instantiation, as today's inputs
   already allow: primary result cell + `collectStaticRedirectWriteTargets`
   (fixed writable inputs; skipped when envelopes exist, per the existing
   tiering at `runner.ts:3495-3501`) + declared materializer envelopes.
   Pass it in the registration; nothing about it updates from run logs.
2. Collapse `SchedulerWriteIndex` to: static `outputsByNode` /
   `nodesByOutputEntity` + the materializer envelope index. Delete
   current-known/historical write tracking, declared-write seeding,
   dependents backfill on write growth, structural-ancestor pruning.
3. Remove the `schedulerHistoricalMightWrite` experimental option, the
   legacy `getMightWrite` mode, and historical write storage (deletion
   confirmed; keep `getMightWrite` as a thin "return outputs" shim only if
   a caller remains).
4. Belt-and-braces: a dev-mode assertion that a run's actual writes fall
   inside the registered surface (primary + static targets + envelopes),
   surfacing any side-writer the transformer's capability analysis missed —
   this is diagnostics for declaration gaps, not enforcement of a new rule.

Exit: writer lookup is a static map access; observation payload no longer
needs write sets (coordinate the payload change with phase 7 or ship
dual-write).

## Phase 2 — Tx-carried identity (self-suppression)

Deliberately narrow. The in-process channel CANNOT be deleted here: the
conditional-effect filter depends on `changedWritesHistory`, which that
channel records; removing the filter before phase 3's invalid-at-turn
run-gate exists would regress effect-run counts. Channel deletion happens
inside the phase 3 cutover, where its replacement lands atomically.

1. Make the originating action a first-class transaction attribute:
   `sourceAction?: Action` stamped on the inner `IStorageTransaction`
   (alongside today's informal `debugActionId`, `action-run.ts:337`), set
   for action runs *and* event dispatches. Comparison is **object
   identity** — never the diagnostic action id, which can collide across
   instances (e.g. `pull:${uri}`).
2. Switch self-suppression in notification handling to
   `notification.source?.sourceAction === action`; delete `inFlightSources`
   (the WeakMap, add/remove lifecycle, and its notification check).
3. **Keep the changeGroup skip unchanged.** It is a user-facing suppression
   feature (`cf-code-editor` sinks subscribe with a changeGroup to filter
   their own edits), not scheduler plumbing.
4. Verification: scheduler suite green; a focused fixture that an action
   writing its own read does not retrigger itself, and that a sibling
   action with the same diagnostic id IS still triggered.

Exit: self-suppression is one object-identity comparison; no per-action
in-flight bookkeeping.

## Phase E (independent) — lineage + receipts for event-launched work

Implements spec §7.6 (invariants I10, I11). Independent of the v2 cutover:
lands against the current scheduler and should not wait for it. Has a
memory-engine component — coordinate with the memory owners from the start.

**E0 — shared infrastructure.**

1. Durable event identity minted at send: origin tx id (or ingress id) +
   stream link + per-origin sequence; carried on `QueuedEvent` and into the
   handling transaction.
2. Rejection taxonomy: split commit rejections into *retryable* (optimistic
   conflict — retry as today) and *permanent* (precondition failed — drop,
   never retry), surfaced distinctly to the scheduler's retry paths
   (`events.ts` unshift-retry must not fire on permanent rejections).

**E1 — speculation lineage (I10).**

1. Stream `Cell.set` keeps queueing at send time (`cell.ts:1167` —
   unchanged latency); the queued event records its origin tx id.
2. Same-space origins: handling transactions carry an *origin-committed*
   precondition verified by the memory engine (same-session commits are
   processed in order, so the origin's fate is known — the check is free).
   Cross-space origins: the event **parks until the origin commit is
   confirmed** (spec resolved decision 11; same head-parking mechanism as
   time-gated dependencies, latency mirrors the accepted cross-space write
   protocol), then dispatches normally; dropped on origin failure. No
   cross-space server verification.
3. Client lineage registry: origin tx → {queued events, started pieces}.
   On locally-known origin failure: cancel undispatched descendant events,
   cancel+stop descendant pieces (`handleJavaScriptHandlerResult` pull
   path — restoring the cleanup the push branch had, `runner.ts:2724-2729`
   vs `2735`). `navigateTo` keeps `startAfterSuccessfulCommit`.
4. Leave a watch-this comment at the retry-exhaustion sites
   (`watchReactiveActionCommit`, `rescheduleActionForImmediateRetry`) for
   the accepted zombie-piece case (spec resolved decision 9).

**E2 — receipts = result cells (I11).**

1. Make the handler-result cause event-causal: replace the random
   per-invocation `$event: crypto.randomUUID()` (`runner.ts:2995-2998`)
   with the E0 event id, threaded from `QueuedEvent` into the handler
   frame. All handler-frame-minted ids become deterministic per event:
   retries reuse ids (aborted attempts never committed), per-gesture
   uniqueness is preserved (event ids are unique per send). Verify no
   fixture relies on per-attempt-unique ids.
2. Memory engine: create-only precondition on the handling's result cell
   (default-on for all events, decision 14; gated only by the
   transitional `commitPreconditions` protocol flag), with a distinct
   permanent rejection (receipt-exists).
3. Runner: every handling materializes the result cell unconditionally
   (default-on for all events — it is the `{ resultFor: cause }` cell a
   pattern-launching handler already creates; no new document kind and no
   class machinery); on receipt-exists rejection the client drops the
   event (lost race — no retry) and emits telemetry.
4. Single-handler enforcement: replace `queueSchedulerEvent`'s silent
   one-event-per-matching-handler fanout with one handler per stream link
   at registration (dev-mode error on concurrent duplicates; audit
   existing registrations first). Multi-handler dispatch = future opt-in
   feature (handler id would join the result-cell derivation).
5. Future layering deferred (spec open question 2): per-class refinements,
   receipt retention/GC, and CFC exactly-once scope alignment land later;
   E2 ships with no class surface at all.

**Fixtures (red first, per repo practice):**

- payload-only follow-up from a failed parent commit (escapes today's
  read-dependency rejection): never handled durably;
- handler conflicts then succeeds on retry: follow-up handled exactly once,
  payload from the committed attempt;
- handler exhausts retries: follow-up never handled; result piece stopped
  and unregistered;
- receipt race (multi-runtime, same event id — use the multi-user `cf test`
  harness): exactly one runtime's handler commits; the loser does not
  retry;
- receipt + retryable conflict on another doc: handler retries and commits;
  its own receipt never blocks it;
- cross-space follow-up: parks until origin confirmation, dispatches after;
  dropped (never dispatched) when the origin fails;
- pattern-launching event redelivered to a second runtime: the result-cell
  create collides; exactly one piece exists; the loser does not retry.

Exit: I10 holds for handler-sent events and handler-started pieces; I11
holds for receipt-enabled classes.

## Phase 3 — Node records + liveness refcounts + new pass (the cutover)

Build the v2 components (`registry`, `graph`, `invalidation`, `settle`,
`gates` minimal) in `scheduler/` alongside v1, then flip module-by-module
where separable, or as one reviewed series where not:

1. Introduce `SchedulerNode` records; migrate classification, status, causes,
   parent, budgets into them. The ~25 `create*State()` bundles shrink as each
   consumer reads the record instead.
2. Replace `SchedulerStaleness` + demand walks with `liveRefs` maintenance on
   edge deltas + provisional demand. Delete
   `pullDemandedFirstRunComputations`, `pullDemandedContinuationComputations`,
   `activePullDemandActions`, `scheduledFirstTime`, `isEffectAction`.
3. Replace the settle loop with `pass()` (§7): seeds = invalid∧live∧eligible,
   downstream closure for ordering, run-gate re-check at turn. Delete
   `dirty-dependencies.ts` upstream collector, traversal-root asymmetry,
   `collectStack`, the late-materializer per-effect recheck (folded into
   work-set construction), and the cycle breaker (replaced by §7.7 budgets +
   backoff). **The in-process propagation channel and the conditional-effect
   machinery are deleted here** (moved from phase 2): `write-propagation.ts`,
   `changedWritesHistory`, `conditionallyScheduledEffects` and its run-time
   filter — in the same change-series that lands the invalid-at-turn
   run-gate, with filter parity fixtures (effects skipped-as-unchanged under
   the old filter must still not run) and the synchronous-notification
   assertion across storage providers as gates.
4. Read-delta application replaces resubscribe/unsubscribe-around-runs
   (`pull-subscriptions.ts` resubscribe path, trigger replace memo).
5. Port the run path (`action-run.ts`) minus the deleted steps; keep CFC
   trigger-read consume/restore, retries, observation attach.
6. Test strategy:
   - the existing behavioral suite is the contract — it must pass unchanged
     except where it asserts v1 *internals* (set memberships, filter stats
     wording); rewrite those against the introspection surface;
   - new fixtures: provisional-demand expiry (spec resolved decision 4),
     parent-continuation-as-invalidation, first-run with under-approximated
     declared reads (assert ≤1 extra run and convergence), cycle backoff
     (non-converging pair stays rate-limited, `idle()` still resolves, other
     subgraphs unaffected);
   - benches before/after: `scheduler.bench.ts`,
     `scheduler-demand-roots.bench.ts`, `scheduler-stale-propagation.bench.ts`
     (this one should improve dramatically or become trivial),
     `scheduler-event-preflight.bench.ts`,
     `scheduler-materializer-fanout.bench.ts`, plus the CT-1623 reload
     re-run counts (the historical regression metric for this subsystem).

Exit: `scheduler.ts` is a facade over the component set; the inventory's
"Delete/Subsume" column is fully realized for §§4–6.

## Phase 4 — Remove the first-run prefetch

Depends on phases 1–3.

1. Stop generating reactive-node `populateDependencies` in the runner
   (`runner.ts:3510-3535`); pass `declaredReads` (the already-computed
   `reads` binding links) in the `NodeSpec` instead.
2. Delete the scheduler-side collection passes
   (`collectInitialExecuteDependencies`, `collectPostEventDependencies`,
   `collectPullSettlePreRunDependencies`, `pendingDependencyCollection`,
   `dependency-collection.ts`).
3. Keep the handler-preflight populate (it moves to `events`).
4. Risk to manage: the prefetch's deep `get()` doubled as a replica warmer —
   it kicked loads of link-target docs before first run. v2 relies on
   (a) piece-level `resume` sync (§9.2) for resumed pieces, (b) fresh pieces
   having their data locally by construction, and (c) the invariant that a
   doc arriving later surfaces as a change and re-runs the reader (the
   #3886 awaitSync lesson). Add an integration fixture: cold-replica fresh
   start where a computation's input doc arrives only after its first run;
   assert convergence without manual nudges.
5. Measure piece-start cost on a large space (the original complaint):
   expect registration to be index-inserts only; compare against v1 numbers
   for the default app.

## Phase 5 — Unify time gates

1. Fold `delays.ts` + `delay-control.ts` + the event wake timer into
   `gates`: per-node `debounceReadyAt` / `throttleReadyAt` / `backoffUntil`,
   one wake timer, `eligibleAt()`.
2. Re-express auto-debounce and the §7.7 backoff as policies writing gate
   fields; delete computation trailing-flush seeds (`scheduler-throttle` /
   `scheduler-timing` tests define the observable contract and must pass).
3. Event parking uses the same wake (head event `notBefore` = min
   `eligibleAt` of blocking deps).

## Phase 6 — Event preflight closure cache (optional, measured)

Decision (2026-06-11): default stays populate-per-dispatch; caching is an
off-by-default optimization adopted only if measurement justifies it.

1. Bench first: quantify steady-state preflight cost on realistic handlers
   (`scheduler-event-preflight.bench.ts` + a UI-flow trace). If it is not a
   material share of event latency, skip this phase entirely.
2. If adopted: cache the handler read closure from the last dispatch log,
   invalidate on handler re-registration, keep populate-per-dispatch as the
   correctness fallback and as the default until the cache has soak time.

## Phase 7 — Persistence alignment

Coordinate with memory-layer owners (observation rows live in memory v2):

1. Slim the observation payload (§9.3): drop `currentKnownWrites` /
   `declaredWrites`; version the observation shape; readers accept both
   during transition.
2. Move resume to the piece-level phase: runner awaits space sync once, then
   registers nodes in `resume` mode synchronously against the fetched
   snapshot batch (one `listSchedulerActionSnapshots` query per piece rather
   than per action, if the API allows batching — extend if not). Delete the
   per-action token/timeout/canApply apparatus.
3. Re-run `scheduler-observations.test.ts`, `reload-rehydration.test.ts`,
   `scheduler-persistent-state.bench.ts`; extend with: resume-clean piece
   performs zero runs and zero cell-data reads (I2/I7 witness).
4. Separately decide the default-on flip of
   `EXPERIMENTAL_PERSISTENT_SCHEDULER_STATE` (own rollout, not gated on v2).

---

## Risk register

| Risk | Phase | Mitigation |
| --- | --- | --- |
| A storage configuration delivers commit notifications asynchronously, so same-pass convergence regresses (still correct, more ticks) | 2 | Test-only synchronicity assertion across providers; accept extra ticks as degraded-but-correct; document the provider requirement in storage interface docs |
| Hidden dependence on prefetch as replica warmer (cold-start empty reads) | 4 | Fixture in 4.4; awaitSync piece gate; arrival-as-change invariant test |
| A side-writer the transformer's capability analysis missed (write outside the registered surface) | 1 | Dev-mode actual-writes-within-surface assertion (phase 1.4) surfaces declaration gaps; idempotency validator covers the contract side |
| Parked cross-space follow-up head-blocks the single global lane for a confirmation round trip | E | Accepted (same slowness class as the cross-space write protocol); the agreed per-space lane split confines it when it bites (spec open question 1) |
| Permanent-vs-retryable rejection taxonomy leaks wrong behavior (a permanent rejection retried, or a conflict dropped) | E | Taxonomy lands first (E0) with focused tests on both retry paths (`events.ts` unshift, `action-run.ts` watch) before lineage/receipts build on it |
| Default-on receipts add one create per handling; high-frequency programmatic event streams could bite | E | Measure commit volume after E2; per-class layering and retention/GC (spec open question 2) are the escape hatch |
| Single-handler enforcement breaks a stream silently relying on multi-match fanout | E | Registration audit before enforcement; dev-mode error first, prod telemetry; multi-handler returns later as an explicit opt-in feature |
| Event-causal handler-frame ids change id derivation for documents created in handlers | E | Uniqueness per gesture is preserved (event ids unique per send); fixture sweep over handler-heavy patterns before the cause swap |
| Conditional-effect parity (effects running more often than v1's watermark filter allowed) | 2–3 | Run-count parity fixtures on the v1 conditional-effect tests; the §7.2 closure-ordering must land with the watermark deletion, not after |
| Persisted observation misses after fingerprint change | 0, 7 | Versioned fingerprints; a miss only costs one re-run per node |
| changeGroup external consumers (runtime client, toolshed diagnostics) | 2 | Grep + keep as inert diagnostic label until consumers migrate |
| Non-converging patterns that v1's cycle breaker kept visibly fresh now lag behind backoff gates | 3 | Backoff caps (e.g. ≤2s) keep worst-case staleness bounded; non-settling telemetry unchanged; pattern-side fix remains the real remedy |
| Behavioral drift in `idle()` (tests and `cf` CLI lean on it heavily) | 3, 5 | Treat `idle()` semantics (§8.4) as frozen contract; port its tests first |

## Flag end-state

| Flag / API | Disposition |
| --- | --- |
| `pullMode` + `enablePullMode`/`disablePullMode`/`isPullModeEnabled` | Removed (phase 0) |
| `experimental.schedulerHistoricalMightWrite` | Removed (phase 1) |
| `EXPERIMENTAL_PERSISTENT_SCHEDULER_STATE` / `experimental.persistentSchedulerState` | Kept through v2; default-on is a separate rollout (phase 7.4) |
| New old-vs-new scheduler flag | **Not introduced** — cutover happens on a branch series with the test suite as the gate |