# Debugging Settle Waves

Use this guide when the shell or a pattern feels busy after a change, reload,
or user interaction, especially when the page looks responsive but the worker
keeps running for much longer than expected.

The goal is to answer four questions quickly:

1. Is the expensive work on the main thread or in the worker?
2. Is the worker spending time on rendering, storage, traversal, or scheduler
   convergence?
3. Which interaction or write kicked off the fan-out wave?
4. What should be instrumented next if the existing logs are not enough?

A worked example with concrete measurements from one investigation is archived
in [settle-wave-2026-03-findings](archive/settle-wave-2026-03-findings.md).
Console API details for every command used below live in
[console-commands](console-commands.md).

## When To Suspect a Settle Wave

Start here if you see any of the following:

- reload looks visually fine but the worker stays busy
- creating or editing content triggers long waves of background work
- `scheduler` or `traverse` counts jump rapidly after a single write
- UI updates land, but the runtime keeps settling for several more passes
- Chrome traces show long tasks on a dedicated worker thread instead of
  `CrRendererMain`

This guide is worker-first: in practice the most important work often happens
off the main thread. First rule out a true non-idempotent loop with
`await commonfabric.detectNonIdempotent()` — see
[non-idempotent-detection](non-idempotent-detection.md). If `busyTime` is high
but `nonIdempotent` and `cycles` are empty, you are looking at broad fan-out or
slow convergence: continue here.

The escalation order: trace (main thread or worker?) → logger baselines →
settle stats → trigger trace ("which write scheduled this?") → action-run
trace ("which actions actually ran?") → write stack trace ("which callsite
wrote the hot cell?") → focused debug loggers.

## Reproduction Workflow

Use a simple, repeatable shell flow and keep it fixed across runs:

1. Open a local space at `http://localhost:8000/<space-id>`.
2. Perform the interaction you care about (e.g. create a note), return to the
   starting view, and confirm the result persisted.
3. Repeat the same interaction several times in the **same space** — for
   scaling questions, fan-out grows with existing content, so a fresh space
   per run hides the problem. In the integration harness, prefer
   `SPACE_NAME=...` over a random space.

For a scripted browser reproduction instead of manual console work, the
default-app integration flow supports trace capture:

```sh
HEADLESS=true API_URL=http://localhost:8000 FRONTEND_URL=http://localhost:5173 \
CF_CAPTURE_TRIGGER_TRACE=1 \
deno test -A packages/patterns/integration/default-app.test.ts
```

Use `:5173` when you need the shell to serve the worktree's current code, and
keep `API_URL` pointed at Toolshed on `:8000`.

## Trace Workflow

Capture two Chrome performance traces when possible: one for a reload (does
startup or persisted hydration cause a wave?) and one around the interaction
(does a specific write or navigation fan out?). For large spaces, repeat the
same reload 3–5 times and compare medians; individual worker timings vary a
lot from run to run even when the shape is stable.

For each trace, compare total `RunTask` time on `CrRendererMain` versus the
dedicated worker, count worker tasks at or above `50 ms`, inspect large
`RunTask` slices for long microtask drains, and map the hottest worker bundle
locations back to source files.

Rough rule of thumb:

- If main-thread time dominates, start with rendering and DOM work.
- If worker `RunTask` time dominates, treat the trace as a scheduler/runtime
  problem first.
- If `worker-reconciler` is quiet but `scheduler/execute/settle` is large, the
  bottleneck is convergence before rendering.
- If `traverse` is high count but low average latency, it is often a symptom of
  broad fan-out rather than the sole root cause.

## Console Workflow

After the trace, keep the session live and inspect the worker through
`commonfabric.rt` in the page console.

### Baseline, Replay, Compare

```js
// Shown inside a pattern body.
await commonfabric.rt.getLoggerCounts()       // see what counters exist
await commonfabric.rt.resetLoggerBaselines()  // just before the interaction
// ... replay one interaction, let it settle ...
await commonfabric.rt.getLoggerCounts()       // inspect the deltas
```

The most useful worker timing groups to compare first:

- `scheduler/execute`, `scheduler/execute/settle`, `scheduler/execute/event`
- `scheduler/run`, `scheduler/run/action`, `scheduler/run/commit`
- `traverse`
- `storage.cache`
- `worker-reconciler`

How to interpret the deltas:

- Large `scheduler/execute/settle` with multiple `execute()` passes usually
  means repeated convergence work after the initial event.
- A large trigger-trace fan-out plus repeated `schedule-resubscribe` usually
  means a write is matching many existing subscriptions and rebuilding too
  much scheduling state.
- Large `schedule-run-start` relative to one user interaction means one write
  is fanning out into many action runs.
- High `storage.cache` volume matters only if its timing dominates too.
- Small `worker-reconciler` deltas mean the UI flush is downstream, not the
  primary bottleneck.

### Settle Stats

When logger timing is not enough, capture per-`execute()` settle-loop stats:

```js
// Shown inside a pattern body.
await commonfabric.rt.setSettleStatsEnabled(true)
// ... replay the interaction ...
await commonfabric.rt.getSettleStatsHistory()
```

Prefer `getSettleStatsHistory()` — `getSettleStats()` returns only the **last**
`execute()` call, so a trailing empty settle pass can overwrite the
interesting wave. See [console-commands](console-commands.md#worker-settle-stats)
for the payload shape and a live-polling snippet.

### Trigger Trace: Which Write Scheduled This Action?

Use trigger trace when the question is no longer "is there churn?" and becomes
"which exact write scheduled this action again?"

```js
// Shown inside a pattern body.
await commonfabric.rt.setTriggerTraceEnabled(false)  // reset the ring buffer
await commonfabric.rt.setTriggerTraceEnabled(true)
// ... replay the interaction, let it settle ...
await commonfabric.explainTriggerTrace({ rootOnly: true, limit: 8 })
```

`explainTriggerTrace` groups exact `space/entity/path` changes, counts direct
schedules and downstream effects, reads the changed cells back, and adds shape
hints such as `ui-result` and `index-state`. For raw entries and manual
grouping, see [console-commands](console-commands.md#worker-trigger-trace).

### Action-Run Trace: Which Actions Actually Ran?

Use exact action-run tracing when the question becomes "which actions really
ran?" rather than "which were merely scheduled?" — especially when trigger
trace is noisy because one root write schedules many sinks, or when comparing
run N against run N+1 in the same space.

```js
// Shown inside a pattern body.
await commonfabric.rt.setActionRunTraceEnabled(false)  // reset
await commonfabric.rt.setActionRunTraceEnabled(true)
// ... replay the interaction ...
await commonfabric.rt.idle()
const trace = await commonfabric.rt.getActionRunTrace()
```

Group entries by `actionId` and sort by count and total duration (grouping
snippet in [console-commands](console-commands.md#worker-action-run-trace)).
The first run in a space often includes navigation, mount, and
reader-materialization noise — compare later runs against each other.

### Write Stack Trace: Which Callsite Wrote the Hot Cell?

Once trigger trace has told you which cell is noisy, arm the transaction-level
write watcher to capture the exact write callsite:

```js
// Shown inside a pattern body.
await commonfabric.watchWrites({
  space: "did:key:z6Mkm...",
  id: "of:baedrei...",
  path: [],
  match: "exact",
  label: "watched hot cell",
})
// ... replay the interaction ...
const trace = await commonfabric.getWriteStackTrace()
```

Interpreting the captured stacks:

- `Runner.setupInternal` / `Runner.instantiatePatternNode` frames mean piece
  instantiation/setup writes — usually noise, not churn
- `diffAndUpdate`, `applyChangeSet`, or pattern handler frames point to runtime
  state updates after setup — usually the more interesting targets
- a generic `raw:async ...worker-runtime.js` frame is not evidence of one
  specific builtin; keep the next 1–3 frames underneath it to find the actual
  pattern or runtime helper

Disable with `await commonfabric.watchWrites([])`.

## Selective Debug Logging

Prefer the structured traces above. When you need log output, use the focused
loggers before raising the whole `scheduler` module:

- `runner.trigger-flow` — which source action id re-enters `Runner.run()`,
  `setupInternal()`, `instantiatePatternNode()`
- `runner.wish-flow` — is `wish()` launching suggestion patterns or just
  reading hot indexes?
- `scheduler` — settle-loop internals (broad — last resort)

```js
// Shown inside a pattern body.
await commonfabric.rt.setLoggerEnabled(true, "runner.trigger-flow")
await commonfabric.rt.setLoggerLevel("debug", "runner.trigger-flow")
```

In scheduler debug logs, the fan-out shape to look for is: one commit matching
dozens of registered actions; the same action ids recurring across trigger
entries; repeated `schedule-resubscribe` bursts after each run; alternating
change → trigger → run → commit → resubscribe waves. If you see that shape,
the problem is usually not one slow action body — it is the number of affected
actions and the number of times they are revisited.

## Broad Async Readers

Sometimes the hottest action is not the root cause, especially an async reader
over a broad index or collection. Signs: run count stays low but total time is
large or highly variable; one action spends its time in awaited `sync()` or
lookup work; downstream index/grid/summary actions are hot in the same wave.
When that happens, measure both run count and total time (do not optimize only
by count), check whether the hot action is loading a large result set, inspect
the producer side as well as the reader side (index builders that scan all
pieces, views that materialize many previews, broad root-state cells rewritten
wholesale), and compare small versus large spaces before blaming the action
body itself.

## Source Paths Worth Checking First

Start with these locations when traces or logs point to worker churn:

- `packages/runner/src/scheduler.ts` — execute orchestration, queueing, public
  diagnosis API; `packages/runner/src/scheduler/` holds the mode-specific
  settle loops (`pull-execution.ts`, `push-execution.ts`), event dispatch
  (`events.ts`, `pull-events.ts`, `push-events.ts`), action execution and
  resubscribe timing (`action-run.ts`), and trigger matching
  (`trigger-index.ts`, `scheduling-writes.ts`, `dependency-graph.ts`)
- `packages/runtime-client/backends/web-worker/index.ts` — worker message
  entrypoint — and `runtime-processor.ts` — console-facing scheduler IPC
- `packages/runner/src/storage/cache.ts` — socket event dispatch
- `packages/html/src/worker/reconciler.ts` — worker flush scheduling

## Next Steps

If the existing traces and logs still leave ambiguity, add or expose more
instrumentation rather than guessing. For what one full investigation
measured, concluded, and recommended next, see the archived
[March 2026 findings](archive/settle-wave-2026-03-findings.md). The full
`commonfabric.*` API reference is in [console-commands](console-commands.md),
and [non-idempotent-detection](non-idempotent-detection.md) covers ruling out
true loops.