# Traverse optimization journal

Working log for the traverse optimization loop. Every attempt gets an entry —
including rejected ones — so tactics aren't re-tried blind. Measurement machine:
Apple M3 Max, Deno 2.8.1.

## Goal

**≥2× reduction in full notebook-test replay wall time** (baseline ~9.0s median
→ target ≤4.5s), with:

- byte-identical oracles on all fixtures (`deno task test` replay test green
  without golden regeneration),
- no fixture regressing,
- counter movement explaining every wall-time claim.

## Baseline (2026-06-09, Phase 0 HEAD)

| fixture            | full replay | schemaCalls | anyOf   | memoHits |
| ------------------ | ----------- | ----------- | ------- | -------- |
| notebook-test      | ~9.0s       | 278,072     | 337,441 | 13,280   |
| shopping-list-test | ~14ms       | 2,223       | 80      | 87       |
| piece-query-legacy | ~3ms        | 63          | 0       | 22       |

## Profile (notebook-test ×2, sampling 100µs)

**~70% of CPU is structural hashing** (`op_node_hash_update` 44.5%, value-hash
feed* ~13%, hash.ts ~5.6%, LRU lookups 5.8%). Traversal logic itself ~11%.
Attribution of hash-machinery time to call sites:

| share | site                                                                                                         |
| ----- | ------------------------------------------------------------------------------------------------------------ |
| 42.8% | MapSet selector hashing (`(v) => hashStringOf(v)`, traverse.ts:309) via `trackVisitedDoc` per pointer follow |
| 25.8% | `hashSchema(schema)` in the `traverseWithSchema` memo key                                                    |
| 18.5% | `internSchemaPairAsKey` keys for `combineSchema`/`mergeSchemaFlags`/`mergeSchemaOption` caches               |
| 8.4%  | `internSchemaReturningSchemaAndHash` interning freshly minted merged schemas                                 |

Root cause: `value-hash`'s `frozenObjectHashCache` (WeakMap) only engages for
deep-frozen objects, and the hot path mints **fresh** selector/schema objects
per pointer follow (`narrowSchema` returns a new `{path, schema}`, then
`combineOptionalSchema` mints a new schema). Identity never repeats, so every
fresh object pays one full SHA tree-walk; with ~64k pointer follows per replay
round this dominates everything.

Tooling: `profile-driver.ts` / `profile-target.ts` in this directory (CDP
sampling profiler + self-time/attribution reports, no external tools).

## Iterations

### Iteration 1 — kill hashing on the hot path

Two competing candidates, both with byte-identical oracles AND byte-identical
traversal counters (purely representational changes), full runner suite green.

**Candidate A (canonicalize at the seams)** — ~8×: notebook 9175→1142ms,
shopping-list 18→7ms, legacy 3→2ms. Single file (traverse.ts +300/−28). Memos:
narrowAndCombineSelectorForLink (content+identity-token keys),
schemaAtPathCanonical (gated on isInternedSchema input), combinatorRestSchema,
resolveSchemaRefsCanonical, module-level coverage cache + true-schema index +
non-copying MapSet.values().

**Candidate B (identity-keyed caches)** — 7.8×: notebook 9585→1232ms. 3 files
(traverse.ts, cfc.ts, cfc/schema-refs.ts): memoized the producers directly
(cfc.schemaAtPath module-level WeakMap memo, resolveCfcSchemaRefs memo),
identity-first traverseWithSchema memo with internSchema canonicalize-on-miss,
WeakMap pair caches, splitSchemaLogic memo, MapSet.valuesFor(). Hash self-time
62.5% → 8.2% (46× absolute).

**Convergent findings (both independently):**

1. The four profiled hash sites were symptoms; the deepest root was **producers
   minting fresh schema objects per call** — above all `$ref` resolution (13,384
   $refs in the notebook fixture de-canonicalized entire subtrees per visit),
   plus cfc.schemaAtPath's `{...cursor}` returns and the `{anyOf, ...rest}`
   destructure (337k anyOf branches).
2. **Per-instance caches are a trap**: contexts/traversers/cfc instances are
   created per invocation (12,820 per notebook replay), so per-instance memos
   stay permanently cold (the old coverageSelectorCache had this flaw).
   Module-level WeakMap-rooted memos are the working shape.
3. **WeakMap keyed on the incoming selector is nearly useless** at the
   followPointer seam — the hot caller mints a fresh selector per pointer, so
   identity never repeats. Content+identity-token keys (A) or
   canonicalize-on-miss via internSchema (B) are the fixes.

Risk difference: B's traverseWithSchema memo deep-freezes (interns) every schema
entering the query path including doc-embedded ones; A gates memos on
already-interned inputs (no freezing expansion).

**Adversarial verdicts** (independent reviewer agents, instructed to refute):

- A: SAFE TO LAND — all 8 attack vectors refuted (incl. a 600-trial differential
  fuzz of the true-schema index against the old scan). Three non-blocking
  hardenings recommended.
- B: SAFE TO LAND — 6/7 refuted; 1 confirmed-but-unreachable issue (NUL-join key
  collision, PoC'd, needs a literal ` ` in a document key).

**Winner: A** (equivalent speed, single file, no freezing expansion, cleaner
verdict). Landed with all recommended hardenings applied: link-hop memo gated on
`isMemoizableSchemaInput` (immutable identities only), injective length-prefixed
`pathKey()` encoding replacing `"\0"` joins in the new memo keys, frozen marker
constants.

**Post-hardening verification**: notebook 1069ms median (**8.5×** vs 9.0s
baseline), shopping-list 7ms, legacy 2ms; counters byte-identical (schemaCalls
278,072 / memoHits 13,280 unchanged); replay oracle green without golden
regeneration; full runner suite 550 passed; memory package 165 passed.

**Goal status: exceeded (target was 2×; achieved 8.5×).**

Rebase note: after PR #3939 merged and main picked up the value-hash codec
refactor (#3941), the fresh-main baseline measured 10.1s median (hashing got
slightly more expensive), and this change measured 1.21s on top of it — **8.4×
on current main**, oracle still green against unchanged goldens.

### Iteration 2 — server side (toolshed)

Setup: local toolshed (memory v2 server) on offset ports with
`CF_TRAVERSE_CAPTURE` + `--inspect`; load = runner integration tests plus 7-8
rounds of `patterns/integration/reload/default-app-notebook`; sampling via the
new `profile-attach.ts`. New checked-in fixture: `toolshed-reload` (20k server
traversals, 1,922 docs, 1,036 selectors).

**Finding 1 — toolshed is not CPU-bound: 92.6% idle under an aggressive reload
loop.** Of the ~7% busy CPU: deep-freeze registration at `decodeMemoryBoundary`
~19% (every node of every wire message is registered in the frozen cache),
hashing ~13%, wire codec (JsonEncodingContext parse/encode) ~12%, GC ~8%,
traverse proper ~7%, subscription state cloning ~2%. Server-side optimization
buys latency tails/headroom, not throughput.

**Finding 2 — replay-fidelity bug fixed:** live storage deep-freezes every doc
at the wire-decode boundary, but the replay corpus served mutable parsed JSON,
so frozen-identity fast paths could never engage in replay.
`FixtureObjectManager` now deep-freezes corpus values; this alone cut
toolshed-reload replay 652→217ms and goldens are unchanged (freezing is
value-neutral).

**Finding 3 — interned-only memo gates bypassed every seam memo on the server**
(wire-decoded doc schemas are frozen but never interned). Relaxed
`isMemoizableSchemaInput` and the three sibling gates to accept deep-frozen
inputs (frozen ⇒ identity cannot go stale; same safety argument). Correct and
now engaged server-side, but…

**Finding 4 — negative result (recorded so it isn't retried): the gate
relaxation yields ~0% live improvement.** A/B attach-profiles (280s/8 rounds vs
240s/7 rounds, per-second normalized) show identical deep-freeze/hash/codec
rates. True-caller attribution of the residual server hash time: trackVisitedDoc
17%, combineSchema pair keys 11%, codec encode 10%, toDocumentSelector 3%. The
reload workload actively CREATES docs, so most remaining hashes are first-touch
hashes of genuinely novel selectors/schemas — caches can only amortize repeats,
not cold misses.

**Post-merge review simplification (#3948 feedback):** `schemaToken()` was
redundant — an interned (or frozen, after first touch) schema's structural hash
is already WeakMap-cached by value-hash, so `hashSchema()` is an O(1) memo-key
part with no separate token bookkeeping (and structural keys let
distinct-but-equal frozen schemas share memo entries, which tokens could not).
A/B: ~1.6–2.3% slower on the replay medians — in the noise, as predicted by the
reviewer; main's #3956 (frozen-cache-first hash lookup) narrows it further.
Lesson recorded: before inventing identity machinery, check whether the existing
content-hash cache already provides it.

**Stop-rule triggered** (<3% twice: gate relaxation ~0%, and no cheaper
candidate identified within traverse scope). Documented next targets if anyone
wants the remaining ~7%-of-busy: deep-freeze registration cost at
`decodeMemoryBoundary` (register-roots-only / cheaper cache — data-model blast
radius), codec encode hashing, `cloneTrackedGraphState`.

### Iteration 3 — tail latency

Motivation: traverse no longer dominates mean time but shows up in tails. New
instrumentation: `replayFixture(..., { collectLatency: true })` produces
per-invocation p50/p90/p99/p99.9/max plus the slowest invocations with their
counters; the bench prints it under `BENCH_DIAGNOSTICS=1`.

**Tail diagnosis:** p50 ≈ 5µs but p99 was 5ms+ and max 25ms — a 1000× tail,
identical shape on client and server fixtures: anyOf-heavy vnode docs evaluating
up to 8,424 branches per traversal. The worst case was the same doc × selector
paying 25ms twice (client invocations cannot amortize across calls).

**Fixes (oracle byte-identical, counters preserved exactly):**

1. `prepareAnyOf` — branch sort (two hasAsCell passes per node!), rest+option
   merges, and `canBranchMatch`'s static derivations ($ref resolution, type
   normalization, required-applicability) precomputed once per schema identity;
   the per-node prefilter is now a few field reads.
2. Per-visit `docVisits`/`uniquePaths` diagnostics in `traverseWithSchema` built
   a string per visit but only feed the slow-traverse log — now gated behind
   `CF_TRAVERSE_DIAGNOSTICS=1`.
3. ~~`createDataCellURI` memoization~~ — **reverted after CI caught a real
   regression** (folksonomy-aggregator +26%, reproduced locally 1180→1490ms).
   The memo (keyed on frozen value identity + base address/schema) wins on
   replay fixtures where values repeat across invocations, but live dynamic
   patterns mint FRESH deep-frozen values per evaluation: the memo never hits,
   so `isDeepFrozen` subtree walks + key hashing are pure added cost. Negative
   result recorded: identity memos need identity REUSE, not just identity
   stability — fresh-per-evaluation frozen objects have neither. (The CI perf
   check's subtest-level timing was RIGHT after two earlier shard-level flags
   were noise; always A/B the named test locally.) The real fix is upstream: the
   code's own TODO says query traversal should not route through data-URI
   synthesis at all.

**Results (final, fixes 1+2 only):** notebook max 25.2→18.6ms (−26%), p99
5.17→3.70ms (−28%), mean −18%; toolshed-reload max 7.9→6.3ms, p99.9 3.3→2.7ms;
shopping-list unchanged (no anyOf tail); folksonomy (dynamic-schema live
pattern) at main parity. The prepared-anyOf path is additionally gated on
`isInternedSchema(resolved)` — interned identity guarantees reuse;
frozen-but-fresh schemas take the legacy loop verbatim.

**Post-fix profile of the worst invocation is flat**: top item is
`traverseWithSchema` per-call overhead (~0.7µs × 6.3k calls, largely V8 inline
attribution), then the pointer-hop/coverage cluster (narrowAndCombine 4.5%,
include 4.1%, pathStartsWith 3.7%, internPathSelector 2.3%) — a multi-site ~13%
if ever needed. Stopped here per diminishing returns.

### Remaining known hotspots (client side, from iteration 1)

From the post-A profile (flat, ~1.4s total) and the server-fixture profile:

- `addressKey` string churn in dagMemo (8.1% on the DAG-heavy server fixture) +
  traverseDAG per-step address spreads.
- anyOf discriminator index (337k branch evaluations still run canBranchMatch
  per node; precompile per interned schema).
- Diminishing returns expected (<15% each on the notebook metric); next
  iteration only if the loop's <3%-twice stop rule isn't triggered.