# FUSE Reliability Design ## Purpose This document describes an implementation plan for making `@commonfabric/fuse` safe under backend stalls, transport failures, CFC writeback, and platform FUSE quirks while preserving normal filesystem behavior for ordinary programs. The normative API contract lives in `docs/specs/fuse-filesystem/`. This package document is implementation-specific: it names current code seams, proposed actors/FSMs, deadline and backpressure policy, rollout phases, and test targets. ## Current Baseline The implementation already has several reliability foundations: - `mod.ts` tracks pending async FUSE replies with `trackPendingFuseReply()` and defers kernel invalidation until pending replies drain. - `handles.ts` owns per-handle write buffers, truncate state, dirty flags, versioning, range checks, and per-handle CFC authorization. - `mod.ts` schedules safety-net flushes for transports that do not reliably send `flush`/`release`, and records write statistics in `.status`. - `cell-bridge.ts` exposes `.status`, marks the mount disconnected/read-only on transport errors, and reconnects with capped exponential backoff. - `cell-bridge.ts` serializes and coalesces per-piece-property rebuilds, dedupes in-flight hydrations, stages rebuilds under pending roots, and invalidates cache entries after committed cell changes. - `cfc-writeback.ts` persists prepared CFC writeback records, tracks crash-point states, and reconciles records after inode churn or subtree rebuild. The main gap is that several mutating paths still acknowledge FUSE success before the Common Fabric operation has reached a clear commit or acceptance boundary. That improves responsiveness but makes normal tools vulnerable to silent backend failure, transport loss, or CFC finalize failure. ## Target Posture Treat the daemon as a bounded-latency POSIX adapter, not a best-effort RPC proxy. Every FUSE request should enter a tracked lifecycle: copy transient arguments, start a deadline, optionally register interrupt handling, enqueue under bounded concurrency, and reply exactly once. For the default mode, mutating operations are commit-confirmed: FUSE reports success only after local validation, CFC authorization, backend mutation or runtime acceptance, and safe projection invalidation/reconciliation reach the operation's success boundary. Local-ack or offline-queue behavior may exist only as an explicit compatibility mode and must be visible in `.status`. ## Core Invariants 1. **Exactly one reply per request.** Every received low-level request reaches one terminal reply path. Timeout, cancellation, success, and error paths race through one `replyOnce` guard. 2. **Bounded wait.** No request waits indefinitely for backend I/O, reconnect, rebuild, CFC reconciliation, or invalidation. Deadlines resolve to standard errno values. 3. **No default local-ack for mutations.** A mutating syscall reports success only after the operation reaches its configured Common Fabric boundary. 4. **Visible tree equals confirmed projection.** Shared `FsTree` state should not expose optimistic creates, deletes, renames, symlinks, or cell writes. Only handle-private buffers may be speculative. 5. **Per-cell ordering.** Mutations serialize per logical `(space, entity, cell)` unless the backend provides stronger ordered/idempotent write primitives. 6. **Handlers are non-idempotent.** One buffered handler write maps to at most one runtime send. Never auto-retry handler invocations after timeout or reconnect without a future idempotency-key contract. 7. **CFC fails closed.** Missing, stale, or incomplete labels remain incomplete/fail-closed. FUSE produces and reconciles annotations; gVisor owns sandbox observation policy. 8. **Backpressure is explicit.** Queue saturation, degraded state, or backend unavailability become visible errno/status outcomes, not unbounded memory or silent pending success. ## Request Lifecycle Introduce a small `FuseRequestSlot` wrapper in `mod.ts` or a helper module: ```text received -> validating -> queued -> running -> replying -> replied received/validating/queued/running -> timed-out -> replying -> replied ``` Each slot records: - operation name, inode, path/name copies, file handle, and logical ref when known; - start time, deadline, and timeout reason; - whether an interrupt was observed; - reply state and errno/data summary; - associated mutation operation ID, if any. Callbacks must copy names, buffers, and file-info fields they need before returning to libfuse. The slot owns the reply pointer until `replyOnce()` fires. `forget`-style operations remain special: they use no normal reply but still must stay on the high-priority cleanup lane. ### Suggested Deadlines Initial values should be constants surfaced through `.status`: | Operation class | Soft deadline | | ------------------------------------- | --------------- | | cached metadata/read | 50 ms | | cold lookup/hydration | 2 s | | `readdir` page | 5 s | | cell content write/flush | 30 s | | handler runtime acceptance | 30 s | | source pattern update | 60 s | | `fsync` / explicit durability barrier | 60-120 s | | reconnect probe | 5 s per attempt | Linux `request_timeout` and macFUSE `daemon_timeout` should be treated as hard provider safety nets above these internal deadlines. Provider hard timeouts may abort or eject the entire mount, so they are not normal control flow. ## Actor Boundaries The implementation does not need a new framework. Treat existing modules as actors with explicit ownership and bounded mailboxes. | Actor | Existing home | Responsibility | | -------------------- | ------------------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | FUSE adapter | `mod.ts` | Low-level callback translation, local deterministic validation, request slots, errno replies, pending-reply tracking. It should not call backend mutation APIs directly once the coordinator exists. | | Handle actor | `handles.ts` | Open-file state, write buffers, truncate state, dirty/version flags, max file size, flush/release dedupe. | | Mutation coordinator | new module, e.g. `mutation-coordinator.ts` | Per-cell queues, operation deadlines, commit ordering, timeout classification, result-to-errno mapping, `.status` operation metrics. | | Projection actor | `cell-bridge.ts` | Space connection, hydration, subscriptions, queued rebuilds, source tree rebuilds, cache invalidation scheduling. | | CFC writeback actor | `cfc-writeback.ts` | Prepare/finalize records, stale-generation detection, recovery persistence, diagnostics, and reconciliation. | | Connection actor | `cell-bridge.ts` initially | `online -> suspect -> readOnly/reconnecting -> online` transitions, reconnect backoff, synced/reconciled gating before writes resume. | ## Supervisor, Worker, and Process Isolation Deno Workers are useful as an inner responsiveness boundary, but they are not a hard recovery boundary for native FFI or FUSE provider wedges. Workers run in the same OS process as their creator. `Worker.terminate()` can stop JavaScript execution and message processing, but it should be treated as best-effort for a worker blocked inside native libfuse/FFI code, a blocking FFI helper thread, or a provider/kernel call that never returns. The first process-boundary architecture should stay Deno-only: ```text cf fuse mount --background -> Deno supervisor process -> Deno FUSE child process that loads libfuse and owns the mount -> optional Deno Worker for libfuse session and hot FUSE state -> child control plane for heartbeat, status, and graceful unmount ``` The supervisor process must not load libfuse or call Deno FFI. Its job is to spawn the FUSE child, track its PID/process group, monitor heartbeat/status, request graceful unmount, escalate to platform abort/force-unmount, and kill or restart the child when needed. The Deno FUSE child owns the libfuse session, callbacks, `FsTree`, `CellBridge`, CFC writeback store, request deadlines, and mutation coordinator. This is already enough to create a real kill/restart boundary: if the child wedges in native/libfuse/provider code, the supervisor remains schedulable because it is a separate OS process. An optional worker inside the child can keep the child process control plane responsive if the FUSE session's JavaScript event loop stalls. It should not be used as the only supervisor for production reliability. If the worker gets wedged in native code, the containing child process may still retain native threads, libfuse state, mount state, or corrupted address-space state. The hard reset boundary is the child process. Design rules: - If the libfuse session runs in a worker, keep the hot filesystem state in that worker too: `FsTree`, callbacks, handle map, mutation coordinator, CFC writeback integration, deadlines, and reply guards. Avoid serializing every filesystem operation across worker messages. - Keep the child-process control plane small: heartbeat, status snapshots, graceful `fuse_session_exit`/unmount request, and final process exit. - Keep the external supervisor outside any process that has loaded libfuse. It owns wall-clock containment, forced unmount/abort, and process kill/restart. - Request deadlines still live inside the FUSE request lifecycle. A supervisor can abort a mount, but it cannot provide clean per-operation errno replies once the session worker is wedged. Initial implementation may remain single-process while the mutation coordinator and request deadlines are built. The first isolation step should then be the Deno supervisor + Deno FUSE child split above, not a worker-only refactor. A worker-only refactor is worthwhile only if JavaScript event-loop coupling is the dominant problem and native/provider wedges are not observed. A Rust component should be a sidecar executable, not a Deno extension, if it is introduced for reliability. A Rust extension loaded into Deno would share the same process-boundary problem as Deno FFI. A Rust sidecar can replace the Deno FUSE child later if hand-maintained FFI structs, signal handling, or libfuse threading inside the Deno child remains fragile, while preserving the same Deno supervisor contract. ## Mutation Operation FSM All mutating operations should use one common FSM: ```text queued -> validating -> cfc-prepared -> applying -> syncing -> projecting -> finalizing -> succeeded validating/applying/syncing/projecting/finalizing -> failed-known(errno) -> timed-out-unknown -> disconnected-readonly ``` `failed-known(errno)` means the operation definitely failed before committing or the backend returned a definite error. `timed-out-unknown` means the daemon can no longer tell whether the backend will eventually commit; return `ETIMEDOUT`, record the operation in `.status`, and never replay automatically unless the operation has an idempotency key. Human and agent diagnostics must preserve that distinction. `.status` should show whether the last mutation was a known failure, a timed-out unknown, or an accepted operation whose downstream reactive effects may still be settling, so normal errno-style failures remain debuggable from both shell clients and cf-harness runs. ### Success Boundaries | Operation | Default success boundary | | ---------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------- | | scalar / `.json` cell write | local validation and CFC authorization passed; `cell.set()` resolved; backend sync/acceptance completed; projection invalidation or rebuild is scheduled safely. | | `[FS]` projection write | parsed content applied to the corresponding cells; required deletes/updates completed; projection invalidation or rebuild is scheduled safely. | | source write | `piece.setPattern()` resolved; error log updated; source projection finalized or invalidated. | | handler write | runtime accepted the invocation and sync/idle boundary completed. Downstream reactive effects may settle later. | | create/mkdir/unlink/rmdir/rename/symlink | parent/common cell mutation resolved and the shared `FsTree` is updated from confirmed state or safely invalidated for rebuild. | | CFC writeback | trusted prepare was valid when required; mutation succeeded; finalize/reconcile was recorded, or incomplete/fail-closed state was persisted. | `create`/`mkdir` must not depend on the sandboxed program calling `setxattr` afterward. Normal programs should run unchanged against `/fabric`; any required CFC write intent must be supplied by gVisor's kernel-space policy path before FUSE commits the mutation. In the current gVisor FUSE prototype, gVisor blocks unprivileged sandbox writes to protected CFC xattr names at the syscall layer, then its kernel CFC hooks update internal label state and emit ordinary `FUSE_SETXATTR` requests for `trusted.cfc.*` names. FUSE should therefore treat protected CFC xattrs as trusted only under that gVisor mediation assumption; a non-xattr side channel or explicit discriminator would be a future transport change, not current behavior. The local `user.commonfabric.cfc.*` compatibility bridge remains only for testing/integration and is not a sandbox trust boundary. In observe or explicit compatibility mode, a create without complete trusted intent may proceed only with incomplete/fail-closed annotations and diagnostics. Enforcing-mode create is app-visible only after the entry is fully labeled and committed, or it fails with a normal errno. Implementation bridge: the current code has mediated prepare/finalize xattrs and persisted `CfcWritebackStore` records; it does not yet have hidden quarantine dentries or a separate quarantine namespace. The quarantine step must add a private `QuarantineStore` beside the confirmed `FsTree`. Quarantine records must not be inserted into parent child maps, readdir indexes, or kernel-visible inode/dentry state, and normal `lookup`, `readdir`, `readdirplus`, `getattr`, `access`, `open`, and `opendir` must never consult the store. Records should be keyed at least by `{ operationId, parentRef, name, expectedGeneration }`, with operation type, prepared/finalized label state, timestamps, and diagnostics as record data. Only trusted completion/finalize handling, trusted abort handling, startup recovery, and TTL garbage collection may read `QuarantineStore`. Completion must match by operation ID and validate parent ref, target name, operation type, generation, and prepared/finalized labels before publishing anything into the normal projection. FUSE should scan quarantine records on startup and periodically at runtime, aborting records older than the configured TTL; one hour is the initial target unless active trusted completion is in progress. Post-create xattrs are separate modeled metadata operations; they may not retroactively authorize a usable entry or lower confidentiality/integrity labels. ## Backpressure Policy Add bounded queues before backend work: - global active backend mutations: start with 32; - per-space active backend mutations: start with 16; - per-cell active mutation: 1, with bounded pending queue; - high-priority cleanup lane for `forget`, interrupts, `release`, and watchdog cleanup; - bounded rebuild queue per piece prop, continuing to coalesce stale rebuilds; - max buffered handle bytes retains the current virtual file limit. Admission failure should be explicit. For normal blocking filesystem calls, wait within the request deadline; if no slot opens, return `ETIMEDOUT` or `EIO` based on state. Reserve `EAGAIN` for documented retryable races, not ordinary overload. ## Connection and Degraded Modes Connection state should be explicit in `.status` and should control mutation admission: ```text online -> suspect -> readOnly-reconnecting -> reconciling -> online ``` When transport failure is detected, remove write bits as the current `buildStat()` path already does, reject new mutations with `EROFS` or `EIO`, continue serving cached reads where safe, and probe reconnect with capped exponential backoff. Before writes resume, the connection actor should require backend sync plus any needed CFC/writeback reconciliation. ## Invalidation and Platform Policy Keep invalidation out of active callback/reply paths. The existing pending-reply drain before `notify_inval_entry` / `notify_inval_inode` is a core safety rule, especially for FUSE-T. - On Linux, use deferred invalidation when supported, falling back to short TTLs on `ENOSYS` or repeated failures. - On macOS/FUSE-T, assume provider-specific cache and call-order behavior. Prefer short TTLs and deferred/no invalidation over synchronous reverse invalidation while the kernel is waiting on the daemon. - Never store logs, state files, or watchdog artifacts inside the mounted tree; that can deadlock the daemon by re-entering its own mount. ## Watchdog and Abort Strategy The daemon can guarantee bounded replies only while its event loop and native FFI calls continue to make progress. A supervisor should provide outer containment: - heartbeat timestamp from the FUSE process; - last completed request ID and pending request count; - Linux fusectl `waiting` count when available; - last backend success/failure and reconnect state; - controlled unmount/abort/kill path when heartbeat stops and waiting requests remain. On Linux, aborting the FUSE connection is the reliable last-resort escape for a wedged daemon. On macOS, forced unmount and provider-specific timeout/eject behavior are the last resort; this must be documented as weaker than a daemon deadline guarantee. ### Hang Classes The reliability architecture prevents or contains several classes of hangs, but it cannot make every kernel/provider failure recoverable from inside the daemon. | Hang class | Prevented by this design? | Containment | | ------------------------------------------------------------------- | -------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | Backend promise never resolves | Yes | Request deadlines, mutation timeouts, and connection degraded mode reply with errno instead of waiting forever. | | Deadlock caused by reverse invalidation during a pending FUSE reply | Yes | Keep invalidation deferred until pending replies drain; fall back to short TTLs when notify is unsafe or unsupported. | | Unbounded write/rebuild backlog | Yes | Bounded mutation/rebuild queues and explicit timeout/error outcomes. | | Transport disconnect after a started mutation | Partially | New writes fail fast while disconnected; started operations become known failure or timed-out unknown and are not auto-replayed. | | Daemon event loop still alive but request progress stops | Partially | Supervisor detects heartbeat/request-progress mismatch and aborts/unmounts externally. | | Deno process blocked inside a native libfuse/provider call | Not from inside the process | External supervisor must force unmount/abort/kill. Consider isolating riskier FFI/session work in an expendable child process if this recurs. | | Kernel/provider task wedges and does not return to userspace | Not by daemon architecture alone | Linux fusectl abort may release waiters; macOS depends on provider forced unmount/eject behavior and may require process kill, force unmount, or reboot in worst cases. | Therefore the product guarantee should be phrased as two layers: 1. **Daemon-level guarantee:** if the FUSE daemon remains schedulable and can call reply functions, every request reaches data or errno by its deadline. 2. **Containment guarantee:** if the daemon or provider stops making progress, an external supervisor detects it and attempts platform-specific abort, unmount, and process cleanup within a bounded wall-clock interval. If experiments still produce unkillable kernel waits, the next architecture step is stronger process isolation: run the libfuse session in a small supervised child whose only job is kernel/FUSE I/O, keep backend state and operation logs in the parent or a separate durable process, and treat child replacement/remount as the recovery path. That adds IPC complexity, so it should be triggered by real provider wedges rather than used as the first implementation step. ## Observability Extend `.status` before changing behavior so experiments are visible: ```json { "requests": { "inFlight": 0, "completed": 0, "timedOut": 0, "lastTimeoutAt": null }, "mutations": { "queued": 0, "inFlight": 0, "succeeded": 0, "failed": 0, "unknown": 0, "perCellQueueMax": 0 }, "connection": { "state": "online", "lastError": null, "reconnectAttempts": 0 }, "deadlines": { "cellWriteMs": 30000, "handlerMs": 30000, "sourceWriteMs": 60000 } } ``` The existing write stats and CFC writeback counts should remain, but operation IDs and unknown outcomes need first-class counters so agents and humans can tell the difference between success, known failure, and timed-out unknown state. ## Rollout Plan 1. **Document and instrument current behavior.** Add request/mutation counters, deadline constants, and status fields without changing semantics. 2. **Extract request slots.** Centralize exactly-once replies and timeout paths in `mod.ts`; add tests for reply races using fake callbacks where possible. 3. **Add mutation coordinator for handle writes.** Keep `write(2)` as buffer-copy acknowledgement, but make `flush` and `fsync` wait for commit or return errno. 4. **Move namespace mutations behind the coordinator.** Stop optimistic shared tree updates for create/mkdir/unlink/rmdir/rename/symlink; update or invalidate the tree after backend success. 5. **Integrate CFC writeback.** Use existing prepare/finalize persistence and reconciliation inside the operation FSM, including timeout and stale-generation status. 6. **Add degraded-mode gates.** New writes fail fast while disconnected or reconciling; cached reads continue where safe. 7. **Split supervisor from FUSE child.** Teach `cf fuse mount --background` to run a Deno supervisor process that does not load libfuse, then spawn a Deno FUSE child that owns the mount, heartbeat, and status stream. 8. **Add watchdog escalation.** The supervisor monitors heartbeat/status and performs platform-appropriate graceful unmount, abort/force-unmount, process-group kill, and restart. 9. **Escalate to Rust sidecar only if needed.** If the Deno FFI child remains fragile after the process split, replace the child with a Rust sidecar binary while keeping the same supervisor/watchdog shape. ## Test Plan - Unit-test request slot `replyOnce`, timeout, and interrupt races. - Unit-test mutation coordinator ordering, per-cell serialization, queue saturation, and unknown timeout classification. - Extend `handles.test.ts` for flush/release dedupe and commit-confirmed errors. - Extend `cell-bridge.test.ts` for disconnected/read-only transitions and reconnect gating before writes resume. - Extend `cfc-writeback.test.ts` for operation FSM integration with stale, malformed, missing, committed, failed, and timed-out prepare/finalize paths. - Add fault-injection tests with never-resolving `cell.set()`, rejected backend writes, late success after timeout, transport closed, and subscription storms. - Add platform smoke tests that wrap shell commands in `timeout` and verify they terminate with data or errno rather than hanging. ## Non-Goals - Replacing the Deno/libfuse implementation with Rust or a second daemon. - Making local-ack/offline queueing the default filesystem behavior. - Automatically retrying non-idempotent handler invocations. - Treating `user.commonfabric.cfc.*` as trusted enforcement input. - Guaranteeing recovery from kernel/provider hard lockups without an external supervisor or forced unmount path. ## Open Questions 1. What backend primitive is the canonical commit/acceptance boundary for cell writes: `cell.set()` resolution, `manager.synced()`, subscription observation, or a new write receipt? 2. Should namespace mutations update the nearest common parent value in one cell write to avoid partial copy/delete behavior? 3. What operation ID or idempotency mechanism is needed for safe retries of cell writes, source updates, and future handler invocation contracts? 4. Which deadline defaults should be user-configurable, and which should be hard package constants? 5. How should `.status` expose detailed operation records without leaking CFC or transcript-sensitive metadata?