# Data Model This document specifies the immutable data representation — what values can be stored and how they are identified. ## Status Draft — based on codebase investigation. --- ## Current State ### Overview The system stores **storable values** — data that can be serialized to JSON with some extensions. All persistent data and in-flight messages use this representation. ### Base Types Storable values are JSON-compatible with specific constraints: | Type | Notes | |------|-------| | `null` | JSON null | | `boolean` | `true` or `false` | | `number` | Finite only; `NaN` and `Infinity` rejected | | `string` | Unicode text | | `array` | Ordered sequence of storable values | | `object` | String-keyed map of storable values | #### Numbers - Only finite numbers are storable - `-0` is normalized to `0` during conversion - `NaN` and `Infinity` throw errors #### Arrays - Must be dense (no holes) - Must not contain `undefined` elements - Sparse arrays are densified during conversion (`undefined` → `null`) - Non-index keys (named properties) cause rejection as not-storable #### Objects - Plain objects only (no class instances) - Keys must be strings; symbol keys cause rejection as not-storable - Values must be storable - No distinction between regular and null-prototype objects; reconstruction produces regular plain objects ### Special Values #### `undefined` `undefined` has special semantics depending on context: - **Top-level**: Indicates deletion (remove the stored value) - **Object property**: Treated as absent (property is omitted) - **Array element**: Converted to `null` during storage #### Non-Storable Types These types cannot be stored directly: - `bigint` — throws error - `symbol` — throws error - `function` — throws error unless it has a `toJSON()` method - Class instances — throws error unless they have `toJSON()` or special handling ### Special Object Shapes Certain object shapes have system-defined semantics. These use reserved keys that begin with special characters. #### Reference Sigil: `{ "/": ... }` Objects with a `"/"` key are references, not literal data. This convention comes from [DAG-JSON](https://ipld.io/specs/codecs/dag-json/spec/). ```json { "/": { "link@1": { "id": "of:abc123...", "path": ["items", "0", "name"] } } } ``` See [Identity and References](./3-identity-and-references.md) for details. #### Stream Marker: `{ $stream: true }` Objects with exactly `{ $stream: true }` mark stream cell locations. The marker persists to preserve stream identity; event payloads are ephemeral. See [Cells](./4-cells.md) for stream semantics. #### Error Wrapper: `{ "@Error": {...} }` Error instances are converted to a storable form using the `@` prefix convention: ```json { "@Error": { "name": "TypeError", "message": "Cannot read property 'x' of undefined", "stack": "TypeError: Cannot read...", "cause": null } } ``` Properties captured: - `name` — error type name - `message` — error message - `stack` — stack trace (if available) - `cause` — nested cause (recursively converted) - Any custom enumerable properties This allows errors to round-trip through storage while preserving diagnostic information. ### Circular References Within a single document, circular references are detected and throw an error. The system does not support storing cyclic data within a document's value. Shared references (the same object appearing multiple times) are preserved correctly. Cycles *across* documents are supported via explicit links (sigil links). Two cells can reference each other, forming a cycle in the broader data graph. The no-cycles constraint applies only to the serializable content of a single cell. The within-document prohibition is inherited from JSON's tree structure, not from a deep architectural requirement. If a future storage format supports cyclic references natively (e.g. CBOR with shared references, or a CRDT layer with internal pointers), this constraint could be relaxed. --- ## Hashing and Content Addressing ### Current State The system uses merkle-tree hashing with CID-formatted output: 1. Content is translated into a binary tree representation 2. Tree nodes are hashed using SHA-256 3. Result is formatted as a CID (Content Identifier) However, entity data is generally **not content-addressed**. Entity IDs are stable addresses (analogous to IPNS names) that point to the most current version of the data. Hashes are primarily used for: - Pattern ID generation (derived from pattern definition) - Request deduplication - Causal chain references (hashing the causal tree of what led to the data's existence, not the data content itself) The `"/"` sigil convention is reused as a general escape mechanism for special object shapes, not specifically tied to IPLD/IPFS semantics. The legacy `{ "/": string }` bare-string link form has been removed from recognition; current link formats use structured objects under the `"/"` key. ### Concerns: IPFS Conventions Without IPFS Benefits The system uses IPFS-derived conventions — CID formatting and merkle tree hashing — but does not participate in the IPFS network: - **No content retrieval by CID**: The system doesn't fetch data from IPFS - **No pinning**: Content isn't published to or retrieved from the IPFS network - **No external verification**: CIDs aren't verified against external sources - **No deduplication across peers**: The distributed storage benefits don't apply Since entity IDs are addresses (not content hashes), the CID formatting adds encoding complexity without providing interoperability. A simpler hashing scheme would serve the actual use cases (pattern IDs, deduplication, causal chains) equally well. --- ## Proposed Directions ### Simplified Canonical Hashing Replace `merkle-reference` with a simpler canonical hashing approach: - Traverse the natural data structure directly (no intermediate tree) - Sort object keys, preserve array order - Hash type tags + content in a single pass - No intermediate allocations The hash should reflect the logical content, not any particular encoding or intermediate representation. #### Benefits - Simpler implementation - Lower overhead (no tree construction) - Hash reflects actual data shape - Easier to reason about what changes affect identity #### Relationship to Late Serialization This proposal pairs with [Late Serialization](#late-serialization-rich-types-within-the-runtime): if rich types flow through the runtime, canonical hashing should operate on those types directly (via their deconstructed state for `StorableInstance`s), not on JSON-encoded forms. The hash becomes encoding-independent — the same identity whether later serialized to JSON, CBOR, or Automerge. #### Open Questions - What is the exact specification for canonical hashing? - How should each type be tagged? (null, bool, int, float, string, bytes, array, object, references) - How do special object shapes (references, streams, errors) participate? - What is the migration path from current CID-based identifiers? ### Late Serialization: Rich Types Within the Runtime #### The Principle Rich types should flow through the runtime as themselves; serialization to wire/storage formats should happen only at boundary crossings. ``` ┌─────────────────────────────────────────────────────┐ │ Runtime Context │ │ │ │ Cell ←→ Cell ←→ Error ←→ Cell ←→ [rich types] │ │ │ └──────────┬──────────────────┬───────────────────────┘ │ │ ▼ ▼ ┌──────────────┐ ┌──────────────┐ │ Storage │ │ Network │ │ (serialize) │ │ (serialize) │ └──────────────┘ └──────────────┘ ``` #### Current State: Early Conversion Today, special JSON forms are created early and travel through the system: - `normalizeAndDiff()` converts Cells to SigilLinks (`{ "/": {...} }`) immediately - `convertCellsToLinks()` explicitly replaces Cell references with JSON forms - `toStorableValue()` wraps Errors as `{ "@Error": {...} }` during data updates - Stream markers (`{ $stream: true }`) are stored and compared as JSON objects The JSON forms then propagate through transactions, the reactive system, and query results. Code throughout the system must detect and handle these special shapes via `isSigilLink()`, `isStreamValue()`, `isErrorWrapper()`, etc. #### Proposed: Defer Conversion to Boundaries Keep rich types as themselves within the runtime: - **Cells remain Cells** through the reactive graph and transactions - **Errors remain Errors** until they cross a serialization boundary - **Streams are first-class** rather than marker objects - Serialization becomes a "last mile" concern at specific boundary points The `StorableValue` type would expand to a union of three categories: ```typescript type StorableValue = // (a) Primitives | null | boolean | number | string | undefined // currently has special semantics; could become first-class | bigint // currently rejected; could become first-class // (b) Built-in JS types (cannot be patched with symbols) | Error | Map | Set | Uint8Array // or other byte-array type | Date // or Temporal type // (c) Branded storables (our types implementing the protocol) | StorableInstance // Recursive containers | StorableValue[] | { [key: string]: StorableValue } ``` Built-in JS types require explicit serialization handling — we cannot (and should not) patch `Error.prototype` with symbol-keyed methods. The serialization context must recognize these types directly. #### The Storable Protocol Types *we control* opt into storability by implementing methods keyed by well-known symbols: ```typescript const DECONSTRUCT = Symbol.for('common.deconstruct'); const RECONSTRUCT = Symbol.for('common.reconstruct'); // If protocol evolution is needed: Symbol.for('common.deconstruct@2') // Instance protocol: "here's my essential state" interface StorableInstance { [DECONSTRUCT](): unknown; } // Class protocol: "here's how to bring one back" interface StorableClass { [RECONSTRUCT](state: unknown, runtime: Runtime): T; } ``` `[RECONSTRUCT]` is a dedicated static method rather than using the class constructor for two reasons: 1. **Reconstruction-specific context**: It receives the `Runtime` (and potentially other context) which shouldn't be mandated in a regular constructor's signature. 2. **Instance interning**: It can return existing instances rather than always creating new ones — essential for types like `Cell` where identity matters. The presence of `[DECONSTRUCT]` doubles as the brand — no separate marker needed: ```typescript function isStorable(value: unknown): value is StorableInstance { return value != null && typeof value === 'object' && DECONSTRUCT in value; } ``` Example implementation: ```typescript class Cell implements StorableInstance { [DECONSTRUCT]() { return { id: this.entityId, path: this.path, space: this.space }; } static [RECONSTRUCT](state: CellState, runtime: Runtime): Cell { return runtime.getCell(state); } } ``` This approach: - **Open for extension**: New storable types don't require modifying a central type definition - **Co-located logic**: Each type knows how to deconstruct/reconstruct itself - **Symbol-based brands**: Unique symbols prevent collision with user data keys and provide reliable runtime type discrimination #### Deconstructed State and Recursion The value returned by `[DECONSTRUCT]()` can contain any value that is itself deconstructable — including other `StorableInstance`s, built-in types like `Error` or `Map`, and of course primitives and plain objects/arrays. The **serialization system handles recursion**, not the individual deconstructor methods. A `[DECONSTRUCT]` implementation simply returns its essential state; it does not (and should not) recursively deconstruct nested values. The deconstructor methods won't have access to the serialization machinery required for that — by design, as it would be a layering violation. Similarly, `[RECONSTRUCT]` receives state where nested values have already been reconstructed by the serialization system. #### Reconstruction Guarantees The system aims for an **immutable-forward** design: - **Plain objects and arrays** are frozen (`Object.freeze()`) upon reconstruction - **`StorableInstance`s** should ideally be frozen as well — this is the north star, though not yet a strict requirement - **No distinction** is made between regular and null-prototype plain objects; reconstruction always produces regular plain objects This immutability guarantee enables safe sharing of reconstructed values and aligns with the reactive system's assumption that values don't mutate in place. #### Unknown Types When deserializing, a context may encounter a type tag it doesn't recognize — for example, data written by a newer version of the system. Unknown types should be **passed through** rather than rejected, preserving forward compatibility. This requires a generic `StorableInstance` to hold unrecognized types: ```typescript class UnknownStorable implements StorableInstance { constructor( readonly typeTag: string, // e.g., "FutureType@2" readonly state: unknown, // the raw state, already recursively processed ) {} [DECONSTRUCT]() { return { type: this.typeTag, state: this.state }; } static [RECONSTRUCT]( state: { type: string; state: unknown }, _runtime: Runtime, ): UnknownStorable { return new UnknownStorable(state.type, state.state); } } ``` The serialization system has special knowledge of `UnknownStorable`: when it encounters an unknown type tag during deserialization, it wraps the original tag and state into `{ type, state }` and passes that to `[RECONSTRUCT]`. When re-serializing, it uses the preserved `typeTag` to produce the original wire format, allowing data to round-trip through systems that don't understand it. #### Serialization Contexts Classes provide the *capability* to serialize but don't own the wire format. A **serialization context** owns the mapping between classes and tags: ```typescript interface SerializationContext { // Maps storable types to wire format tags getTagFor(value: StorableInstance): string; getClassFor(tag: string): StorableClass; // Format-specific wrapping wrap(tag: string, state: unknown): SerializedForm; unwrap(data: SerializedForm): { tag: string; state: unknown }; } ``` This separation enables: - **Protocol versioning**: Same class, different tags in v1 vs v2 - **Format flexibility**: JSON context vs CBOR context vs Automerge context - **Migration paths**: Old context reads legacy format, new context writes modern format - **Testing**: Mock contexts for unit tests The flow becomes: ``` Serialize: instance.[DECONSTRUCT]() → state → context.wrap(tag, state) → wire Deserialize: wire → context.unwrap() → { tag, state } → Class[RECONSTRUCT](state) → instance ``` #### Serialization Boundaries The boundaries where serialization occurs in the current architecture: | Boundary | Packages | Direction | |----------|----------|-----------| | **Persistence** | `memory` ↔ database | read/write | | **Iframe sandbox** | `runner` ↔ `iframe-sandbox` | postMessage | | **Background service** | `shell` ↔ `background-piece-service` | worker messages | | **Network sync** | `toolshed` ↔ remote peers | WebSocket/HTTP | | **Cross-space** | space A ↔ space B | if in separate processes | Each boundary would use a serialization context: ```typescript // At boundary exit function serialize(value: StorableValue, context: SerializationContext): SerializedForm { if (isStorable(value)) { const state = value[DECONSTRUCT](); const tag = context.getTagFor(value); return context.wrap(tag, state); } // Handle primitives, arrays, plain objects recursively... } // At boundary entry function deserialize(data: SerializedForm, context: SerializationContext, runtime: Runtime): StorableValue { const { tag, state } = context.unwrap(data); if (tag) { const cls = context.getClassFor(tag); return cls[RECONSTRUCT](state, runtime); } // Handle primitives, arrays, plain objects recursively... } ``` The `deserialize` function needs runtime context to reconstitute rich types (e.g., looking up existing Cell instances rather than creating duplicates). #### Benefits - **Type safety**: Rich types carry more information than JSON shapes - **Simpler internal code**: No `isSigilLink()` checks scattered throughout - **Single conversion point**: Easier to maintain, audit, and change - **Format flexibility**: Different boundaries can use different contexts - **Better tooling**: Debuggers show actual Cells, not JSON blobs - **Extensible**: New storable types only need to implement the protocol #### Relationship to Canonical Hashing This proposal pairs with [Simplified Canonical Hashing](#simplified-canonical-hashing): canonical hashes can be computed over rich types directly, using deconstructed state for `StorableInstance`s and type-specific handling for built-in JS types. This makes identity hashing independent of any particular wire encoding. #### Trade-offs - **Migration complexity**: Existing code assumes JSON forms internally - **Runtime context required**: Deserialization needs access to the runtime - **Comparison semantics**: Must define equality for rich types (by identity? by deconstructed state?) - **Not "zero transformations"**: Late serialization eliminates serialization copies within the runtime, but does not eliminate all transformations. Schema-driven reads still select and shape data (resolving links, projecting fields). Link construction still needs to know which data belongs to which document. CFC validation will require traversing data on write. The benefit is eliminating one copy (sometimes more, when pass-through data can be detected as already frozen), not eliminating all data traversal. #### Open Questions - What is the migration path from early to late conversion? - How do rich types participate in change detection and diffing? - Should cycles in deconstructed state be detected and rejected, or is this left to the serialization system? - How are serialization contexts configured and selected at each boundary? - How is the type registry within a context managed? (Static registration? Dynamic discovery? Who owns the registry?) - What happens when `[DECONSTRUCT]` or `[RECONSTRUCT]` fails partway through? (Might want a `ProblematicStorable` with similar structure/use to `UnknownStorable`.) - How do schemas integrate with the storable protocol? Each `StorableInstance` type implies a schema for its deconstructed state. The storable layer should provide serialization contexts access to these schemas. What changes to the schema language are required? (See [Schemas](./7-schemas.md).) - Which built-in JS types should be included? - Byte arrays: `Uint8Array`, `ArrayBuffer`, or both? - Date/time: `Date`, `Temporal.Instant`, `Temporal.ZonedDateTime`? - Are there others beyond Error, Map, Set? - Should additional JS primitives become first-class? - `undefined`: Currently has context-dependent semantics (deletion, absent, null) - `bigint`: Currently rejected; useful for large integers - How does this interact with the proposed CRDT layer (below)? --- ### JSON Encoding for Special Types This section describes the **JSON-compatible** representation of special types. While the system will likely maintain a JSON encoding indefinitely (especially useful for debugging and interoperability), the intent is for other wire and storage formats to be available which _are not_ themselves layered on top of a translation from "native types" to JSON. Other encodings like CBOR may represent types more directly — for example, using CBOR's native byte array rather than `{ "/Bytes@1": "base64..." }`. #### Current State: Three Conventions The current system uses three different conventions for special object shapes: | Convention | Example | Used For | |------------|---------|----------| | IPLD sigil | `{ "/": { "link@1": {...} } }` | Cell references | | `@` prefix | `{ "@Error": {...} }` | Error instances | | `$` prefix | `{ "$stream": true }` | Stream markers | This inconsistency complicates parsing and adds cognitive overhead. #### Proposed: Unified `/@` Keys Unify all special types under a single convention: single-key objects where the key follows the pattern `/@`: - `/` — sigil prefix (nodding to IPLD heritage) - `` — `UpperCamelCase` type name - `@` — version number (natural number, optionally `.`) Examples: ```json { "/Link@1": { "id": "of:abc...", "path": ["x", "y"], "space": "..." } } { "/Error@1": { "name": "TypeError", "message": "...", "stack": "..." } } { "/Stream@1": null } { "/Map@1": [ ["key1", "value1"], ["key2", "value2"] ] } { "/Set@1": [ "a", "b", "c" ] } { "/Bytes@1": "base64encoded..." } { "/Date@1": "2026-02-05T12:34:56Z" } { "/BigInt@1": "12345678901234567890" } ``` **Note:** The `/@` convention described here applies specifically to the JSON encoding. Serialization contexts for other formats are free to use whatever representation makes the most sense in their context. #### Benefits - **Single convention**: One pattern to recognize and parse - **Flat structure**: One level of nesting (vs two for current IPLD style) - **Self-describing**: Type and version visible in the key - **Compact**: Shorter than `{ "/": { "link@1": {...} } }` - **Versionable**: Built-in version field supports evolution #### Detection A value is a special type if: 1. It is a plain object 2. It has exactly one key 3. That key starts with `/` This simple rule is quick to check and provides maximum flexibility to evolve the key format. #### Stateless Types Types that require no reconstruction state use `null` as the value: ```json { "/Stream@1": null } ``` This clearly distinguishes "no state needed" from "empty state" (`{}`). #### Escaping and Literal Values Two escape mechanisms handle cases where user data might be mistaken for special types: **`/object` — Single-layer escape, values still interpreted** Wraps a plain object whose key(s) might look like special types, but the values are still processed normally: ```json { "/object": { "/myKey": { "/Link@1": { "id": "..." } } } } ``` Deserializes to: `{ "/myKey": }` The `/object` wrapper is stripped, the inner object's keys are taken literally, but its values go through normal deserialization (the Link is reconstructed). **`/quote` — Fully literal, no interpretation** Wraps a value that should be returned exactly as-is, with no deserialization of any nested special forms: ```json { "/quote": { "/Link@1": { "id": "..." } } } ``` Deserializes to: `{ "/Link@1": { "id": "..." } }` — the inner structure is *not* reconstructed as a Link; it remains a plain object. Use cases for `/quote`: - Storing schemas or examples that describe special types without instantiating them - Metaprogramming and introspection - Optimization: skip deserialization when the subtree is known to be plain data - Round-tripping JSON structures that happen to look like special types **When to use which:** - `/object`: You have a plain object with a slash-prefixed key, but values should still be interpreted normally - `/quote`: You want the entire subtree treated as literal JSON data #### Unknown Type Handling When a JSON context encounters a `/@` key it doesn't recognize, it uses `UnknownStorable` (see [Unknown Types](#unknown-types) in the Storable Protocol section) to preserve the data for round-tripping. #### Relationship to Serialization Contexts This wire format is what serialization contexts produce. The context's `wrap()` and `unwrap()` methods would generate and parse these `/@` keys, mapping between rich runtime types and their serialized form. The context is also responsible for: - Applying `/object` or `/quote` escaping when serializing plain objects that happen to have slash-prefixed keys - Wrapping unknown types using the `typeTag` preserved in `UnknownStorable` #### Open Questions - What is the migration path from current formats? - Is `.minor` versioning needed, or is major-only sufficient? --- ### CRDT-Based Storage Layer For collaborative features (multiple users editing shared data), the storage layer could be implemented using CRDTs (Conflict-free Replicated Data Types). Automerge is a candidate implementation. #### Automerge's Data Model Automerge is described as "JSON-like" but supports a richer type system: | Category | Types | |----------|-------| | Containers | Map (string keys), List, Text (collaborative UTF-8) | | Primitives | null, boolean, string, f64, i64, u64, bytes, timestamp, counter | Notable differences from the current system: - **Distinct integer types**: i64 and u64 vs JavaScript's single number type - **Native binary data**: `bytes` as a first-class type - **Timestamps**: Built-in, not a convention - **Counter**: Special type with additive merge semantics #### Type System Constraints Automerge has a **fixed type system by design** — merge semantics, binary format optimization, and cross-language interoperability require known types. Custom types must be handled at an application layer above Automerge. This means the current special object shapes (`"/"`, `$stream`, `@Error`) would need a mapping layer: - Store as Automerge primitives/containers - Interpret special shapes at a layer above Automerge - The `bytes` type could store arbitrary data but loses fine-grained merge (entire blob becomes last-write-wins) #### Internal Structure Automerge documents store: - Full causal history with actor IDs (128-bit) - Operation sequences forming a change DAG (similar to git) - Columnar encoding with RLE compression This enables offline editing with automatic merge on reconnection. #### Layered CRDT Architecture CRDTs could operate at multiple levels simultaneously: - **Space level**: The entire space modeled as a single CRDT document, tracking the evolution of the overall JSON-ish structure - **Component level**: Individual parts (e.g., a text field) modeled as their own independent CRDT documents with type-specific merge semantics These layers work in harmony: - The space treats component CRDTs opaquely — just another value that changes - Patterns that understand specific CRDT types can work with the component's native semantics (e.g., collaborative text editing with cursor positions) - The space-level CRDT handles structural changes (adding/removing fields) - Component-level CRDTs handle content changes within their boundaries This separation allows general-purpose space sync to coexist with specialized collaborative editing where needed. #### Trade-offs | Benefit | Cost | |---------|------| | Automatic conflict resolution | Fixed type system requires mapping layer | | Offline-first with sync | Causal history grows over time | | Proven merge semantics | Additional complexity vs simple last-write-wins | | Cross-language support | Must map custom types to Automerge types | #### Open Questions - Which data benefits from CRDT semantics vs simple last-write-wins? - How do Cells map to Automerge documents? - Should collaborative text (Text type) be exposed directly? - What is the compaction/garbage collection strategy for causal history? --- ## Open Questions - Should there be additional special object shapes beyond `"/"`, `$stream`, and `@Error`? - How should versioning of special shapes work? - What happens when unknown special shapes are encountered? - Should the `@Error` format capture more or less information? --- **Next:** [Storage Format](./2-storage-format.md)