LatticeDB

The embedded knowledge graph for AI.

LatticeDB is a single-file database that combines a property graph, vector search, and full-text search. It's built for RAG, agents, and any application where relationships between data matter as much as the data itself.

One file. Your entire database is a single portable file. No server, no configuration.
Three search modes. Graph traversal, HNSW vector similarity, and BM25 full-text — in one query.
Sub-millisecond. 0.13 us node lookups. 0.83 ms vector search at 1M vectors with 100% recall.

-- Find chunks similar to a query, traverse to their document, then to the author
MATCH (chunk:Chunk)-[:PART_OF]->(doc:Document)-[:AUTHORED_BY]->(author:Person)
WHERE chunk.embedding <=> $query_vector < 0.3
  AND doc.content @@ "neural networks"
RETURN doc.title, chunk.text, author.name
ORDER BY chunk.embedding <=> $query_vector
LIMIT 10

Features

Graph

Nodes and edges with labels and arbitrary properties
Multi-hop traversal, variable-length paths (*1..3)
ACID transactions with snapshot isolation
MERGE, WITH, UNWIND, aggregations (count, sum, avg, min, max, collect)

Vector Search

HNSW approximate nearest neighbor with configurable M, ef
Built-in hash embeddings or HTTP client for Ollama/OpenAI
Batch insert for bulk loading

Full-Text Search

BM25-ranked inverted index with tokenization and stemming
Fuzzy search with configurable Levenshtein distance

Cypher Query Language

MATCH, WHERE, RETURN, CREATE, DELETE, SET, REMOVE
ORDER BY, LIMIT, SKIP, DETACH DELETE
Vector distance operator: <=>
Full-text search operator: @@
Parameters: $name

Operations

Single-file storage with write-ahead log for crash recovery
Zero configuration — open a file and start working
Clean C API; Python and TypeScript bindings wrap it

Use Cases

RAG Systems — Vector search finds relevant chunks, graph traversal gathers context
Knowledge Graphs — Linked notes and documents with semantic search
AI Agents — Persistent memory with relationship awareness
Local Development — Lightweight alternative to Neo4j or Weaviate for prototyping

Getting Started

Head to the Installation page to install LatticeDB, then follow the Quick Start guide to build your first knowledge graph.

Installation

CLI

curl -fsSL https://raw.githubusercontent.com/jeffhajewski/latticedb/main/dist/install.sh | bash

Python

pip install latticedb

Requires Python 3.9+ and NumPy. The native shared library (liblattice.dylib / liblattice.so) must be available on the system.

TypeScript / Node.js

npm install @hajewski/latticedb

Requires Node.js 18+. The native shared library must be available on the system.

Building from Source

LatticeDB is written in Zig with zero dependencies.

git clone https://github.com/jeffhajewski/latticedb.git
cd latticedb
zig build                  # build everything
zig build test             # run tests
zig build -Doptimize=ReleaseFast   # optimized build

Build the shared library for language bindings:

zig build shared

This produces liblattice.dylib (macOS) or liblattice.so (Linux).

See Building from Source for more details.

Quick Start

This guide walks through creating a small knowledge graph with documents and authors, storing embeddings, indexing text, and querying across all three search modes.

Python

from latticedb import Database, hash_embed

with Database("knowledge.db", create=True, enable_vector=True, vector_dimensions=128) as db:

    # --- Build the graph ---
    with db.write() as txn:
        # Create authors
        alice = txn.create_node(labels=["Person"], properties={"name": "Alice", "field": "ML"})
        bob = txn.create_node(labels=["Person"], properties={"name": "Bob", "field": "Systems"})
        txn.create_edge(alice.id, bob.id, "COLLABORATES_WITH")

        # Create documents with chunks
        for title, text, author in [
            ("Attention Is All You Need", "The transformer architecture uses self-attention...", alice),
            ("Scaling Laws for LLMs", "We find that model performance scales predictably...", alice),
            ("Log-Structured Merge Trees", "LSM trees optimize write-heavy workloads...", bob),
        ]:
            doc = txn.create_node(labels=["Document"], properties={"title": title})
            chunk = txn.create_node(labels=["Chunk"], properties={"text": text})

            # Store embedding and index text
            txn.set_vector(chunk.id, "embedding", hash_embed(text, dimensions=128))
            txn.fts_index(chunk.id, text)

            txn.create_edge(chunk.id, doc.id, "PART_OF")
            txn.create_edge(doc.id, author.id, "AUTHORED_BY")

        txn.commit()

    # --- Query: vector search + text match + graph traversal ---
    results = db.query("""
        MATCH (chunk:Chunk)-[:PART_OF]->(doc:Document)-[:AUTHORED_BY]->(author:Person)
        WHERE chunk.embedding <=> $query < 0.5
        RETURN doc.title, chunk.text, author.name
        ORDER BY chunk.embedding <=> $query
        LIMIT 5
    """, parameters={"query": hash_embed("transformer attention mechanism", dimensions=128)})

    for row in results:
        print(f"{row['doc.title']} by {row['author.name']}")

    # --- Full-text search ---
    for r in db.fts_search("self-attention transformer"):
        print(f"Node {r.node_id}: score={r.score:.4f}")

    # --- Aggregations ---
    stats = db.query("""
        MATCH (doc:Document)-[:AUTHORED_BY]->(p:Person)
        RETURN p.name, count(doc) AS papers
        ORDER BY papers DESC
    """)
    for row in stats:
        print(f"{row['p.name']}: {row['papers']} papers")

TypeScript

import { Database, hashEmbed } from "@hajewski/latticedb";

const db = new Database("knowledge.db", {
  create: true,
  enableVector: true,
  vectorDimensions: 128,
});
await db.open();

// Build a graph
await db.write(async (txn) => {
  const alice = await txn.createNode({
    labels: ["Person"],
    properties: { name: "Alice", field: "ML" },
  });
  const doc = await txn.createNode({
    labels: ["Document"],
    properties: { title: "Attention Is All You Need" },
  });
  const chunk = await txn.createNode({
    labels: ["Chunk"],
    properties: { text: "The transformer architecture uses self-attention..." },
  });

  await txn.setVector(chunk.id, "embedding", hashEmbed("transformer self-attention", 128));
  await txn.ftsIndex(chunk.id, "The transformer architecture uses self-attention...");

  await txn.createEdge(chunk.id, doc.id, "PART_OF");
  await txn.createEdge(doc.id, alice.id, "AUTHORED_BY");
});

// Query across vector search + graph traversal
const results = await db.query(
  `MATCH (chunk:Chunk)-[:PART_OF]->(doc:Document)-[:AUTHORED_BY]->(author:Person)
   WHERE chunk.embedding <=> $query < 0.5
   RETURN doc.title, chunk.text, author.name
   ORDER BY chunk.embedding <=> $query
   LIMIT 5`,
  { query: hashEmbed("attention mechanism", 128) }
);

for (const row of results.rows) {
  console.log(`${row["doc.title"]} by ${row["author.name"]}`);
}

await db.close();

Next Steps

Core Concepts — understand the data model
Cypher Overview — learn the query language
Building a RAG System — end-to-end tutorial

Core Concepts

Property Graph

LatticeDB stores data as a property graph: a collection of nodes connected by edges, where both nodes and edges can have labels and key-value properties.

Nodes

A node represents an entity. Each node has:

A unique ID (assigned automatically)
One or more labels that categorize it (e.g., Person, Document, Chunk)
Zero or more properties — key-value pairs (e.g., name: "Alice", age: 30)

alice = txn.create_node(
    labels=["Person"],
    properties={"name": "Alice", "age": 30}
)

Edges

An edge represents a relationship between two nodes. Each edge has:

A source node and a target node
An edge type (e.g., KNOWS, AUTHORED_BY, PART_OF)
A direction — edges always point from source to target

txn.create_edge(alice.id, bob.id, "KNOWS")

Properties

Properties are typed key-value pairs attached to nodes. Supported types:

Type	Python	TypeScript	C
Null	`None`	`null`	`LATTICE_VALUE_NULL`
Boolean	`bool`	`boolean`	`LATTICE_VALUE_BOOL`
Integer	`int`	`number`	`LATTICE_VALUE_INT`
Float	`float`	`number`	`LATTICE_VALUE_FLOAT`
String	`str`	`string`	`LATTICE_VALUE_STRING`
Binary	`bytes`	`Uint8Array`	`LATTICE_VALUE_BYTES`

Cypher Query Language

LatticeDB uses Cypher — a declarative graph query language. Nodes are written in parentheses, edges in square brackets:

MATCH (person:Person)-[:KNOWS]->(friend:Person)
WHERE person.name = "Alice"
RETURN friend.name

LatticeDB extends Cypher with two operators:

<=> — vector distance for similarity search
@@ — full-text search with BM25 scoring

Vector Search

LatticeDB includes an HNSW (Hierarchical Navigable Small World) index for approximate nearest-neighbor search on vector embeddings.

To use vector search:

Enable vector storage when opening the database
Attach vector embeddings to nodes
Search by similarity using vector_search() or the <=> operator in Cypher

# Store an embedding
txn.set_vector(node.id, "embedding", vector)

# Search by similarity
results = db.vector_search(query_vector, k=10)

# Or in Cypher
db.query("MATCH (n:Chunk) WHERE n.embedding <=> $q < 0.5 RETURN n", parameters={"q": query_vector})

Embeddings

LatticeDB provides two ways to generate embeddings:

hash_embed — a built-in deterministic hash function. No external service needed. Useful for testing and simple keyword-based similarity.
EmbeddingClient — an HTTP client that connects to Ollama, OpenAI, or any compatible API for production-quality embeddings.

Full-Text Search

LatticeDB includes a BM25-scored inverted index for full-text search. Index text content on nodes, then search across all indexed text.

# Index text
txn.fts_index(node.id, "The transformer architecture uses self-attention")

# Search
results = db.fts_search("transformer attention")

# Fuzzy search (typo-tolerant)
results = db.fts_search_fuzzy("transformr atention")

# Or in Cypher
db.query('MATCH (n) WHERE n.text @@ "transformer attention" RETURN n')

Transactions

All operations in LatticeDB happen within transactions. LatticeDB uses snapshot isolation: each transaction sees a consistent snapshot of the database as of when it started.

Read transactions can run concurrently — readers never block other readers.
Write transactions are serialized — only one write transaction can commit at a time.

# Read transaction
with db.read() as txn:
    node = txn.get_node(node_id)

# Write transaction
with db.write() as txn:
    txn.create_node(labels=["Person"], properties={"name": "Alice"})
    txn.commit()

If a write transaction is not explicitly committed, it is rolled back when the context exits.

Single-File Storage

The entire database — nodes, edges, properties, vector index, text index, and write-ahead log — lives in a single file. This makes databases portable and easy to manage: copy, back up, or deploy by moving one file.

When to Use LatticeDB

LatticeDB is fast, but speed is not the only thing that matters. Here are cases where a different tool is the better choice.

You need multiple applications writing to the same database at the same time

LatticeDB is embedded with a single-writer model. One process opens the file and owns it. If you need many clients connecting over a network, use Neo4j, PostgreSQL, or another client-server database.

Your data is fundamentally tabular

If your data fits naturally into rows and columns — sales records, user accounts, time series — a relational database like SQLite or PostgreSQL will be simpler and just as fast. Graph databases shine when relationships between records are the point, not an afterthought.

You need to scale beyond a single machine

LatticeDB stores everything in one file on one machine. If you need sharding, replication, or distributed queries across billions of nodes, look at Neo4j cluster, Dgraph, or a managed service like Neptune.

You need the full Cypher language

LatticeDB supports most of Cypher but not all of it. Features like OPTIONAL MATCH and CALL procedures are not yet implemented. If your queries depend on these, Neo4j is the complete implementation.

You need mature tooling and ecosystem

Neo4j has visualization tools, admin dashboards, monitoring, drivers in every language, and years of community resources. PostgreSQL has decades of tooling. LatticeDB is new and lean — which is a strength for embedding, but a weakness if you need a rich operational ecosystem around your database.

Building a RAG System

This guide walks through building a Retrieval-Augmented Generation (RAG) system with LatticeDB. We'll chunk documents, generate embeddings, store them with graph relationships, and query using vector search combined with graph traversal.

Architecture

A typical RAG system with LatticeDB:

Ingest — Split documents into chunks, generate embeddings, store in the graph
Link — Create edges between chunks, documents, authors, topics
Retrieve — Vector search finds relevant chunks, graph traversal gathers context
Generate — Pass retrieved context to an LLM

Step 1: Set Up the Database

from latticedb import Database, hash_embed

db = Database(
    "rag.db",
    create=True,
    enable_vector=True,
    vector_dimensions=128,  # Match your embedding model's output
)
db.open()

For production, use a real embedding model via the HTTP client:

from latticedb import EmbeddingClient

client = EmbeddingClient(
    "http://localhost:11434",
    model="nomic-embed-text",
)

Step 2: Ingest Documents

Split documents into chunks and store them with graph relationships:

def ingest_document(db, title, author_name, chunks):
    with db.write() as txn:
        # Create or find the author
        doc = txn.create_node(
            labels=["Document"],
            properties={"title": title},
        )

        author = txn.create_node(
            labels=["Person"],
            properties={"name": author_name},
        )
        txn.create_edge(doc.id, author.id, "AUTHORED_BY")

        # Create chunks with embeddings
        prev_chunk = None
        for i, text in enumerate(chunks):
            chunk = txn.create_node(
                labels=["Chunk"],
                properties={"text": text, "position": i},
            )

            # Store embedding
            embedding = hash_embed(text, dimensions=128)
            txn.set_vector(chunk.id, "embedding", embedding)

            # Index for full-text search
            txn.fts_index(chunk.id, text)

            # Link chunk to document
            txn.create_edge(chunk.id, doc.id, "PART_OF")

            # Link sequential chunks
            if prev_chunk is not None:
                txn.create_edge(prev_chunk.id, chunk.id, "NEXT")
            prev_chunk = chunk

        txn.commit()

Step 3: Add Topic Links

Enrich the graph with topic relationships:

with db.write() as txn:
    ml_topic = txn.create_node(
        labels=["Topic"],
        properties={"name": "Machine Learning"},
    )

    # Link documents to topics
    txn.create_edge(doc.id, ml_topic.id, "ABOUT")
    txn.commit()

Step 4: Query — Vector Search + Graph Context

The key advantage of LatticeDB: retrieve by similarity, then traverse the graph for additional context.

def retrieve_context(db, query_text, k=5):
    """Retrieve relevant chunks with their surrounding context."""
    query_vec = hash_embed(query_text, dimensions=128)

    # Find similar chunks and traverse to their documents and authors
    results = db.query("""
        MATCH (chunk:Chunk)-[:PART_OF]->(doc:Document)-[:AUTHORED_BY]->(author:Person)
        WHERE chunk.embedding <=> $query < 0.5
        RETURN chunk.text, doc.title, author.name
        ORDER BY chunk.embedding <=> $query
        LIMIT $k
    """, parameters={"query": query_vec, "k": k})

    return results

Retrieve with Neighboring Chunks

Get surrounding chunks for more context:

results = db.query("""
    MATCH (chunk:Chunk)-[:PART_OF]->(doc:Document)
    WHERE chunk.embedding <=> $query < 0.5
    WITH chunk, doc
    ORDER BY chunk.embedding <=> $query
    LIMIT 5
    MATCH (prev:Chunk)-[:NEXT]->(chunk)
    RETURN prev.text, chunk.text, doc.title
""", parameters={"query": query_vec})

Combine Vector and Full-Text Search

results = db.query("""
    MATCH (chunk:Chunk)-[:PART_OF]->(doc:Document)
    WHERE chunk.embedding <=> $query < 0.5
      AND chunk.text @@ $keywords
    RETURN chunk.text, doc.title
    ORDER BY chunk.embedding <=> $query
    LIMIT 10
""", parameters={
    "query": query_vec,
    "keywords": "transformer attention",
})

Step 5: Pass to LLM

context = retrieve_context(db, "How does self-attention work?")

# Build prompt with retrieved context
chunks = [f"[{r['doc.title']}] {r['chunk.text']}" for r in context]
prompt = f"""Answer the question based on the following context:

{chr(10).join(chunks)}

Question: How does self-attention work?"""

# Pass to your LLM of choice
# response = llm.generate(prompt)

Batch Loading

For large datasets, use batch insert:

import numpy as np

with db.write() as txn:
    # Insert 10,000 chunks at once
    vectors = np.array([hash_embed(text, 128) for text in all_chunks], dtype=np.float32)
    node_ids = txn.batch_insert("Chunk", vectors)

    # Set properties and create edges afterward
    for node_id, text in zip(node_ids, all_chunks):
        txn.set_property(node_id, "text", text)
        txn.fts_index(node_id, text)

    txn.commit()

Performance Tips

Use batch_insert() for bulk loading — significantly faster than individual creates
Set ef_search based on your recall requirements (64 gives 100% recall at 1M vectors)
Use cache_size_mb to control memory usage
Index only the text fields you need to search with fts_index()
Use parameters ($name) to enable query plan caching

Knowledge Graph Modeling

This guide covers data modeling patterns for knowledge graphs in LatticeDB.

Basic Patterns

Entities and Relationships

Model real-world entities as nodes and their relationships as edges:

with db.write() as txn:
    alice = txn.create_node(labels=["Person"], properties={"name": "Alice"})
    company = txn.create_node(labels=["Company"], properties={"name": "Acme Corp"})
    txn.create_edge(alice.id, company.id, "WORKS_AT")
    txn.commit()

Labels as Types

Use labels to categorize nodes. A node can have multiple labels:

txn.create_node(labels=["Person", "Employee", "Manager"], properties={"name": "Alice"})

Query by label:

MATCH (m:Manager) RETURN m.name

Document Graph

A common pattern for document management:

Document -[:AUTHORED_BY]-> Person
Document -[:ABOUT]-> Topic
Chunk -[:PART_OF]-> Document
Chunk -[:NEXT]-> Chunk

with db.write() as txn:
    doc = txn.create_node(labels=["Document"], properties={"title": "Paper Title"})
    author = txn.create_node(labels=["Person"], properties={"name": "Alice"})
    topic = txn.create_node(labels=["Topic"], properties={"name": "Machine Learning"})

    txn.create_edge(doc.id, author.id, "AUTHORED_BY")
    txn.create_edge(doc.id, topic.id, "ABOUT")

    # Create a chain of chunks
    chunks = []
    for i, text in enumerate(chunk_texts):
        chunk = txn.create_node(labels=["Chunk"], properties={"text": text, "position": i})
        txn.create_edge(chunk.id, doc.id, "PART_OF")
        if chunks:
            txn.create_edge(chunks[-1].id, chunk.id, "NEXT")
        chunks.append(chunk)

    txn.commit()

Multi-Hop Queries

Finding Collaborators

-- Direct collaborators
MATCH (a:Person {name: "Alice"})-[:COLLABORATES_WITH]->(b:Person)
RETURN b.name

-- Collaborators of collaborators
MATCH (a:Person {name: "Alice"})-[:COLLABORATES_WITH*2]->(c:Person)
RETURN DISTINCT c.name

Path Queries

-- How are two people connected?
MATCH (a:Person {name: "Alice"})-[*1..4]->(b:Person {name: "Dave"})
RETURN a.name, b.name

Aggregating Over Relationships

-- Authors with the most documents about ML
MATCH (p:Person)<-[:AUTHORED_BY]-(d:Document)-[:ABOUT]->(t:Topic {name: "Machine Learning"})
RETURN p.name, count(d) AS papers
ORDER BY papers DESC
LIMIT 10

Person -[:KNOWS]-> Person
Person -[:FOLLOWS]-> Person
Person -[:POSTED]-> Post
Post -[:TAGGED]-> Topic

-- Find friends who share interests
MATCH (me:Person {name: "Alice"})-[:KNOWS]->(friend:Person)
MATCH (me)-[:FOLLOWS]->(topic:Topic)<-[:FOLLOWS]-(friend)
RETURN friend.name, collect(topic.name) AS shared_interests

Modeling Tips

Use descriptive edge types. AUTHORED_BY is more useful than RELATED_TO.
Keep properties simple. Store complex data as separate nodes with edges rather than large property values.
Use labels for querying. Labels enable efficient scans via the label index.
Model for your queries. Think about what traversals you need and structure edges accordingly.
Use MERGE for idempotent imports. When loading data from multiple sources, MERGE prevents duplicate nodes.

Working with Embeddings

LatticeDB supports storing and searching vector embeddings on nodes. This guide covers the embedding options and how to use them.

Overview

To use embeddings:

Enable vector storage when opening the database
Generate embeddings (built-in hash or external service)
Attach embeddings to nodes
Search by similarity

Enabling Vector Storage

db = Database(
    "mydb.db",
    create=True,
    enable_vector=True,
    vector_dimensions=128,  # Must match your embedding dimensions
)

const db = new Database("mydb.db", {
  create: true,
  enableVector: true,
  vectorDimensions: 128,
});

Hash Embeddings (Built-in)

hash_embed generates deterministic embeddings from text without an external service. It uses feature hashing to map text tokens to a fixed-dimension vector.

When to use: Testing, prototyping, or when keyword-level similarity is sufficient.

Python:

from latticedb import hash_embed

vec = hash_embed("hello world", dimensions=128)
# Returns a numpy array of shape (128,)

TypeScript:

import { hashEmbed } from "@hajewski/latticedb";

const vec = hashEmbed("hello world", 128);
// Returns a Float32Array of length 128

HTTP Embedding Client

For production-quality semantic embeddings, use the HTTP client to connect to an embedding service.

Ollama

from latticedb import EmbeddingClient

with EmbeddingClient("http://localhost:11434") as client:
    vec = client.embed("hello world")

import { EmbeddingClient } from "@hajewski/latticedb";

const client = new EmbeddingClient({
  endpoint: "http://localhost:11434",
});
const vec = client.embed("hello world");
client.close();

OpenAI

from latticedb import EmbeddingClient, EmbeddingApiFormat

with EmbeddingClient(
    "https://api.openai.com/v1",
    model="text-embedding-3-small",
    api_format=EmbeddingApiFormat.OPENAI,
    api_key="sk-...",
) as client:
    vec = client.embed("hello world")

import { EmbeddingClient, EmbeddingApiFormat } from "@hajewski/latticedb";

const client = new EmbeddingClient({
  endpoint: "https://api.openai.com/v1",
  model: "text-embedding-3-small",
  apiFormat: EmbeddingApiFormat.OpenAI,
  apiKey: "sk-...",
});
const vec = client.embed("hello world");
client.close();

Storing Embeddings

Attach a vector to a node within a write transaction:

with db.write() as txn:
    node = txn.create_node(labels=["Chunk"], properties={"text": "some text"})
    embedding = hash_embed("some text", dimensions=128)
    txn.set_vector(node.id, "embedding", embedding)
    txn.commit()

Searching

Programmatic API

query_vec = hash_embed("search query", dimensions=128)
results = db.vector_search(query_vec, k=10, ef_search=64)
for r in results:
    print(f"Node {r.node_id}: distance={r.distance:.4f}")

Cypher

MATCH (n:Chunk)
WHERE n.embedding <=> $query < 0.5
RETURN n.text
ORDER BY n.embedding <=> $query
LIMIT 10

Batch Insert

For bulk loading, use batch_insert which is significantly faster than inserting one at a time:

import numpy as np

with db.write() as txn:
    vectors = np.random.rand(10000, 128).astype(np.float32)
    node_ids = txn.batch_insert("Document", vectors)
    txn.commit()

Choosing Dimensions

The vector_dimensions parameter must be set when opening the database and must match the embedding model output:

Model	Dimensions
`hash_embed` (built-in)	Configurable (default 128)
`text-embedding-3-small` (OpenAI)	1536
`text-embedding-3-large` (OpenAI)	3072
`nomic-embed-text` (Ollama)	768
`mxbai-embed-large` (Ollama)	1024

Higher dimensions capture more semantic nuance but use more memory and slow down search.

Full-Text Search

LatticeDB includes a BM25-scored inverted index for full-text search. This guide covers indexing, searching, and fuzzy matching.

How It Works

LatticeDB's full-text search uses:

Tokenization — text is split into terms
Stemming — terms are reduced to their root form
Inverted index — maps terms to the nodes containing them
BM25 scoring — ranks results by relevance considering term frequency, document frequency, and document length

Indexing Text

Index text content on a node within a write transaction:

with db.write() as txn:
    node = txn.create_node(labels=["Document"], properties={"title": "My Doc"})
    txn.fts_index(node.id, "The quick brown fox jumps over the lazy dog")
    txn.commit()

await db.write(async (txn) => {
  const node = await txn.createNode({
    labels: ["Document"],
    properties: { title: "My Doc" },
  });
  await txn.ftsIndex(node.id, "The quick brown fox jumps over the lazy dog");
});

Searching

Programmatic API

results = db.fts_search("quick fox", limit=10)
for r in results:
    print(f"Node {r.node_id}: score={r.score:.4f}")

const results = await db.ftsSearch("quick fox", { limit: 10 });
for (const r of results) {
  console.log(`Node ${r.nodeId}: score=${r.score.toFixed(4)}`);
}

Cypher

MATCH (d:Document)
WHERE d.content @@ "quick fox"
RETURN d.title

Fuzzy Search

Fuzzy search tolerates typos using Levenshtein edit distance:

# Finds "machine learning" despite typos
results = db.fts_search_fuzzy("machin lerning", limit=10)

Controlling Sensitivity

results = db.fts_search_fuzzy(
    "machne",
    limit=10,
    max_distance=2,      # Max edit distance (default: 2)
    min_term_length=4,   # Min term length for fuzzy matching (default: 4)
)

const results = await db.ftsSearchFuzzy("machne", {
  limit: 10,
  maxDistance: 2,
  minTermLength: 4,
});

max_distance — maximum Levenshtein edit distance. Higher values find more matches but may include irrelevant results.
min_term_length — minimum term length to apply fuzzy matching. Short terms (like "a", "the") are matched exactly.

Combining with Vector Search

Use both search modes in a single Cypher query for hybrid retrieval:

MATCH (chunk:Chunk)
WHERE chunk.embedding <=> $query < 0.5
  AND chunk.text @@ "neural networks"
RETURN chunk.text
ORDER BY chunk.embedding <=> $query
LIMIT 10

Performance

Full-text search in LatticeDB is fast:

Operation	Latency
FTS search (100 docs)	19 us

This is ~300x faster than SQLite FTS5 and competitive with Tantivy, a dedicated Rust search library. See Benchmarks for details.

Transactions and Durability

LatticeDB provides ACID transactions with snapshot isolation.

Transaction Model

Read transactions see a consistent snapshot of the database. Multiple read transactions can run concurrently.
Write transactions can modify the database. Only one write transaction can commit at a time (single-writer model).
Snapshot isolation — each transaction sees the database as it was when the transaction began. Other transactions' uncommitted changes are invisible.

Using Transactions

Python

# Read transaction
with db.read() as txn:
    node = txn.get_node(node_id)
    edges = txn.get_outgoing_edges(node_id)
    # Transaction automatically completes when context exits

# Write transaction
with db.write() as txn:
    node = txn.create_node(labels=["Person"], properties={"name": "Alice"})
    txn.commit()
    # If commit() is not called, the transaction is rolled back

TypeScript

// Read transaction
await db.read(async (txn) => {
  const node = await txn.getNode(nodeId);
  const edges = await txn.getOutgoingEdges(nodeId);
});

// Write transaction
await db.write(async (txn) => {
  const node = await txn.createNode({
    labels: ["Person"],
    properties: { name: "Alice" },
  });
  // Transaction auto-commits on success, rolls back on error
});

C

// Begin a read transaction
lattice_txn* txn;
lattice_begin(db, LATTICE_TXN_READ_ONLY, &txn);
// ... read operations ...
lattice_commit(txn);

// Begin a write transaction
lattice_begin(db, LATTICE_TXN_READ_WRITE, &txn);
// ... write operations ...
lattice_commit(txn);  // or lattice_rollback(txn);

Queries and Transactions

db.query() automatically creates the appropriate transaction:

Read queries (MATCH ... RETURN) use a read transaction
Write queries (CREATE, SET, DELETE) use a write transaction

# Implicit read transaction
results = db.query("MATCH (n:Person) RETURN n.name")

# Implicit write transaction
db.query("CREATE (n:Person {name: 'Alice'})")

Durability

LatticeDB uses a write-ahead log (WAL) for crash recovery:

Changes are written to the WAL before being applied to data pages
Commit means the WAL record is on disk (a fast sequential write)
Data pages are written lazily during checkpointing
On crash, the WAL is replayed to recover committed transactions

This means committed data survives process crashes and power failures.

Concurrency

Readers never block writers
Writers never block readers
Multiple readers can execute simultaneously
Write transactions are serialized at commit time

This makes LatticeDB well-suited for read-heavy workloads typical of RAG and search applications.

Performance Tuning

This guide covers the key parameters for tuning LatticeDB performance.

Cache Size

The cache_size_mb parameter controls how many database pages are cached in memory. Larger caches reduce disk I/O.

db = Database("mydb.db", cache_size_mb=200)  # 200 MB cache

Guidelines:

Default is 100 MB, which handles most workloads well
For large databases (1M+ nodes), increase to 200-500 MB
For memory-constrained environments, reduce to 50 MB or less
The cache stores fixed-size pages (4 KB each), so 100 MB holds ~25,000 pages

Vector Search: ef_search

The ef_search parameter controls the accuracy/speed tradeoff for HNSW vector search.

ef_search	Mean Latency (1M vectors)	Recall@10
16	506 us	57%
32	1.9 ms	79%
64	990 us	100%
128	3.2 ms	100%
256	11.6 ms	100%

Guidelines:

Default is 64, which achieves 100% recall at 1M vectors
For latency-sensitive applications, try 32 (79% recall)
For maximum recall in critical applications, use 128
Values above 128 rarely improve recall but increase latency

# Programmatic API
results = db.vector_search(query_vec, k=10, ef_search=128)

Batch Insert

When loading large amounts of data, use batch_insert instead of individual creates:

import numpy as np

with db.write() as txn:
    # Fast: ~248 inserts/sec at 1M scale
    vectors = np.random.rand(10000, 128).astype(np.float32)
    node_ids = txn.batch_insert("Document", vectors)
    txn.commit()

Batch insert is significantly faster than individual node creation because it amortizes HNSW index updates.

Query Plan Caching

Use parameterized queries to enable plan caching:

# Good: plan is cached and reused
for name in names:
    db.query("MATCH (n:Person) WHERE n.name = $name RETURN n", parameters={"name": name})

# Bad: new plan compiled for each query
for name in names:
    db.query(f"MATCH (n:Person) WHERE n.name = '{name}' RETURN n")

Monitor cache effectiveness:

stats = db.cache_stats()
print(f"Hit rate: {stats['hits'] / (stats['hits'] + stats['misses']):.1%}")

Indexing Strategy

Full-Text Search

Only index text that you need to search. Indexing unnecessary text wastes memory and slows inserts:

# Index only the searchable text field
txn.fts_index(chunk.id, chunk_text)
# Don't index metadata, IDs, etc.

Labels

Use specific labels for nodes you query frequently. Label scans are fast because they use a dedicated index:

-- Fast: scans only Chunk nodes
MATCH (c:Chunk) WHERE c.embedding <=> $q < 0.5 RETURN c

-- Slower: scans all nodes
MATCH (n) WHERE n.embedding <=> $q < 0.5 RETURN n

Transaction Scope

Keep write transactions short. Long-running write transactions hold the write lock and block other writes:

# Good: small, focused transactions
for batch in chunks(data, 1000):
    with db.write() as txn:
        for item in batch:
            txn.create_node(labels=["Item"], properties=item)
        txn.commit()

# Bad: one giant transaction
with db.write() as txn:
    for item in all_data:  # millions of items
        txn.create_node(labels=["Item"], properties=item)
    txn.commit()

Memory Usage

Vector storage dominates memory at scale:

Scale	Memory
1,000 vectors (128d)	1 MB
10,000 vectors	10 MB
100,000 vectors	101 MB
1,000,000 vectors	1,040 MB

Plan your vector_dimensions and scale accordingly. Lower dimensions use less memory but capture less semantic information.

Cypher Overview

LatticeDB uses the Cypher query language for graph operations. Cypher is a declarative language where you describe patterns to match and operations to perform, rather than specifying how to execute them.

Syntax at a Glance

Nodes are written in parentheses, edges in square brackets:

-- Match a pattern
MATCH (person:Person)-[:KNOWS]->(friend:Person)
WHERE person.name = "Alice"
RETURN friend.name

Supported Clauses

Clause	Purpose
MATCH	Find patterns in the graph
WHERE	Filter results with conditions
RETURN	Specify output columns
CREATE	Create nodes and edges
SET	Update properties and labels
DELETE	Remove nodes and edges
REMOVE	Remove properties and labels
MERGE	Match or create a pattern
WITH	Chain query parts, pipe results
UNWIND	Expand lists into rows
ORDER BY	Sort results
LIMIT	Limit number of results
SKIP	Skip first N results

LatticeDB Extensions

LatticeDB extends Cypher with two operators for search:

Vector Distance (`<=>`)

Find nodes with similar embeddings:

MATCH (chunk:Chunk)
WHERE chunk.embedding <=> $query_vector < 0.5
RETURN chunk.text
ORDER BY chunk.embedding <=> $query_vector

See Vector Search for details.

Full-Text Search (`@@`)

Search indexed text content:

MATCH (doc:Document)
WHERE doc.content @@ "neural networks"
RETURN doc.title

See Full-Text Search for details.

Expressions

Operators

Category	Operators
Comparison	`=`, `<>`, `<`, `<=`, `>`, `>=`
Logical	`AND`, `OR`, `NOT`, `XOR`
Arithmetic	`+`, `-`, `*`, `/`, `%`, `^`
String	`CONTAINS`, `STARTS WITH`, `ENDS WITH`
Null	`IS NULL`, `IS NOT NULL`
Search	`<=>` (vector distance), `@@` (full-text)

Functions

Function	Description
`id(node)`	Get node ID
`coalesce(a, b, ...)`	Return first non-null value
`abs(x)`	Absolute value
`size(list)`	List length
`toInteger(x)`	Convert to integer
`toFloat(x)`	Convert to float

See Functions for details.

Aggregations

Function	Description
`count(x)`	Count values
`sum(x)`	Sum values
`avg(x)`	Average values
`min(x)`	Minimum value
`max(x)`	Maximum value
`collect(x)`	Collect into list

See Aggregations for details.

Parameters

Use $name syntax to pass values safely:

MATCH (n:Person) WHERE n.name = $name RETURN n

See Parameters for language-specific binding.

MATCH and Patterns

MATCH finds patterns in the graph. It is the primary way to read data.

Node Patterns

Match all nodes:

MATCH (n) RETURN n

Match nodes with a label:

MATCH (p:Person) RETURN p.name

Match nodes with inline property filtering:

MATCH (p:Person {name: "Alice"}) RETURN p

Variables

Parentheses bind a matched node to a variable:

MATCH (person:Person)
RETURN person.name, person.age

Variables are used in WHERE, RETURN, SET, and other clauses to reference matched elements.

Edge Patterns

Match edges between nodes:

-- Outgoing edge
MATCH (a:Person)-[:KNOWS]->(b:Person)
RETURN a.name, b.name

-- Incoming edge
MATCH (a:Person)<-[:KNOWS]-(b:Person)
RETURN a.name, b.name

-- Either direction
MATCH (a:Person)-[:KNOWS]-(b:Person)
RETURN a.name, b.name

Edge Variables

Bind edges to variables to access their properties:

MATCH (a)-[r:KNOWS]->(b)
RETURN a.name, r, b.name

Edge Type Filtering

Match specific edge types:

MATCH (a)-[:KNOWS]->(b) RETURN b
MATCH (a)-[:AUTHORED_BY]->(b) RETURN b

Multi-Hop Patterns

Chain patterns to traverse multiple hops:

MATCH (chunk:Chunk)-[:PART_OF]->(doc:Document)-[:AUTHORED_BY]->(author:Person)
RETURN chunk.text, doc.title, author.name

Variable-Length Paths

Match paths of varying length:

-- 1 to 3 hops
MATCH (a)-[*1..3]->(b) RETURN b

-- Exactly 2 hops
MATCH (a)-[*2]->(b) RETURN b

-- 2 or more hops
MATCH (a)-[*2..]->(b) RETURN b

-- Any number of hops
MATCH (a)-[*]->(b) RETURN b

See Variable-Length Paths for details.

Multiple MATCH Clauses

Use multiple MATCH clauses to combine patterns:

MATCH (a:Person {name: "Alice"})
MATCH (b:Person {name: "Bob"})
RETURN a, b

WHERE and Filtering

WHERE filters matched patterns based on conditions.

Comparison Operators

MATCH (n:Person) WHERE n.age > 30 RETURN n.name
MATCH (n:Person) WHERE n.age >= 30 RETURN n.name
MATCH (n:Person) WHERE n.age < 30 RETURN n.name
MATCH (n:Person) WHERE n.age <= 30 RETURN n.name
MATCH (n:Person) WHERE n.name = "Alice" RETURN n
MATCH (n:Person) WHERE n.name <> "Alice" RETURN n

Boolean Logic

Combine conditions with AND, OR, NOT, and XOR:

MATCH (n:Person)
WHERE n.age > 25 AND n.field = "ML"
RETURN n.name

MATCH (n:Person)
WHERE n.field = "ML" OR n.field = "AI"
RETURN n.name

MATCH (n:Person)
WHERE NOT n.field = "Systems"
RETURN n.name

Null Checks

MATCH (n:Person) WHERE n.email IS NULL RETURN n.name
MATCH (n:Person) WHERE n.email IS NOT NULL RETURN n.name

String Predicates

MATCH (n:Person) WHERE n.name CONTAINS "li" RETURN n
MATCH (n:Person) WHERE n.name STARTS WITH "A" RETURN n
MATCH (n:Person) WHERE n.name ENDS WITH "ce" RETURN n

Property Existence

Test whether a property exists by checking for null:

MATCH (n:Person) WHERE n.email IS NOT NULL RETURN n

Inline Property Matching

Properties in the MATCH pattern act as equality filters:

-- These are equivalent:
MATCH (n:Person {name: "Alice"}) RETURN n
MATCH (n:Person) WHERE n.name = "Alice" RETURN n

Vector Distance

Filter by vector similarity using <=>:

MATCH (c:Chunk)
WHERE c.embedding <=> $query < 0.5
RETURN c.text

See Vector Search.

Full-Text Search

Filter by text content using @@:

MATCH (d:Document)
WHERE d.content @@ "neural networks"
RETURN d.title

See Full-Text Search.

RETURN and Projections

RETURN specifies which values to include in the query output.

Returning Properties

MATCH (n:Person)
RETURN n.name, n.age

Returning Nodes

Return a node reference:

MATCH (n:Person)
RETURN n

Aliases

Use AS to rename output columns:

MATCH (n:Person)
RETURN n.name AS name, n.age AS age

Expressions

Return computed values:

MATCH (n:Person)
RETURN n.name, n.age + 1

Literals

Return literal values:

MATCH (n:Person)
RETURN n.name, "person" AS type

DISTINCT

Remove duplicate rows:

MATCH (a:Person)-[:KNOWS]->(b:Person)
RETURN DISTINCT b.name

Ordering

Use ORDER BY to sort results:

MATCH (n:Person)
RETURN n.name, n.age
ORDER BY n.age DESC

Sort by multiple columns:

MATCH (n:Person)
RETURN n.name, n.age
ORDER BY n.field, n.age DESC

Pagination

Use LIMIT and SKIP for pagination:

-- First 10 results
MATCH (n:Person) RETURN n.name LIMIT 10

-- Skip first 10, return next 10
MATCH (n:Person) RETURN n.name SKIP 10 LIMIT 10

Aggregations in RETURN

Aggregation functions can be used directly in RETURN:

MATCH (doc:Document)-[:AUTHORED_BY]->(p:Person)
RETURN p.name, count(doc) AS papers
ORDER BY papers DESC

See Aggregations for the full list.

CREATE, SET, DELETE

Mutation clauses modify the graph structure.

CREATE

Create a Node

CREATE (n:Person {name: "Alice", age: 30})

Create a node with multiple labels:

CREATE (n:Person:Employee {name: "Alice"})

Create an Edge

Create an edge between existing nodes:

MATCH (a:Person {name: "Alice"}), (b:Person {name: "Bob"})
CREATE (a)-[:KNOWS]->(b)

Create Nodes and Edges Together

CREATE (a:Person {name: "Alice"})-[:KNOWS]->(b:Person {name: "Bob"})

SET

Set Properties

MATCH (n:Person {name: "Alice"})
SET n.age = 31

Set multiple properties:

MATCH (n:Person {name: "Alice"})
SET n.age = 31, n.city = "NYC"

Set Labels

Add labels to a node:

MATCH (n:Person {name: "Alice"})
SET n:Admin:Verified

Remove a Property via SET

Setting a property to NULL removes it:

MATCH (n:Person {name: "Alice"})
SET n.city = NULL

DELETE

Delete a Node

Delete a node (fails if the node has edges):

MATCH (n:Person {name: "Charlie"})
DELETE n

DETACH DELETE

Delete a node and all its edges:

MATCH (n:Person {name: "Charlie"})
DETACH DELETE n

REMOVE

Remove a Property

MATCH (n:Person {name: "Alice"})
REMOVE n.city

Remove a Label

MATCH (n:Person {name: "Alice"})
REMOVE n:Verified

MERGE

MERGE matches an existing pattern or creates it if it doesn't exist. It's an idempotent "get or create" operation.

Basic Usage

MERGE (n:Person {name: "Alice"})
RETURN n

If a Person node with name: "Alice" exists, it is returned. Otherwise, one is created.

MERGE with Properties

MERGE (n:Person {name: "Alice"})
SET n.last_seen = "2024-01-15"
RETURN n

The SET clause runs whether the node was matched or created.

MERGE Edges

MATCH (a:Person {name: "Alice"}), (b:Person {name: "Bob"})
MERGE (a)-[:KNOWS]->(b)

This creates the KNOWS edge only if it doesn't already exist.

Use Cases

MERGE is useful when:

Importing data that may have duplicates
Ensuring a relationship exists without creating duplicates
Building graphs incrementally from multiple data sources

WITH and Chaining

WITH pipes the output of one query part into the next. It acts as a boundary between query segments, allowing you to transform, filter, or aggregate intermediate results.

Basic Chaining

MATCH (p:Person)
WITH p.name AS name, p.age AS age
WHERE age > 30
RETURN name

Aggregation then Filtering

WITH is essential for filtering on aggregated values, since WHERE cannot reference aggregation results directly:

MATCH (doc:Document)-[:AUTHORED_BY]->(p:Person)
WITH p.name AS author, count(doc) AS papers
WHERE papers > 5
RETURN author, papers
ORDER BY papers DESC

Variable Scoping

Variables from before WITH are only available after it if they are explicitly passed through:

MATCH (a:Person)-[:KNOWS]->(b:Person)
WITH a, count(b) AS friend_count
RETURN a.name, friend_count

In this example, b is not available after the WITH clause — only a and friend_count are.

Multi-Stage Queries

Chain multiple WITH clauses for complex queries:

MATCH (p:Person)-[:AUTHORED_BY]-(doc:Document)
WITH p, count(doc) AS docs
WITH p, docs WHERE docs > 2
RETURN p.name, docs

UNWIND

UNWIND expands a list into individual rows. Each element of the list becomes a separate row in the output.

Basic Usage

UNWIND [1, 2, 3] AS x
RETURN x

Returns three rows: 1, 2, 3.

Creating Multiple Nodes

Use UNWIND with CREATE to create nodes from a list:

UNWIND ["Alice", "Bob", "Charlie"] AS name
CREATE (n:Person {name: name})

With Parameters

Pass a list as a parameter and unwind it:

UNWIND $names AS name
MATCH (n:Person {name: name})
RETURN n

Combining with MATCH

UNWIND $tags AS tag
MATCH (d:Document)
WHERE d.title CONTAINS tag
RETURN tag, d.title

Aggregations

Aggregation functions compute summary values across groups of rows.

Functions

count

Count the number of values:

MATCH (n:Person) RETURN count(n) AS total

Count with grouping:

MATCH (doc:Document)-[:AUTHORED_BY]->(p:Person)
RETURN p.name, count(doc) AS papers

sum

Sum numeric values:

MATCH (o:Order)
RETURN sum(o.amount) AS total_revenue

avg

Average numeric values:

MATCH (n:Person)
RETURN avg(n.age) AS average_age

min / max

Find minimum or maximum values:

MATCH (n:Person)
RETURN min(n.age) AS youngest, max(n.age) AS oldest

collect

Collect values into a list:

MATCH (p:Person)-[:KNOWS]->(f:Person)
RETURN p.name, collect(f.name) AS friends

Grouping

Non-aggregated columns in RETURN define the grouping:

MATCH (doc:Document)-[:AUTHORED_BY]->(p:Person)
RETURN p.name, count(doc) AS papers, collect(doc.title) AS titles
ORDER BY papers DESC

Here p.name is the grouping key, and count(doc) and collect(doc.title) are computed per group.

Filtering Aggregated Results

Use WITH to filter on aggregated values:

MATCH (p:Person)-[:KNOWS]->(f:Person)
WITH p, count(f) AS friend_count
WHERE friend_count > 5
RETURN p.name, friend_count

Variable-Length Paths

Variable-length path patterns match paths of varying depth in a single query.

Syntax

MATCH (a)-[*min..max]->(b)

*1..3 — match paths of length 1 to 3
*2 — match paths of exactly length 2
*2.. — match paths of length 2 or more
* — match paths of any length (at least 1)

Examples

Fixed Range

-- Friends of friends (exactly 2 hops)
MATCH (a:Person {name: "Alice"})-[:KNOWS*2]->(b:Person)
RETURN b.name

Bounded Range

-- Reachable within 1 to 3 hops
MATCH (a:Person {name: "Alice"})-[:KNOWS*1..3]->(b:Person)
RETURN DISTINCT b.name

Open-Ended

-- All reachable nodes (2+ hops)
MATCH (a:Person {name: "Alice"})-[*2..]->(b)
RETURN b

Any Length

-- All nodes reachable at any depth
MATCH (a:Person {name: "Alice"})-[*]->(b)
RETURN DISTINCT b

Performance

Variable-length paths are implemented using BFS with bitset-based visited tracking. Performance characteristics:

Short paths (1-3 hops): microseconds
Deep traversals scale linearly with the number of reachable nodes
DISTINCT is recommended to avoid duplicate results from multiple paths to the same node

At 10K nodes with 50K edges, a variable path *1..5 completes in ~82 us. See Benchmarks for detailed numbers.

Vector Search (`<=>`)

The <=> operator computes the distance between a stored vector embedding and a query vector. It integrates HNSW vector search into Cypher queries.

Basic Usage

MATCH (chunk:Chunk)
WHERE chunk.embedding <=> $query < 0.5
RETURN chunk.text
ORDER BY chunk.embedding <=> $query
LIMIT 10

This finds chunks whose embedding is within distance 0.5 of the query vector, sorted by proximity.

How It Works

When the query planner encounters <=> in a WHERE clause, it converts the pattern into a specialized HNSW search operator rather than scanning all nodes. The distance threshold (e.g., < 0.5) filters results after the search.

Distance Metric

LatticeDB uses cosine distance (1 - cosine_similarity). Values range from 0 (identical) to 2 (opposite).

Distance	Meaning
0.0	Identical vectors
0.1-0.3	Very similar
0.3-0.5	Moderately similar
0.5-1.0	Dissimilar
1.0-2.0	Very dissimilar to opposite

Ordering by Distance

Use ORDER BY to sort results by similarity:

MATCH (n:Document)
WHERE n.embedding <=> $query < 1.0
RETURN n.title
ORDER BY n.embedding <=> $query
LIMIT 10

Combining with Graph Traversal

The real power is combining vector search with graph patterns:

MATCH (chunk:Chunk)-[:PART_OF]->(doc:Document)-[:AUTHORED_BY]->(author:Person)
WHERE chunk.embedding <=> $query < 0.5
RETURN doc.title, chunk.text, author.name
ORDER BY chunk.embedding <=> $query
LIMIT 10

This finds similar chunks, then traverses to their documents and authors — all in one query.

Combining with Full-Text Search

MATCH (chunk:Chunk)-[:PART_OF]->(doc:Document)
WHERE chunk.embedding <=> $query < 0.5
  AND doc.content @@ "neural networks"
RETURN doc.title, chunk.text
ORDER BY chunk.embedding <=> $query

Parameter Binding

Pass the query vector as a parameter:

Python:

results = db.query(
    "MATCH (n) WHERE n.embedding <=> $q < 0.5 RETURN n",
    parameters={"q": query_vector}
)

TypeScript:

const results = await db.query(
  "MATCH (n) WHERE n.embedding <=> $q < 0.5 RETURN n",
  { q: new Float32Array([0.1, 0.2, ...]) }
);

Full-Text Search (`@@`)

The @@ operator performs BM25-scored full-text search on indexed text content.

Basic Usage

MATCH (d:Document)
WHERE d.content @@ "neural networks"
RETURN d.title

This searches all nodes that have had their text indexed with fts_index() and returns those matching the search terms.

How It Works

When the query planner encounters @@, it uses the inverted index to find matching nodes directly — no full scan needed. Results are scored using BM25, which considers term frequency, inverse document frequency, and document length.

String Queries

The search query is a space-separated list of terms. All terms must match (implicit AND):

-- Both "neural" and "networks" must appear
MATCH (n) WHERE n.text @@ "neural networks" RETURN n

Using Parameters

MATCH (d:Document)
WHERE d.content @@ $search_text
RETURN d.title

Python:

results = db.query(
    'MATCH (d:Document) WHERE d.content @@ $q RETURN d.title',
    parameters={"q": "machine learning"}
)

TypeScript:

const results = await db.query(
  'MATCH (d:Document) WHERE d.content @@ $q RETURN d.title',
  { q: "machine learning" }
);

Combining with Graph Traversal

MATCH (chunk:Chunk)-[:PART_OF]->(doc:Document)-[:AUTHORED_BY]->(author:Person)
WHERE chunk.text @@ "transformer attention"
RETURN doc.title, author.name

Combining with Vector Search

MATCH (chunk:Chunk)
WHERE chunk.embedding <=> $query < 0.5
  AND chunk.text @@ "transformer"
RETURN chunk.text
ORDER BY chunk.embedding <=> $query

Indexing Text

Text must be indexed before it can be searched. Use fts_index() within a write transaction:

Python:

with db.write() as txn:
    txn.fts_index(node.id, "The transformer architecture uses self-attention mechanisms")
    txn.commit()

TypeScript:

await db.write(async (txn) => {
  await txn.ftsIndex(node.id, "The transformer architecture uses self-attention mechanisms");
});

Programmatic Search

In addition to the @@ operator in Cypher, you can use the dedicated search APIs:

Python:

# Exact search
results = db.fts_search("machine learning", limit=10)

# Fuzzy search (typo-tolerant)
results = db.fts_search_fuzzy("machin lerning", limit=10)

TypeScript:

// Exact search
const results = await db.ftsSearch("machine learning", { limit: 10 });

// Fuzzy search (typo-tolerant)
const results = await db.ftsSearchFuzzy("machin lerning", { limit: 10 });

Parameters

Parameters allow you to pass values into queries safely using the $name syntax. This prevents injection attacks and enables query plan caching.

Syntax

Use $ followed by the parameter name in the query:

MATCH (n:Person) WHERE n.name = $name RETURN n
MATCH (n:Person) WHERE n.age > $min_age RETURN n

Python

# String parameter
result = db.query(
    "MATCH (n:Person) WHERE n.name = $name RETURN n",
    parameters={"name": "Alice"},
)

# Numeric parameter
result = db.query(
    "MATCH (n:Person) WHERE n.age > $min_age RETURN n",
    parameters={"min_age": 25},
)

# Vector parameter
from latticedb import hash_embed

result = db.query(
    "MATCH (n) WHERE n.embedding <=> $vec < 0.5 RETURN n",
    parameters={"vec": hash_embed("query text", dimensions=128)},
)

TypeScript

// String parameter
const result = await db.query(
  "MATCH (n:Person) WHERE n.name = $name RETURN n",
  { name: "Alice" }
);

// Numeric parameter
const result = await db.query(
  "MATCH (n:Person) WHERE n.age > $min_age RETURN n",
  { min_age: 25 }
);

// Vector parameter
import { hashEmbed } from "@hajewski/latticedb";

const result = await db.query(
  "MATCH (n) WHERE n.embedding <=> $vec < 0.5 RETURN n",
  { vec: hashEmbed("query text", 128) }
);

C

lattice_query* query;
lattice_query_prepare(db, "MATCH (n) WHERE n.name = $name RETURN n", &query);

// Bind string parameter
lattice_value val = {
    .type = LATTICE_VALUE_STRING,
    .data.string_val = { "Alice", 5 }
};
lattice_query_bind(query, "name", &val);

// Bind vector parameter
float vec[128] = { /* ... */ };
lattice_query_bind_vector(query, "embedding", vec, 128);

Supported Parameter Types

Type	Python	TypeScript	C
String	`str`	`string`	`LATTICE_VALUE_STRING`
Integer	`int`	`number`	`LATTICE_VALUE_INT`
Float	`float`	`number`	`LATTICE_VALUE_FLOAT`
Boolean	`bool`	`boolean`	`LATTICE_VALUE_BOOL`
Null	`None`	`null`	`LATTICE_VALUE_NULL`
Vector	`numpy.ndarray`	`Float32Array`	`float*` (via `lattice_query_bind_vector`)

Query Caching

Parameterized queries are cached by their query text. The same query with different parameter values reuses the cached plan, improving performance for repeated queries.

# These share the same cached plan:
db.query("MATCH (n) WHERE n.name = $name RETURN n", parameters={"name": "Alice"})
db.query("MATCH (n) WHERE n.name = $name RETURN n", parameters={"name": "Bob"})

Functions

LatticeDB supports the following built-in functions in Cypher expressions.

id()

Returns the internal ID of a node:

MATCH (n:Person)
RETURN id(n), n.name

coalesce()

Returns the first non-null argument:

MATCH (n:Person)
RETURN coalesce(n.nickname, n.name) AS display_name

abs()

Returns the absolute value of a number:

MATCH (n:Person)
RETURN n.name, abs(n.score) AS abs_score

size()

Returns the length of a list:

MATCH (p:Person)-[:KNOWS]->(f:Person)
WITH p, collect(f) AS friends
RETURN p.name, size(friends) AS friend_count

toInteger()

Converts a value to an integer:

MATCH (n:Person)
RETURN n.name, toInteger(n.score) AS int_score

toFloat()

Converts a value to a float:

MATCH (n:Person)
RETURN n.name, toFloat(n.age) AS float_age

Type Coercion

Numeric operations automatically promote integers to floats when mixed:

RETURN 42 + 3.14   -- Returns 45.14 (float)

Null propagates through most operations:

RETURN null + 1     -- Returns null
RETURN null = null  -- Returns true (special case)

C API

The C API is LatticeDB's primary interface. All language bindings (Python, TypeScript) wrap this API. The header file is include/lattice.h.

Overview

The API uses opaque handle types and follows a consistent pattern:

Functions return lattice_error (0 = success, negative = error)
Resources are allocated by the library and freed by the caller
Strings and result sets must be explicitly freed

Types

Handles

typedef struct lattice_database lattice_database;
typedef struct lattice_txn lattice_txn;
typedef struct lattice_query lattice_query;
typedef struct lattice_result lattice_result;
typedef struct lattice_vector_result lattice_vector_result;
typedef struct lattice_fts_result lattice_fts_result;
typedef struct lattice_edge_result lattice_edge_result;

IDs

typedef uint64_t lattice_node_id;
typedef uint64_t lattice_edge_id;

Error Codes

LATTICE_OK                  // 0  - Success
LATTICE_ERROR               // -1 - Generic error
LATTICE_ERROR_IO            // -2 - I/O error
LATTICE_ERROR_CORRUPTION    // -3 - Data corruption detected
LATTICE_ERROR_NOT_FOUND     // -4 - Resource not found
LATTICE_ERROR_ALREADY_EXISTS // -5 - Resource already exists
LATTICE_ERROR_INVALID_ARG   // -6 - Invalid argument
LATTICE_ERROR_TXN_ABORTED   // -7 - Transaction aborted
LATTICE_ERROR_LOCK_TIMEOUT  // -8 - Lock timeout
LATTICE_ERROR_READ_ONLY     // -9 - Write attempted on read-only txn
LATTICE_ERROR_FULL          // -10 - Database full
LATTICE_ERROR_VERSION_MISMATCH // -11 - Version mismatch
LATTICE_ERROR_CHECKSUM      // -12 - Checksum verification failed
LATTICE_ERROR_OUT_OF_MEMORY // -13 - Out of memory

Value Types

typedef enum {
    LATTICE_VALUE_NULL = 0,
    LATTICE_VALUE_BOOL = 1,
    LATTICE_VALUE_INT = 2,
    LATTICE_VALUE_FLOAT = 3,
    LATTICE_VALUE_STRING = 4,
    LATTICE_VALUE_BYTES = 5,
    LATTICE_VALUE_VECTOR = 6,
    LATTICE_VALUE_LIST = 7,
    LATTICE_VALUE_MAP = 8
} lattice_value_type;

Property Value

typedef struct {
    lattice_value_type type;
    union {
        bool bool_val;
        int64_t int_val;
        double float_val;
        struct { const char* ptr; size_t len; } string_val;
        struct { const uint8_t* ptr; size_t len; } bytes_val;
        struct { const float* ptr; uint32_t dimensions; } vector_val;
    } data;
} lattice_value;

Database Operations

Open

lattice_open_options opts = LATTICE_OPEN_OPTIONS_DEFAULT;
opts.create = true;
opts.enable_vector = true;
opts.vector_dimensions = 128;

lattice_database* db;
lattice_error err = lattice_open("mydb.ltdb", &opts, &db);

Close

lattice_close(db);

Open Options

typedef struct {
    bool create;              // Create if not exists
    bool read_only;           // Open in read-only mode
    uint32_t cache_size_mb;   // Cache size in MB (default: 100)
    uint32_t page_size;       // Page size in bytes (default: 4096)
    bool enable_vector;       // Enable vector storage
    uint16_t vector_dimensions; // Vector dimensions (default: 128)
} lattice_open_options;

Transaction Operations

// Begin a transaction
lattice_txn* txn;
lattice_begin(db, LATTICE_TXN_READ_WRITE, &txn);

// ... do work ...

// Commit or rollback
lattice_commit(txn);
// or: lattice_rollback(txn);

Transaction modes:

LATTICE_TXN_READ_ONLY — read-only, can run concurrently
LATTICE_TXN_READ_WRITE — read-write, serialized

Node Operations

Create a Node

lattice_node_id node_id;
lattice_node_create(txn, "Person", &node_id);

Set / Get Properties

// Set a string property
lattice_value val = {
    .type = LATTICE_VALUE_STRING,
    .data.string_val = { "Alice", 5 }
};
lattice_node_set_property(txn, node_id, "name", &val);

// Get a property
lattice_value out;
lattice_node_get_property(txn, node_id, "name", &out);

Check Existence

bool exists;
lattice_node_exists(txn, node_id, &exists);

Delete a Node

lattice_node_delete(txn, node_id);

Get Labels

char* labels;
lattice_node_get_labels(txn, node_id, &labels);
// labels is a comma-separated string, e.g. "Person,Employee"
// ...
lattice_free_string(labels);

Set a Vector

float vector[128] = { /* ... */ };
lattice_node_set_vector(txn, node_id, "embedding", vector, 128);

Edge Operations

Create / Delete

lattice_edge_id edge_id;
lattice_edge_create(txn, source_id, target_id, "KNOWS", &edge_id);
lattice_edge_delete(txn, source_id, target_id, "KNOWS");

Traverse Edges

lattice_edge_result* edges;
lattice_edge_get_outgoing(txn, node_id, &edges);

uint32_t count = lattice_edge_result_count(edges);
for (uint32_t i = 0; i < count; i++) {
    lattice_node_id source, target;
    const char* type;
    uint32_t type_len;
    lattice_edge_result_get(edges, i, &source, &target, &type, &type_len);
}
lattice_edge_result_free(edges);

Batch Insert

Insert many nodes with vectors in a single call:

lattice_node_with_vector nodes[1000];
for (int i = 0; i < 1000; i++) {
    nodes[i].label = "Document";
    nodes[i].vector = vectors[i];  // float[128]
    nodes[i].dimensions = 128;
}

lattice_node_id ids[1000];
uint32_t count;
lattice_batch_insert(txn, nodes, 1000, ids, &count);

Vector Search

float query[128] = { /* ... */ };
lattice_vector_result* results;
lattice_vector_search(db, query, 128, 10, 64, &results);

uint32_t count = lattice_vector_result_count(results);
for (uint32_t i = 0; i < count; i++) {
    lattice_node_id node_id;
    float distance;
    lattice_vector_result_get(results, i, &node_id, &distance);
    printf("Node %llu: distance=%.4f\n", node_id, distance);
}
lattice_vector_result_free(results);

Parameters:

k — number of nearest neighbors to return
ef_search — HNSW search parameter (0 = default 64). Higher values improve recall at the cost of latency.

Full-Text Search

Index a Document

const char* text = "The quick brown fox jumps over the lazy dog";
lattice_fts_index(txn, node_id, text, strlen(text));

Search

lattice_fts_result* results;
lattice_fts_search(db, "quick fox", 9, 10, &results);

uint32_t count = lattice_fts_result_count(results);
for (uint32_t i = 0; i < count; i++) {
    lattice_node_id node_id;
    float score;
    lattice_fts_result_get(results, i, &node_id, &score);
    printf("Node %llu: score=%.4f\n", node_id, score);
}
lattice_fts_result_free(results);

Fuzzy Search

lattice_fts_result* results;
lattice_fts_search_fuzzy(db, "quik fox", 8, 10, 2, 4, &results);
// max_distance=2, min_term_length=4

Embeddings

Hash Embeddings (Built-in)

float* vector;
uint32_t dims;
lattice_hash_embed("hello world", 11, 128, &vector, &dims);
// Use vector...
lattice_hash_embed_free(vector, dims);

HTTP Embedding Client

lattice_embedding_config config = {
    .endpoint = "http://localhost:11434",
    .model = NULL,  // use default
    .api_format = LATTICE_EMBEDDING_OLLAMA,
    .api_key = NULL,
    .timeout_ms = 0  // default 30s
};

lattice_embedding_client* client;
lattice_embedding_client_create(&config, &client);

float* vector;
uint32_t dims;
lattice_embedding_client_embed(client, "hello world", 11, &vector, &dims);
// Use vector...
lattice_hash_embed_free(vector, dims);

lattice_embedding_client_free(client);

Query Operations

Queries use a prepare/bind/execute pattern:

// 1. Prepare
lattice_query* query;
lattice_query_prepare(db, "MATCH (n) WHERE n.name = $name RETURN n", &query);

// 2. Bind parameters
lattice_value val = {
    .type = LATTICE_VALUE_STRING,
    .data.string_val = { "Alice", 5 }
};
lattice_query_bind(query, "name", &val);

// For vector parameters:
float vec[128] = { /* ... */ };
lattice_query_bind_vector(query, "embedding", vec, 128);

// 3. Execute
lattice_txn* txn;
lattice_begin(db, LATTICE_TXN_READ_ONLY, &txn);

lattice_result* result;
lattice_query_execute(query, txn, &result);

// 4. Iterate results
while (lattice_result_next(result)) {
    uint32_t cols = lattice_result_column_count(result);
    for (uint32_t i = 0; i < cols; i++) {
        const char* name = lattice_result_column_name(result, i);
        lattice_value val;
        lattice_result_get(result, i, &val);
        // Process val...
    }
}

// 5. Cleanup
lattice_result_free(result);
lattice_commit(txn);
lattice_query_free(query);

Query Cache

// Clear cache
lattice_query_cache_clear(db);

// Get statistics
uint32_t entries;
uint64_t hits, misses;
lattice_query_cache_stats(db, &entries, &hits, &misses);

Utilities

// Get version string
const char* version = lattice_version();  // e.g. "0.1.0"

// Get error message
const char* msg = lattice_error_message(LATTICE_ERROR_NOT_FOUND);

Python API

Python bindings for LatticeDB, an embedded knowledge graph database for AI/RAG applications.

Installation

pip install latticedb

The native shared library (liblattice.dylib / liblattice.so) must be available on the system. Install it via the install script or build from source with zig build shared.

Quick Start

import numpy as np
from latticedb import Database

with Database("knowledge.db", create=True, enable_vector=True, vector_dimensions=4) as db:
    # Create nodes, edges, and index content
    with db.write() as txn:
        alice = txn.create_node(
            labels=["Person"],
            properties={"name": "Alice", "age": 30},
        )
        bob = txn.create_node(
            labels=["Person"],
            properties={"name": "Bob", "age": 25},
        )
        txn.create_edge(alice.id, bob.id, "KNOWS")

        # Index text for full-text search
        txn.fts_index(alice.id, "Alice works on machine learning research")
        txn.fts_index(bob.id, "Bob studies deep learning and neural networks")

        # Store vector embeddings
        txn.set_vector(alice.id, "embedding", np.array([1.0, 0.0, 0.0, 0.0], dtype=np.float32))
        txn.set_vector(bob.id, "embedding", np.array([0.0, 1.0, 0.0, 0.0], dtype=np.float32))

        txn.commit()

    # Query with Cypher
    result = db.query("MATCH (n:Person) WHERE n.age > 20 RETURN n.name, n.age")
    for row in result:
        print(row)

    # Vector similarity search
    query_vec = np.array([0.9, 0.1, 0.0, 0.0], dtype=np.float32)
    for r in db.vector_search(query_vec, k=2):
        print(f"Node {r.node_id}: distance={r.distance:.4f}")

    # Full-text search
    for r in db.fts_search("machine learning"):
        print(f"Node {r.node_id}: score={r.score:.4f}")

    # Fuzzy search (typo-tolerant)
    for r in db.fts_search_fuzzy("machin lerning"):
        print(f"Node {r.node_id}: score={r.score:.4f}")

API Reference

Database

Database(
    path: str | Path,
    *,
    create: bool = False,        # Create if doesn't exist
    read_only: bool = False,     # Open in read-only mode
    cache_size_mb: int = 100,    # Page cache size
    enable_vector: bool = False, # Enable vector storage
    vector_dimensions: int = 128 # Vector dimensions
)

Methods

open() / close() - Open/close the database (also works as context manager)
read() - Start a read-only transaction (context manager)
write() - Start a read-write transaction (context manager)
query(cypher, parameters=None) - Execute a Cypher query
vector_search(vector, k=10, ef_search=64) - k-NN vector search
fts_search(query, limit=10) - Full-text search
fts_search_fuzzy(query, limit=10, max_distance=0, min_term_length=0) - Fuzzy full-text search
cache_clear() - Clear the query cache
cache_stats() - Get cache hit/miss statistics

Transaction

Read Operations

get_node(node_id) - Get a node by ID, returns Node or None
node_exists(node_id) - Check if a node exists
get_property(node_id, key) - Get a property value
get_outgoing_edges(node_id) - Get outgoing edges from a node
get_incoming_edges(node_id) - Get incoming edges to a node
is_read_only / is_active - Transaction state

Write Operations

create_node(labels=[], properties=None) - Create a node
delete_node(node_id) - Delete a node
set_property(node_id, key, value) - Set a property on a node
set_vector(node_id, key, vector) - Set a vector embedding
batch_insert(label, vectors) - Batch insert nodes with vectors
fts_index(node_id, text) - Index text for full-text search
create_edge(source_id, target_id, edge_type) - Create an edge
delete_edge(source_id, target_id, edge_type) - Delete an edge
commit() / rollback() - Commit or rollback the transaction

Batch Insert

Insert many nodes with vectors in a single efficient call:

import numpy as np

with Database("vectors.db", create=True, enable_vector=True, vector_dimensions=128) as db:
    with db.write() as txn:
        vectors = np.random.rand(1000, 128).astype(np.float32)
        node_ids = txn.batch_insert("Document", vectors)
        print(f"Created {len(node_ids)} nodes")
        txn.commit()

Full-Text Search

Exact Search

results = db.fts_search("machine learning", limit=10)
for r in results:
    print(f"Node {r.node_id}: score={r.score:.4f}")

Fuzzy Search (Typo-Tolerant)

# Finds "machine learning" even with typos
results = db.fts_search_fuzzy("machne lerning", limit=10)

# Control fuzzy matching sensitivity
results = db.fts_search_fuzzy(
    "machne",
    limit=10,
    max_distance=2,      # Max edit distance (default: 2)
    min_term_length=4,   # Min term length for fuzzy matching (default: 4)
)

Embeddings

LatticeDB includes a built-in hash embedding function and an HTTP client for external embedding services.

Hash Embeddings (Built-in)

Deterministic, no external service needed. Useful for testing or simple keyword-based similarity:

from latticedb import hash_embed

vec = hash_embed("hello world", dimensions=128)
print(vec.shape)  # (128,)

HTTP Embedding Client

Connect to Ollama, OpenAI, or compatible APIs:

from latticedb import EmbeddingClient, EmbeddingApiFormat

# Ollama (default)
with EmbeddingClient("http://localhost:11434") as client:
    vec = client.embed("hello world")

# OpenAI-compatible API
with EmbeddingClient(
    "https://api.openai.com/v1",
    model="text-embedding-3-small",
    api_format=EmbeddingApiFormat.OPENAI,
    api_key="sk-...",
) as client:
    vec = client.embed("hello world")

Edge Traversal

with db.read() as txn:
    outgoing = txn.get_outgoing_edges(node_id)
    for edge in outgoing:
        print(f"{edge.source_id} --[{edge.edge_type}]--> {edge.target_id}")

    incoming = txn.get_incoming_edges(node_id)
    for edge in incoming:
        print(f"{edge.source_id} --[{edge.edge_type}]--> {edge.target_id}")

Cypher Queries

# Pattern matching
result = db.query("MATCH (n:Person) RETURN n.name")

# With parameters
result = db.query(
    "MATCH (n:Person) WHERE n.name = $name RETURN n",
    parameters={"name": "Alice"},
)

# Vector similarity in Cypher
result = db.query(
    "MATCH (n:Document) WHERE n.embedding <=> $vec < 0.5 RETURN n.title",
    parameters={"vec": query_vector},
)

# Full-text search in Cypher
result = db.query(
    'MATCH (n:Document) WHERE n.content @@ "machine learning" RETURN n.title'
)

# Data mutation
db.query("CREATE (n:Person {name: 'Charlie', age: 35})")
db.query("MATCH (n:Person {name: 'Charlie'}) SET n.age = 36")
db.query("MATCH (n:Person {name: 'Charlie'}) DETACH DELETE n")

Query Cache

# Get cache statistics
stats = db.cache_stats()
print(f"Entries: {stats['entries']}, Hits: {stats['hits']}, Misses: {stats['misses']}")

# Clear the cache
db.cache_clear()

Supported Property Types

None - Null value
bool - Boolean
int - 64-bit integer
float - 64-bit float
str - UTF-8 string
bytes - Binary data

Error Handling

from latticedb import LatticeError, LatticeNotFoundError, LatticeIOError

try:
    with Database("nonexistent.db") as db:
        pass
except LatticeNotFoundError:
    print("Database not found")
except LatticeIOError:
    print("I/O error")
except LatticeError as e:
    print(f"Error: {e}")

Requirements

Python 3.9+
NumPy (for vector operations)
The native LatticeDB library (liblattice.dylib / liblattice.so)

TypeScript / Node.js API

TypeScript/Node.js bindings for LatticeDB, an embedded knowledge graph database for AI/RAG applications.

Installation

npm install @hajewski/latticedb

The native shared library (liblattice.dylib / liblattice.so) must be available on the system. Install it via the install script or build from source with zig build shared.

Quick Start

import { Database } from "@hajewski/latticedb";

const db = new Database("knowledge.db", {
  create: true,
  enableVector: true,
  vectorDimensions: 4,
});
await db.open();

// Create nodes, edges, and index content
await db.write(async (txn) => {
  const alice = await txn.createNode({
    labels: ["Person"],
    properties: { name: "Alice", age: 30 },
  });

  const bob = await txn.createNode({
    labels: ["Person"],
    properties: { name: "Bob", age: 25 },
  });

  await txn.createEdge(alice.id, bob.id, "KNOWS");

  // Index text for full-text search
  await txn.ftsIndex(alice.id, "Alice works on machine learning research");
  await txn.ftsIndex(bob.id, "Bob studies deep learning and neural networks");

  // Store vector embeddings
  await txn.setVector(
    alice.id,
    "embedding",
    new Float32Array([1.0, 0.0, 0.0, 0.0])
  );
  await txn.setVector(
    bob.id,
    "embedding",
    new Float32Array([0.0, 1.0, 0.0, 0.0])
  );
});

// Query with Cypher
const result = await db.query(
  "MATCH (n:Person) WHERE n.age > 20 RETURN n.name, n.age"
);
for (const row of result.rows) {
  console.log(row);
}

// Vector similarity search
const results = await db.vectorSearch(
  new Float32Array([0.9, 0.1, 0.0, 0.0]),
  { k: 2 }
);
for (const r of results) {
  console.log(`Node ${r.nodeId}: distance=${r.distance.toFixed(4)}`);
}

// Full-text search
const ftsResults = await db.ftsSearch("machine learning");
for (const r of ftsResults) {
  console.log(`Node ${r.nodeId}: score=${r.score.toFixed(4)}`);
}

// Fuzzy search (typo-tolerant)
const fuzzyResults = await db.ftsSearchFuzzy("machin lerning");
for (const r of fuzzyResults) {
  console.log(`Node ${r.nodeId}: score=${r.score.toFixed(4)}`);
}

await db.close();

API Reference

Database

const db = new Database(path: string, options?: DatabaseOptions);

interface DatabaseOptions {
  create?: boolean;          // Create if not exists (default: false)
  readOnly?: boolean;        // Open read-only (default: false)
  cacheSizeMb?: number;      // Cache size in MB (default: 100)
  enableVector?: boolean;    // Enable vector storage (default: false)
  vectorDimensions?: number; // Vector dimensions (default: 128)
}

Methods

await db.open() - Open the database connection
await db.close() - Close the database connection
await db.read(fn) - Execute a read-only transaction
await db.write(fn) - Execute a read-write transaction
await db.query(cypher, params?) - Execute a Cypher query
await db.vectorSearch(vector, options?) - k-NN vector search
await db.ftsSearch(query, options?) - Full-text search
await db.ftsSearchFuzzy(query, options?) - Fuzzy full-text search
await db.cacheClear() - Clear the query cache
await db.cacheStats() - Get cache hit/miss statistics
db.isOpen() - Check if database is open
db.getPath() - Get database file path

Transaction

Read Operations

await txn.getNode(nodeId) - Get a node by ID, returns Node or null
await txn.nodeExists(nodeId) - Check if a node exists
await txn.getProperty(nodeId, key) - Get a property value
await txn.getOutgoingEdges(nodeId) - Get outgoing edges from a node
await txn.getIncomingEdges(nodeId) - Get incoming edges to a node
txn.isReadOnly() / txn.isActive() - Transaction state

Write Operations

await txn.createNode({ labels, properties }) - Create a node
await txn.deleteNode(nodeId) - Delete a node
await txn.setProperty(nodeId, key, value) - Set a property
await txn.setVector(nodeId, key, vector) - Set a vector embedding
await txn.batchInsert(label, vectors) - Batch insert nodes with vectors
await txn.ftsIndex(nodeId, text) - Index text for full-text search
await txn.createEdge(sourceId, targetId, edgeType) - Create an edge
await txn.deleteEdge(sourceId, targetId, edgeType) - Delete an edge
txn.commit() / txn.rollback() - Commit or rollback

Batch Insert

Insert many nodes with vectors in a single efficient call:

import { Database } from "@hajewski/latticedb";

const db = new Database("vectors.db", {
  create: true,
  enableVector: true,
  vectorDimensions: 128,
});
await db.open();

await db.write(async (txn) => {
  const vectors = Array.from({ length: 1000 }, () =>
    Float32Array.from({ length: 128 }, () => Math.random())
  );
  const nodeIds = await txn.batchInsert("Document", vectors);
  console.log(`Created ${nodeIds.length} nodes`);
});

await db.close();

Full-Text Search

Exact Search

const results = await db.ftsSearch("machine learning", { limit: 10 });
for (const r of results) {
  console.log(`Node ${r.nodeId}: score=${r.score.toFixed(4)}`);
}

Fuzzy Search (Typo-Tolerant)

// Finds "machine learning" even with typos
const results = await db.ftsSearchFuzzy("machne lerning", { limit: 10 });

// Control fuzzy matching sensitivity
const precise = await db.ftsSearchFuzzy("machne", {
  limit: 10,
  maxDistance: 2, // Max edit distance (default: 0 = auto)
  minTermLength: 4, // Min term length for fuzzy matching (default: 0 = auto)
});

Embeddings

LatticeDB includes a built-in hash embedding function and an HTTP client for external embedding services.

Hash Embeddings (Built-in)

Deterministic, no external service needed. Useful for testing or simple keyword-based similarity:

import { hashEmbed } from "@hajewski/latticedb";

const vec = hashEmbed("hello world", 128);
console.log(vec.length); // 128

HTTP Embedding Client

Connect to Ollama, OpenAI, or compatible APIs:

import { EmbeddingClient, EmbeddingApiFormat } from "@hajewski/latticedb";

// Ollama (default)
const client = new EmbeddingClient({
  endpoint: "http://localhost:11434",
});
const vec = client.embed("hello world");
client.close();

// OpenAI-compatible API
const openaiClient = new EmbeddingClient({
  endpoint: "https://api.openai.com/v1",
  model: "text-embedding-3-small",
  apiFormat: EmbeddingApiFormat.OpenAI,
  apiKey: "sk-...",
});
const embedding = openaiClient.embed("hello world");
openaiClient.close();

Edge Traversal

await db.read(async (txn) => {
  const outgoing = await txn.getOutgoingEdges(nodeId);
  for (const edge of outgoing) {
    console.log(`${edge.sourceId} --[${edge.type}]--> ${edge.targetId}`);
  }

  const incoming = await txn.getIncomingEdges(nodeId);
  for (const edge of incoming) {
    console.log(`${edge.sourceId} --[${edge.type}]--> ${edge.targetId}`);
  }
});

Cypher Queries

// Pattern matching
const result = await db.query("MATCH (n:Person) RETURN n.name");

// With parameters
const result = await db.query(
  "MATCH (n:Person) WHERE n.name = $name RETURN n",
  { name: "Alice" }
);

// Vector similarity in Cypher
const result = await db.query(
  "MATCH (n:Document) WHERE n.embedding <=> $vec < 0.5 RETURN n.title",
  { vec: new Float32Array([0.1, 0.2, 0.3, 0.4]) }
);

// Full-text search in Cypher
const result = await db.query(
  'MATCH (n:Document) WHERE n.content @@ "machine learning" RETURN n.title'
);

// Data mutation
await db.query('CREATE (n:Person {name: "Charlie", age: 35})');
await db.query('MATCH (n:Person {name: "Charlie"}) SET n.age = 36');
await db.query('MATCH (n:Person {name: "Charlie"}) DETACH DELETE n');

Query Cache

// Get cache statistics
const stats = await db.cacheStats();
console.log(
  `Entries: ${stats.entries}, Hits: ${stats.hits}, Misses: ${stats.misses}`
);

// Clear the cache
await db.cacheClear();

Supported Property Types

null - Null value
boolean - Boolean
number - Integer or float
string - UTF-8 string
Uint8Array - Binary data

Error Handling

import { Database, isLibraryAvailable } from "@hajewski/latticedb";

// Check if native library is available
if (!isLibraryAvailable()) {
  console.error("LatticeDB native library not found");
  process.exit(1);
}

try {
  const db = new Database("test.db", { create: true });
  await db.open();
  // ...
  await db.close();
} catch (error) {
  console.error("Database error:", error);
}

Building from Source

Requires Node.js 18+ and the LatticeDB native library.

# From the latticedb root directory
zig build shared

# Build the TypeScript bindings
cd bindings/typescript
npm install
npm run build

# Run tests
npm test

Requirements

Node.js 18+
The native LatticeDB library (liblattice.dylib / liblattice.so)

Architecture Overview

This section explains how LatticeDB's storage engine works, from the ground up. Each chapter builds on the previous, showing how simple primitives combine to create a durable, transactional database.

The Stack

┌─────────────────────────────────────────────────────────────┐
│                      Application                             │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                   Transaction Manager                        │
│              (ACID guarantees, isolation)                    │
└─────────────────────────────────────────────────────────────┘
                              │
              ┌───────────────┼───────────────┐
              ▼               ▼               ▼
┌───────────────────┐ ┌─────────────┐ ┌─────────────────────┐
│      B+Tree       │ │     WAL     │ │    Checkpointer     │
│  (ordered data)   │ │ (durability)│ │  (recovery bounds)  │
└───────────────────┘ └─────────────┘ └─────────────────────┘
              │               │               │
              └───────────────┼───────────────┘
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                      Buffer Pool                             │
│                (page caching in RAM)                         │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                     Page Manager                             │
│              (page allocation, checksums)                    │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                 Virtual File System                          │
│                    (file I/O)                                │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                    Operating System                          │
└─────────────────────────────────────────────────────────────┘

Chapters

Storage Engine

Virtual File System - Abstracting file I/O for portability and testing
Page Manager - Fixed-size pages, allocation, and checksums
Buffer Pool - Caching pages in memory with eviction
B+Tree - Ordered key-value storage with efficient lookups

Durability & Transactions

Write-Ahead Log - Durability through logging before data changes
Transaction Manager - ACID transactions with begin/commit/abort
Checkpointing - Bounding recovery time by flushing dirty pages
Recovery - ARIES-style crash recovery with redo/undo

Data Model

Graph Storage - Nodes, edges, labels, and properties

Search Indexes

Vector Search - HNSW approximate nearest neighbor search
Full-Text Search - BM25-scored inverted index with tokenization

Query System

Query Execution - Volcano iterator model, operators, and planning

Design Principles

Direct Page Manipulation

We don't deserialize pages into objects. Instead, we read/write bytes directly in page buffers. This is "zero-copy" - no intermediate representations, no serialization overhead.

Traditional approach:           Our approach:

Page bytes ──► Object ──►      Page bytes ──► Direct access
              in memory                       via offsets
                  │                               │
                  ▼                               ▼
              Modify                          Modify bytes
                  │                               │
                  ▼                               ▼
Object ──► Serialize ──► Page   Already in page buffer!

Everything is Pages

The entire database is built on fixed-size pages (4KB by default):

B+Tree nodes are pages
WAL frames are pages
The file header is a page
Free list entries are pages

This uniformity simplifies the system - one caching layer, one I/O path, one checksum format.

Durability Through WAL

Changes are logged before being applied. This means:

Commit = log is on disk (fast, sequential write)
Actual data pages can be written lazily
Crash recovery replays the log

Simple Concurrency Model

Each page has a reader-writer latch. Multiple readers OR one writer. No complex lock hierarchies - simplicity over maximum concurrency.

Virtual File System (VFS)

What It Is

The Virtual File System is an abstraction layer over file I/O operations. Instead of calling OS functions directly, all file operations go through a VFS interface.

Why We Need It

1. Testability

With a VFS, we can create an in-memory implementation for testing:

Production:                     Testing:

┌─────────────┐                ┌─────────────┐
│   B+Tree    │                │   B+Tree    │
└──────┬──────┘                └──────┬──────┘
       │                              │
       ▼                              ▼
┌─────────────┐                ┌─────────────┐
│  PosixVfs   │                │  MemoryVfs  │
└──────┬──────┘                └──────┬──────┘
       │                              │
       ▼                              ▼
┌─────────────┐                ┌─────────────┐
│  Disk I/O   │                │  HashMap    │
└─────────────┘                └─────────────┘

Tests run instantly because there's no actual disk I/O.

2. Portability

Different operating systems have different file APIs:

Linux: pread, pwrite, fsync
Windows: ReadFile, WriteFile, FlushFileBuffers
Embedded: Custom flash drivers

The VFS hides these differences. We implement one VFS per platform, and the rest of the database doesn't care.

3. Injection of Failures

For testing crash recovery, we need to simulate failures:

const FaultyVfs = struct {
    inner: Vfs,
    fail_after_n_writes: usize,
    write_count: usize,

    pub fn write(self: *Self, offset: u64, data: []const u8) !void {
        self.write_count += 1;
        if (self.write_count > self.fail_after_n_writes) {
            return error.SimulatedCrash;
        }
        return self.inner.write(offset, data);
    }
};

The Interface

File Operations

pub const File = struct {
    ptr: *anyopaque,
    vtable: *const VTable,

    pub const VTable = struct {
        read:     fn(ptr, offset: u64, buf: []u8) Error!usize,
        write:    fn(ptr, offset: u64, data: []const u8) Error!void,
        sync:     fn(ptr) Error!void,
        size:     fn(ptr) Error!u64,
        close:    fn(ptr) void,
        truncate: fn(ptr, size: u64) Error!void,
    };

    // Convenience methods that call vtable
    pub fn read(self: File, offset: u64, buf: []u8) !usize {
        return self.vtable.read(self.ptr, offset, buf);
    }
    // ... etc
};

VFS Operations

pub const Vfs = struct {
    ptr: *anyopaque,
    vtable: *const VTable,

    pub const VTable = struct {
        open:   fn(ptr, path: []const u8, flags: OpenFlags) Error!File,
        delete: fn(ptr, path: []const u8) Error!void,
        exists: fn(ptr, path: []const u8) bool,
    };
};

Open Flags

pub const OpenFlags = struct {
    read: bool = true,      // Open for reading
    write: bool = false,    // Open for writing
    create: bool = false,   // Create if doesn't exist
    exclusive: bool = false, // Fail if exists (with create)
};

Key Operations

Positioned Read/Write

Unlike streaming I/O, we use positioned reads and writes:

// Read 4096 bytes starting at byte offset 8192
const bytes_read = try file.read(8192, buffer[0..4096]);

// Write 4096 bytes starting at byte offset 8192
try file.write(8192, data[0..4096]);

This maps directly to page-based access:

Page 0 is at offset 0
Page 1 is at offset 4096
Page N is at offset N * 4096

Sync (fsync)

The most important operation for durability:

try file.sync();

This forces all pending writes to physical storage. Without it, data might sit in OS buffers and be lost on power failure.

Application writes:          After fsync:

┌─────────────┐              ┌─────────────┐
│ Application │              │ Application │
└──────┬──────┘              └─────────────┘
       │ write()
       ▼
┌─────────────┐              ┌─────────────┐
│  OS Buffer  │              │  OS Buffer  │
│  (in RAM)   │ ◄── Data     │   (empty)   │
└─────────────┘     sits     └──────┬──────┘
       ╳           here!            │ fsync forces
   Power                            │ write to disk
   failure                          ▼
   loses it!             ┌─────────────────────┐
                         │ Disk (persistent)    │
                         │ Data is SAFE         │
                         └─────────────────────┘

The POSIX Implementation

For Unix-like systems (Linux, macOS):

pub const PosixVfs = struct {
    allocator: Allocator,

    pub fn open(self: *Self, path: []const u8, flags: OpenFlags) !File {
        var posix_flags: u32 = 0;

        if (flags.read and flags.write) {
            posix_flags |= O_RDWR;
        } else if (flags.write) {
            posix_flags |= O_WRONLY;
        } else {
            posix_flags |= O_RDONLY;
        }

        if (flags.create) posix_flags |= O_CREAT;
        if (flags.exclusive) posix_flags |= O_EXCL;

        const fd = try std.posix.open(path, posix_flags, 0o644);
        // ... wrap in File struct
    }
};

Read Implementation

Uses pread for positioned reading (thread-safe, no seeking):

fn read(self: *PosixFile, offset: u64, buf: []u8) !usize {
    return std.posix.pread(self.fd, buf, offset);
}

Write Implementation

Uses pwrite for positioned writing:

fn write(self: *PosixFile, offset: u64, data: []const u8) !void {
    var written: usize = 0;
    while (written < data.len) {
        const n = try std.posix.pwrite(self.fd, data[written..], offset + written);
        if (n == 0) return error.DiskFull;
        written += n;
    }
}

Sync Implementation

fn sync(self: *PosixFile) !void {
    try std.posix.fsync(self.fd);
}

Error Handling

VFS operations can fail in various ways:

pub const VfsError = error{
    FileNotFound,      // Path doesn't exist
    PermissionDenied,  // No access rights
    DiskFull,          // No space left
    IoError,           // Hardware failure
    FileLocked,        // Another process has it
    InvalidPath,       // Bad path string
    AlreadyExists,     // Exclusive create failed
};

These are translated from OS-specific error codes:

fn mapPosixError(err: std.posix.Error) VfsError {
    return switch (err) {
        .ENOENT => VfsError.FileNotFound,
        .EACCES, .EPERM => VfsError.PermissionDenied,
        .ENOSPC => VfsError.DiskFull,
        .EIO => VfsError.IoError,
        // ...
    };
}

Usage Pattern

// Create VFS
var posix_vfs = PosixVfs.init(allocator);
const vfs = posix_vfs.vfs();  // Get interface

// Open file
const file = try vfs.open("database.db", .{
    .read = true,
    .write = true,
    .create = true,
});
defer file.close();

// Read page 5
var buf: [4096]u8 = undefined;
_ = try file.read(5 * 4096, &buf);

// Modify and write back
buf[0] = 42;
try file.write(5 * 4096, &buf);

// Ensure durability
try file.sync();

Why Not Just Use std.fs?

Zig's standard library has file operations, but:

No positioned I/O - std.fs.File uses streaming with seek, which isn't thread-safe
No vtable pattern - Can't swap implementations
Different error types - We want database-specific errors

The VFS is a thin layer, but it gives us control where we need it.

Page Manager

What It Is

The Page Manager handles the lowest level of database storage: fixed-size pages. It manages the database file structure, page allocation/deallocation, and data integrity through checksums.

Why Fixed-Size Pages?

Simplicity

Every piece of data lives in a 4KB page. This uniformity simplifies everything:

┌────────────────────────────────────────────────────────────┐
│                    Database File                            │
├────────────┬────────────┬────────────┬────────────┬────────┤
│  Page 0    │  Page 1    │  Page 2    │  Page 3    │  ...   │
│  (Header)  │  (B+Tree)  │  (B+Tree)  │  (Free)    │        │
│  4096 bytes│  4096 bytes│  4096 bytes│  4096 bytes│        │
└────────────┴────────────┴────────────┴────────────┴────────┘
     │             │             │             │
     ▼             ▼             ▼             ▼
Offset: 0        4096         8192        12288

To read page N: offset = N * 4096

Alignment with Hardware

SSD pages: Modern SSDs have 4KB or 8KB internal pages
OS pages: Most operating systems use 4KB virtual memory pages
Disk sectors: Traditional HDDs use 512B or 4KB sectors

4KB aligns well with all of these, enabling:

Direct I/O (bypassing OS cache)
Atomic writes on some hardware
Efficient memory mapping

Efficient Caching

Fixed-size pages make the buffer pool trivial - every frame is the same size, no fragmentation.

File Structure

Page 0: File Header

The first page is special - it contains metadata about the database:

┌────────────────────────────────────────────────────────────┐
│                    File Header (Page 0)                     │
├────────────────────────────────────────────────────────────┤
│  Offset 0-3:    Magic number (0x4C415454 = "LATT")         │
│  Offset 4-5:    Format version                              │
│  Offset 6-7:    Minimum reader version                      │
│  Offset 8-11:   Page size (4096)                            │
│  Offset 12-15:  Freelist head page                          │
│  Offset 16-23:  Created timestamp                           │
│  Offset 24-31:  Modified timestamp                          │
│  Offset 32-47:  File UUID (16 bytes)                        │
│  Offset 48+:    Reserved / padding                          │
└────────────────────────────────────────────────────────────┘

Key fields:

Magic number: Identifies this as a Lattice database file. Opening a random file will fail fast.
Format version: Allows future changes to the format
Freelist head: Points to the first free page (for allocation)
File UUID: Unique identifier, used to match WAL files

Data Pages

Every other page starts with a common header:

┌────────────────────────────────────────────────────────────┐
│                    Page Header (8 bytes)                    │
├────────────────────────────────────────────────────────────┤
│  Offset 0-3:    Checksum (CRC32 of bytes 8-4095)           │
│  Offset 4:      Page type (btree_internal, btree_leaf, etc)│
│  Offset 5:      Flags                                       │
│  Offset 6-7:    Reserved                                    │
├────────────────────────────────────────────────────────────┤
│  Offset 8-4095: Page-specific data                         │
└────────────────────────────────────────────────────────────┘

Page Types

pub const PageType = enum(u8) {
    free = 0,           // Unallocated, on freelist
    btree_internal = 1, // B+Tree internal node
    btree_leaf = 2,     // B+Tree leaf node
    overflow = 3,       // Large value overflow
    freelist = 4,       // Freelist continuation
};

Page Allocation

The Freelist

Free pages are linked together in a freelist:

Header                    Free pages linked together
┌──────────┐             ┌──────────┐    ┌──────────┐    ┌──────────┐
│freelist: ├────────────►│ Page 5   ├───►│ Page 3   ├───►│ Page 9   │
│    5     │             │ next: 3  │    │ next: 9  │    │ next: 0  │
└──────────┘             └──────────┘    └──────────┘    └──────────┘
                                                              │
                                                         NULL (end)

Allocating a Page

pub fn allocatePage(self: *Self) !PageId {
    // 1. Try freelist first
    if (self.header.freelist_page != NULL_PAGE) {
        return self.allocateFromFreelist();
    }

    // 2. Allocate at end of file
    const page_id = self.pageCount();

    // 3. Extend file with zeroed page
    var zeros: [4096]u8 = [_]u8{0} ** 4096;
    const header: *PageHeader = @ptrCast(&zeros);
    header.* = PageHeader.init(.free);

    try self.file.write(page_id * 4096, &zeros);

    return page_id;
}

Allocating from Freelist

fn allocateFromFreelist(self: *Self) !PageId {
    const page_id = self.header.freelist_page;

    // Read the free page to get next pointer
    var buf: [4096]u8 = undefined;
    try self.readPageRaw(page_id, &buf);

    // Next pointer is stored after the 8-byte header
    const next_free = std.mem.readInt(u32, buf[8..12], .little);

    // Update freelist head
    self.header.freelist_page = next_free;
    try self.writeHeader();

    return page_id;
}

Freeing a Page

pub fn freePage(self: *Self, page_id: PageId) !void {
    // 1. Can't free the header page
    if (page_id == 0) return error.InvalidPageId;

    // 2. Set up page as free, pointing to current freelist head
    var buf: [4096]u8 = [_]u8{0} ** 4096;
    const header: *PageHeader = @ptrCast(&buf);
    header.* = PageHeader.init(.free);

    // Store old freelist head as next pointer
    std.mem.writeInt(u32, buf[8..12], self.header.freelist_page, .little);

    // 3. Calculate checksum and write
    header.checksum = calculateChecksum(buf[8..]);
    try self.file.write(page_id * 4096, &buf);

    // 4. Update freelist head to this page
    self.header.freelist_page = page_id;
    try self.writeHeader();
}

This is a LIFO (stack) freelist - last freed is first allocated. Simple and efficient.

Checksums

Every page has a CRC32 checksum that covers bytes 8-4095 (everything after the checksum field itself).

Why Checksums?

Detect corruption: Hardware failures, cosmic rays, bugs
Detect torn writes: Partial page writes from crashes
Validate reads: Catch errors before using bad data

Checksum Calculation

pub fn calculateChecksum(data: []const u8) u32 {
    var crc = std.hash.Crc32.init();
    crc.update(data);
    return crc.final();
}

Writing with Checksum

pub fn writePage(self: *Self, page_id: PageId, buf: []u8) !void {
    // Calculate checksum of everything after the checksum field
    const header: *PageHeader = @ptrCast(buf.ptr);
    header.checksum = calculateChecksum(buf[8..]);

    // Write to disk
    try self.file.write(page_id * 4096, buf);
}

Reading with Verification

pub fn readPage(self: *Self, page_id: PageId, buf: []u8) !void {
    try self.file.read(page_id * 4096, buf);

    // Verify checksum
    const header: *const PageHeader = @ptrCast(buf.ptr);
    const expected = calculateChecksum(buf[8..]);

    if (header.checksum != expected and header.checksum != 0) {
        return error.ChecksumMismatch;
    }
}

Note: Checksum of 0 is allowed for newly allocated pages that haven't been written yet.

Read-Only Mode

The Page Manager supports opening databases read-only:

var pm = try PageManager.init(allocator, vfs, "db.file", .{
    .read_only = true,
});

// This fails:
_ = try pm.allocatePage();  // error.PermissionDenied
try pm.writePage(1, &buf);  // error.PermissionDenied

Useful for:

Backup tools
Read replicas
Forensic analysis

Error Handling

pub const PageManagerError = error{
    InvalidHeader,      // File header is malformed
    InvalidMagic,       // Not a Lattice database file
    VersionTooNew,      // File created by newer version
    ChecksumMismatch,   // Data corruption detected
    InvalidPageId,      // Page ID out of range
    PageNotAllocated,   // Accessing unallocated page
    IoError,            // Underlying I/O failed
    PermissionDenied,   // Write to read-only database
    FileNotFound,       // Database file doesn't exist
    DiskFull,           // No space for new pages
};

Thread Safety

The Page Manager itself is NOT thread-safe. Each operation directly touches the file. In a multi-threaded context, the Buffer Pool provides thread-safe access by:

Caching pages in memory
Using latches (reader-writer locks) per page
Serializing writes through its own mutex

Usage Pattern

// Open or create database
var pm = try PageManager.init(allocator, vfs, "my.db", .{
    .create = true,
    .page_size = 4096,
});
defer pm.deinit();

// Allocate a page
const page_id = try pm.allocatePage();

// Write data
var buf: [4096]u8 align(4096) = undefined;
const header: *PageHeader = @ptrCast(&buf);
header.* = PageHeader.init(.btree_leaf);
// ... fill in page data ...
try pm.writePage(page_id, &buf);

// Read it back
var read_buf: [4096]u8 align(4096) = undefined;
try pm.readPage(page_id, &read_buf);

// Free the page
try pm.freePage(page_id);

// Ensure durability
try pm.sync();

Key Design Decisions

Fixed 4KB Pages

Chosen for hardware alignment. Could be made configurable (stored in header), but 4KB works well for most cases.

Freelist in Pages

The freelist uses the free pages themselves for storage. No external data structure needed. Elegant and space-efficient.

Checksum After Header

Checksum doesn't cover itself (circular dependency). By placing checksum first, we can compute it over the rest of the page in one pass.

No Internal Fragmentation Tracking

We don't track free space within pages - that's the responsibility of higher layers (B+Tree). The Page Manager only deals with whole pages.

Buffer Pool

What It Is

The Buffer Pool is a cache that keeps frequently accessed pages in memory. Instead of reading from disk every time, we check the buffer pool first.

The Problem It Solves

Disk I/O is slow:

Operation          Time
─────────────────────────────
CPU instruction    ~1 ns
RAM access         ~100 ns
SSD read           ~100,000 ns (100 μs)
HDD read           ~10,000,000 ns (10 ms)

RAM is 1,000-100,000x faster than disk. If we can keep hot pages in RAM, performance improves dramatically.

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                        Buffer Pool                               │
│                                                                  │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │                      Page Table                          │    │
│  │              (PageId → FrameId mapping)                  │    │
│  │                                                          │    │
│  │    PageId 5 ──► Frame 2                                  │    │
│  │    PageId 12 ─► Frame 0                                  │    │
│  │    PageId 3 ──► Frame 7                                  │    │
│  └─────────────────────────────────────────────────────────┘    │
│                              │                                   │
│                              ▼                                   │
│  ┌───────┐ ┌───────┐ ┌───────┐ ┌───────┐ ┌───────┐ ┌───────┐   │
│  │Frame 0│ │Frame 1│ │Frame 2│ │Frame 3│ │Frame 4│ │Frame 5│   │
│  │───────│ │───────│ │───────│ │───────│ │───────│ │───────│   │
│  │pg: 12 │ │pg: -- │ │pg: 5  │ │pg: -- │ │pg: 8  │ │pg: 1  │   │
│  │pin: 2 │ │pin: 0 │ │pin: 1 │ │pin: 0 │ │pin: 0 │ │pin: 3 │   │
│  │dirty:Y│ │dirty:N│ │dirty:N│ │dirty:N│ │dirty:Y│ │dirty:N│   │
│  │use: 3 │ │use: 0 │ │use: 1 │ │use: 0 │ │use: 2 │ │use: 5 │   │
│  │[data] │ │[data] │ │[data] │ │[data] │ │[data] │ │[data] │   │
│  └───────┘ └───────┘ └───────┘ └───────┘ └───────┘ └───────┘   │
│                                                                  │
│  Free list: [1, 3]  (frames not holding any page)               │
│  Clock hand: ──────────────────────►                            │
└─────────────────────────────────────────────────────────────────┘

The Frame

Each frame holds one page:

pub const BufferFrame = struct {
    page_id: PageId,                    // Which page is here (0 = empty)
    data: []align(4096) u8,             // The actual 4KB page data
    pin_count: atomic(u32),             // Reference count
    dirty: bool,                        // Modified since read?
    usage_count: u8,                    // For Clock eviction
    latch: RwLatch,                     // Reader-writer lock
};

Pin Count

The pin count is a reference count:

fetchPage()  ──► pin_count += 1   "I'm using this page"
unpinPage()  ──► pin_count -= 1   "I'm done with this page"

pin_count > 0  ──► Page is in use, cannot be evicted
pin_count = 0  ──► Page can be evicted if needed

Dirty Flag

Read page    ──► dirty = false    "Matches disk"
Modify page  ──► dirty = true     "Different from disk"
Write to disk ─► dirty = false    "Matches disk again"

Dirty pages MUST be written to disk before eviction. Otherwise we lose data!

Reader-Writer Latch

Multiple readers OR one writer

Read latch:    "I'm reading, others can read too"
Write latch:   "I'm modifying, exclusive access"

Fetching a Page

pub fn fetchPage(self: *Self, page_id: PageId, mode: LatchMode) !*BufferFrame {
    self.mutex.lock();
    defer self.mutex.unlock();

    // 1. Check if already in buffer pool
    if (self.page_table.get(page_id)) |frame_id| {
        const frame = &self.frames[frame_id];
        frame.pin_count.fetchAdd(1, .monotonic);
        frame.usage_count = @min(frame.usage_count + 1, 255);
        acquireLatch(frame, mode);
        return frame;
    }

    // 2. Not in pool - need to load from disk
    const frame = try self.findVictimFrame();

    // 3. If victim is dirty, flush it first
    if (frame.dirty) {
        try self.pm.writePage(frame.page_id, frame.data);
        frame.dirty = false;
    }

    // 4. Update page table
    if (frame.page_id != NULL_PAGE) {
        self.page_table.remove(frame.page_id);
    }
    self.page_table.put(page_id, frame_id);

    // 5. Read page from disk
    try self.pm.readPage(page_id, frame.data);

    // 6. Set up frame
    frame.page_id = page_id;
    frame.pin_count.store(1, .monotonic);
    frame.dirty = false;
    frame.usage_count = 1;
    acquireLatch(frame, mode);

    return frame;
}

The Clock Eviction Algorithm

When the buffer pool is full, we need to evict a page to make room. We use the Clock algorithm (also called "second chance").

Why Clock?

LRU (Least Recently Used) is optimal but expensive - requires updating timestamps on every access
Clock approximates LRU cheaply using a usage bit

How It Works

Imagine the frames arranged in a circle with a clock hand:

                    ┌─────┐
               ┌────│  3  │────┐
              ╱     │use:1│     ╲
         ┌─────┐    └─────┘    ┌─────┐
         │  2  │               │  4  │
         │use:0│               │use:2│
         └─────┘               └─────┘
              ╲     ┌─────┐     ╱
               └────│  1  │────┘
                    │use:0│◄──── clock hand
                    │pin:0│
                    └─────┘

To find a victim:

fn findVictimFrame(self: *Self) !*BufferFrame {
    // Try free list first
    if (self.free_list.pop()) |frame_id| {
        return &self.frames[frame_id];
    }

    // Clock sweep
    var attempts: usize = 0;
    while (attempts < self.frame_count * 2) {
        const frame = &self.frames[self.clock_hand];

        // Move hand
        self.clock_hand = (self.clock_hand + 1) % self.frame_count;
        attempts += 1;

        // Skip pinned pages
        if (frame.pin_count.load(.monotonic) > 0) {
            continue;
        }

        // Second chance: if used recently, clear and skip
        if (frame.usage_count > 0) {
            frame.usage_count -= 1;
            continue;
        }

        // Found victim!
        return frame;
    }

    return error.BufferPoolFull;  // All pages pinned
}

The key insight: Usage count gives pages a "second chance". Recently used pages survive one clock sweep. Only pages that haven't been used in a full rotation get evicted.

Unpinning Pages

When done with a page:

pub fn unpinPage(self: *Self, frame: *BufferFrame, dirty: bool) void {
    // Release latch
    frame.latch.release();

    // Mark dirty if modified
    if (dirty) {
        frame.dirty = true;
    }

    // Decrement pin count
    _ = frame.pin_count.fetchSub(1, .monotonic);
}

Always unpin! Failure to unpin causes:

Pages stuck in memory forever
Buffer pool eventually fills with pinned pages
BufferPoolFull errors

Flushing Pages

Writing dirty pages to disk:

// Flush one page
pub fn flushPage(self: *Self, page_id: PageId) !void {
    const frame = self.getFrame(page_id) orelse return;

    if (frame.dirty) {
        try self.pm.writePage(frame.page_id, frame.data);
        frame.dirty = false;
    }
}

// Flush all dirty pages
pub fn flushAll(self: *Self) !void {
    for (self.frames) |*frame| {
        if (frame.page_id != NULL_PAGE and frame.dirty) {
            try self.pm.writePage(frame.page_id, frame.data);
            frame.dirty = false;
        }
    }
}

Thread Safety

The buffer pool is thread-safe:

Mutex protects page_table, free_list, clock_hand
Per-frame latches protect page data
Atomic pin_count for safe reference counting

Thread 1: fetchPage(5)          Thread 2: fetchPage(5)
    │                               │
    ├─► mutex.lock()                ├─► mutex.lock() [WAIT]
    │   look up page 5              │
    │   pin_count++                 │
    │   mutex.unlock() ─────────────┼─► mutex.lock() [GOT IT]
    │                               │   look up page 5 [FOUND]
    ├─► frame.latch.read()          │   pin_count++
    │   ... read data ...           │   mutex.unlock()
    │                               │
    │                               ├─► frame.latch.read()
    │                               │   ... read data ...
    │                               │
    ├─► unpinPage()                 ├─► unpinPage()

Multiple threads can read the same page concurrently (shared latch).

Memory Alignment

Page buffers are 4KB-aligned:

const data = try allocator.alignedAlloc(u8, 4096, page_size);

Why?

Direct I/O: Some systems require aligned buffers for O_DIRECT
SIMD: Aligned data enables vectorized operations
Cache lines: Better CPU cache utilization

Sizing the Buffer Pool

// 64MB buffer pool = 16,384 pages
var bp = try BufferPool.init(allocator, &pm, 64 * 1024 * 1024);

Guidelines:

More is better (to a point)
Working set: Should fit frequently accessed pages
Available RAM: Leave room for OS and other processes
Typical: 25-75% of available RAM

Usage Pattern

var bp = try BufferPool.init(allocator, &pm, pool_size);
defer bp.deinit();  // Flushes dirty pages

// Read a page
const frame = try bp.fetchPage(page_id, .shared);
defer bp.unpinPage(frame, false);
const value = readValueFromPage(frame.data);

// Modify a page
const frame = try bp.fetchPage(page_id, .exclusive);
defer bp.unpinPage(frame, true);  // true = dirty
modifyPage(frame.data);

Key Invariants

Pin before access: Never access page data without pinning
Unpin when done: Every fetchPage must have matching unpinPage
Mark dirty: If you modified the page, set dirty=true when unpinning
Flush before close: deinit() flushes, or call flushAll() explicitly

B+Tree

What It Is

A B+Tree is a self-balancing tree structure optimized for disk-based storage. It provides O(log N) lookups, inserts, and deletes, with efficient range scans.

Why B+Tree?

The Problem with Binary Trees

A binary search tree with 1 million entries is ~20 levels deep (log2(1M) ≈ 20). Each level requires a disk read. That's 20 disk reads per lookup!

B+Tree: Wide and Shallow

A B+Tree has many keys per node (hundreds), making it very shallow:

Binary Tree (1M entries):          B+Tree (1M entries):
        depth ~20                       depth ~3

          ○                               ○
         / \                         /    |    \
        ○   ○                       ○     ○     ○
       /\   /\                    / | \ / | \ / | \
      ○ ○ ○  ○                   ○  ○  ○ ○  ○  ○ ○ ○
     ... (20 levels)             (leaf level)

    20 disk reads                   3 disk reads!

With 100 keys per node: log100(1M) ≈ 3 levels.

Data Only in Leaves

B+Trees store all data in leaf nodes. Internal nodes only contain keys for routing:

                    ┌─────────────────────┐
                    │   [30]  [60]  [90]  │  ◄── Internal: keys only
                    └───┬──────┬──────┬───┘
                       ╱       │       ╲
        ┌─────────────┘        │        └─────────────┐
        ▼                      ▼                      ▼
┌───────────────┐    ┌───────────────┐    ┌───────────────┐
│ 10:val 20:val │◄──►│ 30:val 50:val │◄──►│ 60:val 80:val │
└───────────────┘    └───────────────┘    └───────────────┘
        ▲                                         ▲
        └─── Leaves: keys AND values ─────────────┘
             Linked for range scans

Our Implementation: Direct Page Manipulation

We don't create "node objects" in memory. Instead, we read/write bytes directly in page buffers. The "node" is just a lens over page bytes.

Traditional approach:                Our approach:

┌─────────────┐                     ┌─────────────┐
│ Page bytes  │                     │ Page bytes  │
└──────┬──────┘                     └──────┬──────┘
       │ deserialize                       │
       ▼                                   │ direct access
┌─────────────┐                            │
│ Node struct │                            │
│ - keys[]    │                            ▼
│ - values[]  │                     Read/write at
│ - children[]│                     calculated offsets
└──────┬──────┘
       │ serialize
       ▼
┌─────────────┐
│ Page bytes  │
└─────────────┘

No intermediate objects. No serialization overhead.

Page Layout

Leaf Node

┌────────────────────────────────────────────────────────────────┐
│                         Leaf Page (4KB)                         │
├────────────────────────────────────────────────────────────────┤
│ Bytes 0-7:   Page Header (checksum, type=leaf, flags)          │
├────────────────────────────────────────────────────────────────┤
│ Bytes 8-9:   Entry count (u16)                                 │
│ Bytes 10-13: Prev leaf page (u32)                              │
│ Bytes 14-17: Next leaf page (u32)                              │
│ Bytes 18-19: Free space offset (u16)                           │
├────────────────────────────────────────────────────────────────┤
│ Slot 0: key_offset, key_len, val_offset, val_len (8 bytes)     │
│ Slot 1: key_offset, key_len, val_offset, val_len (8 bytes)     │
│ Slot 2: ...                                                    │
│         ...                                                    │
│                    [free space grows down ↓]                   │
│                                                                │
│                    [entries grow up ↑]                         │
│                                                                │
│ "value2"                                                       │
│ "key2"                                                         │
│ "value1"                                                       │
│ "key1"                                                         │
└────────────────────────────────────────────────────────────────┘
                                                          Byte 4095

Variable-length entries: Keys and values are stored at the end of the page, growing upward. Slots at the front point to them.

Internal Node

┌────────────────────────────────────────────────────────────────┐
│                       Internal Page (4KB)                       │
├────────────────────────────────────────────────────────────────┤
│ Bytes 0-7:   Page Header (checksum, type=internal, flags)      │
├────────────────────────────────────────────────────────────────┤
│ Bytes 8-9:   Key count (u16)                                   │
│ Bytes 10-13: Rightmost child (u32)                             │
│ Bytes 14-15: Free space offset (u16)                           │
├────────────────────────────────────────────────────────────────┤
│ Slot 0: child_page, key_offset, key_len (8 bytes)              │
│ Slot 1: child_page, key_offset, key_len (8 bytes)              │
│         ...                                                    │
│                    [free space]                                │
│                                                                │
│ "separator_key2"                                               │
│ "separator_key1"                                               │
└────────────────────────────────────────────────────────────────┘

Point Lookup

Finding a value by key:

pub fn get(self: *Self, key: []const u8) !?[]const u8 {
    // 1. Start at root
    var page_id = self.root_page;

    while (true) {
        // 2. Fetch page from buffer pool
        const frame = try self.bp.fetchPage(page_id, .shared);
        defer self.bp.unpinPage(frame, false);

        const page_type = getPageType(frame.data);

        if (page_type == .btree_leaf) {
            // 3. Binary search in leaf
            return LeafNode.search(frame.data, key, self.comparator);
        } else {
            // 4. Binary search for child pointer, descend
            page_id = InternalNode.findChild(frame.data, key, self.comparator);
        }
    }
}

Example lookup for key "dog":

                        Root (Internal)
                    ┌─────────────────────┐
                    │  [cat]     [fish]   │
                    └───┬──────────┬──────┘
                       ╱           ╲
     "dog" > "cat"    ╱             ╲
     "dog" < "fish"  ╱               ╲
                    ▼
             ┌─────────────────┐
             │ cow:v1  dog:v2  │ ◄── Found! Return v2
             └─────────────────┘

Insertion

Simple Case: Space in Leaf

pub fn insert(self: *Self, key: []const u8, value: []const u8) !void {
    // 1. Find the leaf
    const leaf_page = try self.findLeaf(key);

    // 2. Fetch with exclusive latch
    const frame = try self.bp.fetchPage(leaf_page, .exclusive);
    defer self.bp.unpinPage(frame, true);

    // 3. Check if there's space
    if (LeafNode.hasSpace(frame.data, key.len, value.len)) {
        // 4. Insert directly
        LeafNode.insert(frame.data, key, value, self.comparator);
    } else {
        // 5. Need to split
        try self.splitLeafAndInsert(frame, key, value);
    }
}

Splitting a Leaf

When a leaf is full:

Before split:
┌─────────────────────────────────────┐
│  a:1  b:2  c:3  d:4  e:5  [FULL]   │
└─────────────────────────────────────┘
                 │
                 │ Insert "f:6" - no room!
                 ▼
After split:
┌──────────────────────┐    ┌──────────────────────┐
│  a:1  b:2  c:3       │◄──►│  d:4  e:5  f:6       │
└──────────────────────┘    └──────────────────────┘
         │                            │
         └───────────┬────────────────┘
                     │
                     ▼
              "d" promoted to parent

fn splitLeaf(self: *Self, frame: *BufferFrame, key: []const u8, value: []const u8) !void {
    // 1. Allocate new page
    const new_page = try self.pm.allocatePage();
    const new_frame = try self.bp.fetchPage(new_page, .exclusive);

    // 2. Find split point (middle)
    const entry_count = LeafNode.getCount(frame.data);
    const split_point = entry_count / 2;

    // 3. Move upper half to new page
    LeafNode.moveEntries(frame.data, new_frame.data, split_point, entry_count);

    // 4. Insert new key into appropriate page
    if (compare(key, split_key) < 0) {
        LeafNode.insert(frame.data, key, value);
    } else {
        LeafNode.insert(new_frame.data, key, value);
    }

    // 5. Update sibling pointers
    LeafNode.setNext(frame.data, new_page);
    LeafNode.setPrev(new_frame.data, frame.page_id);

    // 6. Get separator key and insert into parent
    const separator = LeafNode.getFirstKey(new_frame.data);
    try self.insertIntoParent(frame.page_id, separator, new_page);
}

Growing the Tree

When the root splits, we create a new root:

Before:
          ┌─────────────┐
          │  Root (full)│
          └─────────────┘

After root split:
          ┌─────────────┐
          │  New Root   │  ◄── New level!
          │    [50]     │
          └──────┬──────┘
                ╱ ╲
    ┌──────────┘   └──────────┐
    ▼                         ▼
┌─────────────┐      ┌─────────────┐
│  Old Root   │      │  New Page   │
│  (left)     │      │  (right)    │
└─────────────┘      └─────────────┘

The tree grows at the top, not the bottom. All leaves stay at the same depth.

Range Scan

Thanks to linked leaves, range scans are efficient:

pub fn range(self: *Self, start: ?[]const u8, end: ?[]const u8) !Iterator {
    // 1. Find starting leaf
    const start_page = if (start) |k|
        try self.findLeaf(k)
    else
        try self.findLeftmostLeaf();

    return Iterator{
        .btree = self,
        .current_page = start_page,
        .current_slot = 0,
        .end_key = end,
    };
}

Iterator walks the leaf chain:

Start key: "dog"     End key: "hamster"

   ┌─────────────────────────────────────┐
   │           Internal nodes            │
   └─────────────────────────────────────┘
                     │
                     ▼
┌─────────────┐   ┌─────────────┐   ┌─────────────┐
│ cat│cow│    │──►│ dog│elk│fox│──►│goat│ham│    │
└─────────────┘   └─────────────┘   └─────────────┘
                    ▲                   ▲
                    │ start             │ end
                    └───────────────────┘
                          scan range

No need to traverse internal nodes for each entry - just follow the links.

Concurrency

We use latch crabbing (simplified):

Lookup (read-only):
    1. Latch root (shared)
    2. Latch child (shared)
    3. Unlatch root
    4. Latch grandchild (shared)
    5. Unlatch child
    ... continue to leaf

Insert (may modify):
    1. Latch root (exclusive)
    2. If child is safe (won't split), latch child and unlatch root
    3. Continue down

A "safe" node is one with enough space that it won't split (for insert) or won't underflow (for delete).

The BTree Struct

pub const BTree = struct {
    bp: *BufferPool,           // Page cache
    root_page: PageId,         // Current root
    comparator: KeyComparator, // For ordering
    page_size: u32,            // Usually 4096
    allocator: Allocator,

    // ...methods
};

This is just ~40 bytes of metadata. The actual data lives in pages on disk.

Key Design Decisions

Variable-Length Keys and Values

We store the actual bytes, not fixed-size slots. This supports:

Short keys efficiently (no wasted space)
Long keys (up to page size)
Any binary data

Separator Keys in Internal Nodes

Internal nodes store separator keys, not full key-value pairs. This maximizes fanout (keys per internal node).

No Deletion (Yet)

Our implementation doesn't handle deletes with rebalancing. For a knowledge graph database, we can use soft deletes (mark as deleted) and periodic compaction.

Page-Based, Not Block-Based

Each node is exactly one page. No spanning, no compression across pages. Simple and predictable.

Example: Full Insert Sequence

Insert "zebra" into a tree:

Step 1: Start at root
┌─────────────────────────────────────────────┐
│  Root (Internal)                            │
│  [lion]           [rabbit]                  │
│    ↓                 ↓                      │
│  page 2           page 3                    │→ page 4 (rightmost)
└─────────────────────────────────────────────┘
"zebra" > "rabbit" → go right to page 4

Step 2: Descend to leaf (page 4)
┌─────────────────────────────────────────────┐
│  Leaf (Page 4)                              │
│  snake:v1  tiger:v2  wolf:v3                │
│  [has space for zebra]                      │
└─────────────────────────────────────────────┘

Step 3: Insert into leaf
┌─────────────────────────────────────────────┐
│  Leaf (Page 4)                              │
│  snake:v1  tiger:v2  wolf:v3  zebra:v4     │
└─────────────────────────────────────────────┘
Done! No splits needed.

Performance Characteristics

Operation	Average Case	Worst Case
Lookup	O(log N)	O(log N)
Insert	O(log N)	O(log N)
Range(k)	O(log N + k)	O(log N + k)

Where N is the number of entries and k is the number of entries in range.

Disk I/O per operation: ~3-4 reads for trees up to billions of entries.

Write-Ahead Log (WAL)

The Problem

Imagine you're updating a B+Tree. You need to:

Modify a leaf page
Maybe split it (modify parent)
Maybe split the parent (modify grandparent)
Update the root

If power fails between steps 2 and 3, your database is corrupted - the tree structure is broken.

Even a single page write isn't safe. A 4KB page write might not be atomic at the hardware level - you could end up with half old data, half new data (a "torn write").

The Insight

Instead of modifying data pages directly, first write a log of what you intend to do. The log is append-only and sequential. Only after the log is safely on disk do you modify the actual data pages.

If you crash:

Before log write: Nothing happened, no corruption
After log write, before data write: Replay the log to finish the operation
After both: Everything is fine

This is the write-ahead rule: log first, then data.

Why Append-Only Sequential Writes Are Special

The WAL relies on a key property: sequential append is much safer than random writes.

Random writes (data pages):          Sequential append (WAL):
┌─────┐ ┌─────┐ ┌─────┐              ┌─────────────────────────┐
│ P1  │ │ P2  │ │ P3  │              │ Log Log Log Log ...     │
└─────┘ └─────┘ └─────┘              └─────────────────────────┘
   ↑       ↑       ↑                                         ↑
   │       │       │                                         │
 write   write   write                              single write position

With random writes, the disk head jumps around. A crash can leave any combination of pages in inconsistent states.

With sequential append:

Only one "frontier" where writing happens
Everything before the frontier is complete
Everything after doesn't exist yet
Much simpler to reason about crash states

The System Primitive: fsync

The WAL's safety depends on one critical system call: fsync (or fdatasync).

write(fd, data, len);  // Data goes to OS buffer cache
fsync(fd);             // Forces data to physical disk platters

Without fsync, your "written" data might sit in RAM for seconds or minutes. A power failure loses it. fsync is a durability barrier - when it returns, data is on persistent storage.

This is expensive (milliseconds, not microseconds), so we batch multiple records into frames before syncing.

WAL Structure

┌────────────────────────────────────────────────────────────┐
│                    WAL File Layout                          │
├────────────────────────────────────────────────────────────┤
│                                                             │
│  ┌─────────────────────────────────────────────────────┐   │
│  │ Header (4KB)                                         │   │
│  │  • magic: 0x574C4F47 ("WLOG")                        │   │
│  │  • database_uuid: links WAL to its database          │   │
│  │  • frame_count: how many frames written              │   │
│  │  • checkpoint_lsn: recovery starting point           │   │
│  │  • checksum: detects corruption                      │   │
│  └─────────────────────────────────────────────────────┘   │
│                                                             │
│  ┌─────────────────────────────────────────────────────┐   │
│  │ Frame 0 (4KB)                                        │   │
│  │  ┌───────────────────────────────────────────────┐   │   │
│  │  │ Frame Header (32 bytes)                        │   │   │
│  │  │  • frame_number: 0                             │   │   │
│  │  │  • record_count: 3                             │   │   │
│  │  │  • data_size: 156                              │   │   │
│  │  │  • checksum: CRC of data area                  │   │   │
│  │  └───────────────────────────────────────────────┘   │   │
│  │  ┌───────────────────────────────────────────────┐   │   │
│  │  │ Record: TXN_BEGIN (txn=1, lsn=1)               │   │   │
│  │  └───────────────────────────────────────────────┘   │   │
│  │  ┌───────────────────────────────────────────────┐   │   │
│  │  │ Record: INSERT (txn=1, lsn=2, payload=...)     │   │   │
│  │  └───────────────────────────────────────────────┘   │   │
│  │  ┌───────────────────────────────────────────────┐   │   │
│  │  │ Record: TXN_COMMIT (txn=1, lsn=3)              │   │   │
│  │  └───────────────────────────────────────────────┘   │   │
│  │                     ... unused space ...              │   │
│  └─────────────────────────────────────────────────────┘   │
│                                                             │
│  ┌─────────────────────────────────────────────────────┐   │
│  │ Frame 1 (4KB)                                        │   │
│  │   ...                                                │   │
│  └─────────────────────────────────────────────────────┘   │
│                                                             │
└────────────────────────────────────────────────────────────┘

LSN: The Universal Clock

Every record gets a Log Sequence Number (LSN) - a monotonically increasing integer.

LSN 1: TXN_BEGIN (txn 1)
LSN 2: INSERT key="alice" (txn 1)
LSN 3: INSERT key="bob" (txn 1)
LSN 4: TXN_BEGIN (txn 2)
LSN 5: DELETE key="charlie" (txn 2)
LSN 6: TXN_COMMIT (txn 1)
LSN 7: TXN_ABORT (txn 2)

LSN serves multiple purposes:

Ordering: Total order of all operations
Recovery point: "Replay from LSN 42"
Page tracking: Each data page stores the LSN of the last modification

The prev_lsn Chain

Each record stores prev_lsn - the previous LSN for the same transaction:

LSN 1: TXN_BEGIN (txn=1, prev_lsn=0)
LSN 2: INSERT (txn=1, prev_lsn=1)      ←─┐
LSN 3: INSERT (txn=2, prev_lsn=0)         │
LSN 4: UPDATE (txn=1, prev_lsn=2)      ───┘
LSN 5: TXN_COMMIT (txn=1, prev_lsn=4)

This creates a backward chain for each transaction:

Transaction 1: LSN 5 → LSN 4 → LSN 2 → LSN 1
Transaction 2: LSN 3 (just one operation)

Why? Rollback. To abort transaction 1, follow the chain backward, undoing each operation.

Write Flow

                    Application
                        │
                        ▼
              ┌─────────────────┐
              │   Transaction   │
              │   INSERT(k,v)   │
              └────────┬────────┘
                       │
                       ▼
              ┌─────────────────┐
              │   WAL Manager   │
              │                 │
              │  1. Assign LSN  │
              │  2. Build record│
              │  3. Append to   │
              │     frame buffer│
              └────────┬────────┘
                       │
          ┌────────────┴────────────┐
          │                         │
          ▼                         ▼
   Frame buffer full?          COMMIT requested?
          │                         │
          ▼                         ▼
   ┌─────────────┐           ┌─────────────┐
   │ Flush frame │           │ Flush frame │
   │ to disk     │           │ + fsync     │
   └─────────────┘           └─────────────┘
                                    │
                                    ▼
                            ┌─────────────┐
                            │ Now safe to │
                            │ return to   │
                            │ application │
                            └─────────────┘

The key insight: we buffer multiple records in memory, only hitting disk when:

The frame buffer is full (4KB of records accumulated)
A transaction commits (durability guarantee)

Commit Protocol

When you call COMMIT:

1. Write TXN_COMMIT record to WAL buffer
2. Flush current frame to disk
3. Call fsync() - WAIT for disk acknowledgment
4. Return "committed" to application

After step 3, even if power fails immediately, the commit record is on disk. Recovery will see the commit and know the transaction succeeded.

If we crash before step 3? The commit record might be lost. Recovery won't see a commit for that transaction, so it will be rolled back. No partial commits ever escape to the application.

Recovery: Redo and Undo

On startup after a crash, we need to:

Find the end of valid log - Scan frames, verify checksums, find last valid record
Redo committed transactions - Replay all operations from committed transactions
Undo uncommitted transactions - Roll back any transaction without a commit record

                     WAL Contents
    ┌───────────────────────────────────────────┐
    │ LSN 1: TXN_BEGIN (txn 1)                  │
    │ LSN 2: INSERT x=1 (txn 1)                 │
    │ LSN 3: TXN_BEGIN (txn 2)                  │
    │ LSN 4: INSERT y=2 (txn 2)                 │
    │ LSN 5: TXN_COMMIT (txn 1)    ← committed  │
    │ LSN 6: INSERT z=3 (txn 2)                 │
    │ ─── CRASH HERE ───                        │
    └───────────────────────────────────────────┘

    Recovery:
    • Txn 1: Has COMMIT → REDO (x=1 is permanent)
    • Txn 2: No COMMIT  → UNDO (y=2 and z=3 are rolled back)

Why Frames?

Why not just append individual records?

Reason 1: Atomicity

A 4KB write is more likely to be atomic than arbitrary-sized writes. Many storage systems guarantee 512-byte or 4KB atomic writes. By aligning to page boundaries, we reduce torn write risk.

Reason 2: Batching

Each fsync is expensive (~1-10ms). By batching records into frames, we amortize the sync cost across many operations.

Reason 3: Checksums

Each frame has a checksum. During recovery, we can detect partial/corrupted frames and stop replay at the right point.

The Checksum

We use CRC32 to detect corruption:

Frame on disk:
┌──────────────────────────────────────────┐
│ Header: checksum = 0xABCD1234            │
├──────────────────────────────────────────┤
│ Data: [record1][record2][record3]        │
└──────────────────────────────────────────┘

On read:
1. Read frame
2. Compute CRC32(data)
3. Compare with stored checksum
4. If mismatch → corruption detected, stop replay

This catches:

Torn writes (partial frame written)
Bit rot (storage degradation)
Wrong WAL file (UUID also checked)

The UUID Link

The WAL header stores the database file's UUID:

Database file header:      WAL header:
┌──────────────────┐      ┌──────────────────┐
│ uuid: ABC123...  │ ←──→ │ uuid: ABC123...  │
└──────────────────┘      └──────────────────┘

This prevents accidentally using database A's WAL with database B. The UUID is generated randomly when the database is created.

Record Types

pub const WalRecordType = enum(u8) {
    // Transaction control
    txn_begin = 0x01,
    txn_commit = 0x02,
    txn_abort = 0x03,

    // Data modifications
    insert = 0x10,
    update = 0x11,
    delete = 0x12,

    // Page-level operations
    page_write = 0x20,

    // Checkpointing
    checkpoint_begin = 0x30,
    checkpoint_end = 0x31,

    // Savepoints
    savepoint = 0x40,
    savepoint_rollback = 0x41,

    // Compensation (for undo during recovery)
    clr = 0x50,
};

WalManager API

// Append a record, get back its LSN
const lsn = try wal.appendRecord(.insert, txn_id, prev_lsn, payload);

// Force all buffered records to disk
try wal.sync();

// Iterate records for recovery
var iter = wal.iterate(start_lsn);
while (try iter.next(&buf)) |record| {
    // Process record
}

// Update checkpoint position
try wal.setCheckpointLsn(lsn);

Summary

Concept	Purpose
Write-ahead rule	Log before data = crash safety
Sequential append	Simpler crash states than random writes
fsync	Durability barrier to physical storage
LSN	Universal ordering of all operations
prev_lsn chain	Enables transaction rollback
Frames	Atomic-ish writes + batching
Checksums	Detect corruption and torn writes
UUID	Match WAL to correct database file

The WAL transforms the problem from "make random writes atomic" (very hard) to "make sequential append reliable" (much easier).

Transaction Manager

What is a Transaction?

A transaction is a logical unit of work - a group of operations that should either all succeed or all fail together.

Without transactions:                With transactions:

1. Debit Alice $100    ✓            BEGIN
2. ─── CRASH ───                    1. Debit Alice $100
3. Credit Bob $100     ✗            2. Credit Bob $100
                                    COMMIT
Result: Alice lost $100,
Bob got nothing                     Result: Either both happen,
                                    or neither happens

The ACID Properties

Transactions guarantee four properties:

Property	Meaning	How We Achieve It
Atomicity	All or nothing	WAL + rollback
Consistency	Valid state to valid state	Application logic
Isolation	Transactions don't interfere	Timestamps + MVCC
Durability	Committed = permanent	WAL + fsync

The Core Data Structures

Transaction (User-Facing Handle)

This is what the application sees:

pub const Transaction = struct {
    id: u64,              // Unique identifier (1, 2, 3, ...)
    state: TxnState,      // active, committed, or aborted
    mode: TxnMode,        // read_only or read_write
    isolation: IsolationLevel,  // snapshot, read_committed, serializable
    start_ts: u64,        // When transaction started (for visibility)
    commit_ts: u64,       // When committed (0 until commit)
};

TxnEntry (Internal State)

The manager keeps more detailed state internally:

const TxnEntry = struct {
    txn: Transaction,           // The user-facing handle
    last_lsn: u64,              // LSN of most recent operation
    begin_lsn: u64,             // LSN of TXN_BEGIN record
    savepoints: ArrayList,      // Stack of savepoints
    undo_log: ArrayList,        // Operations to reverse on abort
};

TxnManager (The Coordinator)

pub const TxnManager = struct {
    allocator: Allocator,
    wal: *WalManager,                        // For durability
    active_txns: HashMap(u64, TxnEntry),     // All running transactions
    next_txn_id: u64,                        // Counter for IDs
    current_ts: u64,                         // Global timestamp clock
    committed_count: u64,                    // Statistics
    aborted_count: u64,
    mutex: Mutex,                            // Thread safety
};

Transaction Lifecycle

1. BEGIN

When you start a transaction:

Application                    TxnManager                         WAL
    │                              │                               │
    │  begin(read_write, snapshot) │                               │
    ├─────────────────────────────►│                               │
    │                              │                               │
    │                              │  1. Lock mutex                │
    │                              │  2. Assign txn_id = 1         │
    │                              │  3. Assign start_ts = 1       │
    │                              │  4. appendRecord(TXN_BEGIN)   │
    │                              ├──────────────────────────────►│
    │                              │                     lsn = 1   │
    │                              │◄──────────────────────────────┤
    │                              │                               │
    │                              │  5. Create TxnEntry           │
    │                              │     last_lsn = 1              │
    │                              │  6. Store in active_txns      │
    │                              │  7. Unlock mutex              │
    │                              │                               │
    │   Transaction { id=1, ... }  │                               │
    │◄─────────────────────────────┤                               │

The start_ts is crucial for isolation - it determines what data this transaction can "see".

2. Operations

Each modification goes through logOperation:

Application                    TxnManager                         WAL
    │                              │                               │
    │  logOperation(INSERT, data)  │                               │
    ├─────────────────────────────►│                               │
    │                              │                               │
    │                              │  1. Check txn.canWrite()      │
    │                              │  2. Get TxnEntry              │
    │                              │  3. appendRecord(INSERT,      │
    │                              │       txn_id=1,               │
    │                              │       prev_lsn=1,  ◄── last_lsn
    │                              │       payload=data)           │
    │                              ├──────────────────────────────►│
    │                              │                     lsn = 2   │
    │                              │◄──────────────────────────────┤
    │                              │                               │
    │                              │  4. Update last_lsn = 2       │
    │                              │                               │
    │              lsn = 2         │                               │
    │◄─────────────────────────────┤                               │

Notice how prev_lsn points to the previous operation. This builds a backward chain.

3. COMMIT

Commit makes everything permanent:

Application                    TxnManager                         WAL           Disk
    │                              │                               │              │
    │  commit(&txn)                │                               │              │
    ├─────────────────────────────►│                               │              │
    │                              │                               │              │
    │                              │  1. Check txn.state == active │              │
    │                              │  2. appendRecord(TXN_COMMIT,  │              │
    │                              │       prev_lsn=last_lsn)      │              │
    │                              ├──────────────────────────────►│              │
    │                              │                               │              │
    │                              │  3. wal.sync()                │              │
    │                              ├──────────────────────────────►│   fsync()   │
    │                              │                               ├─────────────►│
    │                              │                               │   durable!  │
    │                              │                               │◄─────────────┤
    │                              │◄──────────────────────────────┤              │
    │                              │                               │              │
    │                              │  4. Assign commit_ts          │              │
    │                              │  5. Set txn.state = committed │              │
    │                              │  6. Remove from active_txns   │              │
    │                              │  7. Increment committed_count │              │
    │                              │                               │              │
    │              OK              │                               │              │
    │◄─────────────────────────────┤                               │              │

The critical point: We only return success to the application AFTER fsync() completes. This is the durability guarantee.

4. ABORT

Abort discards everything:

Application                    TxnManager                         WAL
    │                              │                               │
    │  abort(&txn)                 │                               │
    ├─────────────────────────────►│                               │
    │                              │                               │
    │                              │  1. appendRecord(TXN_ABORT)   │
    │                              ├──────────────────────────────►│
    │                              │                               │
    │                              │  2. wal.sync()                │
    │                              │                               │
    │                              │  3. Set txn.state = aborted   │
    │                              │  4. Clean up TxnEntry         │
    │                              │  5. Remove from active_txns   │
    │                              │                               │
    │              OK              │                               │
    │◄─────────────────────────────┤                               │

We log TXN_ABORT so that during crash recovery, we know this transaction was intentionally aborted.

The prev_lsn Chain

Every record points back to the previous record from the same transaction. This creates a linked list through the WAL:

┌─────────────────────────────────────────────────────────────────┐
│                         WAL                                      │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  LSN 1: TXN_BEGIN  (txn=1, prev_lsn=0)  ◄─────────────┐         │
│                                                        │         │
│  LSN 2: INSERT     (txn=1, prev_lsn=1)  ──────────────┘         │
│                              ▲                                   │
│                              │                                   │
│  LSN 3: TXN_BEGIN  (txn=2, prev_lsn=0)  ◄───────┐               │
│                                                  │               │
│  LSN 4: UPDATE     (txn=1, prev_lsn=2)  ────────┼───┐           │
│                              ▲                   │   │           │
│                              │                   │   │           │
│  LSN 5: DELETE     (txn=2, prev_lsn=3)  ────────┘   │           │
│                                                      │           │
│  LSN 6: TXN_COMMIT (txn=1, prev_lsn=4)  ─────────────┘          │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Transaction 1's chain: 6 → 4 → 2 → 1
Transaction 2's chain: 5 → 3

Why is this useful?

Rollback: To abort transaction 1, follow 6->4->2->1, undoing each operation
Recovery: After crash, find uncommitted transactions by following chains
No scanning: Don't need to scan entire WAL to find a transaction's operations

Savepoints: Partial Rollback

Sometimes you want to undo part of a transaction, not all of it:

BEGIN;
    INSERT user Alice;           -- LSN 2
    SAVEPOINT before_orders;     -- Remember this point
    INSERT order #1;             -- LSN 4
    INSERT order #2;             -- LSN 5
    -- Oops, orders were wrong
    ROLLBACK TO before_orders;   -- Undo LSN 4, 5
    INSERT order #3;             -- LSN 7
COMMIT;

How Savepoints Work

                    Savepoint Stack                 Undo Log
                    ┌─────────────────┐            ┌─────────────┐
                    │ "before_orders" │            │ INSERT #1   │ ← position 1
tm.savepoint()  ──► │   lsn: 3        │            │ INSERT #2   │ ← position 2
                    │   undo_pos: 0   │──────────► └─────────────┘
                    └─────────────────┘                  ▲
                                                         │
                                                    truncate here
                                                    on rollback

When we rollback to savepoint:

Find the savepoint by name
Log SAVEPOINT_ROLLBACK to WAL
Truncate undo_log to undo_position
Remove savepoints created after this one

Timestamps and Isolation

Every transaction gets two timestamps:

start_ts:  Assigned at BEGIN - determines what data is visible
commit_ts: Assigned at COMMIT - marks when changes become visible to others

Snapshot Isolation Example

Timeline:
─────────────────────────────────────────────────────────────────►
     │           │           │           │           │
   ts=1        ts=2        ts=3        ts=4        ts=5
     │           │           │           │           │
   Txn A       Txn A       Txn B       Txn A       Txn B
   BEGIN      INSERT      BEGIN      COMMIT       reads
 start_ts=1    x=100     start_ts=3  commit_ts=4    x

Question: What does Txn B see when it reads x?

With snapshot isolation: Txn B started at ts=3, before Txn A committed at ts=4. So Txn B does NOT see x=100. It sees whatever x was before Txn A.

This is why we track start_ts and commit_ts - they determine visibility.

Read-Only Transactions

Read-only transactions are special:

var txn = try tm.begin(.read_only, .snapshot);

// This fails:
const result = tm.logOperation(&txn, .insert, "data");
// Error: TxnError.ReadOnly

Why have read-only transactions?

No WAL writes - faster, no disk I/O for reads
No locks needed - can always proceed
Never abort due to conflicts - just reads a snapshot
Helps garbage collection - we know this txn won't modify old versions

Thread Safety

The TxnManager uses a mutex to protect shared state:

pub fn begin(self: *Self, ...) TxnError!Transaction {
    self.mutex.lock();         // ← Only one thread at a time
    defer self.mutex.unlock(); // ← Released when function exits

    // Safe to modify next_txn_id, active_txns, etc.
    ...
}

This is a simple approach. The mutex is held briefly (microseconds), and the WAL I/O dominates latency anyway.

Statistics

The manager tracks statistics for monitoring:

const stats = tm.getStats();

stats.active_count     // Currently running transactions
stats.committed_count  // Total successful commits
stats.aborted_count    // Total aborts
stats.oldest_active_id // Oldest running transaction
stats.current_ts       // Current timestamp

oldest_active_id is important for garbage collection - we can't clean up any data that this transaction might still need to see.

API Summary

var tm = TxnManager.init(allocator, &wal);

// Start a transaction
var txn = try tm.begin(.read_write, .snapshot);

// Log operations
_ = try tm.logOperation(&txn, .insert, payload);
_ = try tm.logOperation(&txn, .update, payload);

// Create savepoint
try tm.savepoint(&txn, "before_danger");

// More operations
_ = try tm.logOperation(&txn, .delete, payload);

// Rollback to savepoint if needed
try tm.rollbackToSavepoint(&txn, "before_danger");

// Commit or abort
try tm.commit(&txn);  // or tm.abort(&txn)

Integration

┌─────────────────────────────────────────────────────────────────┐
│                        Application                               │
└───────────────────────────────┬─────────────────────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────────┐
│                      TxnManager                                  │
│  • Manages transaction lifecycle                                 │
│  • Maintains prev_lsn chains                                     │
│  • Tracks active transactions                                    │
└───────────────────────────────┬─────────────────────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────────┐
│                       WalManager                                 │
│  • Durability through logging                                    │
│  • fsync on commit                                               │
└───────────────────────────────┬─────────────────────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────────┐
│                    BufferPool + BTree                            │
│  • Actual data storage                                           │
│  • Page management                                               │
│  • (Will check txn visibility for MVCC)                          │
└─────────────────────────────────────────────────────────────────┘

The Transaction Manager is the coordinator - it doesn't store data itself, but ensures that all operations follow ACID rules by orchestrating the WAL, tracking state, and enforcing invariants.

Checkpointing

The Problem

Without checkpointing, two bad things happen:

WAL grows forever - Every transaction appends records, file gets huge
Recovery takes forever - After crash, must replay entire WAL from the beginning

After 1 year of operation:

WAL: [millions of records from day 1 ... to today]
      ◄──────────────────────────────────────────►
              Recovery replays ALL of this
              Could take hours!

The Solution

A checkpoint says: "Everything up to this point is safely on disk in the main database file. We don't need the old WAL records anymore."

Before checkpoint:
┌─────────────────────────────────────────────────────────┐
│                     Buffer Pool (RAM)                    │
│  ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐        │
│  │ Page 5  │ │ Page 12 │ │ Page 3  │ │ Page 8  │        │
│  │ DIRTY   │ │ DIRTY   │ │ clean   │ │ DIRTY   │        │
│  └─────────┘ └─────────┘ └─────────┘ └─────────┘        │
└─────────────────────────────────────────────────────────┘
                        │
                        │ Changes only in RAM!
                        │ If power fails, they're lost
                        ▼
┌─────────────────────────────────────────────────────────┐
│                  Database File (Disk)                    │
│               [old versions of pages]                    │
└─────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────┐
│                      WAL (Disk)                          │
│  [record][record][record][record][record][record]...     │
│  ◄──────────────────────────────────────────────────►   │
│        Recovery needs ALL of these                       │
└─────────────────────────────────────────────────────────┘

After checkpoint:
┌─────────────────────────────────────────────────────────┐
│                     Buffer Pool (RAM)                    │
│  ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐        │
│  │ Page 5  │ │ Page 12 │ │ Page 3  │ │ Page 8  │        │
│  │ clean   │ │ clean   │ │ clean   │ │ clean   │        │
│  └─────────┘ └─────────┘ └─────────┘ └─────────┘        │
└─────────────────────────────────────────────────────────┘
         │           │                       │
         │           │     FLUSHED!          │
         ▼           ▼                       ▼
┌─────────────────────────────────────────────────────────┐
│                  Database File (Disk)                    │
│               [current versions of pages]                │
└─────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────┐
│                      WAL (Disk)                          │
│  [old records...] │ CHECKPOINT │ [new records...]        │
│                   ▲                                      │
│            checkpoint_lsn                                │
│                   │                                      │
│     Recovery starts HERE ─────►                          │
│     (old records ignored)                                │
└─────────────────────────────────────────────────────────┘

How It Works Step by Step

Step 1: Write CHECKPOINT_BEGIN to WAL

WAL: [...existing records...][CHECKPOINT_BEGIN, lsn=1000]

This marks the start of the checkpoint. If we crash during checkpointing, recovery sees this and knows a checkpoint was in progress.

Step 2: Sync WAL

fsync(wal_file)

Ensures CHECKPOINT_BEGIN is on disk before we start flushing pages.

Step 3: Flush Dirty Pages

Scan the buffer pool and write dirty pages to the database file:

for each frame in buffer_pool:
    if frame.dirty:
        write frame.data to database_file at page offset
        frame.dirty = false

Passive mode: Skip pages that are currently pinned (in use by transactions). This avoids blocking active work.

Full mode: Wait for latches if needed. Guarantees all dirty pages are flushed.

Step 4: Sync Database File

fsync(database_file)

Critical! This ensures all the page writes are actually on disk, not sitting in OS buffers.

Step 5: Write CHECKPOINT_END to WAL

WAL: [...][CHECKPOINT_BEGIN][CHECKPOINT_END, lsn=1002]

This marks successful completion. The payload includes stats (pages flushed).

Step 6: Update checkpoint_lsn

wal_header.checkpoint_lsn = 1002  // LSN of CHECKPOINT_END
fsync(wal_file)

This is the key marker for recovery. It says "everything before LSN 1002 is safely on disk."

The Checkpoint Modes

Passive Mode

"Checkpoint what you can without disrupting active work"

for each frame:
    if dirty AND not pinned AND can get latch immediately:
        flush it
    else:
        skip it (maybe next time)

Good for: Background checkpointing during normal operation. Doesn't block transactions.

Bad: May leave some dirty pages unflushed.

Full Mode

"Checkpoint everything, wait if necessary"

for each frame:
    if dirty:
        wait for latch (spin until available)
        flush it

Good for: Complete checkpoint before shutdown, or when WAL is getting too large.

Bad: May briefly block transactions waiting for latches.

Truncate Mode

"Full checkpoint, then reset WAL to beginning"

1. Do full checkpoint
2. Truncate WAL file to just the header
3. Reset frame_count to 0

Good for: Reclaiming disk space when WAL has grown large.

Requires: No active readers (they might be reading old WAL frames).

Why CHECKPOINT_BEGIN and CHECKPOINT_END?

Consider what happens if we crash during checkpointing:

Crash scenario 1: After BEGIN, before any flushes
─────────────────────────────────────────────────
WAL: [...records...][CHECKPOINT_BEGIN]
                                     ▲
                                   crash

Recovery: Sees incomplete checkpoint (BEGIN but no END).
          Ignores it, replays from previous checkpoint_lsn.
          Safe!

Crash scenario 2: After some flushes, before END
─────────────────────────────────────────────────
WAL: [...records...][CHECKPOINT_BEGIN]
                                     ▲
                                   crash
Database file: Has SOME pages flushed, but not all

Recovery: Sees incomplete checkpoint.
          Some pages are updated, some aren't.
          Replays WAL from previous checkpoint_lsn.
          WAL replay overwrites pages with correct data.
          Safe!

Crash scenario 3: After END
───────────────────────────
WAL: [...records...][CHECKPOINT_BEGIN][CHECKPOINT_END]
                                                     ▲
                                                   crash

Recovery: Sees complete checkpoint.
          Starts replay from checkpoint_lsn.
          Fast recovery!

Statistics

Each checkpoint returns stats:

CheckpointStats{
    pages_flushed: 42,        // How many dirty pages written
    duration_ns: 15_000_000,  // 15ms
    checkpoint_lsn: 1002,     // New checkpoint position
    wal_truncated: false,     // Whether WAL was reset
}

Useful for monitoring:

If pages_flushed is always high, consider checkpointing more often
If duration_ns is high, disk might be slow
Track checkpoint_lsn growth over time

When to Checkpoint

// Check if WAL has grown too large
if checkpointer.shouldCheckpoint(max_wal_frames: 1000) {
    try checkpointer.checkpoint(.passive);
}

Common strategies:

Size-based: When WAL exceeds N frames
Time-based: Every N minutes
Transaction-based: After N commits
On shutdown: Always do full checkpoint before closing

Integration

┌─────────────────────────────────────────────────────────────────┐
│                    Checkpointer                                  │
│                         │                                        │
│         ┌───────────────┼───────────────┐                       │
│         ▼               ▼               ▼                       │
│   ┌──────────┐   ┌──────────┐   ┌──────────┐                   │
│   │BufferPool│   │PageManager│   │WalManager│                   │
│   │          │   │          │   │          │                   │
│   │• frames  │   │• writePage│   │• append  │                   │
│   │• dirty   │   │• sync     │   │• sync    │                   │
│   │• latches │   │          │   │• setLsn  │                   │
│   └──────────┘   └──────────┘   └──────────┘                   │
└─────────────────────────────────────────────────────────────────┘

The Checkpointer coordinates between:

BufferPool: Knows which pages are dirty
PageManager: Writes pages to database file
WalManager: Records checkpoint markers, updates checkpoint_lsn

API

var checkpointer = Checkpointer.init(allocator, &bp, &pm, &wal);

// Passive checkpoint (background)
const stats = try checkpointer.checkpoint(.passive);

// Full checkpoint (before shutdown)
const stats = try checkpointer.checkpoint(.full);

// Truncate checkpoint (reclaim WAL space)
const stats = try checkpointer.checkpoint(.truncate);

// Check if checkpoint needed
if (checkpointer.shouldCheckpoint(1000)) {
    _ = try checkpointer.checkpoint(.passive);
}

// Get last checkpoint stats
if (checkpointer.getLastStats()) |stats| {
    log("Flushed {} pages in {}ms", stats.pages_flushed, stats.duration_ns / 1_000_000);
}

Summary

Concept	Purpose
Dirty page flush	Write modified pages to database file
checkpoint_lsn	Recovery starting point
CHECKPOINT_BEGIN/END	Crash safety during checkpoint
Passive mode	Non-blocking background checkpoint
Full mode	Complete checkpoint, may block briefly
Truncate mode	Reclaim WAL disk space

Checkpointing is the bridge between WAL-based durability (fast commits) and bounded recovery time (fast restart). Without it, durability requires replaying the entire history on every startup.

Crash Recovery

The Problem

Databases crash. Power fails, operating systems panic, processes get killed. When this happens, the database must be able to restart and return to a consistent state without losing committed data.

Consider what might be in-flight when a crash occurs:

Scenario: Crash during normal operation

Buffer Pool (RAM):
  Page 5: dirty, modified by committed txn
  Page 8: dirty, modified by uncommitted txn
  Page 12: dirty, partially written by in-progress txn

WAL (Disk):
  [committed txn records][uncommitted txn records][partial frame?]

Database File (Disk):
  [old versions of pages - some stale, some current]

After restart, we need to:

Redo committed work - If a transaction committed (COMMIT record in WAL) but its changes weren't flushed to the database file, redo them
Ignore uncommitted work - If a transaction never committed, pretend it never happened

The ARIES Philosophy

Our recovery is inspired by ARIES (Algorithms for Recovery and Isolation Exploiting Semantics), a recovery algorithm developed at IBM. The key insight:

Write-Ahead Logging + Redo at Recovery = Durability

The WAL contains everything we need. After a crash:

Read the WAL from the last checkpoint
Determine which transactions committed
Redo only committed operations

Two-Phase Recovery

Phase 1: Analysis

Scan the WAL to understand what happened:

WAL Contents:
┌────────────────────────────────────────────────────┐
│ LSN 1: TXN_BEGIN (txn=1)         → txn 1 started  │
│ LSN 2: INSERT (txn=1)            → txn 1 modified │
│ LSN 3: TXN_BEGIN (txn=2)         → txn 2 started  │
│ LSN 4: INSERT (txn=2)            → txn 2 modified │
│ LSN 5: TXN_COMMIT (txn=1)        → txn 1 COMMITTED│
│ LSN 6: UPDATE (txn=2)            → txn 2 modified │
│ ─── CRASH ───                                      │
└────────────────────────────────────────────────────┘

Analysis Result:
  Transaction 1: COMMITTED (has TXN_COMMIT)
  Transaction 2: IN_PROGRESS (no commit/abort)

During analysis, we build:

Transaction table: State of each transaction (committed, aborted, in-progress)
Redo list: All data operations that might need to be redone

Phase 2: Redo

Apply committed operations to the database:

Redo Process:
┌─────────────────────────────────────────────────────┐
│ For each operation in redo_list:                    │
│   1. Check if transaction committed                 │
│      - If not: skip (discard uncommitted changes)   │
│   2. Apply the operation to database file           │
│   3. Continue to next operation                     │
└─────────────────────────────────────────────────────┘

Result:
  LSN 2 (INSERT, txn=1): REDO (txn 1 committed)
  LSN 4 (INSERT, txn=2): SKIP (txn 2 didn't commit)
  LSN 6 (UPDATE, txn=2): SKIP (txn 2 didn't commit)

Why No Undo?

Traditional ARIES has three phases: Analysis, Redo, Undo. We skip Undo because of how we structure our system:

Our approach: Don't apply changes to data pages until commit

During a transaction, changes are in the buffer pool (RAM)
On commit, dirty pages are flushed
On crash, uncommitted changes in RAM are simply lost

Traditional approach: Apply changes immediately, undo on abort

Changes written to data pages as transaction runs
If abort/crash, must read WAL backwards and undo each change
More complex, but allows larger transactions (not limited by RAM)

For an embedded database like Lattice, the simpler "don't undo" approach works well.

The Checkpoint Starting Point

Recovery doesn't scan the entire WAL - only from the last checkpoint:

WAL Timeline:
─────────────────────────────────────────────────────────────────►

[old records] │ CHECKPOINT │ [records since checkpoint]
              ▲            ▲
              │            │
    checkpoint_lsn         │
                           │
              Recovery starts here
              (everything before is safely on disk)

The checkpoint_lsn in the WAL header marks where recovery begins. The Checkpointer sets this after successfully flushing all dirty pages.

Transaction States

During recovery, each transaction can be in one of three states:

State	Meaning	WAL Record	Action
`committed`	Completed successfully	TXN_COMMIT present	Redo operations
`aborted`	Explicitly rolled back	TXN_ABORT present	Ignore operations
`in_progress`	Crash during execution	No commit/abort	Ignore operations

The distinction between aborted and in_progress is informational - both are handled the same way (ignore their changes).

Record Type Handling

Different WAL records are handled differently:

switch (record.record_type) {
    .txn_begin => {
        // Track new transaction as in_progress
    },
    .txn_commit => {
        // Mark transaction as committed
    },
    .txn_abort => {
        // Mark transaction as aborted
    },
    .insert, .update, .delete => {
        // Data modification - save for redo phase
    },
    .page_write => {
        // Physical page write - save for redo phase
    },
    .checkpoint_begin, .checkpoint_end => {
        // Informational - no action needed
    },
    .savepoint, .savepoint_rollback => {
        // Track but no special handling
    },
    .clr => {
        // Compensation Log Record - for undo operations
    },
}

Physical vs Logical Redo

Our recovery supports two types of operations:

Physical: page_write

Contains the complete page image:

Payload: [page_id: 4 bytes][page_data: 4096 bytes]

Redo: Write entire page directly to database file

Logical: insert, update, delete

Contains the operation parameters:

Payload: [key][value][metadata]

Redo: Re-execute the operation against the B+Tree

For simplicity, our implementation currently relies on page_write for physical durability, with logical operations tracked for statistics.

Corruption Detection

When recovery encounters a checksum mismatch, it must determine whether this is:

Tail corruption - A torn write at the end of the WAL (safe to proceed)
Mid-log corruption - Real corruption with valid data after it (unsafe)

Scan-Ahead Detection

On checksum mismatch, we scan ahead to check for valid frames:

Scenario 1: Tail Corruption (Torn Write)
─────────────────────────────────────────
Frame 0: checksum OK
Frame 1: checksum OK
Frame 2: checksum MISMATCH
Frame 3: [nothing valid]
Frame 4: [nothing valid]

→ No valid frames after corruption
→ This is a torn write at end of WAL
→ Safe to proceed with frames 0-1


Scenario 2: Mid-Log Corruption (Real Corruption)
────────────────────────────────────────────────
Frame 0: checksum OK
Frame 1: checksum MISMATCH  ← corruption here
Frame 2: checksum OK        ← but valid data after!
Frame 3: checksum OK

→ Valid frames exist after corruption
→ This is real data corruption
→ FAIL with MidLogCorruption error

Why This Matters

Mid-log corruption is dangerous because the corrupted frame might contain:

A TXN_COMMIT record we can't see
Data modifications needed for consistency

If we skip the corrupted frame and continue, we might:

Treat a committed transaction as uncommitted (data loss)
Apply partial transaction state (inconsistency)

By failing on mid-log corruption, we force manual intervention (restore from backup) rather than silently losing data.

Implementation

fn hasValidFramesAfter(wal: *WalManager, corrupted_frame: u64) bool {
    // Check up to 10 frames ahead
    for (corrupted_frame + 1 .. min(corrupted_frame + 10, frame_count)) |frame_num| {
        if (frameHasValidChecksum(frame_num)) {
            return true;  // Mid-log corruption!
        }
    }
    return false;  // Tail corruption, safe to stop
}

Statistics

Recovery returns detailed statistics:

RecoveryStats{
    start_lsn: 1000,            // Where we started (checkpoint_lsn)
    end_lsn: 1523,              // Last valid record
    records_scanned: 523,        // Total records processed
    transactions_found: 15,      // Distinct transactions
    transactions_committed: 12,  // Successfully committed
    transactions_aborted: 2,     // Explicitly aborted
    transactions_rolled_back: 1, // In-progress at crash
    redo_operations: 156,        // Operations redone
    duration_ns: 45_000_000,    // 45ms
    stopped_at_corruption: true, // Hit tail corruption
    corrupted_frame: 42,         // Frame number with bad checksum
}

These are useful for:

Monitoring recovery time
Debugging transaction issues
Capacity planning
Detecting disk health issues (frequent tail corruption may indicate hardware problems)

Recovery Flow Diagram

┌─────────────────────────────────────────────────────────────────┐
│                     Database Startup                             │
└───────────────────────────────┬─────────────────────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────────┐
│                      Open WAL File                               │
│                 Read checkpoint_lsn from header                  │
└───────────────────────────────┬─────────────────────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────────┐
│                     ANALYSIS PHASE                               │
│                                                                  │
│  for each record from checkpoint_lsn to end:                    │
│    - Track transaction states                                    │
│    - Build redo list for data operations                        │
│    - Stop at corruption/end of valid log                        │
└───────────────────────────────┬─────────────────────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────────┐
│                       REDO PHASE                                 │
│                                                                  │
│  for each operation in redo_list:                               │
│    if transaction.state == committed:                           │
│      apply operation to database                                │
│    else:                                                        │
│      discard (transaction didn't commit)                        │
└───────────────────────────────┬─────────────────────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────────┐
│                    Sync Database File                            │
│                 Return recovery statistics                       │
└───────────────────────────────┬─────────────────────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────────┐
│                   Database Ready for Use                         │
└─────────────────────────────────────────────────────────────────┘

API

// Simple recovery on startup
const stats = try recoverDatabase(allocator, &wal, &pm);
std.debug.print("Recovered {} transactions, redid {} operations in {}ms\n", .{
    stats.transactions_committed,
    stats.redo_operations,
    stats.duration_ns / 1_000_000,
});

// Or use RecoveryManager directly for more control
var rm = RecoveryManager.init(allocator);
const stats = try rm.recover(&wal, &pm);

Integration with Startup

Typical database startup sequence:

1. Open database file (PageManager)
2. Open WAL file (WalManager)
3. Check if recovery needed (WAL has records past checkpoint?)
4. Run recovery
5. Checkpoint to clean slate
6. Ready for operations

Summary

Concept	Purpose
Analysis phase	Determine transaction outcomes
Redo phase	Apply committed operations
checkpoint_lsn	Recovery starting point
Transaction states	committed, aborted, in_progress
Checksum verification	Detect corruption/partial writes
Tail corruption	Torn write at end, safe to tolerate
Mid-log corruption	Real corruption, fail to prevent data loss
Scan-ahead detection	Distinguish tail from mid-log corruption
No Undo	Uncommitted changes not written to pages

Recovery transforms a potentially inconsistent crash state into a consistent database by leveraging the WAL as the authoritative record of committed transactions.

Graph Storage

The Property Graph Model

Lattice implements a Labeled Property Graph - the same model used by Neo4j, Amazon Neptune, and other graph databases. It consists of:

Node:
  - id: unique identifier (u64)
  - labels: set of strings (e.g., "Person", "Employee")
  - properties: key-value pairs (e.g., name: "Alice", age: 30)

Edge:
  - source: node id
  - target: node id
  - type: string (e.g., "KNOWS", "WORKS_AT")
  - properties: key-value pairs (e.g., since: 2020)

This model is expressive enough to represent almost any domain while remaining simple to query and traverse.

The Storage Challenge

How do you store a graph in a B+Tree (which is fundamentally a key-value store)?

The key insight: decompose the graph into multiple B+Trees, each optimized for a specific access pattern.

┌─────────────────────────────────────────────────────────────────┐
│                      Graph Storage Layer                         │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐           │
│  │   SYMBOLS    │  │    NODES     │  │    EDGES     │           │
│  │   B+Tree     │  │   B+Tree     │  │   B+Tree     │           │
│  │              │  │              │  │              │           │
│  │ string → id  │  │ node_id →    │  │ composite    │           │
│  │ id → string  │  │   NodeData   │  │ key → data   │           │
│  └──────────────┘  └──────────────┘  └──────────────┘           │
│                                                                  │
│  ┌──────────────┐                                               │
│  │ LABEL_INDEX  │                                               │
│  │   B+Tree     │                                               │
│  │              │                                               │
│  │ (label,node) │                                               │
│  │    → ∅       │                                               │
│  └──────────────┘                                               │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

String Interning (Symbol Table)

Graphs have lots of repeated strings: label names, property keys, edge types. Storing "Person" thousands of times wastes space. Instead, we intern strings.

String Interning:

  "Person"  ──────►  1000
  "Employee" ─────►  1001
  "name"    ──────►  1002
  "KNOWS"   ──────►  1003

  Now instead of storing "Person" (6 bytes),
  we store 1000 (2 bytes)

The Symbol Table uses two B+Trees:

SYMBOLS (forward):           SYMBOLS_REVERSE:
  "Person"   → 1000           1000 → "Person"
  "Employee" → 1001           1001 → "Employee"
  "name"     → 1002           1002 → "name"
  "KNOWS"    → 1003           1003 → "KNOWS"

Symbol IDs are u16 (0-65535):

0: Reserved (null)
1-999: Reserved for system use
1000-65535: User-defined symbols

API

var symbols = SymbolTable.init(allocator, &forward_tree, &reverse_tree);

// Intern a string (creates if not exists, returns existing if present)
const person_id = try symbols.intern("Person");  // 1000

// Lookup without creating
const id = try symbols.lookup("Person");  // 1000
// or
try symbols.lookup("Unknown");  // SymbolError.NotFound

// Resolve ID back to string
const name = try symbols.resolve(person_id);  // "Person"
defer symbols.freeString(name);  // Must free allocated string

Node Storage

Nodes are stored in a B+Tree with simple u64 keys:

NODES B+Tree:
  Key: node_id (u64, little-endian)
  Value: NodeData (serialized)

NodeData Format

┌────────────────────────────────────────────────────────────┐
│ num_labels: u16                                            │
│ labels: [symbol_id: u16] × num_labels                     │
│ num_properties: u16                                        │
│ properties: [PropertyEntry] × num_properties              │
└────────────────────────────────────────────────────────────┘

PropertyEntry:
┌────────────────────────────────────────────────────────────┐
│ key_id: u16 (interned string)                              │
│ value_type: u8                                             │
│ value_data: variable                                       │
└────────────────────────────────────────────────────────────┘

Value Types:
  0 = Null
  1 = Bool (1 byte: 0 or 1)
  2 = Int64 (8 bytes, little-endian)
  3 = Float64 (8 bytes, IEEE 754)
  4 = String (u32 length + bytes)
  5 = Bytes (u32 length + bytes)
  6 = List (TODO)
  7 = Map (TODO)

API

var store = NodeStore.init(allocator, &nodes_tree);

// Create a node
const labels = [_]SymbolId{ person_id, employee_id };
const properties = [_]Property{
    .{ .key_id = name_id, .value = .{ .string_val = "Alice" } },
    .{ .key_id = age_id, .value = .{ .int_val = 30 } },
};
const node_id = try store.create(&labels, &properties);

// Get a node
var node = try store.get(node_id);
defer node.deinit(allocator);

// Check existence
if (store.exists(node_id)) { ... }

// Update a node
try store.update(node_id, &new_labels, &new_properties);

// Delete a node
try store.delete(node_id);

Edge Storage

Edges are more complex because we need efficient traversal in both directions:

"Find all people Alice knows" (outgoing)
"Find all people who know Alice" (incoming)

The Double-Write Pattern

For edge (Alice)-[:KNOWS]->(Bob), we store two entries:

Entry 1 (Outgoing from Alice):
  Key: (Alice, OUTGOING, KNOWS, Bob)
  Value: edge properties

Entry 2 (Incoming to Bob):
  Key: (Bob, INCOMING, KNOWS, Alice)
  Value: edge properties (same data)

This doubles write cost but enables O(1) traversal in either direction.

Key Format

Edge Key (19 bytes, big-endian for lexicographic ordering):
┌──────────────────────────────────────────────────────────┐
│ source_id: u64    │ direction: u8 │ type_id: u16 │ target_id: u64 │
│   (8 bytes)       │   (1 byte)    │  (2 bytes)   │   (8 bytes)    │
└──────────────────────────────────────────────────────────┘

Direction:
  0 = Outgoing (source → target)
  1 = Incoming (target ← source)

Big-endian encoding ensures keys sort correctly for range scans:

All edges from node X are contiguous
Within that, all outgoing edges are together
Within that, edges of same type are together

Why This Key Order?

The key (source, direction, type, target) is optimized for common queries:

Query: "All outgoing edges from Alice"
  Scan: (Alice, 0, *, *)
  Keys are contiguous!

Query: "All KNOWS edges from Alice"
  Scan: (Alice, 0, KNOWS, *)
  Even more specific prefix!

Query: "Does Alice know Bob?"
  Point lookup: (Alice, 0, KNOWS, Bob)
  Direct key access!

API

var store = EdgeStore.init(allocator, &edges_tree);

// Create an edge
const properties = [_]Property{
    .{ .key_id = since_id, .value = .{ .int_val = 2020 } },
};
try store.create(alice_id, bob_id, knows_id, &properties);

// Get an edge
var edge = try store.get(alice_id, bob_id, knows_id);
defer edge.deinit(allocator);

// Check existence
if (store.exists(alice_id, bob_id, knows_id)) { ... }

// Delete an edge (removes both outgoing and incoming entries)
try store.delete(alice_id, bob_id, knows_id);

// Iterate all outgoing edges from a node
var iter = try store.getOutgoing(alice_id);
defer iter.deinit();
while (try iter.next()) |edge| {
    defer edge.deinit(allocator);
    // Process edge...
}

// Iterate incoming edges
var incoming = try store.getIncoming(bob_id);
defer incoming.deinit();

// Filter by edge type
var knows_edges = try store.getOutgoingByType(alice_id, knows_id);
defer knows_edges.deinit();

// Count edges without allocating
const out_count = try store.countOutgoing(alice_id);
const in_count = try store.countIncoming(bob_id);

Label Index

For queries like MATCH (n:Person), we need to find all nodes with a given label efficiently. The Label Index provides this.

LABEL_INDEX B+Tree:
  Key: (label_id: u16, node_id: u64) - big-endian
  Value: empty (existence only)

How It Works

When creating node Alice with labels [Person, Employee]:
  Insert: (Person, Alice) → ∅
  Insert: (Employee, Alice) → ∅

Query "all Person nodes":
  Range scan: (Person, 0) to (Person, MAX)
  Returns: Alice, Bob, Carol, ...

API

var index = LabelIndex.init(allocator, &label_tree);

// Add labels when creating a node
try index.addLabels(&[_]SymbolId{ person_id, employee_id }, node_id);

// Check if node has a label
if (index.hasLabel(person_id, node_id)) { ... }

// Remove a label from a node
try index.remove(person_id, node_id);

// Get all nodes with a label (allocates result slice)
const person_nodes = try index.getNodesByLabel(person_id);
defer allocator.free(person_nodes);
for (person_nodes) |node_id| {
    // Process each Person node...
}

// Lazy iteration (memory-efficient for large result sets)
var iter = try index.iterNodesByLabel(person_id);
defer iter.deinit();
while (try iter.next()) |node_id| {
    // Process one node at a time...
}

// Count nodes without allocating
const person_count = try index.countNodesByLabel(person_id);

Putting It Together

A complete graph operation involves multiple B+Trees:

Creating node Alice:Person with name="Alice":

1. Symbol Table:
   - intern("Person") → 1000
   - intern("name") → 1001

2. Node Store:
   - Allocate node_id = 1
   - Serialize: [1 label: 1000][1 prop: 1001="Alice"]
   - Insert: 1 → serialized_data

3. Label Index:
   - Insert: (1000, 1) → ∅


Creating edge (Alice)-[:KNOWS]->(Bob):

1. Symbol Table:
   - intern("KNOWS") → 1002

2. Edge Store:
   - Insert: (1, OUT, 1002, 2) → properties
   - Insert: (2, IN, 1002, 1) → properties

Performance Characteristics

Operation	Complexity	Notes
Create node	O(log n)	One B+Tree insert + label index inserts
Get node	O(log n)	Single B+Tree lookup
Delete node	O(log n)	One B+Tree delete
Create edge	O(log n)	Two B+Tree inserts (double-write)
Get edge	O(log n)	Single B+Tree lookup
Delete edge	O(log n)	Two B+Tree deletes (double-delete)
Check label	O(log n)	Single B+Tree lookup
Remove label	O(log n)	Single B+Tree delete
All nodes with label	O(log n + k)	Range scan, k = result count
All edges from node	O(log n + k)	Range scan, k = edge count

Where n = total items in the respective B+Tree.

Current Limitations

No MVCC: All operations see latest data (no snapshot isolation yet)
Property updates overwrite: No partial property updates
No underflow handling: Deleted entries leave wasted space until compaction

Future Enhancements

Property indexes: Secondary indexes on property values
Edge type index: Fast lookup by edge type across all nodes
MVCC integration: Transaction-aware reads and writes
B+Tree underflow handling: Merge/redistribute pages after deletions

Summary

Component	B+Tree Key	Purpose
Symbol Table (forward)	string	String → ID mapping
Symbol Table (reverse)	symbol_id	ID → String mapping
Node Store	node_id	Node data storage
Edge Store	(src, dir, type, tgt)	Edge data + traversal
Label Index	(label_id, node_id)	Label-based queries

The graph storage layer transforms a B+Tree (a key-value store) into a full property graph database through careful key design and the double-write pattern for edges.

Vector Search

Overview

Lattice provides approximate nearest neighbor (ANN) search using the HNSW (Hierarchical Navigable Small World) algorithm. This enables semantic search over high-dimensional embedding vectors—a core requirement for AI/RAG applications.

┌─────────────────────────────────────────────────────────────────┐
│                     Vector Search Stack                         │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌─────────────────┐                                           │
│  │ EmbeddingClient │  Optional: HTTP calls to embedding APIs   │
│  │   (optional)    │  Ollama, OpenAI, etc.                     │
│  └────────┬────────┘                                           │
│           │ []f32                                               │
│           ▼                                                     │
│  ┌─────────────────┐                                           │
│  │   HnswIndex     │  Multi-layer graph for ANN search         │
│  │                 │  O(log n) search complexity               │
│  └────────┬────────┘                                           │
│           │                                                     │
│           ▼                                                     │
│  ┌─────────────────┐                                           │
│  │ VectorStorage   │  Page-based vector data storage           │
│  │                 │  Integrates with BufferPool               │
│  └─────────────────┘                                           │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Embedding Generation

Lattice uses a bring your own embeddings approach by default. You can:

Generate embeddings externally and pass them directly
Use the optional EmbeddingClient to call HTTP embedding APIs

Option 1: Bring Your Own Embeddings

// You generate embeddings however you want
const embedding: []const f32 = your_embedding_function("Hello world");

// Insert directly into HNSW index
try hnsw_index.insert(doc_id, embedding);

Option 2: HTTP Embedding Client

The EmbeddingClient calls external HTTP endpoints to generate embeddings. It's disabled by default—you must explicitly create and configure it.

const lattice = @import("lattice");

// Ollama (local) - simplest config
var client = lattice.EmbeddingClient.init(allocator, .{
    .endpoint = "http://localhost:11434/api/embeddings",
});
defer client.deinit();

// Generate embedding
const vector = try client.embed("Hello, world!");
defer allocator.free(vector);

Configuration Options

const config = lattice.EmbeddingConfig{
    // Required: HTTP endpoint URL
    .endpoint = "http://localhost:11434/api/embeddings",

    // Model name (default: "nomic-embed-text")
    .model = "nomic-embed-text",

    // API format: .ollama (default) or .openai
    .api_format = .ollama,

    // Optional API key for authenticated endpoints
    .api_key = null,

    // Request timeout in milliseconds (default: 30000)
    .timeout_ms = 30_000,
};

Supported API Formats

Format	Request	Response
`.ollama`	`{"model": "...", "prompt": "..."}`	`{"embedding": [...]}`
`.openai`	`{"model": "...", "input": "..."}`	`{"data": [{"embedding": [...]}]}`

Examples

Ollama (local)

var client = EmbeddingClient.init(allocator, .{
    .endpoint = "http://localhost:11434/api/embeddings",
    .model = "nomic-embed-text",
});

OpenAI

var client = EmbeddingClient.init(allocator, .{
    .endpoint = "https://api.openai.com/v1/embeddings",
    .model = "text-embedding-3-small",
    .api_format = .openai,
    .api_key = "sk-...",
});

Local OpenAI-compatible (llama.cpp, vLLM, etc.)

var client = EmbeddingClient.init(allocator, .{
    .endpoint = "http://localhost:8080/v1/embeddings",
    .model = "local-model",
    .api_format = .openai,
});

HNSW Index

HNSW (Hierarchical Navigable Small World) is a graph-based algorithm for approximate nearest neighbor search. It achieves O(log n) search complexity with high recall.

How HNSW Works

Layer 2:    [A] ─────────────────────── [D]
             │                           │
Layer 1:    [A] ──── [B] ──── [C] ──── [D] ──── [E]
             │        │        │        │        │
Layer 0:    [A]─[F]─[B]─[G]─[C]─[H]─[D]─[I]─[E]─[J]
            (dense connections at layer 0)

Upper layers have exponentially fewer nodes (sparse)
Search starts at top layer, greedily descends
Layer 0 uses beam search for final candidates
Each node maintains bidirectional connections to neighbors

Configuration

const config = lattice.HnswConfig{
    .m = 16,                    // Connections per node (layers 1+)
    .m_max0 = 32,              // Connections at layer 0
    .ef_construction = 200,    // Search width during insert
    .ef_search = 64,           // Search width during query
    .ml = 0.36067977,          // Level multiplier (1/ln(2))
    .metric = .cosine,         // Distance metric
};

Parameter	Default	Description
`m`	16	Max connections per node (higher = better recall, more memory)
`m_max0`	32	Max connections at layer 0 (typically 2×m)
`ef_construction`	200	Beam width during insert (higher = better graph quality)
`ef_search`	64	Beam width during search (higher = better recall, slower)
`metric`	`.cosine`	Distance metric: `.euclidean`, `.cosine`, `.inner_product`

API Usage

const lattice = @import("lattice");

// Initialize vector storage
var vector_storage = try lattice.VectorStorage.init(allocator, buffer_pool, 384);
defer vector_storage.deinit();

// Initialize HNSW index
var hnsw = lattice.HnswIndex.init(allocator, buffer_pool, &vector_storage, .{
    .metric = .cosine,
    .ef_search = 100,
});
defer hnsw.deinit();

// Insert vectors
try hnsw.insert(1, embedding1);
try hnsw.insert(2, embedding2);
try hnsw.insert(3, embedding3);

// Search for 10 nearest neighbors
const results = try hnsw.search(query_vector, 10, null);
defer hnsw.freeResults(results);

for (results) |result| {
    std.debug.print("ID: {}, Distance: {d:.4}\n", .{ result.id, result.distance });
}

Distance Metrics

Metric	Formula	Best For
`.euclidean`	√Σ(aᵢ - bᵢ)²	General purpose
`.cosine`	1 - (a·b)/(‖a‖‖b‖)	Text embeddings (normalized)
`.inner_product`	-Σ(aᵢ × bᵢ)	When vectors are pre-normalized

For text embeddings from most models (OpenAI, Cohere, etc.), use .cosine.

Vector Storage

Vectors are stored in pages managed by the buffer pool, separate from the HNSW graph structure.

┌────────────────────────────────────────────────────────────────┐
│ Vector Page (4096 bytes)                                       │
├────────────────────────────────────────────────────────────────┤
│ Header (24 bytes)                                              │
│   page_type: u8 = 0x04 (vector_data)                          │
│   dimensions: u16                                              │
│   vector_count: u16                                            │
│   next_page: u32 (overflow chain)                             │
├────────────────────────────────────────────────────────────────┤
│ Slot 0: [vector_id: u64][f32 × dimensions]                    │
│ Slot 1: [vector_id: u64][f32 × dimensions]                    │
│ ...                                                            │
└────────────────────────────────────────────────────────────────┘

Vectors per page depends on dimensions:

384-dim (1536 bytes): 2 vectors/page
768-dim (3072 bytes): 1 vector/page
1536-dim (6144 bytes): spans multiple pages

Performance Characteristics

Operation	Complexity	Notes
Insert	O(log n)	Builds graph connections
Search (k-NN)	O(log n)	Independent of k for small k
Delete	O(m × log n)	Removes connections
Memory	O(n × m × layers)	~1KB per vector at m=16

Tuning Guidelines

For higher recall:

Increase ef_search (e.g., 100-200)
Increase m (e.g., 32-64)
Use more ef_construction (e.g., 400)

For faster search:

Decrease ef_search (e.g., 32)
Accept lower recall

Typical configuration for 1M vectors:

.m = 16,
.m_max0 = 32,
.ef_construction = 200,
.ef_search = 64,  // Adjust based on recall/speed tradeoff

Complete Example

const std = @import("std");
const lattice = @import("lattice");

pub fn main() !void {
    var gpa = std.heap.GeneralPurposeAllocator(.{}){};
    defer _ = gpa.deinit();
    const allocator = gpa.allocator();

    // Setup storage (abbreviated)
    var buffer_pool = try lattice.BufferPool.init(allocator, page_manager, 1000);
    defer buffer_pool.deinit();

    // Initialize embedding client (optional)
    var embedder = lattice.EmbeddingClient.init(allocator, .{
        .endpoint = "http://localhost:11434/api/embeddings",
    });
    defer embedder.deinit();

    // Initialize vector storage and HNSW index
    var vector_storage = try lattice.VectorStorage.init(allocator, &buffer_pool, 384);
    defer vector_storage.deinit();

    var hnsw = lattice.HnswIndex.init(allocator, &buffer_pool, &vector_storage, .{
        .metric = .cosine,
    });
    defer hnsw.deinit();

    // Index some documents
    const docs = [_][]const u8{
        "Zig is a systems programming language",
        "Rust focuses on memory safety",
        "Go emphasizes simplicity and concurrency",
    };

    for (docs, 0..) |doc, i| {
        const embedding = try embedder.embed(doc);
        defer allocator.free(embedding);
        try hnsw.insert(i, embedding);
    }

    // Search
    const query_embedding = try embedder.embed("memory safe language");
    defer allocator.free(query_embedding);

    const results = try hnsw.search(query_embedding, 3, null);
    defer hnsw.freeResults(results);

    for (results) |result| {
        std.debug.print("Doc {}: distance {d:.4}\n", .{ result.id, result.distance });
    }
}

SIMD-Optimized Distance Functions

Distance calculations use Zig's @Vector for portable SIMD acceleration. This provides 4-8x speedup over scalar implementations for typical embedding dimensions (384, 768, 1536).

// All distance functions are SIMD-optimized
const dist = lattice.vector.distance.euclideanDistance(vec_a, vec_b);
const cos_dist = lattice.vector.distance.cosineDistance(vec_a, vec_b);
const ip_dist = lattice.vector.distance.innerProductDistance(vec_a, vec_b);

// Additional utilities
const dot = lattice.vector.distance.dotProduct(vec_a, vec_b);
const magnitude = lattice.vector.distance.norm(vec);
lattice.vector.distance.normalize(vec);  // in-place

Implementation details:

Processes 8 floats at a time (AVX-256 compatible)
Handles non-aligned remainder with scalar fallback
Works on ARM NEON (128-bit, processed as 2×4)

Current Limitations

In-memory graph: Node connections stored in memory (persisted to pages on write)
No incremental persistence: Full graph rebuild on restart
Single-threaded: No concurrent insert/search yet

Future Enhancements

Persistent graph: Load HNSW structure from disk
Concurrent access: Reader-writer locks for parallel search
Product quantization: Compress vectors for larger datasets
Filtered search: Combine with label/property predicates

Full-Text Search (BM25)

This document explains how Lattice's full-text search works, from text input to ranked results.

Overview

Full-text search allows you to find documents containing specific words or phrases, ranked by relevance. Lattice implements the BM25 (Best Match 25) ranking algorithm, the same algorithm used by Elasticsearch and other production search engines.

The FTS system consists of five components:

┌─────────────┐     ┌────────────┐     ┌──────────────┐     ┌─────────┐
│  Tokenizer  │ ──► │ Dictionary │ ──► │ Posting List │ ──► │ Scorer  │
│             │     │  (B+Tree)  │     │   (Pages)    │     │ (BM25)  │
└─────────────┘     └────────────┘     └──────────────┘     └─────────┘
      │                   │                   │                  │
      ▼                   ▼                   ▼                  ▼
 "hello world"      "hello" → 1        doc_ids with         ranked
  → ["hello",       "world" → 2        term frequencies     results
     "world"]

1. Tokenizer

The tokenizer breaks text into searchable tokens.

What It Does

Given input text:

"The quick brown fox jumps over the lazy dog"

The tokenizer produces:

["quick", "brown", "fox", "jumps", "lazy", "dog"]

Notice "The", "the", and "over" are missing—they're stop words.

How It Works

Input: "Hello, World! This is a TEST."
        │
        ▼
┌─────────────────────────────────────┐
│ 1. Scan for word boundaries         │
│    Split on non-alphanumeric chars  │
│    "Hello" "World" "This" "is" "a" "TEST"
└─────────────────────────────────────┘
        │
        ▼
┌─────────────────────────────────────┐
│ 2. Apply length filter              │
│    min_length=2, max_length=64      │
│    "Hello" "World" "This" "is" "TEST"
│    ("a" removed - too short)        │
└─────────────────────────────────────┘
        │
        ▼
┌─────────────────────────────────────┐
│ 3. Normalize to lowercase           │
│    "hello" "world" "this" "is" "test"
└─────────────────────────────────────┘
        │
        ▼
┌─────────────────────────────────────┐
│ 4. Filter stop words                │
│    "hello" "world" "test"           │
│    ("this", "is" removed)           │
└─────────────────────────────────────┘
        │
        ▼
Output: [Token{text="hello", position=0},
         Token{text="world", position=1},
         Token{text="test",  position=2}]

Stop Words

Stop words are common words that add little search value. Lattice supports stop word filtering for 11 languages:

Language	Example Stop Words
English	the, and, is, a, to, of, in, that, it
German	der, die, das, und, ist, in, zu, den
French	le, la, les, de, et, en, que, un
Spanish	el, la, los, de, en, que, y, es
Italian	il, la, lo, i, di, e, che, un
Portuguese	o, a, os, de, e, que, em, um
Dutch	de, het, een, en, van, in, is
Swedish	och, i, att, det, en, som, av
Norwegian	og, i, det, er, en, at, til
Danish	og, i, at, det, en, er, til
Finnish	ja, on, ei, ole, oli, se, han
Russian	и, в, не, на, что, он, с, как

Set the language in your tokenizer config:

var tokenizer = Tokenizer.init(allocator, text, .{
    .remove_stop_words = true,
    .language = .german,  // Use German stop words
});

Configuration

pub const TokenizerConfig = struct {
    min_token_length: u8 = 2,      // Skip tokens shorter than this
    max_token_length: u8 = 64,     // Skip tokens longer than this
    lowercase: bool = true,         // Convert to lowercase
    remove_accents: bool = true,    // Remove diacritics (planned)
    remove_stop_words: bool = true, // Filter common words
    use_stemming: bool = false,     // Apply Porter stemmer
    language: Language = .english,  // Language for stop words
};

Porter Stemmer

When use_stemming is enabled, tokens are reduced to their root forms:

Input Token    →  Stemmed
─────────────────────────
"running"      →  "run"
"connected"    →  "connect"
"optimization" →  "optim"
"databases"    →  "databas"

Why stem? Stemming improves recall by matching morphological variants:

Query "run" matches documents containing "running", "runs", "runner"
Query "connect" matches "connected", "connecting", "connection"

Note: Currently only English stemming is supported. For other languages, words are returned unchanged when stemming is enabled. Future versions may add Snowball stemmers for additional languages.

Use normalizeAndStem() or normalizeAndStemWithLanguage() for manual stemming:

var buf: [64]u8 = undefined;

// English stemming
const en_stemmed = tokenizer_mod.normalizeAndStemWithLanguage("RUNNING", &buf, true, .english);
// en_stemmed = "run"

// German (no stemmer available, returns lowercased)
const de_stemmed = tokenizer_mod.normalizeAndStemWithLanguage("RUNNING", &buf, true, .german);
// de_stemmed = "running"

2. Dictionary

The dictionary maps tokens to integer IDs and tracks statistics.

What It Does

Token          →  TokenId  DocFreq  PostingPage
─────────────────────────────────────────────────
"hello"        →     1        5         42
"world"        →     2        3         43
"database"     →     3       12         44
"search"       →     4        8         45

Why TokenIds?

Storing the full token string everywhere would waste space. Instead:

Dictionary stores: "database" → TokenId 3
Posting lists store: TokenId 3 (4 bytes instead of 8)
Reduced I/O and memory usage

Storage Format

The dictionary uses a B+Tree with:

Key: Token string (e.g., "hello")
Value: DictionaryEntry (24 bytes)

DictionaryEntry (24 bytes, extern struct):
┌──────────────┬──────────────┬──────────────┬──────────────┬──────────────┐
│  total_freq  │  token_id    │   doc_freq   │ posting_page │  _padding    │
│   (8 bytes)  │  (4 bytes)   │  (4 bytes)   │  (4 bytes)   │  (4 bytes)   │
└──────────────┴──────────────┴──────────────┴──────────────┴──────────────┘

Fields are ordered largest-first to minimize internal padding (u64 requires 8-byte alignment).

Field	Type	Description
`total_freq`	u64	Total occurrences across all documents
`token_id`	u32	Unique identifier (1 to ~4 billion)
`doc_freq`	u32	Number of documents containing this token
`posting_page`	PageId	First page of the posting list
`_padding`	u32	Explicit padding for 8-byte struct alignment

Operations

getOrCreate(token) - Get existing TokenId or create new one:

Input: "hello"
        │
        ▼
┌─────────────────────────────────────┐
│ B+Tree lookup for "hello"           │
│                                     │
│ Found? → Return existing token_id   │
│                                     │
│ Not found? →                        │
│   1. Assign next_token_id (e.g., 5) │
│   2. Insert into B+Tree             │
│   3. Return 5                       │
└─────────────────────────────────────┘

3. Posting Lists

A posting list stores which documents contain a specific token.

What It Does

For the token "database" (TokenId 3):

Posting List for "database":
┌────────┬───────────┐
│ DocId  │ TermFreq  │
├────────┼───────────┤
│   15   │     3     │  ← Document 15 contains "database" 3 times
│   42   │     1     │  ← Document 42 contains "database" 1 time
│   89   │     7     │  ← Document 89 contains "database" 7 times
│  156   │     2     │
│  203   │     1     │
│  ...   │    ...    │
└────────┴───────────┘

Page Layout

Each posting list page is 4096 bytes:

Posting Page (4096 bytes):
┌─────────────────────────────────────────────────────────────┐
│ PageHeader (8 bytes)                                        │
│   page_type = FTS_POSTING                                   │
├─────────────────────────────────────────────────────────────┤
│ PostingPageHeader (20 bytes)                                │
│   token_id:        u32  - Which token this list is for      │
│   num_entries:     u32  - Number of postings in this page   │
│   next_page:       u32  - Overflow page (0 = none)          │
│   num_skip_ptrs:   u16  - Number of skip pointers           │
│   flags:           u16  - 0x01 = has positions              │
│   data_start:      u32  - Byte offset where data begins     │
├─────────────────────────────────────────────────────────────┤
│ Skip Pointers (16 bytes each, optional)                     │
│   [doc_id, byte_offset, entry_count] × N                    │
├─────────────────────────────────────────────────────────────┤
│ Posting Data (variable length, varint encoded)              │
│   [doc_id: varint] [term_freq: varint]                      │
│   [doc_id: varint] [term_freq: varint]                      │
│   ...                                                       │
└─────────────────────────────────────────────────────────────┘

PostingPageHeader Explained

The PostingPageHeader is metadata at the start of each posting page:

Field	Bytes	Purpose
`token_id`	4	Identifies which token this posting list belongs to
`num_entries`	4	How many (doc_id, term_freq) pairs are in this page
`next_page`	4	PageId of overflow page if list doesn't fit (0 = no overflow)
`num_skip_pointers`	2	Count of skip pointers for fast seeking
`flags`	2	Bit flags: 0x01 = positions stored for phrase queries
`data_start`	4	Byte offset where posting entries begin (after skip pointers)

Varint Encoding

Document IDs and frequencies are stored using variable-length integers (varints) for compression:

Value           Binary              Varint Bytes     Savings
─────────────────────────────────────────────────────────────
127             0111 1111           [0x7F]           1 byte (vs 8)
128             1000 0000           [0x80, 0x01]     2 bytes
16383           0011 1111 1111 1111 [0xFF, 0x7F]     2 bytes
1,000,000       ...                 [0xC0, 0x84, 0x3D] 3 bytes

Encoding algorithm:

while value >= 0x80:
    output byte = (value & 0x7F) | 0x80   // Low 7 bits + continuation flag
    value = value >> 7
output byte = value                        // Final byte (no continuation)

Example: Encoding 300 (binary: 1 0010 1100)

300 = 0b100101100

Step 1: 300 >= 128, so:
        output[0] = (300 & 0x7F) | 0x80 = 0x2C | 0x80 = 0xAC
        300 >> 7 = 2

Step 2: 2 < 128, so:
        output[1] = 2 = 0x02

Result: [0xAC, 0x02] (2 bytes instead of 8)

Overflow Pages

When a posting list exceeds one page, it chains to overflow pages:

Page 42 (first page)          Page 87 (overflow)
┌────────────────────┐        ┌────────────────────┐
│ Header             │        │ Header             │
│   next_page = 87 ──┼───────►│   next_page = 0    │
│   num_entries = 200│        │   num_entries = 50 │
├────────────────────┤        ├────────────────────┤
│ Entries 1-200      │        │ Entries 201-250    │
└────────────────────┘        └────────────────────┘

Skip Pointers

Skip pointers enable O(log n) seeking within posting lists, dramatically speeding up multi-term AND queries.

Structure:

SkipPointer (16 bytes):
┌────────────┬─────────────┬─────────────┐
│  doc_id    │ byte_offset │ entry_count │
│  (8 bytes) │  (4 bytes)  │  (4 bytes)  │
└────────────┴─────────────┴─────────────┘

How they work:

Skip pointers are created every 128 entries (SKIP_INTERVAL). They record:

doc_id: The document ID at that entry
byte_offset: Where to jump in the posting data
entry_count: Number of entries before this pointer

Posting list with 500 entries:

Skip Pointers:
  [0] doc_id=1280, offset=512, count=128   ← entry 128
  [1] doc_id=2560, offset=1024, count=256  ← entry 256
  [2] doc_id=3840, offset=1536, count=384  ← entry 384

Seeking to doc_id 3000:
  1. Binary search skip pointers: find [1] (doc_id=2560 < 3000)
  2. Jump to offset 1024
  3. Linear scan from entry 256 to find doc_id 3000

Result: Skipped 256 entries instead of scanning all 300

Multi-term intersection optimization:

For AND queries like "database optimization":

Sort terms by doc_freq (smallest first)
Iterate through smallest posting list
For each doc_id, use skipTo() on other lists
Skip pointers let large lists jump ahead efficiently

Query: "the database" (AND)

"the":      10,000 documents
"database":    100 documents  ← Driver (smallest)

Without skip pointers: 10,000 + 100 = 10,100 iterations
With skip pointers:    100 + ~100 seeks = ~200 operations

4. Document Length Storage

BM25 scoring requires knowing each document's length. This is stored in a separate B+Tree:

DocId  →  Length (tokens)
────────────────────────
   1   →      150
   2   →       45
   3   →      892
  ...  →      ...

Why store lengths?

BM25 penalizes long documents to prevent them from dominating results just because they contain more words. A 10,000-word document mentioning "database" once is less relevant than a 100-word document mentioning it once.

Statistics tracked:

total_docs: Total documents indexed
total_tokens: Sum of all document lengths
avg_doc_length: Average tokens per document (for normalization)

5. BM25 Scoring

BM25 calculates a relevance score for each document.

The Formula

Score(D, Q) = Σ IDF(term) × TF_norm(term, D)
              for each term in query Q

Where:

IDF (Inverse Document Frequency):

IDF(term) = log((N - df + 0.5) / (df + 0.5) + 1)

N  = total number of documents
df = number of documents containing this term

Rare terms get higher IDF scores. If "quantum" appears in 5 of 10,000 documents, it's more significant than "the" appearing in 9,500.

TF_norm (Normalized Term Frequency):

TF_norm = (tf × (k1 + 1)) / (tf + k1 × (1 - b + b × (dl / avgdl)))

tf    = term frequency in this document
k1    = saturation parameter (default 1.2)
b     = length normalization (default 0.75)
dl    = document length
avgdl = average document length

Parameters

Parameter	Default	Effect
`k1`	1.2	Controls term frequency saturation. Higher = more weight to repeated terms
`b`	0.75	Length normalization. 0 = no normalization, 1 = full normalization

Scoring Example

Corpus: 1000 documents, average length 200 tokens
Query: "database optimization"

Document 42:
  - Length: 150 tokens
  - "database" appears 3 times
  - "optimization" appears 1 time

Term: "database"
  - doc_freq = 50 (appears in 50 docs)
  - IDF = log((1000 - 50 + 0.5) / (50 + 0.5) + 1) = log(19.82) ≈ 2.99

  - tf = 3
  - dl/avgdl = 150/200 = 0.75
  - TF_norm = (3 × 2.2) / (3 + 1.2 × (1 - 0.75 + 0.75 × 0.75))
            = 6.6 / (3 + 1.2 × 0.8125)
            = 6.6 / 3.975
            ≈ 1.66

  - Score for "database" = 2.99 × 1.66 ≈ 4.96

Term: "optimization"
  - doc_freq = 10 (rarer term)
  - IDF = log((1000 - 10 + 0.5) / (10 + 0.5) + 1) ≈ 4.55

  - tf = 1
  - TF_norm = (1 × 2.2) / (1 + 1.2 × 0.8125) ≈ 1.12

  - Score for "optimization" = 4.55 × 1.12 ≈ 5.10

Total Score for Document 42 = 4.96 + 5.10 = 10.06

6. Search Flow

Single-Term Search

Query: "database"
        │
        ▼
┌─────────────────────────────────────┐
│ 1. Tokenize query                   │
│    → ["database"]                   │
└─────────────────────────────────────┘
        │
        ▼
┌─────────────────────────────────────┐
│ 2. Dictionary lookup                │
│    "database" → TokenId 3           │
│                 doc_freq = 50       │
│                 posting_page = 44   │
└─────────────────────────────────────┘
        │
        ▼
┌─────────────────────────────────────┐
│ 3. Iterate posting list (page 44)  │
│    For each (doc_id, term_freq):    │
│      - Get doc_length               │
│      - Calculate BM25 score         │
│      - Add to results               │
└─────────────────────────────────────┘
        │
        ▼
┌─────────────────────────────────────┐
│ 4. Sort by score descending         │
│    Return top K results             │
└─────────────────────────────────────┘

Multi-Term Search (AND Semantics)

Query: "database optimization"
        │
        ▼
┌─────────────────────────────────────┐
│ 1. Tokenize query                   │
│    → ["database", "optimization"]   │
└─────────────────────────────────────┘
        │
        ▼
┌─────────────────────────────────────┐
│ 2. For each term, iterate posting   │
│    list and accumulate scores       │
│                                     │
│    doc_scores[doc_id] += score      │
│    doc_term_count[doc_id] += 1      │
└─────────────────────────────────────┘
        │
        ▼
┌─────────────────────────────────────┐
│ 3. Filter: keep only docs where     │
│    term_count == num_query_terms    │
│    (AND semantics)                  │
└─────────────────────────────────────┘
        │
        ▼
┌─────────────────────────────────────┐
│ 4. Sort by accumulated score        │
│    Return top K results             │
└─────────────────────────────────────┘

OR Search

OR search returns documents matching any query term:

// Returns docs containing "mysql" OR "postgres" OR both
const results = try fts.searchOr("mysql postgres", 10);

// Or use explicit mode selection
const results = try fts.searchWithMode("mysql postgres", .@"or", 10);

Query: "mysql postgres" (OR mode)
        │
        ▼
┌─────────────────────────────────────┐
│ 1. Tokenize query                   │
│    → ["mysql", "postgres"]          │
└─────────────────────────────────────┘
        │
        ▼
┌─────────────────────────────────────┐
│ 2. For each term, iterate posting   │
│    list and accumulate scores       │
│                                     │
│    doc_scores[doc_id] += score      │
│    (no term count filtering)        │
└─────────────────────────────────────┘
        │
        ▼
┌─────────────────────────────────────┐
│ 3. Sort by accumulated score        │
│    Docs with both terms rank higher │
└─────────────────────────────────────┘

NOT Search (Exclusions)

Prefix terms with - to exclude documents containing them:

// Find "database" docs that don't mention "mysql"
const results = try fts.searchWithMode("database -mysql", .@"and", 10);

// Multiple exclusions
const results = try fts.searchWithMode("database -mysql -oracle", .@"and", 10);

Query: "database -mysql"
        │
        ▼
┌─────────────────────────────────────┐
│ 1. Parse query                      │
│    terms = ["database"]             │
│    excluded = ["mysql"]             │
└─────────────────────────────────────┘
        │
        ▼
┌─────────────────────────────────────┐
│ 2. Search for positive terms        │
│    → candidate documents            │
└─────────────────────────────────────┘
        │
        ▼
┌─────────────────────────────────────┐
│ 3. Build exclusion set              │
│    (all doc_ids containing "mysql") │
└─────────────────────────────────────┘
        │
        ▼
┌─────────────────────────────────────┐
│ 4. Filter candidates                │
│    Remove docs in exclusion set     │
└─────────────────────────────────────┘

Phrase Search

Phrase search finds documents where terms appear adjacent and in order:

// Enable position storage in config
var fts = FtsIndex.init(allocator, &bp, &dict_tree, &lengths_tree, .{
    .store_positions = true,
});

// Search for exact phrase
const results = try fts.searchPhrase("quick brown fox", 10);

Query: "quick brown fox" (phrase)

Document 1: "The quick brown fox jumps"     ← MATCHES (adjacent: pos 1,2,3)
Document 2: "A quick red brown fox"         ← NO MATCH (not adjacent)
Document 3: "The brown quick fox"           ← NO MATCH (wrong order)

How it works:

┌─────────────────────────────────────┐
│ 1. Get posting lists with positions │
│                                     │
│    "quick": doc1[pos=1], doc2[pos=1]│
│    "brown": doc1[pos=2], doc2[pos=3]│
│    "fox":   doc1[pos=3], doc2[pos=4]│
└─────────────────────────────────────┘
        │
        ▼
┌─────────────────────────────────────┐
│ 2. For each candidate document:     │
│    Check position adjacency         │
│                                     │
│    doc1: quick@1, brown@2, fox@3    │
│          1+1=2 ✓, 1+2=3 ✓ → MATCH   │
│                                     │
│    doc2: quick@1, brown@3, fox@4    │
│          1+1=2 ≠ 3 → NO MATCH       │
└─────────────────────────────────────┘

Note: Phrase queries require store_positions = true in FtsConfig. Without positions, searchPhrase() falls back to AND semantics.

Quoted Phrase Syntax

You can also use quoted strings in regular search() and searchWithMode() calls to automatically detect phrase queries:

// These are equivalent:
const results1 = try fts.searchPhrase("quick brown", 10);
const results2 = try fts.searchWithMode("\"quick brown\"", .@"and", 10);

Combining phrases with terms and exclusions:

// Phrase + additional term (AND mode)
// Matches documents with "quick brown" phrase AND term "jumps"
const results = try fts.searchWithMode("\"quick brown\" jumps", .@"and", 10);

// Phrase + exclusion
// Matches documents with "quick brown" phrase but NOT containing "fox"
const results = try fts.searchWithMode("\"quick brown\" -fox", .@"and", 10);

// Multiple phrases
// Matches documents with both phrases
const results = try fts.searchWithMode("\"quick brown\" \"lazy dog\"", .@"and", 10);

// Phrase with OR mode
// Matches documents with "quick brown" phrase OR term "rabbit"
const results = try fts.searchWithMode("\"quick brown\" rabbit", .@"or", 10);

Single-word quotes:

Single words in quotes are treated as regular terms (since a one-word phrase is just a term):

// These are equivalent:
const results1 = try fts.search("database", 10);
const results2 = try fts.searchWithMode("\"database\"", .@"and", 10);

Fuzzy Search

Fuzzy search finds documents even when query terms contain typos. It uses Levenshtein (edit) distance to match terms within a configurable threshold.

const fuzzy = @import("fts/fuzzy.zig");

// Search with typo tolerance (max 2 edits)
const results = try fts.searchFuzzy("databse", .{
    .max_distance = 2,      // Allow up to 2 edits
    .min_term_length = 4,   // Only fuzzy-match terms >= 4 chars
}, 10);
defer fts.freeResults(results);
// Matches documents containing "database" (edit distance 1)

How it works:

Query: "databse" (typo)
  ↓
Scan dictionary for terms within edit distance 2:
  - "database" (distance=1) ✓
  - "datastore" (distance=3) ✗
  ↓
Search matching terms with distance penalty:
  - Score("database") × penalty(1) = Score × 0.75
  ↓
Return ranked results

Distance penalty formula:

penalty = 1.0 - (distance / max_distance)²

Examples (max_distance = 2):
- distance 0: 1.00 (exact match)
- distance 1: 0.75 (25% penalty)
- distance 2: 0.00 (filtered out)

Levenshtein distance counts the minimum single-character edits needed:

Insertion: "helo" → "hello" (distance 1)
Deletion: "hello" → "helo" (distance 1)
Substitution: "hello" → "hallo" (distance 1)

Note: Transpositions ("recieve" → "receive") count as 2 edits in standard Levenshtein.

Prefix/Wildcard Search

Prefix search finds documents containing terms that start with a given prefix. Use * as a suffix wildcard.

const prefix = @import("fts/prefix.zig");

// Search for terms starting with "optim" (matches "optimize", "optimization", "optimizer")
const results = try fts.searchWithPrefix("optim*", .{
    .min_prefix_length = 2,   // Minimum prefix length (prevents "a*" explosion)
    .max_expansions = 50,     // Maximum terms to expand
}, 10);
defer fts.freeResults(results);

How it works:

Query: "optim*"
  ↓
Calculate range bounds: ["optim", "optin")
  ↓
B+Tree range scan to find matching terms:
  - "optimization" ✓
  - "optimize" ✓
  - "optimizer" ✓
  ↓
Search all matching terms (like OR query)
  ↓
Return ranked results

Combining prefix with regular terms:

// AND semantics: documents must contain "systems" AND a term starting with "data"
const results = try fts.searchWithPrefix("systems data*", .{}, 10);
// Matches documents with both "systems" and "database"/"datastore"/"data"

Constraints:

Suffix wildcards only (optim*), not prefix wildcards (*tion)
No middle wildcards (da*base)
Minimum prefix length configurable (default 2 chars)
Maximum expansions capped to prevent explosion

Search Highlighting

Search highlighting returns text snippets with matched terms marked, enabling UI display of search results with context.

const highlight = @import("fts/highlight.zig");

const text = "The database stores data efficiently for optimal performance.";
const query_terms = [_][]const u8{ "database", "data" };

const result = try highlight.highlight(
    allocator,
    text,
    &query_terms,
    .{ .use_stemming = false },  // TokenizerConfig
    .{
        .context_chars = 80,      // Characters of context around matches
        .max_snippets = 3,        // Maximum snippets to return
        .merge_distance = 40,     // Merge snippets closer than this
        .prefix_marker = "<em>",  // Marker before matched terms
        .suffix_marker = "</em>", // Marker after matched terms
        .ellipsis = "...",        // Added when text is truncated
    },
);
defer highlight.freeResult(allocator, result);

// result.snippets[0].text = "The <em>database</em> stores <em>data</em> efficiently..."
// result.total_matches = 2

How it works:

Query terms: ["database", "data"]
Document: "The database stores data efficiently for optimal performance."
  ↓
Re-tokenize document with same config:
  - "database" at offset 4 → stems to "databas" or "database"
  - "data" at offset 20 → stems to "data"
  ↓
Match against query terms (stemmed comparison):
  - Match found at positions 4-12 and 20-24
  ↓
Group matches into snippets with context:
  - Single snippet with both matches (close together)
  ↓
Insert markers around matches:
  - "The <em>database</em> stores <em>data</em> efficiently..."

With stemming:

const text = "The runners were running fast in the marathon.";
const query_terms = [_][]const u8{"run"};  // Already stemmed query

const result = try highlight.highlight(
    allocator,
    text,
    &query_terms,
    .{ .use_stemming = true },  // Enable stemming
    .{ .prefix_marker = "**", .suffix_marker = "**" },
);
// "running" stems to "run" → match
// "runners" stems to "runner" → no match
// result.snippets[0].text = "The runners were **running** fast..."

Key design decisions:

Re-tokenization approach: Document text is re-tokenized at query time (no storage overhead)
Stemmed matching: Query terms are matched against stemmed document tokens
Original text preserved: Markers wrap the original text, not stemmed forms
Configurable markers: Default <em>/</em> but customizable for any format
Snippet merging: Close matches combined into single snippets

Finding matches only (without snippets):

const matches = try highlight.findMatches(
    allocator,
    text,
    &query_terms,
    .{ .use_stemming = false },
);
defer highlight.freeMatches(allocator, matches);

for (matches) |match| {
    // match.start = byte offset of match start
    // match.end = byte offset of match end (exclusive)
    const matched_text = text[match.start..match.end];
}

7. Indexing Flow

indexDocument(doc_id, text)

Input: doc_id=42, text="The quick database optimization guide"
        │
        ▼
┌─────────────────────────────────────┐
│ 1. Tokenize text                    │
│    → ["quick", "database",          │
│        "optimization", "guide"]     │
└─────────────────────────────────────┘
        │
        ▼
┌─────────────────────────────────────┐
│ 2. Count term frequencies           │
│    term_freqs = {                   │
│      "quick": 1,                    │
│      "database": 1,                 │
│      "optimization": 1,             │
│      "guide": 1                     │
│    }                                │
└─────────────────────────────────────┘
        │
        ▼
┌─────────────────────────────────────┐
│ 3. For each term:                   │
│    a. getOrCreate in dictionary     │
│    b. Create posting page if needed │
│    c. Append posting entry          │
│    d. Increment doc_freq            │
└─────────────────────────────────────┘
        │
        ▼
┌─────────────────────────────────────┐
│ 4. Store document length            │
│    doc_lengths[42] = 4              │
│    Update avg_doc_length            │
└─────────────────────────────────────┘
        │
        ▼
Output: 4 (number of tokens indexed)

8. Data Structures Summary

In-Memory

Structure	Purpose
`Tokenizer`	Streaming text tokenization
`PostingIterator`	Iterate posting list entries
`Bm25Scorer`	Calculate relevance scores

On-Disk (B+Trees)

B+Tree	Key	Value
Dictionary	Token string	DictionaryEntry (24 bytes)
DocLengths	DocId (8 bytes)	Length (4 bytes)

On-Disk (Pages)

Page Type	Content
`FTS_POSTING`	Posting list entries (varint encoded)

9. Limitations and Future Work

Current Limitations

English-only stemming - Porter stemmer for English only (other languages skip stemming)

Implemented Features

Phrase queries - searchPhrase("quick brown fox") with position verification
Boolean queries - AND (default), OR (searchOr), NOT (-term syntax)
Porter stemming - use_stemming=true reduces words to roots (English)
Position indexing - store_positions=true for phrase queries
Skip pointers - O(log n) posting list intersection for multi-term queries
Quoted phrase syntax - Parse "exact phrase" in regular search() calls
Fuzzy search - Levenshtein distance for typo tolerance via searchFuzzy()
Multi-language stop words - 11 languages supported (EN, DE, FR, ES, IT, PT, NL, SV, NO, DA, FI, RU)
Prefix/wildcard search - searchWithPrefix("optim*") expands to matching terms
Search highlighting - highlight() returns snippets with matched terms marked
Document deletion - removeDocument() with reverse index properly cleans posting lists and stats

Planned Features

Multi-language stemmers - Snowball stemmers for German, French, etc.

10. Usage Example

const std = @import("std");
const lattice = @import("lattice");

pub fn main() !void {
    var gpa = std.heap.GeneralPurposeAllocator(.{}){};
    const allocator = gpa.allocator();

    // Initialize storage (simplified)
    var bp = try BufferPool.init(allocator, &pm, 64 * 4096);
    var dict_tree = try BTree.init(allocator, &bp);
    var lengths_tree = try BTree.init(allocator, &bp);

    // Create FTS index with phrase query support
    var fts = FtsIndex.init(allocator, &bp, &dict_tree, &lengths_tree, .{
        .store_positions = true,  // Enable phrase queries
    });

    // Index documents
    _ = try fts.indexDocument(1, "Introduction to database systems");
    _ = try fts.indexDocument(2, "Advanced database optimization techniques");
    _ = try fts.indexDocument(3, "Web development with JavaScript");
    _ = try fts.indexDocument(4, "Database performance and MySQL tuning");

    // Basic AND search (all terms must match)
    const results1 = try fts.search("database optimization", 10);
    defer fts.freeResults(results1);
    // Returns doc 2 (contains both terms)

    // OR search (any term matches)
    const results2 = try fts.searchOr("mysql postgres", 10);
    defer fts.freeResults(results2);
    // Returns doc 4 (contains "mysql")

    // NOT search (exclusions with -prefix)
    const results3 = try fts.searchWithMode("database -mysql", .@"and", 10);
    defer fts.freeResults(results3);
    // Returns docs 1, 2 (contain "database" but not "mysql")

    // Phrase search (exact sequence)
    const results4 = try fts.searchPhrase("database systems", 10);
    defer fts.freeResults(results4);
    // Returns doc 1 (has "database systems" as adjacent phrase)

    // Quoted phrase syntax (alternative to searchPhrase)
    // Phrase "database optimization" + term "advanced"
    const results5 = try fts.searchWithMode("\"database optimization\" advanced", .@"and", 10);
    defer fts.freeResults(results5);
    // Returns doc 2 (has phrase "database optimization" AND term "advanced")

    // Fuzzy search (typo tolerance)
    const results6 = try fts.searchFuzzy("databse", .{
        .max_distance = 2,
        .min_term_length = 4,
    }, 10);
    defer fts.freeResults(results6);
    // Returns docs with "database" (edit distance 1 from "databse")

    for (results1) |result| {
        std.debug.print("Doc {}: score {d:.2}\n", .{ result.doc_id, result.score });
    }
}

11. Serialization Pattern

All on-disk structures use extern struct with compile-time size assertions:

/// Good: Self-documenting, compile-time verified
pub const PostingPageHeader = extern struct {
    token_id: TokenId,
    num_entries: u32,
    next_page: PageId,
    num_skip_pointers: u16,
    flags: u16,
    data_start: u32,

    comptime {
        std.debug.assert(@sizeOf(PostingPageHeader) == 20);
    }

    pub fn read(data: []const u8) PostingPageHeader {
        return std.mem.bytesAsValue(PostingPageHeader, data[OFFSET..][0..@sizeOf(PostingPageHeader)]).*;
    }

    pub fn write(self: *const PostingPageHeader, data: []u8) void {
        @memcpy(data[OFFSET..][0..@sizeOf(PostingPageHeader)], std.mem.asBytes(self));
    }
};

Why this pattern?

No magic numbers - Field layout is defined by the struct, not manual offsets
Compile-time verification - comptime block catches size mismatches immediately
Self-documenting - Struct fields serve as documentation
Consistent with codebase - WAL, PageHeader, FileHeader all use this pattern

Alignment considerations:

When structs contain u64 fields, use explicit padding and order fields largest-first:

pub const DictionaryEntry = extern struct {
    total_freq: u64,   // 8 bytes - largest first (requires 8-byte alignment)
    token_id: u32,     // 4 bytes
    doc_freq: u32,     // 4 bytes
    posting_page: u32, // 4 bytes
    _padding: u32 = 0, // 4 bytes - explicit trailing padding

    comptime {
        std.debug.assert(@sizeOf(DictionaryEntry) == 24);
    }
};

12. File Reference

File	Purpose
`src/fts/tokenizer.zig`	Text tokenization, normalization, language config
`src/fts/stopwords.zig`	Multi-language stop word lists (11 languages)
`src/fts/dictionary.zig`	Token → TokenId mapping via B+Tree, range iteration
`src/fts/posting.zig`	Posting list storage with varint encoding, entry removal
`src/fts/scorer.zig`	BM25 scoring, document length storage
`src/fts/index.zig`	Main FtsIndex coordinator, boolean/phrase/fuzzy/prefix search
`src/fts/stemmer.zig`	Porter stemmer algorithm (English), language routing
`src/fts/fuzzy.zig`	Levenshtein distance, fuzzy term expansion
`src/fts/prefix.zig`	Prefix/wildcard search, upper bound calculation
`src/fts/highlight.zig`	Search result highlighting, snippet extraction
`src/fts/reverse_index.zig`	doc_id → terms mapping for document deletion

Query Execution

Overview

Lattice executes Cypher queries using the Volcano iterator model—a pull-based execution engine where operators form a tree and data flows upward one row at a time. This design enables lazy evaluation, memory efficiency, and composable query plans.

┌─────────────────────────────────────────────────────────────────┐
│                    Query Execution Pipeline                      │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  "MATCH (p:Person) WHERE p.age > 30 RETURN p.name LIMIT 10"     │
│                              │                                   │
│                              ▼                                   │
│                    ┌─────────────────┐                          │
│                    │      Lexer      │  Tokens                  │
│                    └────────┬────────┘                          │
│                              ▼                                   │
│                    ┌─────────────────┐                          │
│                    │     Parser      │  AST                     │
│                    └────────┬────────┘                          │
│                              ▼                                   │
│                    ┌─────────────────┐                          │
│                    │    Semantic     │  Validated AST           │
│                    │    Analyzer     │  + Variable bindings     │
│                    └────────┬────────┘                          │
│                              ▼                                   │
│                    ┌─────────────────┐                          │
│                    │     Planner     │  Operator Tree           │
│                    └────────┬────────┘                          │
│                              ▼                                   │
│                    ┌─────────────────┐                          │
│                    │    Executor     │  Result Rows             │
│                    └─────────────────┘                          │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

The Volcano Iterator Model

Every operator implements three methods:

pub const Operator = struct {
    vtable: *const VTable,
    ptr: *anyopaque,

    pub const VTable = struct {
        open:  fn (*anyopaque, *ExecutionContext) OperatorError!void,
        next:  fn (*anyopaque, *ExecutionContext) OperatorError!?*Row,
        close: fn (*anyopaque, *ExecutionContext) void,
        deinit: fn (*anyopaque, Allocator) void,
    };
};

open(): Initialize the operator, acquire resources (iterators, memory)
next(): Return the next row, or null if exhausted
close(): Release resources (unpin pages, close iterators)
deinit(): Free the operator itself

Pull-Based Execution

Data flows upward through the operator tree. The root operator "pulls" from its children:

        ┌─────────┐
        │  Limit  │  ◄── Executor calls next() here
        └────┬────┘
             │ pulls from
        ┌────┴────┐
        │ Project │
        └────┬────┘
             │ pulls from
        ┌────┴────┐
        │ Filter  │
        └────┬────┘
             │ pulls from
        ┌────┴────┐
        │LabelScan│  ◄── Reads from storage
        └─────────┘

When the executor calls root.next():

Limit calls Project.next()
Project calls Filter.next()
Filter calls LabelScan.next()
LabelScan reads from B+Tree, returns row
Filter evaluates predicate—if false, pulls another row
Project evaluates expressions, transforms row
Limit checks count, returns row (or null if limit reached)

Lazy Evaluation

Work happens only when next() is called. For LIMIT 10, we might scan millions of nodes but only process 10 matching rows. Unused rows are never materialized.

Rows and Slots

A Row is a fixed-size tuple flowing between operators:

pub const Row = struct {
    slots: [16]SlotValue,      // Variable bindings
    distances: [16]f32,        // Vector search distances
    scores: [16]f32,           // FTS relevance scores
    populated: u16,            // Bitmask of populated slots
};

Slot Values

Each slot holds one of:

pub const SlotValue = union(enum) {
    empty: void,               // Unset
    node_ref: NodeId,          // Reference to a node (just the ID)
    edge_ref: EdgeId,          // Reference to an edge
    property: PropertyValue,   // Actual value (int, string, bool, etc.)
};

Why Slots Instead of Named Variables?

Performance. Looking up slots[3] is O(1). The planner assigns each variable to a numbered slot:

MATCH (p:Person)-[r:KNOWS]->(f:Person)
       │          │          │
     slot 0    slot 1     slot 2

The execution context maintains the mapping from names to slots for expression evaluation.

Lazy Materialization

Rows store references (NodeId, EdgeId), not full objects. Properties are fetched on-demand during expression evaluation. This avoids loading data that's never accessed.

Operators

Scan Operators

AllNodesScan: Iterates every node in the database.

open():  Create B+Tree iterator (null start key = first entry)
next():  Read entry, extract NodeId from key, set output slot
close(): Release iterator, unpin pages

LabelScan: Iterates nodes with a specific label.

open():  Create LabelIndex iterator for the label
next():  Get next NodeId from index, set output slot
close(): Release iterator

Both produce rows with a single slot containing a node reference.

Filter Operator

Evaluates a predicate and passes through only matching rows.

fn next(self: *Filter, ctx: *ExecutionContext) !?*Row {
    while (true) {
        const row = try self.input.next(ctx) orelse return null;

        const result = try self.evaluator.evaluate(self.predicate, row, ctx);
        if (result.isTruthy()) {
            return row;
        }
        // Predicate failed, try next row
    }
}

Supports short-circuit evaluation for AND/OR.

Project Operator

Transforms rows by evaluating expressions for each output column.

RETURN p.name, p.age + 1, "literal"
       │        │          │
    expr[0]  expr[1]    expr[2]

fn next(self: *Project, ctx: *ExecutionContext) !?*Row {
    const input_row = try self.input.next(ctx) orelse return null;

    self.output_row.clear();
    for (self.items) |item| {
        const value = try self.evaluator.evaluate(item.expr, input_row, ctx);
        self.output_row.setSlot(item.output_slot, resultToSlotValue(value));
    }
    return self.output_row;
}

Expand Operator

Traverses edges from input nodes. This implements pattern matching like (a)-[r]->(b).

State:
    source_slot      = slot containing the source node
    target_slot      = slot to write target node
    edge_slot        = slot to write edge (optional)
    edge_iterator    = current iterator over edges
    current_input    = row being expanded

next():
    loop:
        if edge_iterator.next() returns edge:
            output_row = copy input row
            output_row[target_slot] = edge.target
            output_row[edge_slot] = edge (if requested)
            return output_row

        // Exhausted edges for current input, get next input
        current_input = input.next()
        if null: return null

        source = current_input[source_slot]
        edge_iterator = edgeStore.getOutgoing(source)

This is a one-to-many operator: one input row can produce multiple output rows.

Supports three directions:

outgoing: (a)-[r]->(b) — traverse outgoing edges
incoming: (a)<-[r]-(b) — traverse incoming edges
both: (a)-[r]-(b) — traverse both directions

Limit and Skip

Limit: Returns at most N rows.

fn next(self: *Limit, ctx: *ExecutionContext) !?*Row {
    if (self.returned >= self.count) return null;
    const row = try self.input.next(ctx) orelse return null;
    self.returned += 1;
    return row;
}

Skip: Discards the first N rows.

fn next(self: *Skip, ctx: *ExecutionContext) !?*Row {
    while (self.skipped < self.count) {
        _ = try self.input.next(ctx) orelse return null;
        self.skipped += 1;
    }
    return try self.input.next(ctx);
}

Sort Operator (Blocking)

Sort must see all input before producing any output—it's a blocking operator.

fn open(self: *Sort, ctx: *ExecutionContext) !void {
    try self.input.open(ctx);

    // Materialize all input rows
    while (try self.input.next(ctx)) |row| {
        try self.rows.append(row.*);
    }

    // Sort in memory
    self.sortRows();
}

fn next(self: *Sort, ctx: *ExecutionContext) !?*Row {
    if (self.index >= self.rows.items.len) return null;
    defer self.index += 1;
    return &self.rows.items[self.index];
}

This breaks the streaming model but is necessary for ordering.

Vector Search Operator

Performs k-NN search using the HNSW index.

fn open(self: *VectorSearch, ctx: *ExecutionContext) !void {
    // Execute search upfront
    self.results = try self.index.search(self.query_vector, self.k, null);
}

fn next(self: *VectorSearch, ctx: *ExecutionContext) !?*Row {
    while (self.current < self.results.len) {
        const result = self.results[self.current];
        self.current += 1;

        // Apply distance threshold
        if (self.threshold) |t| {
            if (result.distance > t) continue;
        }

        self.row.setSlot(self.output_slot, .{ .node_ref = result.node_id });
        self.row.setDistance(self.output_slot, result.distance);
        return self.row;
    }
    return null;
}

The distance is stored in the row for use in expressions or sorting.

FTS Search Operator

Performs full-text search with BM25 scoring.

fn open(self: *FtsSearch, ctx: *ExecutionContext) !void {
    self.results = try self.index.search(self.query_text, self.limit);
}

fn next(self: *FtsSearch, ctx: *ExecutionContext) !?*Row {
    // Similar to VectorSearch but with scores instead of distances
    // ...
    self.row.setScore(self.output_slot, result.score);
    return self.row;
}

Mutation Operators

Mutation operators modify the graph structure. Unlike read operators, they have side effects.

CreateNode: Creates new nodes with labels and properties.

fn next(self: *CreateNode, ctx: *ExecutionContext) !?*Row {
    const node_id = try ctx.storage.createNode(self.labels);
    for (self.properties) |prop| {
        const value = try self.evaluator.evaluate(prop.value, null, ctx);
        try ctx.storage.setProperty(node_id, prop.key, value);
    }
    self.output_row.setSlot(self.output_slot, .{ .node_ref = node_id });
    return &self.output_row;
}

DeleteNode: Removes nodes from the graph. With DETACH, also removes connected edges.

fn next(self: *DeleteNode, ctx: *ExecutionContext) !?*Row {
    const row = try self.input.next(ctx) orelse return null;
    const node_id = row.slots[self.target_slot].node_ref;

    if (self.detach) {
        try ctx.storage.detachDelete(node_id);
    } else {
        try ctx.storage.deleteNode(node_id);
    }
    return row;  // Pass through for chaining
}

SetProperty: Updates properties on nodes or edges.

MATCH (n:Person {name: "Alice"}) SET n.age = 31, n.city = "NYC"

fn next(self: *SetProperty, ctx: *ExecutionContext) !?*Row {
    const row = try self.input.next(ctx) orelse return null;
    const target = row.slots[self.target_slot];
    const value = try self.evaluator.evaluate(self.value_expr, row, ctx);

    // SET n.prop = NULL removes the property
    if (value == .null_val) {
        try ctx.storage.removeProperty(target, self.property_name);
    } else {
        try ctx.storage.setProperty(target, self.property_name, value);
    }
    return row;
}

SetLabels: Adds labels to nodes.

MATCH (n:Person {name: "Alice"}) SET n:Admin:Verified

RemoveProperty: Explicitly removes properties.

MATCH (n:Person {name: "Alice"}) REMOVE n.city

RemoveLabel: Removes labels from nodes.

MATCH (n:Person {name: "Alice"}) REMOVE n:Verified

Expression Evaluation

The expression evaluator handles predicates and projections.

pub fn evaluate(expr: *const Expression, row: *const Row, ctx: *ExecutionContext) !EvalResult {
    return switch (expr.*) {
        .literal => |lit| evaluateLiteral(lit),
        .variable => |v| evaluateVariable(v, row, ctx),
        .property_access => |pa| evaluatePropertyAccess(pa, row, ctx),
        .binary => |b| evaluateBinary(b, row, ctx),
        .unary => |u| evaluateUnary(u, row, ctx),
        .function_call => |f| evaluateFunction(f, row, ctx),
        // ...
    };
}

Supported Operations

Comparison: =, <>, <, <=, >, >=

Logical: AND, OR, NOT, XOR (with short-circuit evaluation)

Arithmetic: +, -, *, /, %, ^

String: CONTAINS, STARTS WITH, ENDS WITH

Null handling: IS NULL, IS NOT NULL

Functions: id(), coalesce(), abs(), size(), toInteger(), toFloat()

Vector Search: <=> (vector distance operator)

-- k-NN search with distance threshold
MATCH (c:Chunk) WHERE c.embedding <=> $query < 0.5 RETURN c

Full-Text Search: @@ (FTS match operator)

-- BM25-scored full-text search
MATCH (d:Doc) WHERE d.text @@ $search RETURN d
MATCH (d:Doc) WHERE d.text @@ "neural networks" RETURN d

These operators are recognized by the planner and converted to specialized search operators that use the HNSW and FTS indexes directly, rather than scanning all nodes.

Type Coercion

Numeric operations promote integers to floats when mixed:

42 + 3.14  →  45.14 (float)

Null propagates through most operations:

null + 1   →  null
null = null → true (special case for equality)

The Query Planner

The planner transforms an AST into an operator tree by walking the query clauses.

Planning Algorithm

pub fn plan(query: *Query) !Operator {
    var current: ?Operator = null;

    for (query.clauses) |clause| {
        current = switch (clause) {
            .match => try planMatch(clause.match, current),
            .where => try planWhere(clause.where, current),
            .create => try planCreate(clause.create, current),
            .delete => try planDelete(clause.delete, current),
            .set => try planSet(clause.set, current),
            .remove => try planRemove(clause.remove, current),
            .return_ => try planReturn(clause.return_, current),
            .order_by => try planOrderBy(clause.order_by, current),
            .limit => try planLimit(clause.limit, current),
            .skip => try planSkip(clause.skip, current),
        };
    }

    return current.?;
}

Example: Simple Query

MATCH (p:Person) WHERE p.age > 30 RETURN p.name LIMIT 10

Produces:

Limit(count=10)
  └── Project([p.name → slot 0])
        └── Filter(p.age > 30)
              └── LabelScan(label="Person", output=slot 0)

Example: Edge Traversal

MATCH (a:Person)-[:KNOWS]->(b:Person) RETURN a.name, b.name

Produces:

Project([a.name → slot 0, b.name → slot 1])
  └── Expand(source=0, target=1, type="KNOWS", dir=outgoing)
        └── LabelScan(label="Person", output=slot 0)

Variable Binding

The planner allocates slots for variables:

Variable	Slot	Kind
a	0	node
b	1	node
r	2	edge

These bindings are registered in the ExecutionContext for expression evaluation.

Execution Context

Runtime state for query execution:

pub const ExecutionContext = struct {
    allocator: Allocator,
    row_arena: ArenaAllocator,          // Fast row allocation
    parameters: StringHashMap(Value),   // Query parameters ($name)
    variables: StringHashMap(u8),       // Variable → slot mapping
};

Row Arena

Rows are allocated from an arena that's reset between queries. This avoids per-row allocation overhead:

// During query execution
const row = try ctx.allocRow();  // Fast bump allocation

// After query completes
ctx.resetRowArena();  // Frees all rows at once

Query Parameters

Parameters allow safe value injection:

MATCH (p:Person) WHERE p.age > $min_age RETURN p

try ctx.setParameter("min_age", .{ .int_val = 30 });

Operator Composition

Operators are composable—any operator can wrap any other:

// Filter wrapping Expand wrapping LabelScan
Filter
  └── Expand
        └── LabelScan

// Limit wrapping Sort wrapping Filter
Limit
  └── Sort
        └── Filter
              └── ...

This enables flexible query plans without special-casing combinations.

Memory Management

Operator Lifecycle

1. Planner creates operators (heap allocated)
2. Executor calls open() on root (cascades down)
3. Executor calls next() repeatedly (rows flow up)
4. Executor calls close() on root (cascades down)
5. Executor calls deinit() on root (frees operators bottom-up)

Iterator Cleanup

Operators holding B+Tree or index iterators must release them in close():

fn close(self: *LabelScan, ctx: *ExecutionContext) void {
    if (self.iterator) |*iter| {
        iter.deinit();  // Unpins pages in buffer pool
        self.iterator = null;
    }
}

Failing to do this leaks pinned pages.

Future Optimizations

The current planner uses simple heuristics. Future improvements could include:

Predicate Pushdown

Move filters closer to scans:

Before: Filter(a.x > 10) → Expand → LabelScan
After:  Expand → Filter(a.x > 10) → LabelScan

Index Selection

Choose between AllNodesScan vs LabelScan vs property index based on selectivity.

Join Ordering

For multi-pattern queries, order joins by estimated cardinality.

Vectorized Execution

Process rows in batches instead of one-at-a-time for better cache utilization.

Database Query API

The Database.query() method provides the primary interface for executing Cypher queries. It orchestrates all components of the query pipeline.

Basic Usage

// Open database
var db = try Database.open(allocator, "mydb.ltdb", .{ .create = true });
defer db.close();

// Create some data
_ = try db.createNode(&[_][]const u8{"Person"});
_ = try db.createNode(&[_][]const u8{"Person"});

// Execute a query
var result = try db.query("MATCH (n:Person) RETURN n");
defer result.deinit();  // Always clean up results

// Process results
std.debug.print("Found {} rows\n", .{result.rowCount()});
for (result.rows) |row| {
    for (row.values) |val| {
        switch (val) {
            .node_id => |id| std.debug.print("Node: {}\n", .{id}),
            .string_val => |s| std.debug.print("String: {s}\n", .{s}),
            .int_val => |i| std.debug.print("Int: {}\n", .{i}),
            .null_val => std.debug.print("null\n", .{}),
            else => {},
        }
    }
}

Result Types

/// A single value in a query result
pub const ResultValue = union(enum) {
    null_val: void,      // NULL
    bool_val: bool,      // true/false
    int_val: i64,        // integers
    float_val: f64,      // floating point
    string_val: []const u8,  // strings
    node_id: NodeId,     // node reference
};

/// A row in a query result
pub const ResultRow = struct {
    values: []ResultValue,  // One value per column
};

/// Complete query result
pub const QueryResult = struct {
    columns: [][]const u8,  // Column names from RETURN clause
    rows: []ResultRow,      // Result rows
    allocator: Allocator,

    pub fn deinit(self: *QueryResult) void;  // Free all memory
    pub fn rowCount(self: *const QueryResult) usize;
    pub fn columnCount(self: *const QueryResult) usize;
};

Error Handling

The query() method returns QueryError on failure:

pub const QueryError = error{
    ParseError,      // Syntax error in Cypher
    SemanticError,   // Undefined variable, type mismatch
    PlanError,       // Query cannot be planned
    ExecutionError,  // Runtime error during execution
    OutOfMemory,
};

Example error handling:

const result = db.query("MATCH (n RETURN n");  // Missing )
if (result) |r| {
    defer r.deinit();
    // process results
} else |err| switch (err) {
    QueryError.ParseError => std.debug.print("Syntax error\n", .{}),
    QueryError.SemanticError => std.debug.print("Semantic error\n", .{}),
    else => std.debug.print("Query failed: {}\n", .{err}),
}

Internal Pipeline

When you call db.query(cypher), the following happens:

1. Parser.init(allocator, cypher)
   └── parser.parse() → AST (or ParseError)

2. SemanticAnalyzer.init(allocator)
   └── analyzer.analyze(ast) → AnalysisResult (or SemanticError)

3. QueryPlanner.init(allocator, storage_context)
   └── planner.plan(ast, analysis) → Operator tree (or PlanError)

4. ExecutionContext.init(allocator)
   └── Register variable bindings from planner

5. execute(allocator, root_operator, context)
   └── Pull rows through operator tree → executor.QueryResult

6. convertResult(exec_result, planner)
   └── Transform to user-friendly QueryResult

The StorageContext connects the planner to database storage:

const storage_ctx = StorageContext{
    .node_tree = &self.node_tree,       // B+Tree for nodes
    .label_index = &self.label_index,   // Label → NodeId index
    .edge_store = &self.edge_store,     // Edge storage
    .symbol_table = &self.symbol_table, // String interning
};

Memory Management

QueryResult owns all memory: Call deinit() when done
Strings are copied: Safe to use after query components are freed
Row arena: Internal rows use arena allocation, reset between queries

Files

File	Purpose
`src/storage/database.zig`	Database.query() API, QueryResult types
`src/query/executor.zig`	Operator interface, Row, ExecutionContext
`src/query/expression.zig`	Expression evaluator
`src/query/planner.zig`	AST to operator tree transformation
`src/query/operators/scan.zig`	AllNodesScan, LabelScan
`src/query/operators/filter.zig`	Filter operator
`src/query/operators/project.zig`	Project operator
`src/query/operators/expand.zig`	Edge traversal
`src/query/operators/vector.zig`	HNSW k-NN search
`src/query/operators/fts.zig`	Full-text search
`src/query/operators/limit.zig`	Limit, Skip, Sort
`src/query/operators/mutation.zig`	CREATE, DELETE, SET, REMOVE

Benchmarks

Benchmarked on Apple M1, single-threaded, with auto-scaled buffer pool.

Core Operations

Operation	Latency	Throughput	Target	Status
Node lookup	0.13 us	7.9M ops/sec	< 1 us	PASS
Node creation	0.65 us	1.5M ops/sec	—	—
Edge traversal	9 us	111K ops/sec	—	—
Full-text search (100 docs)	19 us	53K ops/sec	—	—
10-NN vector search (1M vectors)	0.83 ms	1.2K ops/sec	< 10 ms @ 1M	PASS

Vector Search (HNSW) at Scale

128-dimensional cosine vectors, M=16, ef_construction=200, ef_search=64, k=10.

Scale	Mean Latency	P99 Latency	Recall@10	Memory
1,000	65 us	70 us	100%	1 MB
10,000	174 us	695 us	99%	10 MB
100,000	438 us	1.2 ms	99%	101 MB
1,000,000	832 us	1.8 ms	100%	1,040 MB

Search latency scales sub-linearly (O(log N)) with 99-100% recall@10. Uses heuristic neighbor selection (HNSW paper Algorithm 4) for diverse graph connectivity, connection page packing for ~4.5x memory reduction, and pre-normalized dot product for fast cosine distance.

ef_search Sensitivity (1M vectors)

ef_search	Mean Latency	Recall@10
16	506 us	57%
32	1.9 ms	79%
64	990 us	100%
128	3.2 ms	100%
256	11.6 ms	100%

Optimization History

Baseline (pre-optimization)

Scale	Insert Rate	Search Mean	Recall@10
1K	~91/sec	1.7ms	100%
10K	~42/sec	3.8ms	99%
100K	~23/sec	4.5ms	99%
1M	~14/sec	6.4ms	100%

Post-optimization (Phase 2)

Six optimizations applied: last_page tracking, pre-sized search structures, stack-buffer connection I/O, cached vectors in heuristic pruning, pre-normalize + dot product for cosine.

Scale	Insert Rate	Search Mean	Recall@10
1K	~954/sec	65us	100%
10K	~726/sec	174us	99%
100K	~526/sec	438us	99%
1M	~248/sec	832us	100%

Improvement Summary

Scale	Insert Speedup	Search Speedup
1K	10.5x	26x
10K	17.5x	22x
100K	22.8x	10x
1M	17.7x	7.7x

Reproducing

zig build benchmark                        # Core operation benchmarks
zig build vector-benchmark -- --quick      # Vector benchmarks (1K/10K/100K, ~7 min)
zig build vector-benchmark                 # Full vector benchmarks including 1M (~70 min)
zig build graph-benchmark -- --quick       # Graph traversal benchmarks

Competitive Analysis

Point Lookups

System	Latency	Type	Source
LatticeDB	0.13 us	Embedded	`zig build benchmark`
RocksDB (in-memory)	0.14 us	Embedded	RocksDB wiki
SQLite (in-memory)	~0.2 us	Embedded	Turso blog
SQLite (WAL, disk)	3 us (p90)	Embedded	marending.dev
Neo4j	28 ms (p99)	Server	Memgraph comparison

LatticeDB's B+Tree achieves sub-microsecond cached lookups, matching RocksDB in-memory and outperforming SQLite on disk by 23x.

Vector Search

System	Latency (10-NN)	Scale	Type	Source
LatticeDB	0.83 ms mean, 100% recall	1M	Embedded	`zig build vector-benchmark`
FAISS HNSW (single-thread)	0.5-3 ms	1M	Library	FAISS wiki
Weaviate	1.4 ms mean, 3.1 ms P99	1M	Server	Weaviate benchmarks
Qdrant	~1-2 ms	1M	Server	Qdrant benchmarks
Milvus + SQ8	2.2 ms P99	1M	Server	VectorDBBench
pgvector HNSW	~5 ms @ 99% recall	1M	Extension	Jonathan Katz
LanceDB	3-5 ms	1M	Embedded	LanceDB blog
Chroma	4-5 ms mean	1M	Embedded	Chroma docs
Pinecone P2	~15 ms (incl. network)	1M	Cloud	Pinecone blog
sqlite-vec (brute force)	17 ms	1M	Extension	Alex Garcia

LatticeDB at 1M achieves 0.83 ms mean with 100% recall@10 — faster than FAISS single-threaded HNSW and competitive with Weaviate and Qdrant server-based systems (which add network overhead in practice).

Graph Traversal

System	2-hop (100K nodes)	Type	Source
LatticeDB	39 us	Embedded	`zig build sqlite-benchmark`
SQLite (recursive CTE)	548 us	Embedded	`zig build sqlite-benchmark`
Kuzu	19 ms	Embedded	The Data Quarry
Neo4j	10 ms (1M nodes)	Server	Neo4j blog

LatticeDB vs SQLite

Social network graph with power-law degree distribution, adjacency cache pre-warmed.

Small Scale (10K nodes, 50K edges)

Workload	LatticeDB	SQLite	Speedup
1-hop traversal	560 ns	13.0 us	23x
2-hop traversal	3.0 us	37.5 us	13x
3-hop traversal	19.1 us	178.5 us	9x
Variable path (1..5)	82.4 us	4.3 ms	52x

Medium Scale (100K nodes, 500K edges)

Workload	LatticeDB	SQLite	Speedup
1-hop traversal	8.0 us	290.0 us	36x
2-hop traversal	38.7 us	548.3 us	14x
3-hop traversal	197.3 us	1.2 ms	6x
Variable path (1..5)	134.4 us	10.1 ms	75x

Depth-Limited Traversal (10K nodes, 50K edges)

Depth	LatticeDB	SQLite	Speedup
10	311 us	121 ms	390x
15	380 us	271 ms	713x
25	318 us	587 ms	1,848x
50	500 us	1.4 s	2,819x

LatticeDB uses BFS with adjacency cache and bitset visited tracking. SQLite uses a recursive CTE with UNION deduplication. Both compute identical reachable node sets (~8K nodes). The gap widens at deeper depths as SQLite's CTE overhead grows with each recursion level.

Full-Text Search (BM25)

System	Search Latency	Type	Source
LatticeDB	19 us	Embedded	`zig build benchmark`
SQLite FTS5	< 6 ms	Embedded	SQLite Cloud
Elasticsearch	1-10 ms	Server	Various
Tantivy	10-100 us	Library	Various

LatticeDB's inverted index with BM25 scoring is ~300x faster than SQLite FTS5 and competitive with Tantivy (a dedicated Rust search library).

Building from Source

LatticeDB is written in Zig with zero external dependencies.

Prerequisites

Zig (0.15.x or later)

Clone and Build

git clone https://github.com/jeffhajewski/latticedb.git
cd latticedb
zig build                  # build everything

Build Targets

zig build                      # Build everything
zig build library              # Build static library only
zig build cli                  # Build CLI tool only
zig build amalgamation         # Create single-file distribution
zig build shared               # Build shared library for bindings

Optimized Builds

zig build release-safe         # Release build with safety checks
zig build release-fast         # Optimized release build
zig build -Doptimize=ReleaseFast   # Alternative optimized build

Building Language Bindings

Python

# Build the shared library first
zig build shared

# The Python bindings use ctypes to load the shared library
cd bindings/python
pip install -e .

TypeScript

# Build the shared library first
zig build shared

# Build the TypeScript bindings
cd bindings/typescript
npm install
npm run build

Project Structure

src/
├── core/           # Core types and utilities
├── storage/        # B+Tree, page management, WAL
├── vector/         # HNSW index, vector operations
├── fts/            # Full-text search, tokenizer
├── query/          # Cypher parser, planner, executor
├── transaction/    # Transaction management, MVCC
├── concurrency/    # Locking, latches
├── api/            # C API bindings
└── cli/            # CLI tool

include/
└── lattice.h       # C API header

bindings/
├── python/         # Python bindings
└── typescript/     # TypeScript/Node.js bindings

Running Tests

LatticeDB has a comprehensive test suite covering unit tests, integration tests, concurrency tests, crash recovery tests, and benchmarks.

Test Commands

zig build test                 # Run unit tests
zig build integration-test     # Run integration tests
zig build concurrency-test     # Run concurrency tests
zig build crash-test           # Run crash recovery tests

Benchmarks

zig build benchmark                        # Core operation benchmarks
zig build vector-benchmark -- --quick      # Vector benchmarks (1K/10K/100K, ~7 min)
zig build vector-benchmark                 # Full vector benchmarks including 1M (~70 min)
zig build graph-benchmark -- --quick       # Graph traversal benchmarks

Test Structure

tests/
├── unit/           # Unit tests for individual modules
├── integration/    # End-to-end integration tests
├── fuzz/           # Fuzzing targets for parser and serialization
├── crash/          # Crash recovery tests (kill process mid-transaction)
├── concurrency/    # Multi-threaded concurrency tests
└── performance/    # Performance benchmarks

Testing Standards

Aim for 100% branch coverage on core modules
Fuzzing is mandatory for the parser and serialization code
Crash recovery is tested by killing the process mid-transaction and verifying data integrity
Concurrency tests cover all multi-threaded code paths

TypeScript Binding Tests

cd bindings/typescript
npm test

Python Binding Tests

cd bindings/python
pytest

Contributing

Contributions to LatticeDB are welcome.

Getting Started

Fork the repository on GitHub
Clone your fork and create a branch
Make your changes
Run the test suite: zig build test
Submit a pull request

Development Setup

git clone https://github.com/YOUR_USERNAME/latticedb.git
cd latticedb
zig build test    # Verify everything builds and passes

Code Style

Follow existing Zig conventions in the codebase
All allocation goes through explicit allocator parameters
Fail fast: detect and report corruption, don't hide it
Keep the C API as the contract: all bindings wrap include/lattice.h

Testing

All changes should include appropriate tests:

Unit tests for new functions or modules
Integration tests for end-to-end behavior changes
Fuzz tests for parser or serialization changes
Crash tests for durability-related changes

Run the full test suite before submitting:

zig build test
zig build integration-test

Pull Requests

Keep PRs focused on a single change
Include a clear description of what changed and why
Ensure all tests pass
Add tests for new functionality

Reporting Issues

File issues on GitHub with:

A clear description of the problem
Steps to reproduce
Expected vs actual behavior
LatticeDB version and platform

LatticeDB Documentation