LatticeDB
The embedded knowledge graph for AI.
LatticeDB is a single-file database that combines a property graph, vector search, and full-text search. It's built for RAG, agents, and any application where relationships between data matter as much as the data itself.
- One file. Your entire database is a single portable file. No server, no configuration.
- Three search modes. Graph traversal, HNSW vector similarity, and BM25 full-text — in one query.
- Sub-millisecond. 0.13 us node lookups. 0.83 ms vector search at 1M vectors with 100% recall.
-- Find chunks similar to a query, traverse to their document, then to the author
MATCH (chunk:Chunk)-[:PART_OF]->(doc:Document)-[:AUTHORED_BY]->(author:Person)
WHERE chunk.embedding <=> $query_vector < 0.3
AND doc.content @@ "neural networks"
RETURN doc.title, chunk.text, author.name
ORDER BY chunk.embedding <=> $query_vector
LIMIT 10
Features
Graph
- Nodes and edges with labels and arbitrary properties
- Multi-hop traversal, variable-length paths (
*1..3) - ACID transactions with snapshot isolation
- MERGE, WITH, UNWIND, aggregations (
count,sum,avg,min,max,collect)
Vector Search
- HNSW approximate nearest neighbor with configurable M, ef
- Built-in hash embeddings or HTTP client for Ollama/OpenAI
- Batch insert for bulk loading
Full-Text Search
- BM25-ranked inverted index with tokenization and stemming
- Fuzzy search with configurable Levenshtein distance
Cypher Query Language
- MATCH, WHERE, RETURN, CREATE, DELETE, SET, REMOVE
- ORDER BY, LIMIT, SKIP, DETACH DELETE
- Vector distance operator:
<=> - Full-text search operator:
@@ - Parameters:
$name
Operations
- Single-file storage with write-ahead log for crash recovery
- Zero configuration — open a file and start working
- Clean C API; Python and TypeScript bindings wrap it
Use Cases
- RAG Systems — Vector search finds relevant chunks, graph traversal gathers context
- Knowledge Graphs — Linked notes and documents with semantic search
- AI Agents — Persistent memory with relationship awareness
- Local Development — Lightweight alternative to Neo4j or Weaviate for prototyping
Getting Started
Head to the Installation page to install LatticeDB, then follow the Quick Start guide to build your first knowledge graph.
Installation
CLI
curl -fsSL https://raw.githubusercontent.com/jeffhajewski/latticedb/main/dist/install.sh | bash
Python
pip install latticedb
Requires Python 3.9+ and NumPy. The native shared library (liblattice.dylib / liblattice.so) must be available on the system.
TypeScript / Node.js
npm install @hajewski/latticedb
Requires Node.js 18+. The native shared library must be available on the system.
Building from Source
LatticeDB is written in Zig with zero dependencies.
git clone https://github.com/jeffhajewski/latticedb.git
cd latticedb
zig build # build everything
zig build test # run tests
zig build -Doptimize=ReleaseFast # optimized build
Build the shared library for language bindings:
zig build shared
This produces liblattice.dylib (macOS) or liblattice.so (Linux).
See Building from Source for more details.
Quick Start
This guide walks through creating a small knowledge graph with documents and authors, storing embeddings, indexing text, and querying across all three search modes.
Python
from latticedb import Database, hash_embed
with Database("knowledge.db", create=True, enable_vector=True, vector_dimensions=128) as db:
# --- Build the graph ---
with db.write() as txn:
# Create authors
alice = txn.create_node(labels=["Person"], properties={"name": "Alice", "field": "ML"})
bob = txn.create_node(labels=["Person"], properties={"name": "Bob", "field": "Systems"})
txn.create_edge(alice.id, bob.id, "COLLABORATES_WITH")
# Create documents with chunks
for title, text, author in [
("Attention Is All You Need", "The transformer architecture uses self-attention...", alice),
("Scaling Laws for LLMs", "We find that model performance scales predictably...", alice),
("Log-Structured Merge Trees", "LSM trees optimize write-heavy workloads...", bob),
]:
doc = txn.create_node(labels=["Document"], properties={"title": title})
chunk = txn.create_node(labels=["Chunk"], properties={"text": text})
# Store embedding and index text
txn.set_vector(chunk.id, "embedding", hash_embed(text, dimensions=128))
txn.fts_index(chunk.id, text)
txn.create_edge(chunk.id, doc.id, "PART_OF")
txn.create_edge(doc.id, author.id, "AUTHORED_BY")
txn.commit()
# --- Query: vector search + text match + graph traversal ---
results = db.query("""
MATCH (chunk:Chunk)-[:PART_OF]->(doc:Document)-[:AUTHORED_BY]->(author:Person)
WHERE chunk.embedding <=> $query < 0.5
RETURN doc.title, chunk.text, author.name
ORDER BY chunk.embedding <=> $query
LIMIT 5
""", parameters={"query": hash_embed("transformer attention mechanism", dimensions=128)})
for row in results:
print(f"{row['doc.title']} by {row['author.name']}")
# --- Full-text search ---
for r in db.fts_search("self-attention transformer"):
print(f"Node {r.node_id}: score={r.score:.4f}")
# --- Aggregations ---
stats = db.query("""
MATCH (doc:Document)-[:AUTHORED_BY]->(p:Person)
RETURN p.name, count(doc) AS papers
ORDER BY papers DESC
""")
for row in stats:
print(f"{row['p.name']}: {row['papers']} papers")
TypeScript
import { Database, hashEmbed } from "@hajewski/latticedb";
const db = new Database("knowledge.db", {
create: true,
enableVector: true,
vectorDimensions: 128,
});
await db.open();
// Build a graph
await db.write(async (txn) => {
const alice = await txn.createNode({
labels: ["Person"],
properties: { name: "Alice", field: "ML" },
});
const doc = await txn.createNode({
labels: ["Document"],
properties: { title: "Attention Is All You Need" },
});
const chunk = await txn.createNode({
labels: ["Chunk"],
properties: { text: "The transformer architecture uses self-attention..." },
});
await txn.setVector(chunk.id, "embedding", hashEmbed("transformer self-attention", 128));
await txn.ftsIndex(chunk.id, "The transformer architecture uses self-attention...");
await txn.createEdge(chunk.id, doc.id, "PART_OF");
await txn.createEdge(doc.id, alice.id, "AUTHORED_BY");
});
// Query across vector search + graph traversal
const results = await db.query(
`MATCH (chunk:Chunk)-[:PART_OF]->(doc:Document)-[:AUTHORED_BY]->(author:Person)
WHERE chunk.embedding <=> $query < 0.5
RETURN doc.title, chunk.text, author.name
ORDER BY chunk.embedding <=> $query
LIMIT 5`,
{ query: hashEmbed("attention mechanism", 128) }
);
for (const row of results.rows) {
console.log(`${row["doc.title"]} by ${row["author.name"]}`);
}
await db.close();
Next Steps
- Core Concepts — understand the data model
- Cypher Overview — learn the query language
- Building a RAG System — end-to-end tutorial
Core Concepts
Property Graph
LatticeDB stores data as a property graph: a collection of nodes connected by edges, where both nodes and edges can have labels and key-value properties.
Nodes
A node represents an entity. Each node has:
- A unique ID (assigned automatically)
- One or more labels that categorize it (e.g.,
Person,Document,Chunk) - Zero or more properties — key-value pairs (e.g.,
name: "Alice",age: 30)
alice = txn.create_node(
labels=["Person"],
properties={"name": "Alice", "age": 30}
)
Edges
An edge represents a relationship between two nodes. Each edge has:
- A source node and a target node
- An edge type (e.g.,
KNOWS,AUTHORED_BY,PART_OF) - A direction — edges always point from source to target
txn.create_edge(alice.id, bob.id, "KNOWS")
Properties
Properties are typed key-value pairs attached to nodes. Supported types:
| Type | Python | TypeScript | C |
|---|---|---|---|
| Null | None | null | LATTICE_VALUE_NULL |
| Boolean | bool | boolean | LATTICE_VALUE_BOOL |
| Integer | int | number | LATTICE_VALUE_INT |
| Float | float | number | LATTICE_VALUE_FLOAT |
| String | str | string | LATTICE_VALUE_STRING |
| Binary | bytes | Uint8Array | LATTICE_VALUE_BYTES |
Cypher Query Language
LatticeDB uses Cypher — a declarative graph query language. Nodes are written in parentheses, edges in square brackets:
MATCH (person:Person)-[:KNOWS]->(friend:Person)
WHERE person.name = "Alice"
RETURN friend.name
LatticeDB extends Cypher with two operators:
<=>— vector distance for similarity search@@— full-text search with BM25 scoring
Vector Search
LatticeDB includes an HNSW (Hierarchical Navigable Small World) index for approximate nearest-neighbor search on vector embeddings.
To use vector search:
- Enable vector storage when opening the database
- Attach vector embeddings to nodes
- Search by similarity using
vector_search()or the<=>operator in Cypher
# Store an embedding
txn.set_vector(node.id, "embedding", vector)
# Search by similarity
results = db.vector_search(query_vector, k=10)
# Or in Cypher
db.query("MATCH (n:Chunk) WHERE n.embedding <=> $q < 0.5 RETURN n", parameters={"q": query_vector})
Embeddings
LatticeDB provides two ways to generate embeddings:
hash_embed— a built-in deterministic hash function. No external service needed. Useful for testing and simple keyword-based similarity.EmbeddingClient— an HTTP client that connects to Ollama, OpenAI, or any compatible API for production-quality embeddings.
Full-Text Search
LatticeDB includes a BM25-scored inverted index for full-text search. Index text content on nodes, then search across all indexed text.
# Index text
txn.fts_index(node.id, "The transformer architecture uses self-attention")
# Search
results = db.fts_search("transformer attention")
# Fuzzy search (typo-tolerant)
results = db.fts_search_fuzzy("transformr atention")
# Or in Cypher
db.query('MATCH (n) WHERE n.text @@ "transformer attention" RETURN n')
Transactions
All operations in LatticeDB happen within transactions. LatticeDB uses snapshot isolation: each transaction sees a consistent snapshot of the database as of when it started.
- Read transactions can run concurrently — readers never block other readers.
- Write transactions are serialized — only one write transaction can commit at a time.
# Read transaction
with db.read() as txn:
node = txn.get_node(node_id)
# Write transaction
with db.write() as txn:
txn.create_node(labels=["Person"], properties={"name": "Alice"})
txn.commit()
If a write transaction is not explicitly committed, it is rolled back when the context exits.
Single-File Storage
The entire database — nodes, edges, properties, vector index, text index, and write-ahead log — lives in a single file. This makes databases portable and easy to manage: copy, back up, or deploy by moving one file.
When to Use LatticeDB
LatticeDB is fast, but speed is not the only thing that matters. Here are cases where a different tool is the better choice.
You need multiple applications writing to the same database at the same time
LatticeDB is embedded with a single-writer model. One process opens the file and owns it. If you need many clients connecting over a network, use Neo4j, PostgreSQL, or another client-server database.
Your data is fundamentally tabular
If your data fits naturally into rows and columns — sales records, user accounts, time series — a relational database like SQLite or PostgreSQL will be simpler and just as fast. Graph databases shine when relationships between records are the point, not an afterthought.
You need to scale beyond a single machine
LatticeDB stores everything in one file on one machine. If you need sharding, replication, or distributed queries across billions of nodes, look at Neo4j cluster, Dgraph, or a managed service like Neptune.
You need the full Cypher language
LatticeDB supports most of Cypher but not all of it. Features like OPTIONAL MATCH and CALL procedures are not yet implemented. If your queries depend on these, Neo4j is the complete implementation.
You need mature tooling and ecosystem
Neo4j has visualization tools, admin dashboards, monitoring, drivers in every language, and years of community resources. PostgreSQL has decades of tooling. LatticeDB is new and lean — which is a strength for embedding, but a weakness if you need a rich operational ecosystem around your database.
Building a RAG System
This guide walks through building a Retrieval-Augmented Generation (RAG) system with LatticeDB. We'll chunk documents, generate embeddings, store them with graph relationships, and query using vector search combined with graph traversal.
Architecture
A typical RAG system with LatticeDB:
- Ingest — Split documents into chunks, generate embeddings, store in the graph
- Link — Create edges between chunks, documents, authors, topics
- Retrieve — Vector search finds relevant chunks, graph traversal gathers context
- Generate — Pass retrieved context to an LLM
Step 1: Set Up the Database
from latticedb import Database, hash_embed
db = Database(
"rag.db",
create=True,
enable_vector=True,
vector_dimensions=128, # Match your embedding model's output
)
db.open()
For production, use a real embedding model via the HTTP client:
from latticedb import EmbeddingClient
client = EmbeddingClient(
"http://localhost:11434",
model="nomic-embed-text",
)
Step 2: Ingest Documents
Split documents into chunks and store them with graph relationships:
def ingest_document(db, title, author_name, chunks):
with db.write() as txn:
# Create or find the author
doc = txn.create_node(
labels=["Document"],
properties={"title": title},
)
author = txn.create_node(
labels=["Person"],
properties={"name": author_name},
)
txn.create_edge(doc.id, author.id, "AUTHORED_BY")
# Create chunks with embeddings
prev_chunk = None
for i, text in enumerate(chunks):
chunk = txn.create_node(
labels=["Chunk"],
properties={"text": text, "position": i},
)
# Store embedding
embedding = hash_embed(text, dimensions=128)
txn.set_vector(chunk.id, "embedding", embedding)
# Index for full-text search
txn.fts_index(chunk.id, text)
# Link chunk to document
txn.create_edge(chunk.id, doc.id, "PART_OF")
# Link sequential chunks
if prev_chunk is not None:
txn.create_edge(prev_chunk.id, chunk.id, "NEXT")
prev_chunk = chunk
txn.commit()
Step 3: Add Topic Links
Enrich the graph with topic relationships:
with db.write() as txn:
ml_topic = txn.create_node(
labels=["Topic"],
properties={"name": "Machine Learning"},
)
# Link documents to topics
txn.create_edge(doc.id, ml_topic.id, "ABOUT")
txn.commit()
Step 4: Query — Vector Search + Graph Context
The key advantage of LatticeDB: retrieve by similarity, then traverse the graph for additional context.
def retrieve_context(db, query_text, k=5):
"""Retrieve relevant chunks with their surrounding context."""
query_vec = hash_embed(query_text, dimensions=128)
# Find similar chunks and traverse to their documents and authors
results = db.query("""
MATCH (chunk:Chunk)-[:PART_OF]->(doc:Document)-[:AUTHORED_BY]->(author:Person)
WHERE chunk.embedding <=> $query < 0.5
RETURN chunk.text, doc.title, author.name
ORDER BY chunk.embedding <=> $query
LIMIT $k
""", parameters={"query": query_vec, "k": k})
return results
Retrieve with Neighboring Chunks
Get surrounding chunks for more context:
results = db.query("""
MATCH (chunk:Chunk)-[:PART_OF]->(doc:Document)
WHERE chunk.embedding <=> $query < 0.5
WITH chunk, doc
ORDER BY chunk.embedding <=> $query
LIMIT 5
MATCH (prev:Chunk)-[:NEXT]->(chunk)
RETURN prev.text, chunk.text, doc.title
""", parameters={"query": query_vec})
Combine Vector and Full-Text Search
results = db.query("""
MATCH (chunk:Chunk)-[:PART_OF]->(doc:Document)
WHERE chunk.embedding <=> $query < 0.5
AND chunk.text @@ $keywords
RETURN chunk.text, doc.title
ORDER BY chunk.embedding <=> $query
LIMIT 10
""", parameters={
"query": query_vec,
"keywords": "transformer attention",
})
Step 5: Pass to LLM
context = retrieve_context(db, "How does self-attention work?")
# Build prompt with retrieved context
chunks = [f"[{r['doc.title']}] {r['chunk.text']}" for r in context]
prompt = f"""Answer the question based on the following context:
{chr(10).join(chunks)}
Question: How does self-attention work?"""
# Pass to your LLM of choice
# response = llm.generate(prompt)
Batch Loading
For large datasets, use batch insert:
import numpy as np
with db.write() as txn:
# Insert 10,000 chunks at once
vectors = np.array([hash_embed(text, 128) for text in all_chunks], dtype=np.float32)
node_ids = txn.batch_insert("Chunk", vectors)
# Set properties and create edges afterward
for node_id, text in zip(node_ids, all_chunks):
txn.set_property(node_id, "text", text)
txn.fts_index(node_id, text)
txn.commit()
Performance Tips
- Use
batch_insert()for bulk loading — significantly faster than individual creates - Set
ef_searchbased on your recall requirements (64 gives 100% recall at 1M vectors) - Use
cache_size_mbto control memory usage - Index only the text fields you need to search with
fts_index() - Use parameters (
$name) to enable query plan caching
Knowledge Graph Modeling
This guide covers data modeling patterns for knowledge graphs in LatticeDB.
Basic Patterns
Entities and Relationships
Model real-world entities as nodes and their relationships as edges:
with db.write() as txn:
alice = txn.create_node(labels=["Person"], properties={"name": "Alice"})
company = txn.create_node(labels=["Company"], properties={"name": "Acme Corp"})
txn.create_edge(alice.id, company.id, "WORKS_AT")
txn.commit()
Labels as Types
Use labels to categorize nodes. A node can have multiple labels:
txn.create_node(labels=["Person", "Employee", "Manager"], properties={"name": "Alice"})
Query by label:
MATCH (m:Manager) RETURN m.name
Document Graph
A common pattern for document management:
Document -[:AUTHORED_BY]-> Person
Document -[:ABOUT]-> Topic
Chunk -[:PART_OF]-> Document
Chunk -[:NEXT]-> Chunk
with db.write() as txn:
doc = txn.create_node(labels=["Document"], properties={"title": "Paper Title"})
author = txn.create_node(labels=["Person"], properties={"name": "Alice"})
topic = txn.create_node(labels=["Topic"], properties={"name": "Machine Learning"})
txn.create_edge(doc.id, author.id, "AUTHORED_BY")
txn.create_edge(doc.id, topic.id, "ABOUT")
# Create a chain of chunks
chunks = []
for i, text in enumerate(chunk_texts):
chunk = txn.create_node(labels=["Chunk"], properties={"text": text, "position": i})
txn.create_edge(chunk.id, doc.id, "PART_OF")
if chunks:
txn.create_edge(chunks[-1].id, chunk.id, "NEXT")
chunks.append(chunk)
txn.commit()
Multi-Hop Queries
Finding Collaborators
-- Direct collaborators
MATCH (a:Person {name: "Alice"})-[:COLLABORATES_WITH]->(b:Person)
RETURN b.name
-- Collaborators of collaborators
MATCH (a:Person {name: "Alice"})-[:COLLABORATES_WITH*2]->(c:Person)
RETURN DISTINCT c.name
Path Queries
-- How are two people connected?
MATCH (a:Person {name: "Alice"})-[*1..4]->(b:Person {name: "Dave"})
RETURN a.name, b.name
Aggregating Over Relationships
-- Authors with the most documents about ML
MATCH (p:Person)<-[:AUTHORED_BY]-(d:Document)-[:ABOUT]->(t:Topic {name: "Machine Learning"})
RETURN p.name, count(d) AS papers
ORDER BY papers DESC
LIMIT 10
Social Network
Person -[:KNOWS]-> Person
Person -[:FOLLOWS]-> Person
Person -[:POSTED]-> Post
Post -[:TAGGED]-> Topic
-- Find friends who share interests
MATCH (me:Person {name: "Alice"})-[:KNOWS]->(friend:Person)
MATCH (me)-[:FOLLOWS]->(topic:Topic)<-[:FOLLOWS]-(friend)
RETURN friend.name, collect(topic.name) AS shared_interests
Modeling Tips
- Use descriptive edge types.
AUTHORED_BYis more useful thanRELATED_TO. - Keep properties simple. Store complex data as separate nodes with edges rather than large property values.
- Use labels for querying. Labels enable efficient scans via the label index.
- Model for your queries. Think about what traversals you need and structure edges accordingly.
- Use MERGE for idempotent imports. When loading data from multiple sources,
MERGEprevents duplicate nodes.
Working with Embeddings
LatticeDB supports storing and searching vector embeddings on nodes. This guide covers the embedding options and how to use them.
Overview
To use embeddings:
- Enable vector storage when opening the database
- Generate embeddings (built-in hash or external service)
- Attach embeddings to nodes
- Search by similarity
Enabling Vector Storage
db = Database(
"mydb.db",
create=True,
enable_vector=True,
vector_dimensions=128, # Must match your embedding dimensions
)
const db = new Database("mydb.db", {
create: true,
enableVector: true,
vectorDimensions: 128,
});
Hash Embeddings (Built-in)
hash_embed generates deterministic embeddings from text without an external service. It uses feature hashing to map text tokens to a fixed-dimension vector.
When to use: Testing, prototyping, or when keyword-level similarity is sufficient.
Python:
from latticedb import hash_embed
vec = hash_embed("hello world", dimensions=128)
# Returns a numpy array of shape (128,)
TypeScript:
import { hashEmbed } from "@hajewski/latticedb";
const vec = hashEmbed("hello world", 128);
// Returns a Float32Array of length 128
HTTP Embedding Client
For production-quality semantic embeddings, use the HTTP client to connect to an embedding service.
Ollama
from latticedb import EmbeddingClient
with EmbeddingClient("http://localhost:11434") as client:
vec = client.embed("hello world")
import { EmbeddingClient } from "@hajewski/latticedb";
const client = new EmbeddingClient({
endpoint: "http://localhost:11434",
});
const vec = client.embed("hello world");
client.close();
OpenAI
from latticedb import EmbeddingClient, EmbeddingApiFormat
with EmbeddingClient(
"https://api.openai.com/v1",
model="text-embedding-3-small",
api_format=EmbeddingApiFormat.OPENAI,
api_key="sk-...",
) as client:
vec = client.embed("hello world")
import { EmbeddingClient, EmbeddingApiFormat } from "@hajewski/latticedb";
const client = new EmbeddingClient({
endpoint: "https://api.openai.com/v1",
model: "text-embedding-3-small",
apiFormat: EmbeddingApiFormat.OpenAI,
apiKey: "sk-...",
});
const vec = client.embed("hello world");
client.close();
Storing Embeddings
Attach a vector to a node within a write transaction:
with db.write() as txn:
node = txn.create_node(labels=["Chunk"], properties={"text": "some text"})
embedding = hash_embed("some text", dimensions=128)
txn.set_vector(node.id, "embedding", embedding)
txn.commit()
Searching
Programmatic API
query_vec = hash_embed("search query", dimensions=128)
results = db.vector_search(query_vec, k=10, ef_search=64)
for r in results:
print(f"Node {r.node_id}: distance={r.distance:.4f}")
Cypher
MATCH (n:Chunk)
WHERE n.embedding <=> $query < 0.5
RETURN n.text
ORDER BY n.embedding <=> $query
LIMIT 10
Batch Insert
For bulk loading, use batch_insert which is significantly faster than inserting one at a time:
import numpy as np
with db.write() as txn:
vectors = np.random.rand(10000, 128).astype(np.float32)
node_ids = txn.batch_insert("Document", vectors)
txn.commit()
Choosing Dimensions
The vector_dimensions parameter must be set when opening the database and must match the embedding model output:
| Model | Dimensions |
|---|---|
hash_embed (built-in) | Configurable (default 128) |
text-embedding-3-small (OpenAI) | 1536 |
text-embedding-3-large (OpenAI) | 3072 |
nomic-embed-text (Ollama) | 768 |
mxbai-embed-large (Ollama) | 1024 |
Higher dimensions capture more semantic nuance but use more memory and slow down search.
Full-Text Search
LatticeDB includes a BM25-scored inverted index for full-text search. This guide covers indexing, searching, and fuzzy matching.
How It Works
LatticeDB's full-text search uses:
- Tokenization — text is split into terms
- Stemming — terms are reduced to their root form
- Inverted index — maps terms to the nodes containing them
- BM25 scoring — ranks results by relevance considering term frequency, document frequency, and document length
Indexing Text
Index text content on a node within a write transaction:
with db.write() as txn:
node = txn.create_node(labels=["Document"], properties={"title": "My Doc"})
txn.fts_index(node.id, "The quick brown fox jumps over the lazy dog")
txn.commit()
await db.write(async (txn) => {
const node = await txn.createNode({
labels: ["Document"],
properties: { title: "My Doc" },
});
await txn.ftsIndex(node.id, "The quick brown fox jumps over the lazy dog");
});
Searching
Programmatic API
results = db.fts_search("quick fox", limit=10)
for r in results:
print(f"Node {r.node_id}: score={r.score:.4f}")
const results = await db.ftsSearch("quick fox", { limit: 10 });
for (const r of results) {
console.log(`Node ${r.nodeId}: score=${r.score.toFixed(4)}`);
}
Cypher
MATCH (d:Document)
WHERE d.content @@ "quick fox"
RETURN d.title
Fuzzy Search
Fuzzy search tolerates typos using Levenshtein edit distance:
# Finds "machine learning" despite typos
results = db.fts_search_fuzzy("machin lerning", limit=10)
Controlling Sensitivity
results = db.fts_search_fuzzy(
"machne",
limit=10,
max_distance=2, # Max edit distance (default: 2)
min_term_length=4, # Min term length for fuzzy matching (default: 4)
)
const results = await db.ftsSearchFuzzy("machne", {
limit: 10,
maxDistance: 2,
minTermLength: 4,
});
- max_distance — maximum Levenshtein edit distance. Higher values find more matches but may include irrelevant results.
- min_term_length — minimum term length to apply fuzzy matching. Short terms (like "a", "the") are matched exactly.
Combining with Vector Search
Use both search modes in a single Cypher query for hybrid retrieval:
MATCH (chunk:Chunk)
WHERE chunk.embedding <=> $query < 0.5
AND chunk.text @@ "neural networks"
RETURN chunk.text
ORDER BY chunk.embedding <=> $query
LIMIT 10
Performance
Full-text search in LatticeDB is fast:
| Operation | Latency |
|---|---|
| FTS search (100 docs) | 19 us |
This is ~300x faster than SQLite FTS5 and competitive with Tantivy, a dedicated Rust search library. See Benchmarks for details.
Transactions and Durability
LatticeDB provides ACID transactions with snapshot isolation.
Transaction Model
- Read transactions see a consistent snapshot of the database. Multiple read transactions can run concurrently.
- Write transactions can modify the database. Only one write transaction can commit at a time (single-writer model).
- Snapshot isolation — each transaction sees the database as it was when the transaction began. Other transactions' uncommitted changes are invisible.
Using Transactions
Python
# Read transaction
with db.read() as txn:
node = txn.get_node(node_id)
edges = txn.get_outgoing_edges(node_id)
# Transaction automatically completes when context exits
# Write transaction
with db.write() as txn:
node = txn.create_node(labels=["Person"], properties={"name": "Alice"})
txn.commit()
# If commit() is not called, the transaction is rolled back
TypeScript
// Read transaction
await db.read(async (txn) => {
const node = await txn.getNode(nodeId);
const edges = await txn.getOutgoingEdges(nodeId);
});
// Write transaction
await db.write(async (txn) => {
const node = await txn.createNode({
labels: ["Person"],
properties: { name: "Alice" },
});
// Transaction auto-commits on success, rolls back on error
});
C
// Begin a read transaction
lattice_txn* txn;
lattice_begin(db, LATTICE_TXN_READ_ONLY, &txn);
// ... read operations ...
lattice_commit(txn);
// Begin a write transaction
lattice_begin(db, LATTICE_TXN_READ_WRITE, &txn);
// ... write operations ...
lattice_commit(txn); // or lattice_rollback(txn);
Queries and Transactions
db.query() automatically creates the appropriate transaction:
- Read queries (
MATCH ... RETURN) use a read transaction - Write queries (
CREATE,SET,DELETE) use a write transaction
# Implicit read transaction
results = db.query("MATCH (n:Person) RETURN n.name")
# Implicit write transaction
db.query("CREATE (n:Person {name: 'Alice'})")
Durability
LatticeDB uses a write-ahead log (WAL) for crash recovery:
- Changes are written to the WAL before being applied to data pages
- Commit means the WAL record is on disk (a fast sequential write)
- Data pages are written lazily during checkpointing
- On crash, the WAL is replayed to recover committed transactions
This means committed data survives process crashes and power failures.
Concurrency
- Readers never block writers
- Writers never block readers
- Multiple readers can execute simultaneously
- Write transactions are serialized at commit time
This makes LatticeDB well-suited for read-heavy workloads typical of RAG and search applications.
Performance Tuning
This guide covers the key parameters for tuning LatticeDB performance.
Cache Size
The cache_size_mb parameter controls how many database pages are cached in memory. Larger caches reduce disk I/O.
db = Database("mydb.db", cache_size_mb=200) # 200 MB cache
Guidelines:
- Default is 100 MB, which handles most workloads well
- For large databases (1M+ nodes), increase to 200-500 MB
- For memory-constrained environments, reduce to 50 MB or less
- The cache stores fixed-size pages (4 KB each), so 100 MB holds ~25,000 pages
Vector Search: ef_search
The ef_search parameter controls the accuracy/speed tradeoff for HNSW vector search.
| ef_search | Mean Latency (1M vectors) | Recall@10 |
|---|---|---|
| 16 | 506 us | 57% |
| 32 | 1.9 ms | 79% |
| 64 | 990 us | 100% |
| 128 | 3.2 ms | 100% |
| 256 | 11.6 ms | 100% |
Guidelines:
- Default is 64, which achieves 100% recall at 1M vectors
- For latency-sensitive applications, try 32 (79% recall)
- For maximum recall in critical applications, use 128
- Values above 128 rarely improve recall but increase latency
# Programmatic API
results = db.vector_search(query_vec, k=10, ef_search=128)
Batch Insert
When loading large amounts of data, use batch_insert instead of individual creates:
import numpy as np
with db.write() as txn:
# Fast: ~248 inserts/sec at 1M scale
vectors = np.random.rand(10000, 128).astype(np.float32)
node_ids = txn.batch_insert("Document", vectors)
txn.commit()
Batch insert is significantly faster than individual node creation because it amortizes HNSW index updates.
Query Plan Caching
Use parameterized queries to enable plan caching:
# Good: plan is cached and reused
for name in names:
db.query("MATCH (n:Person) WHERE n.name = $name RETURN n", parameters={"name": name})
# Bad: new plan compiled for each query
for name in names:
db.query(f"MATCH (n:Person) WHERE n.name = '{name}' RETURN n")
Monitor cache effectiveness:
stats = db.cache_stats()
print(f"Hit rate: {stats['hits'] / (stats['hits'] + stats['misses']):.1%}")
Indexing Strategy
Full-Text Search
Only index text that you need to search. Indexing unnecessary text wastes memory and slows inserts:
# Index only the searchable text field
txn.fts_index(chunk.id, chunk_text)
# Don't index metadata, IDs, etc.
Labels
Use specific labels for nodes you query frequently. Label scans are fast because they use a dedicated index:
-- Fast: scans only Chunk nodes
MATCH (c:Chunk) WHERE c.embedding <=> $q < 0.5 RETURN c
-- Slower: scans all nodes
MATCH (n) WHERE n.embedding <=> $q < 0.5 RETURN n
Transaction Scope
Keep write transactions short. Long-running write transactions hold the write lock and block other writes:
# Good: small, focused transactions
for batch in chunks(data, 1000):
with db.write() as txn:
for item in batch:
txn.create_node(labels=["Item"], properties=item)
txn.commit()
# Bad: one giant transaction
with db.write() as txn:
for item in all_data: # millions of items
txn.create_node(labels=["Item"], properties=item)
txn.commit()
Memory Usage
Vector storage dominates memory at scale:
| Scale | Memory |
|---|---|
| 1,000 vectors (128d) | 1 MB |
| 10,000 vectors | 10 MB |
| 100,000 vectors | 101 MB |
| 1,000,000 vectors | 1,040 MB |
Plan your vector_dimensions and scale accordingly. Lower dimensions use less memory but capture less semantic information.
Cypher Overview
LatticeDB uses the Cypher query language for graph operations. Cypher is a declarative language where you describe patterns to match and operations to perform, rather than specifying how to execute them.
Syntax at a Glance
Nodes are written in parentheses, edges in square brackets:
-- Match a pattern
MATCH (person:Person)-[:KNOWS]->(friend:Person)
WHERE person.name = "Alice"
RETURN friend.name
Supported Clauses
| Clause | Purpose |
|---|---|
| MATCH | Find patterns in the graph |
| WHERE | Filter results with conditions |
| RETURN | Specify output columns |
| CREATE | Create nodes and edges |
| SET | Update properties and labels |
| DELETE | Remove nodes and edges |
| REMOVE | Remove properties and labels |
| MERGE | Match or create a pattern |
| WITH | Chain query parts, pipe results |
| UNWIND | Expand lists into rows |
| ORDER BY | Sort results |
| LIMIT | Limit number of results |
| SKIP | Skip first N results |
LatticeDB Extensions
LatticeDB extends Cypher with two operators for search:
Vector Distance (<=>)
Find nodes with similar embeddings:
MATCH (chunk:Chunk)
WHERE chunk.embedding <=> $query_vector < 0.5
RETURN chunk.text
ORDER BY chunk.embedding <=> $query_vector
See Vector Search for details.
Full-Text Search (@@)
Search indexed text content:
MATCH (doc:Document)
WHERE doc.content @@ "neural networks"
RETURN doc.title
See Full-Text Search for details.
Expressions
Operators
| Category | Operators |
|---|---|
| Comparison | =, <>, <, <=, >, >= |
| Logical | AND, OR, NOT, XOR |
| Arithmetic | +, -, *, /, %, ^ |
| String | CONTAINS, STARTS WITH, ENDS WITH |
| Null | IS NULL, IS NOT NULL |
| Search | <=> (vector distance), @@ (full-text) |
Functions
| Function | Description |
|---|---|
id(node) | Get node ID |
coalesce(a, b, ...) | Return first non-null value |
abs(x) | Absolute value |
size(list) | List length |
toInteger(x) | Convert to integer |
toFloat(x) | Convert to float |
See Functions for details.
Aggregations
| Function | Description |
|---|---|
count(x) | Count values |
sum(x) | Sum values |
avg(x) | Average values |
min(x) | Minimum value |
max(x) | Maximum value |
collect(x) | Collect into list |
See Aggregations for details.
Parameters
Use $name syntax to pass values safely:
MATCH (n:Person) WHERE n.name = $name RETURN n
See Parameters for language-specific binding.
MATCH and Patterns
MATCH finds patterns in the graph. It is the primary way to read data.
Node Patterns
Match all nodes:
MATCH (n) RETURN n
Match nodes with a label:
MATCH (p:Person) RETURN p.name
Match nodes with inline property filtering:
MATCH (p:Person {name: "Alice"}) RETURN p
Variables
Parentheses bind a matched node to a variable:
MATCH (person:Person)
RETURN person.name, person.age
Variables are used in WHERE, RETURN, SET, and other clauses to reference matched elements.
Edge Patterns
Match edges between nodes:
-- Outgoing edge
MATCH (a:Person)-[:KNOWS]->(b:Person)
RETURN a.name, b.name
-- Incoming edge
MATCH (a:Person)<-[:KNOWS]-(b:Person)
RETURN a.name, b.name
-- Either direction
MATCH (a:Person)-[:KNOWS]-(b:Person)
RETURN a.name, b.name
Edge Variables
Bind edges to variables to access their properties:
MATCH (a)-[r:KNOWS]->(b)
RETURN a.name, r, b.name
Edge Type Filtering
Match specific edge types:
MATCH (a)-[:KNOWS]->(b) RETURN b
MATCH (a)-[:AUTHORED_BY]->(b) RETURN b
Multi-Hop Patterns
Chain patterns to traverse multiple hops:
MATCH (chunk:Chunk)-[:PART_OF]->(doc:Document)-[:AUTHORED_BY]->(author:Person)
RETURN chunk.text, doc.title, author.name
Variable-Length Paths
Match paths of varying length:
-- 1 to 3 hops
MATCH (a)-[*1..3]->(b) RETURN b
-- Exactly 2 hops
MATCH (a)-[*2]->(b) RETURN b
-- 2 or more hops
MATCH (a)-[*2..]->(b) RETURN b
-- Any number of hops
MATCH (a)-[*]->(b) RETURN b
See Variable-Length Paths for details.
Multiple MATCH Clauses
Use multiple MATCH clauses to combine patterns:
MATCH (a:Person {name: "Alice"})
MATCH (b:Person {name: "Bob"})
RETURN a, b
WHERE and Filtering
WHERE filters matched patterns based on conditions.
Comparison Operators
MATCH (n:Person) WHERE n.age > 30 RETURN n.name
MATCH (n:Person) WHERE n.age >= 30 RETURN n.name
MATCH (n:Person) WHERE n.age < 30 RETURN n.name
MATCH (n:Person) WHERE n.age <= 30 RETURN n.name
MATCH (n:Person) WHERE n.name = "Alice" RETURN n
MATCH (n:Person) WHERE n.name <> "Alice" RETURN n
Boolean Logic
Combine conditions with AND, OR, NOT, and XOR:
MATCH (n:Person)
WHERE n.age > 25 AND n.field = "ML"
RETURN n.name
MATCH (n:Person)
WHERE n.field = "ML" OR n.field = "AI"
RETURN n.name
MATCH (n:Person)
WHERE NOT n.field = "Systems"
RETURN n.name
Null Checks
MATCH (n:Person) WHERE n.email IS NULL RETURN n.name
MATCH (n:Person) WHERE n.email IS NOT NULL RETURN n.name
String Predicates
MATCH (n:Person) WHERE n.name CONTAINS "li" RETURN n
MATCH (n:Person) WHERE n.name STARTS WITH "A" RETURN n
MATCH (n:Person) WHERE n.name ENDS WITH "ce" RETURN n
Property Existence
Test whether a property exists by checking for null:
MATCH (n:Person) WHERE n.email IS NOT NULL RETURN n
Inline Property Matching
Properties in the MATCH pattern act as equality filters:
-- These are equivalent:
MATCH (n:Person {name: "Alice"}) RETURN n
MATCH (n:Person) WHERE n.name = "Alice" RETURN n
Vector Distance
Filter by vector similarity using <=>:
MATCH (c:Chunk)
WHERE c.embedding <=> $query < 0.5
RETURN c.text
See Vector Search.
Full-Text Search
Filter by text content using @@:
MATCH (d:Document)
WHERE d.content @@ "neural networks"
RETURN d.title
See Full-Text Search.
RETURN and Projections
RETURN specifies which values to include in the query output.
Returning Properties
MATCH (n:Person)
RETURN n.name, n.age
Returning Nodes
Return a node reference:
MATCH (n:Person)
RETURN n
Aliases
Use AS to rename output columns:
MATCH (n:Person)
RETURN n.name AS name, n.age AS age
Expressions
Return computed values:
MATCH (n:Person)
RETURN n.name, n.age + 1
Literals
Return literal values:
MATCH (n:Person)
RETURN n.name, "person" AS type
DISTINCT
Remove duplicate rows:
MATCH (a:Person)-[:KNOWS]->(b:Person)
RETURN DISTINCT b.name
Ordering
Use ORDER BY to sort results:
MATCH (n:Person)
RETURN n.name, n.age
ORDER BY n.age DESC
Sort by multiple columns:
MATCH (n:Person)
RETURN n.name, n.age
ORDER BY n.field, n.age DESC
Pagination
Use LIMIT and SKIP for pagination:
-- First 10 results
MATCH (n:Person) RETURN n.name LIMIT 10
-- Skip first 10, return next 10
MATCH (n:Person) RETURN n.name SKIP 10 LIMIT 10
Aggregations in RETURN
Aggregation functions can be used directly in RETURN:
MATCH (doc:Document)-[:AUTHORED_BY]->(p:Person)
RETURN p.name, count(doc) AS papers
ORDER BY papers DESC
See Aggregations for the full list.
CREATE, SET, DELETE
Mutation clauses modify the graph structure.
CREATE
Create a Node
CREATE (n:Person {name: "Alice", age: 30})
Create a node with multiple labels:
CREATE (n:Person:Employee {name: "Alice"})
Create an Edge
Create an edge between existing nodes:
MATCH (a:Person {name: "Alice"}), (b:Person {name: "Bob"})
CREATE (a)-[:KNOWS]->(b)
Create Nodes and Edges Together
CREATE (a:Person {name: "Alice"})-[:KNOWS]->(b:Person {name: "Bob"})
SET
Set Properties
MATCH (n:Person {name: "Alice"})
SET n.age = 31
Set multiple properties:
MATCH (n:Person {name: "Alice"})
SET n.age = 31, n.city = "NYC"
Set Labels
Add labels to a node:
MATCH (n:Person {name: "Alice"})
SET n:Admin:Verified
Remove a Property via SET
Setting a property to NULL removes it:
MATCH (n:Person {name: "Alice"})
SET n.city = NULL
DELETE
Delete a Node
Delete a node (fails if the node has edges):
MATCH (n:Person {name: "Charlie"})
DELETE n
DETACH DELETE
Delete a node and all its edges:
MATCH (n:Person {name: "Charlie"})
DETACH DELETE n
REMOVE
Remove a Property
MATCH (n:Person {name: "Alice"})
REMOVE n.city
Remove a Label
MATCH (n:Person {name: "Alice"})
REMOVE n:Verified
MERGE
MERGE matches an existing pattern or creates it if it doesn't exist. It's an idempotent "get or create" operation.
Basic Usage
MERGE (n:Person {name: "Alice"})
RETURN n
If a Person node with name: "Alice" exists, it is returned. Otherwise, one is created.
MERGE with Properties
MERGE (n:Person {name: "Alice"})
SET n.last_seen = "2024-01-15"
RETURN n
The SET clause runs whether the node was matched or created.
MERGE Edges
MATCH (a:Person {name: "Alice"}), (b:Person {name: "Bob"})
MERGE (a)-[:KNOWS]->(b)
This creates the KNOWS edge only if it doesn't already exist.
Use Cases
MERGE is useful when:
- Importing data that may have duplicates
- Ensuring a relationship exists without creating duplicates
- Building graphs incrementally from multiple data sources
WITH and Chaining
WITH pipes the output of one query part into the next. It acts as a boundary between query segments, allowing you to transform, filter, or aggregate intermediate results.
Basic Chaining
MATCH (p:Person)
WITH p.name AS name, p.age AS age
WHERE age > 30
RETURN name
Aggregation then Filtering
WITH is essential for filtering on aggregated values, since WHERE cannot reference aggregation results directly:
MATCH (doc:Document)-[:AUTHORED_BY]->(p:Person)
WITH p.name AS author, count(doc) AS papers
WHERE papers > 5
RETURN author, papers
ORDER BY papers DESC
Variable Scoping
Variables from before WITH are only available after it if they are explicitly passed through:
MATCH (a:Person)-[:KNOWS]->(b:Person)
WITH a, count(b) AS friend_count
RETURN a.name, friend_count
In this example, b is not available after the WITH clause — only a and friend_count are.
Multi-Stage Queries
Chain multiple WITH clauses for complex queries:
MATCH (p:Person)-[:AUTHORED_BY]-(doc:Document)
WITH p, count(doc) AS docs
WITH p, docs WHERE docs > 2
RETURN p.name, docs
UNWIND
UNWIND expands a list into individual rows. Each element of the list becomes a separate row in the output.
Basic Usage
UNWIND [1, 2, 3] AS x
RETURN x
Returns three rows: 1, 2, 3.
Creating Multiple Nodes
Use UNWIND with CREATE to create nodes from a list:
UNWIND ["Alice", "Bob", "Charlie"] AS name
CREATE (n:Person {name: name})
With Parameters
Pass a list as a parameter and unwind it:
UNWIND $names AS name
MATCH (n:Person {name: name})
RETURN n
Combining with MATCH
UNWIND $tags AS tag
MATCH (d:Document)
WHERE d.title CONTAINS tag
RETURN tag, d.title
Aggregations
Aggregation functions compute summary values across groups of rows.
Functions
count
Count the number of values:
MATCH (n:Person) RETURN count(n) AS total
Count with grouping:
MATCH (doc:Document)-[:AUTHORED_BY]->(p:Person)
RETURN p.name, count(doc) AS papers
sum
Sum numeric values:
MATCH (o:Order)
RETURN sum(o.amount) AS total_revenue
avg
Average numeric values:
MATCH (n:Person)
RETURN avg(n.age) AS average_age
min / max
Find minimum or maximum values:
MATCH (n:Person)
RETURN min(n.age) AS youngest, max(n.age) AS oldest
collect
Collect values into a list:
MATCH (p:Person)-[:KNOWS]->(f:Person)
RETURN p.name, collect(f.name) AS friends
Grouping
Non-aggregated columns in RETURN define the grouping:
MATCH (doc:Document)-[:AUTHORED_BY]->(p:Person)
RETURN p.name, count(doc) AS papers, collect(doc.title) AS titles
ORDER BY papers DESC
Here p.name is the grouping key, and count(doc) and collect(doc.title) are computed per group.
Filtering Aggregated Results
Use WITH to filter on aggregated values:
MATCH (p:Person)-[:KNOWS]->(f:Person)
WITH p, count(f) AS friend_count
WHERE friend_count > 5
RETURN p.name, friend_count
Variable-Length Paths
Variable-length path patterns match paths of varying depth in a single query.
Syntax
MATCH (a)-[*min..max]->(b)
*1..3— match paths of length 1 to 3*2— match paths of exactly length 2*2..— match paths of length 2 or more*— match paths of any length (at least 1)
Examples
Fixed Range
-- Friends of friends (exactly 2 hops)
MATCH (a:Person {name: "Alice"})-[:KNOWS*2]->(b:Person)
RETURN b.name
Bounded Range
-- Reachable within 1 to 3 hops
MATCH (a:Person {name: "Alice"})-[:KNOWS*1..3]->(b:Person)
RETURN DISTINCT b.name
Open-Ended
-- All reachable nodes (2+ hops)
MATCH (a:Person {name: "Alice"})-[*2..]->(b)
RETURN b
Any Length
-- All nodes reachable at any depth
MATCH (a:Person {name: "Alice"})-[*]->(b)
RETURN DISTINCT b
Performance
Variable-length paths are implemented using BFS with bitset-based visited tracking. Performance characteristics:
- Short paths (1-3 hops): microseconds
- Deep traversals scale linearly with the number of reachable nodes
DISTINCTis recommended to avoid duplicate results from multiple paths to the same node
At 10K nodes with 50K edges, a variable path *1..5 completes in ~82 us. See Benchmarks for detailed numbers.
Vector Search (<=>)
The <=> operator computes the distance between a stored vector embedding and a query vector. It integrates HNSW vector search into Cypher queries.
Basic Usage
MATCH (chunk:Chunk)
WHERE chunk.embedding <=> $query < 0.5
RETURN chunk.text
ORDER BY chunk.embedding <=> $query
LIMIT 10
This finds chunks whose embedding is within distance 0.5 of the query vector, sorted by proximity.
How It Works
When the query planner encounters <=> in a WHERE clause, it converts the pattern into a specialized HNSW search operator rather than scanning all nodes. The distance threshold (e.g., < 0.5) filters results after the search.
Distance Metric
LatticeDB uses cosine distance (1 - cosine_similarity). Values range from 0 (identical) to 2 (opposite).
| Distance | Meaning |
|---|---|
| 0.0 | Identical vectors |
| 0.1-0.3 | Very similar |
| 0.3-0.5 | Moderately similar |
| 0.5-1.0 | Dissimilar |
| 1.0-2.0 | Very dissimilar to opposite |
Ordering by Distance
Use ORDER BY to sort results by similarity:
MATCH (n:Document)
WHERE n.embedding <=> $query < 1.0
RETURN n.title
ORDER BY n.embedding <=> $query
LIMIT 10
Combining with Graph Traversal
The real power is combining vector search with graph patterns:
MATCH (chunk:Chunk)-[:PART_OF]->(doc:Document)-[:AUTHORED_BY]->(author:Person)
WHERE chunk.embedding <=> $query < 0.5
RETURN doc.title, chunk.text, author.name
ORDER BY chunk.embedding <=> $query
LIMIT 10
This finds similar chunks, then traverses to their documents and authors — all in one query.
Combining with Full-Text Search
MATCH (chunk:Chunk)-[:PART_OF]->(doc:Document)
WHERE chunk.embedding <=> $query < 0.5
AND doc.content @@ "neural networks"
RETURN doc.title, chunk.text
ORDER BY chunk.embedding <=> $query
Parameter Binding
Pass the query vector as a parameter:
Python:
results = db.query(
"MATCH (n) WHERE n.embedding <=> $q < 0.5 RETURN n",
parameters={"q": query_vector}
)
TypeScript:
const results = await db.query(
"MATCH (n) WHERE n.embedding <=> $q < 0.5 RETURN n",
{ q: new Float32Array([0.1, 0.2, ...]) }
);
Full-Text Search (@@)
The @@ operator performs BM25-scored full-text search on indexed text content.
Basic Usage
MATCH (d:Document)
WHERE d.content @@ "neural networks"
RETURN d.title
This searches all nodes that have had their text indexed with fts_index() and returns those matching the search terms.
How It Works
When the query planner encounters @@, it uses the inverted index to find matching nodes directly — no full scan needed. Results are scored using BM25, which considers term frequency, inverse document frequency, and document length.
String Queries
The search query is a space-separated list of terms. All terms must match (implicit AND):
-- Both "neural" and "networks" must appear
MATCH (n) WHERE n.text @@ "neural networks" RETURN n
Using Parameters
MATCH (d:Document)
WHERE d.content @@ $search_text
RETURN d.title
Python:
results = db.query(
'MATCH (d:Document) WHERE d.content @@ $q RETURN d.title',
parameters={"q": "machine learning"}
)
TypeScript:
const results = await db.query(
'MATCH (d:Document) WHERE d.content @@ $q RETURN d.title',
{ q: "machine learning" }
);
Combining with Graph Traversal
MATCH (chunk:Chunk)-[:PART_OF]->(doc:Document)-[:AUTHORED_BY]->(author:Person)
WHERE chunk.text @@ "transformer attention"
RETURN doc.title, author.name
Combining with Vector Search
MATCH (chunk:Chunk)
WHERE chunk.embedding <=> $query < 0.5
AND chunk.text @@ "transformer"
RETURN chunk.text
ORDER BY chunk.embedding <=> $query
Indexing Text
Text must be indexed before it can be searched. Use fts_index() within a write transaction:
Python:
with db.write() as txn:
txn.fts_index(node.id, "The transformer architecture uses self-attention mechanisms")
txn.commit()
TypeScript:
await db.write(async (txn) => {
await txn.ftsIndex(node.id, "The transformer architecture uses self-attention mechanisms");
});
Programmatic Search
In addition to the @@ operator in Cypher, you can use the dedicated search APIs:
Python:
# Exact search
results = db.fts_search("machine learning", limit=10)
# Fuzzy search (typo-tolerant)
results = db.fts_search_fuzzy("machin lerning", limit=10)
TypeScript:
// Exact search
const results = await db.ftsSearch("machine learning", { limit: 10 });
// Fuzzy search (typo-tolerant)
const results = await db.ftsSearchFuzzy("machin lerning", { limit: 10 });
Parameters
Parameters allow you to pass values into queries safely using the $name syntax. This prevents injection attacks and enables query plan caching.
Syntax
Use $ followed by the parameter name in the query:
MATCH (n:Person) WHERE n.name = $name RETURN n
MATCH (n:Person) WHERE n.age > $min_age RETURN n
Python
# String parameter
result = db.query(
"MATCH (n:Person) WHERE n.name = $name RETURN n",
parameters={"name": "Alice"},
)
# Numeric parameter
result = db.query(
"MATCH (n:Person) WHERE n.age > $min_age RETURN n",
parameters={"min_age": 25},
)
# Vector parameter
from latticedb import hash_embed
result = db.query(
"MATCH (n) WHERE n.embedding <=> $vec < 0.5 RETURN n",
parameters={"vec": hash_embed("query text", dimensions=128)},
)
TypeScript
// String parameter
const result = await db.query(
"MATCH (n:Person) WHERE n.name = $name RETURN n",
{ name: "Alice" }
);
// Numeric parameter
const result = await db.query(
"MATCH (n:Person) WHERE n.age > $min_age RETURN n",
{ min_age: 25 }
);
// Vector parameter
import { hashEmbed } from "@hajewski/latticedb";
const result = await db.query(
"MATCH (n) WHERE n.embedding <=> $vec < 0.5 RETURN n",
{ vec: hashEmbed("query text", 128) }
);
C
lattice_query* query;
lattice_query_prepare(db, "MATCH (n) WHERE n.name = $name RETURN n", &query);
// Bind string parameter
lattice_value val = {
.type = LATTICE_VALUE_STRING,
.data.string_val = { "Alice", 5 }
};
lattice_query_bind(query, "name", &val);
// Bind vector parameter
float vec[128] = { /* ... */ };
lattice_query_bind_vector(query, "embedding", vec, 128);
Supported Parameter Types
| Type | Python | TypeScript | C |
|---|---|---|---|
| String | str | string | LATTICE_VALUE_STRING |
| Integer | int | number | LATTICE_VALUE_INT |
| Float | float | number | LATTICE_VALUE_FLOAT |
| Boolean | bool | boolean | LATTICE_VALUE_BOOL |
| Null | None | null | LATTICE_VALUE_NULL |
| Vector | numpy.ndarray | Float32Array | float* (via lattice_query_bind_vector) |
Query Caching
Parameterized queries are cached by their query text. The same query with different parameter values reuses the cached plan, improving performance for repeated queries.
# These share the same cached plan:
db.query("MATCH (n) WHERE n.name = $name RETURN n", parameters={"name": "Alice"})
db.query("MATCH (n) WHERE n.name = $name RETURN n", parameters={"name": "Bob"})
Functions
LatticeDB supports the following built-in functions in Cypher expressions.
id()
Returns the internal ID of a node:
MATCH (n:Person)
RETURN id(n), n.name
coalesce()
Returns the first non-null argument:
MATCH (n:Person)
RETURN coalesce(n.nickname, n.name) AS display_name
abs()
Returns the absolute value of a number:
MATCH (n:Person)
RETURN n.name, abs(n.score) AS abs_score
size()
Returns the length of a list:
MATCH (p:Person)-[:KNOWS]->(f:Person)
WITH p, collect(f) AS friends
RETURN p.name, size(friends) AS friend_count
toInteger()
Converts a value to an integer:
MATCH (n:Person)
RETURN n.name, toInteger(n.score) AS int_score
toFloat()
Converts a value to a float:
MATCH (n:Person)
RETURN n.name, toFloat(n.age) AS float_age
Type Coercion
Numeric operations automatically promote integers to floats when mixed:
RETURN 42 + 3.14 -- Returns 45.14 (float)
Null propagates through most operations:
RETURN null + 1 -- Returns null
RETURN null = null -- Returns true (special case)
C API
The C API is LatticeDB's primary interface. All language bindings (Python, TypeScript) wrap this API. The header file is include/lattice.h.
Overview
The API uses opaque handle types and follows a consistent pattern:
- Functions return
lattice_error(0 = success, negative = error) - Resources are allocated by the library and freed by the caller
- Strings and result sets must be explicitly freed
Types
Handles
typedef struct lattice_database lattice_database;
typedef struct lattice_txn lattice_txn;
typedef struct lattice_query lattice_query;
typedef struct lattice_result lattice_result;
typedef struct lattice_vector_result lattice_vector_result;
typedef struct lattice_fts_result lattice_fts_result;
typedef struct lattice_edge_result lattice_edge_result;
IDs
typedef uint64_t lattice_node_id;
typedef uint64_t lattice_edge_id;
Error Codes
LATTICE_OK // 0 - Success
LATTICE_ERROR // -1 - Generic error
LATTICE_ERROR_IO // -2 - I/O error
LATTICE_ERROR_CORRUPTION // -3 - Data corruption detected
LATTICE_ERROR_NOT_FOUND // -4 - Resource not found
LATTICE_ERROR_ALREADY_EXISTS // -5 - Resource already exists
LATTICE_ERROR_INVALID_ARG // -6 - Invalid argument
LATTICE_ERROR_TXN_ABORTED // -7 - Transaction aborted
LATTICE_ERROR_LOCK_TIMEOUT // -8 - Lock timeout
LATTICE_ERROR_READ_ONLY // -9 - Write attempted on read-only txn
LATTICE_ERROR_FULL // -10 - Database full
LATTICE_ERROR_VERSION_MISMATCH // -11 - Version mismatch
LATTICE_ERROR_CHECKSUM // -12 - Checksum verification failed
LATTICE_ERROR_OUT_OF_MEMORY // -13 - Out of memory
Value Types
typedef enum {
LATTICE_VALUE_NULL = 0,
LATTICE_VALUE_BOOL = 1,
LATTICE_VALUE_INT = 2,
LATTICE_VALUE_FLOAT = 3,
LATTICE_VALUE_STRING = 4,
LATTICE_VALUE_BYTES = 5,
LATTICE_VALUE_VECTOR = 6,
LATTICE_VALUE_LIST = 7,
LATTICE_VALUE_MAP = 8
} lattice_value_type;
Property Value
typedef struct {
lattice_value_type type;
union {
bool bool_val;
int64_t int_val;
double float_val;
struct { const char* ptr; size_t len; } string_val;
struct { const uint8_t* ptr; size_t len; } bytes_val;
struct { const float* ptr; uint32_t dimensions; } vector_val;
} data;
} lattice_value;
Database Operations
Open
lattice_open_options opts = LATTICE_OPEN_OPTIONS_DEFAULT;
opts.create = true;
opts.enable_vector = true;
opts.vector_dimensions = 128;
lattice_database* db;
lattice_error err = lattice_open("mydb.ltdb", &opts, &db);
Close
lattice_close(db);
Open Options
typedef struct {
bool create; // Create if not exists
bool read_only; // Open in read-only mode
uint32_t cache_size_mb; // Cache size in MB (default: 100)
uint32_t page_size; // Page size in bytes (default: 4096)
bool enable_vector; // Enable vector storage
uint16_t vector_dimensions; // Vector dimensions (default: 128)
} lattice_open_options;
Transaction Operations
// Begin a transaction
lattice_txn* txn;
lattice_begin(db, LATTICE_TXN_READ_WRITE, &txn);
// ... do work ...
// Commit or rollback
lattice_commit(txn);
// or: lattice_rollback(txn);
Transaction modes:
LATTICE_TXN_READ_ONLY— read-only, can run concurrentlyLATTICE_TXN_READ_WRITE— read-write, serialized
Node Operations
Create a Node
lattice_node_id node_id;
lattice_node_create(txn, "Person", &node_id);
Set / Get Properties
// Set a string property
lattice_value val = {
.type = LATTICE_VALUE_STRING,
.data.string_val = { "Alice", 5 }
};
lattice_node_set_property(txn, node_id, "name", &val);
// Get a property
lattice_value out;
lattice_node_get_property(txn, node_id, "name", &out);
Check Existence
bool exists;
lattice_node_exists(txn, node_id, &exists);
Delete a Node
lattice_node_delete(txn, node_id);
Get Labels
char* labels;
lattice_node_get_labels(txn, node_id, &labels);
// labels is a comma-separated string, e.g. "Person,Employee"
// ...
lattice_free_string(labels);
Set a Vector
float vector[128] = { /* ... */ };
lattice_node_set_vector(txn, node_id, "embedding", vector, 128);
Edge Operations
Create / Delete
lattice_edge_id edge_id;
lattice_edge_create(txn, source_id, target_id, "KNOWS", &edge_id);
lattice_edge_delete(txn, source_id, target_id, "KNOWS");
Traverse Edges
lattice_edge_result* edges;
lattice_edge_get_outgoing(txn, node_id, &edges);
uint32_t count = lattice_edge_result_count(edges);
for (uint32_t i = 0; i < count; i++) {
lattice_node_id source, target;
const char* type;
uint32_t type_len;
lattice_edge_result_get(edges, i, &source, &target, &type, &type_len);
}
lattice_edge_result_free(edges);
Batch Insert
Insert many nodes with vectors in a single call:
lattice_node_with_vector nodes[1000];
for (int i = 0; i < 1000; i++) {
nodes[i].label = "Document";
nodes[i].vector = vectors[i]; // float[128]
nodes[i].dimensions = 128;
}
lattice_node_id ids[1000];
uint32_t count;
lattice_batch_insert(txn, nodes, 1000, ids, &count);
Vector Search
float query[128] = { /* ... */ };
lattice_vector_result* results;
lattice_vector_search(db, query, 128, 10, 64, &results);
uint32_t count = lattice_vector_result_count(results);
for (uint32_t i = 0; i < count; i++) {
lattice_node_id node_id;
float distance;
lattice_vector_result_get(results, i, &node_id, &distance);
printf("Node %llu: distance=%.4f\n", node_id, distance);
}
lattice_vector_result_free(results);
Parameters:
k— number of nearest neighbors to returnef_search— HNSW search parameter (0 = default 64). Higher values improve recall at the cost of latency.
Full-Text Search
Index a Document
const char* text = "The quick brown fox jumps over the lazy dog";
lattice_fts_index(txn, node_id, text, strlen(text));
Search
lattice_fts_result* results;
lattice_fts_search(db, "quick fox", 9, 10, &results);
uint32_t count = lattice_fts_result_count(results);
for (uint32_t i = 0; i < count; i++) {
lattice_node_id node_id;
float score;
lattice_fts_result_get(results, i, &node_id, &score);
printf("Node %llu: score=%.4f\n", node_id, score);
}
lattice_fts_result_free(results);
Fuzzy Search
lattice_fts_result* results;
lattice_fts_search_fuzzy(db, "quik fox", 8, 10, 2, 4, &results);
// max_distance=2, min_term_length=4
Embeddings
Hash Embeddings (Built-in)
float* vector;
uint32_t dims;
lattice_hash_embed("hello world", 11, 128, &vector, &dims);
// Use vector...
lattice_hash_embed_free(vector, dims);
HTTP Embedding Client
lattice_embedding_config config = {
.endpoint = "http://localhost:11434",
.model = NULL, // use default
.api_format = LATTICE_EMBEDDING_OLLAMA,
.api_key = NULL,
.timeout_ms = 0 // default 30s
};
lattice_embedding_client* client;
lattice_embedding_client_create(&config, &client);
float* vector;
uint32_t dims;
lattice_embedding_client_embed(client, "hello world", 11, &vector, &dims);
// Use vector...
lattice_hash_embed_free(vector, dims);
lattice_embedding_client_free(client);
Query Operations
Queries use a prepare/bind/execute pattern:
// 1. Prepare
lattice_query* query;
lattice_query_prepare(db, "MATCH (n) WHERE n.name = $name RETURN n", &query);
// 2. Bind parameters
lattice_value val = {
.type = LATTICE_VALUE_STRING,
.data.string_val = { "Alice", 5 }
};
lattice_query_bind(query, "name", &val);
// For vector parameters:
float vec[128] = { /* ... */ };
lattice_query_bind_vector(query, "embedding", vec, 128);
// 3. Execute
lattice_txn* txn;
lattice_begin(db, LATTICE_TXN_READ_ONLY, &txn);
lattice_result* result;
lattice_query_execute(query, txn, &result);
// 4. Iterate results
while (lattice_result_next(result)) {
uint32_t cols = lattice_result_column_count(result);
for (uint32_t i = 0; i < cols; i++) {
const char* name = lattice_result_column_name(result, i);
lattice_value val;
lattice_result_get(result, i, &val);
// Process val...
}
}
// 5. Cleanup
lattice_result_free(result);
lattice_commit(txn);
lattice_query_free(query);
Query Cache
// Clear cache
lattice_query_cache_clear(db);
// Get statistics
uint32_t entries;
uint64_t hits, misses;
lattice_query_cache_stats(db, &entries, &hits, &misses);
Utilities
// Get version string
const char* version = lattice_version(); // e.g. "0.1.0"
// Get error message
const char* msg = lattice_error_message(LATTICE_ERROR_NOT_FOUND);
Python API
Python bindings for LatticeDB, an embedded knowledge graph database for AI/RAG applications.
Installation
pip install latticedb
The native shared library (liblattice.dylib / liblattice.so) must be available on the system. Install it via the install script or build from source with zig build shared.
Quick Start
import numpy as np
from latticedb import Database
with Database("knowledge.db", create=True, enable_vector=True, vector_dimensions=4) as db:
# Create nodes, edges, and index content
with db.write() as txn:
alice = txn.create_node(
labels=["Person"],
properties={"name": "Alice", "age": 30},
)
bob = txn.create_node(
labels=["Person"],
properties={"name": "Bob", "age": 25},
)
txn.create_edge(alice.id, bob.id, "KNOWS")
# Index text for full-text search
txn.fts_index(alice.id, "Alice works on machine learning research")
txn.fts_index(bob.id, "Bob studies deep learning and neural networks")
# Store vector embeddings
txn.set_vector(alice.id, "embedding", np.array([1.0, 0.0, 0.0, 0.0], dtype=np.float32))
txn.set_vector(bob.id, "embedding", np.array([0.0, 1.0, 0.0, 0.0], dtype=np.float32))
txn.commit()
# Query with Cypher
result = db.query("MATCH (n:Person) WHERE n.age > 20 RETURN n.name, n.age")
for row in result:
print(row)
# Vector similarity search
query_vec = np.array([0.9, 0.1, 0.0, 0.0], dtype=np.float32)
for r in db.vector_search(query_vec, k=2):
print(f"Node {r.node_id}: distance={r.distance:.4f}")
# Full-text search
for r in db.fts_search("machine learning"):
print(f"Node {r.node_id}: score={r.score:.4f}")
# Fuzzy search (typo-tolerant)
for r in db.fts_search_fuzzy("machin lerning"):
print(f"Node {r.node_id}: score={r.score:.4f}")
API Reference
Database
Database(
path: str | Path,
*,
create: bool = False, # Create if doesn't exist
read_only: bool = False, # Open in read-only mode
cache_size_mb: int = 100, # Page cache size
enable_vector: bool = False, # Enable vector storage
vector_dimensions: int = 128 # Vector dimensions
)
Methods
open()/close()- Open/close the database (also works as context manager)read()- Start a read-only transaction (context manager)write()- Start a read-write transaction (context manager)query(cypher, parameters=None)- Execute a Cypher queryvector_search(vector, k=10, ef_search=64)- k-NN vector searchfts_search(query, limit=10)- Full-text searchfts_search_fuzzy(query, limit=10, max_distance=0, min_term_length=0)- Fuzzy full-text searchcache_clear()- Clear the query cachecache_stats()- Get cache hit/miss statistics
Transaction
Read Operations
get_node(node_id)- Get a node by ID, returnsNodeorNonenode_exists(node_id)- Check if a node existsget_property(node_id, key)- Get a property valueget_outgoing_edges(node_id)- Get outgoing edges from a nodeget_incoming_edges(node_id)- Get incoming edges to a nodeis_read_only/is_active- Transaction state
Write Operations
create_node(labels=[], properties=None)- Create a nodedelete_node(node_id)- Delete a nodeset_property(node_id, key, value)- Set a property on a nodeset_vector(node_id, key, vector)- Set a vector embeddingbatch_insert(label, vectors)- Batch insert nodes with vectorsfts_index(node_id, text)- Index text for full-text searchcreate_edge(source_id, target_id, edge_type)- Create an edgedelete_edge(source_id, target_id, edge_type)- Delete an edgecommit()/rollback()- Commit or rollback the transaction
Batch Insert
Insert many nodes with vectors in a single efficient call:
import numpy as np
with Database("vectors.db", create=True, enable_vector=True, vector_dimensions=128) as db:
with db.write() as txn:
vectors = np.random.rand(1000, 128).astype(np.float32)
node_ids = txn.batch_insert("Document", vectors)
print(f"Created {len(node_ids)} nodes")
txn.commit()
Full-Text Search
Exact Search
results = db.fts_search("machine learning", limit=10)
for r in results:
print(f"Node {r.node_id}: score={r.score:.4f}")
Fuzzy Search (Typo-Tolerant)
# Finds "machine learning" even with typos
results = db.fts_search_fuzzy("machne lerning", limit=10)
# Control fuzzy matching sensitivity
results = db.fts_search_fuzzy(
"machne",
limit=10,
max_distance=2, # Max edit distance (default: 2)
min_term_length=4, # Min term length for fuzzy matching (default: 4)
)
Embeddings
LatticeDB includes a built-in hash embedding function and an HTTP client for external embedding services.
Hash Embeddings (Built-in)
Deterministic, no external service needed. Useful for testing or simple keyword-based similarity:
from latticedb import hash_embed
vec = hash_embed("hello world", dimensions=128)
print(vec.shape) # (128,)
HTTP Embedding Client
Connect to Ollama, OpenAI, or compatible APIs:
from latticedb import EmbeddingClient, EmbeddingApiFormat
# Ollama (default)
with EmbeddingClient("http://localhost:11434") as client:
vec = client.embed("hello world")
# OpenAI-compatible API
with EmbeddingClient(
"https://api.openai.com/v1",
model="text-embedding-3-small",
api_format=EmbeddingApiFormat.OPENAI,
api_key="sk-...",
) as client:
vec = client.embed("hello world")
Edge Traversal
with db.read() as txn:
outgoing = txn.get_outgoing_edges(node_id)
for edge in outgoing:
print(f"{edge.source_id} --[{edge.edge_type}]--> {edge.target_id}")
incoming = txn.get_incoming_edges(node_id)
for edge in incoming:
print(f"{edge.source_id} --[{edge.edge_type}]--> {edge.target_id}")
Cypher Queries
# Pattern matching
result = db.query("MATCH (n:Person) RETURN n.name")
# With parameters
result = db.query(
"MATCH (n:Person) WHERE n.name = $name RETURN n",
parameters={"name": "Alice"},
)
# Vector similarity in Cypher
result = db.query(
"MATCH (n:Document) WHERE n.embedding <=> $vec < 0.5 RETURN n.title",
parameters={"vec": query_vector},
)
# Full-text search in Cypher
result = db.query(
'MATCH (n:Document) WHERE n.content @@ "machine learning" RETURN n.title'
)
# Data mutation
db.query("CREATE (n:Person {name: 'Charlie', age: 35})")
db.query("MATCH (n:Person {name: 'Charlie'}) SET n.age = 36")
db.query("MATCH (n:Person {name: 'Charlie'}) DETACH DELETE n")
Query Cache
# Get cache statistics
stats = db.cache_stats()
print(f"Entries: {stats['entries']}, Hits: {stats['hits']}, Misses: {stats['misses']}")
# Clear the cache
db.cache_clear()
Supported Property Types
None- Null valuebool- Booleanint- 64-bit integerfloat- 64-bit floatstr- UTF-8 stringbytes- Binary data
Error Handling
from latticedb import LatticeError, LatticeNotFoundError, LatticeIOError
try:
with Database("nonexistent.db") as db:
pass
except LatticeNotFoundError:
print("Database not found")
except LatticeIOError:
print("I/O error")
except LatticeError as e:
print(f"Error: {e}")
Requirements
- Python 3.9+
- NumPy (for vector operations)
- The native LatticeDB library (
liblattice.dylib/liblattice.so)
TypeScript / Node.js API
TypeScript/Node.js bindings for LatticeDB, an embedded knowledge graph database for AI/RAG applications.
Installation
npm install @hajewski/latticedb
The native shared library (liblattice.dylib / liblattice.so) must be available on the system. Install it via the install script or build from source with zig build shared.
Quick Start
import { Database } from "@hajewski/latticedb";
const db = new Database("knowledge.db", {
create: true,
enableVector: true,
vectorDimensions: 4,
});
await db.open();
// Create nodes, edges, and index content
await db.write(async (txn) => {
const alice = await txn.createNode({
labels: ["Person"],
properties: { name: "Alice", age: 30 },
});
const bob = await txn.createNode({
labels: ["Person"],
properties: { name: "Bob", age: 25 },
});
await txn.createEdge(alice.id, bob.id, "KNOWS");
// Index text for full-text search
await txn.ftsIndex(alice.id, "Alice works on machine learning research");
await txn.ftsIndex(bob.id, "Bob studies deep learning and neural networks");
// Store vector embeddings
await txn.setVector(
alice.id,
"embedding",
new Float32Array([1.0, 0.0, 0.0, 0.0])
);
await txn.setVector(
bob.id,
"embedding",
new Float32Array([0.0, 1.0, 0.0, 0.0])
);
});
// Query with Cypher
const result = await db.query(
"MATCH (n:Person) WHERE n.age > 20 RETURN n.name, n.age"
);
for (const row of result.rows) {
console.log(row);
}
// Vector similarity search
const results = await db.vectorSearch(
new Float32Array([0.9, 0.1, 0.0, 0.0]),
{ k: 2 }
);
for (const r of results) {
console.log(`Node ${r.nodeId}: distance=${r.distance.toFixed(4)}`);
}
// Full-text search
const ftsResults = await db.ftsSearch("machine learning");
for (const r of ftsResults) {
console.log(`Node ${r.nodeId}: score=${r.score.toFixed(4)}`);
}
// Fuzzy search (typo-tolerant)
const fuzzyResults = await db.ftsSearchFuzzy("machin lerning");
for (const r of fuzzyResults) {
console.log(`Node ${r.nodeId}: score=${r.score.toFixed(4)}`);
}
await db.close();
API Reference
Database
const db = new Database(path: string, options?: DatabaseOptions);
interface DatabaseOptions {
create?: boolean; // Create if not exists (default: false)
readOnly?: boolean; // Open read-only (default: false)
cacheSizeMb?: number; // Cache size in MB (default: 100)
enableVector?: boolean; // Enable vector storage (default: false)
vectorDimensions?: number; // Vector dimensions (default: 128)
}
Methods
await db.open()- Open the database connectionawait db.close()- Close the database connectionawait db.read(fn)- Execute a read-only transactionawait db.write(fn)- Execute a read-write transactionawait db.query(cypher, params?)- Execute a Cypher queryawait db.vectorSearch(vector, options?)- k-NN vector searchawait db.ftsSearch(query, options?)- Full-text searchawait db.ftsSearchFuzzy(query, options?)- Fuzzy full-text searchawait db.cacheClear()- Clear the query cacheawait db.cacheStats()- Get cache hit/miss statisticsdb.isOpen()- Check if database is opendb.getPath()- Get database file path
Transaction
Read Operations
await txn.getNode(nodeId)- Get a node by ID, returnsNodeornullawait txn.nodeExists(nodeId)- Check if a node existsawait txn.getProperty(nodeId, key)- Get a property valueawait txn.getOutgoingEdges(nodeId)- Get outgoing edges from a nodeawait txn.getIncomingEdges(nodeId)- Get incoming edges to a nodetxn.isReadOnly()/txn.isActive()- Transaction state
Write Operations
await txn.createNode({ labels, properties })- Create a nodeawait txn.deleteNode(nodeId)- Delete a nodeawait txn.setProperty(nodeId, key, value)- Set a propertyawait txn.setVector(nodeId, key, vector)- Set a vector embeddingawait txn.batchInsert(label, vectors)- Batch insert nodes with vectorsawait txn.ftsIndex(nodeId, text)- Index text for full-text searchawait txn.createEdge(sourceId, targetId, edgeType)- Create an edgeawait txn.deleteEdge(sourceId, targetId, edgeType)- Delete an edgetxn.commit()/txn.rollback()- Commit or rollback
Batch Insert
Insert many nodes with vectors in a single efficient call:
import { Database } from "@hajewski/latticedb";
const db = new Database("vectors.db", {
create: true,
enableVector: true,
vectorDimensions: 128,
});
await db.open();
await db.write(async (txn) => {
const vectors = Array.from({ length: 1000 }, () =>
Float32Array.from({ length: 128 }, () => Math.random())
);
const nodeIds = await txn.batchInsert("Document", vectors);
console.log(`Created ${nodeIds.length} nodes`);
});
await db.close();
Full-Text Search
Exact Search
const results = await db.ftsSearch("machine learning", { limit: 10 });
for (const r of results) {
console.log(`Node ${r.nodeId}: score=${r.score.toFixed(4)}`);
}
Fuzzy Search (Typo-Tolerant)
// Finds "machine learning" even with typos
const results = await db.ftsSearchFuzzy("machne lerning", { limit: 10 });
// Control fuzzy matching sensitivity
const precise = await db.ftsSearchFuzzy("machne", {
limit: 10,
maxDistance: 2, // Max edit distance (default: 0 = auto)
minTermLength: 4, // Min term length for fuzzy matching (default: 0 = auto)
});
Embeddings
LatticeDB includes a built-in hash embedding function and an HTTP client for external embedding services.
Hash Embeddings (Built-in)
Deterministic, no external service needed. Useful for testing or simple keyword-based similarity:
import { hashEmbed } from "@hajewski/latticedb";
const vec = hashEmbed("hello world", 128);
console.log(vec.length); // 128
HTTP Embedding Client
Connect to Ollama, OpenAI, or compatible APIs:
import { EmbeddingClient, EmbeddingApiFormat } from "@hajewski/latticedb";
// Ollama (default)
const client = new EmbeddingClient({
endpoint: "http://localhost:11434",
});
const vec = client.embed("hello world");
client.close();
// OpenAI-compatible API
const openaiClient = new EmbeddingClient({
endpoint: "https://api.openai.com/v1",
model: "text-embedding-3-small",
apiFormat: EmbeddingApiFormat.OpenAI,
apiKey: "sk-...",
});
const embedding = openaiClient.embed("hello world");
openaiClient.close();
Edge Traversal
await db.read(async (txn) => {
const outgoing = await txn.getOutgoingEdges(nodeId);
for (const edge of outgoing) {
console.log(`${edge.sourceId} --[${edge.type}]--> ${edge.targetId}`);
}
const incoming = await txn.getIncomingEdges(nodeId);
for (const edge of incoming) {
console.log(`${edge.sourceId} --[${edge.type}]--> ${edge.targetId}`);
}
});
Cypher Queries
// Pattern matching
const result = await db.query("MATCH (n:Person) RETURN n.name");
// With parameters
const result = await db.query(
"MATCH (n:Person) WHERE n.name = $name RETURN n",
{ name: "Alice" }
);
// Vector similarity in Cypher
const result = await db.query(
"MATCH (n:Document) WHERE n.embedding <=> $vec < 0.5 RETURN n.title",
{ vec: new Float32Array([0.1, 0.2, 0.3, 0.4]) }
);
// Full-text search in Cypher
const result = await db.query(
'MATCH (n:Document) WHERE n.content @@ "machine learning" RETURN n.title'
);
// Data mutation
await db.query('CREATE (n:Person {name: "Charlie", age: 35})');
await db.query('MATCH (n:Person {name: "Charlie"}) SET n.age = 36');
await db.query('MATCH (n:Person {name: "Charlie"}) DETACH DELETE n');
Query Cache
// Get cache statistics
const stats = await db.cacheStats();
console.log(
`Entries: ${stats.entries}, Hits: ${stats.hits}, Misses: ${stats.misses}`
);
// Clear the cache
await db.cacheClear();
Supported Property Types
null- Null valueboolean- Booleannumber- Integer or floatstring- UTF-8 stringUint8Array- Binary data
Error Handling
import { Database, isLibraryAvailable } from "@hajewski/latticedb";
// Check if native library is available
if (!isLibraryAvailable()) {
console.error("LatticeDB native library not found");
process.exit(1);
}
try {
const db = new Database("test.db", { create: true });
await db.open();
// ...
await db.close();
} catch (error) {
console.error("Database error:", error);
}
Building from Source
Requires Node.js 18+ and the LatticeDB native library.
# From the latticedb root directory
zig build shared
# Build the TypeScript bindings
cd bindings/typescript
npm install
npm run build
# Run tests
npm test
Requirements
- Node.js 18+
- The native LatticeDB library (
liblattice.dylib/liblattice.so)
Architecture Overview
This section explains how LatticeDB's storage engine works, from the ground up. Each chapter builds on the previous, showing how simple primitives combine to create a durable, transactional database.
The Stack
┌─────────────────────────────────────────────────────────────┐
│ Application │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Transaction Manager │
│ (ACID guarantees, isolation) │
└─────────────────────────────────────────────────────────────┘
│
┌───────────────┼───────────────┐
▼ ▼ ▼
┌───────────────────┐ ┌─────────────┐ ┌─────────────────────┐
│ B+Tree │ │ WAL │ │ Checkpointer │
│ (ordered data) │ │ (durability)│ │ (recovery bounds) │
└───────────────────┘ └─────────────┘ └─────────────────────┘
│ │ │
└───────────────┼───────────────┘
▼
┌─────────────────────────────────────────────────────────────┐
│ Buffer Pool │
│ (page caching in RAM) │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Page Manager │
│ (page allocation, checksums) │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Virtual File System │
│ (file I/O) │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Operating System │
└─────────────────────────────────────────────────────────────┘
Chapters
Storage Engine
- Virtual File System - Abstracting file I/O for portability and testing
- Page Manager - Fixed-size pages, allocation, and checksums
- Buffer Pool - Caching pages in memory with eviction
- B+Tree - Ordered key-value storage with efficient lookups
Durability & Transactions
- Write-Ahead Log - Durability through logging before data changes
- Transaction Manager - ACID transactions with begin/commit/abort
- Checkpointing - Bounding recovery time by flushing dirty pages
- Recovery - ARIES-style crash recovery with redo/undo
Data Model
- Graph Storage - Nodes, edges, labels, and properties
Search Indexes
- Vector Search - HNSW approximate nearest neighbor search
- Full-Text Search - BM25-scored inverted index with tokenization
Query System
- Query Execution - Volcano iterator model, operators, and planning
Design Principles
Direct Page Manipulation
We don't deserialize pages into objects. Instead, we read/write bytes directly in page buffers. This is "zero-copy" - no intermediate representations, no serialization overhead.
Traditional approach: Our approach:
Page bytes ──► Object ──► Page bytes ──► Direct access
in memory via offsets
│ │
▼ ▼
Modify Modify bytes
│ │
▼ ▼
Object ──► Serialize ──► Page Already in page buffer!
Everything is Pages
The entire database is built on fixed-size pages (4KB by default):
- B+Tree nodes are pages
- WAL frames are pages
- The file header is a page
- Free list entries are pages
This uniformity simplifies the system - one caching layer, one I/O path, one checksum format.
Durability Through WAL
Changes are logged before being applied. This means:
- Commit = log is on disk (fast, sequential write)
- Actual data pages can be written lazily
- Crash recovery replays the log
Simple Concurrency Model
Each page has a reader-writer latch. Multiple readers OR one writer. No complex lock hierarchies - simplicity over maximum concurrency.
Virtual File System (VFS)
What It Is
The Virtual File System is an abstraction layer over file I/O operations. Instead of calling OS functions directly, all file operations go through a VFS interface.
Why We Need It
1. Testability
With a VFS, we can create an in-memory implementation for testing:
Production: Testing:
┌─────────────┐ ┌─────────────┐
│ B+Tree │ │ B+Tree │
└──────┬──────┘ └──────┬──────┘
│ │
▼ ▼
┌─────────────┐ ┌─────────────┐
│ PosixVfs │ │ MemoryVfs │
└──────┬──────┘ └──────┬──────┘
│ │
▼ ▼
┌─────────────┐ ┌─────────────┐
│ Disk I/O │ │ HashMap │
└─────────────┘ └─────────────┘
Tests run instantly because there's no actual disk I/O.
2. Portability
Different operating systems have different file APIs:
- Linux:
pread,pwrite,fsync - Windows:
ReadFile,WriteFile,FlushFileBuffers - Embedded: Custom flash drivers
The VFS hides these differences. We implement one VFS per platform, and the rest of the database doesn't care.
3. Injection of Failures
For testing crash recovery, we need to simulate failures:
const FaultyVfs = struct {
inner: Vfs,
fail_after_n_writes: usize,
write_count: usize,
pub fn write(self: *Self, offset: u64, data: []const u8) !void {
self.write_count += 1;
if (self.write_count > self.fail_after_n_writes) {
return error.SimulatedCrash;
}
return self.inner.write(offset, data);
}
};
The Interface
File Operations
pub const File = struct {
ptr: *anyopaque,
vtable: *const VTable,
pub const VTable = struct {
read: fn(ptr, offset: u64, buf: []u8) Error!usize,
write: fn(ptr, offset: u64, data: []const u8) Error!void,
sync: fn(ptr) Error!void,
size: fn(ptr) Error!u64,
close: fn(ptr) void,
truncate: fn(ptr, size: u64) Error!void,
};
// Convenience methods that call vtable
pub fn read(self: File, offset: u64, buf: []u8) !usize {
return self.vtable.read(self.ptr, offset, buf);
}
// ... etc
};
VFS Operations
pub const Vfs = struct {
ptr: *anyopaque,
vtable: *const VTable,
pub const VTable = struct {
open: fn(ptr, path: []const u8, flags: OpenFlags) Error!File,
delete: fn(ptr, path: []const u8) Error!void,
exists: fn(ptr, path: []const u8) bool,
};
};
Open Flags
pub const OpenFlags = struct {
read: bool = true, // Open for reading
write: bool = false, // Open for writing
create: bool = false, // Create if doesn't exist
exclusive: bool = false, // Fail if exists (with create)
};
Key Operations
Positioned Read/Write
Unlike streaming I/O, we use positioned reads and writes:
// Read 4096 bytes starting at byte offset 8192
const bytes_read = try file.read(8192, buffer[0..4096]);
// Write 4096 bytes starting at byte offset 8192
try file.write(8192, data[0..4096]);
This maps directly to page-based access:
- Page 0 is at offset 0
- Page 1 is at offset 4096
- Page N is at offset N * 4096
Sync (fsync)
The most important operation for durability:
try file.sync();
This forces all pending writes to physical storage. Without it, data might sit in OS buffers and be lost on power failure.
Application writes: After fsync:
┌─────────────┐ ┌─────────────┐
│ Application │ │ Application │
└──────┬──────┘ └─────────────┘
│ write()
▼
┌─────────────┐ ┌─────────────┐
│ OS Buffer │ │ OS Buffer │
│ (in RAM) │ ◄── Data │ (empty) │
└─────────────┘ sits └──────┬──────┘
╳ here! │ fsync forces
Power │ write to disk
failure ▼
loses it! ┌─────────────────────┐
│ Disk (persistent) │
│ Data is SAFE │
└─────────────────────┘
The POSIX Implementation
For Unix-like systems (Linux, macOS):
pub const PosixVfs = struct {
allocator: Allocator,
pub fn open(self: *Self, path: []const u8, flags: OpenFlags) !File {
var posix_flags: u32 = 0;
if (flags.read and flags.write) {
posix_flags |= O_RDWR;
} else if (flags.write) {
posix_flags |= O_WRONLY;
} else {
posix_flags |= O_RDONLY;
}
if (flags.create) posix_flags |= O_CREAT;
if (flags.exclusive) posix_flags |= O_EXCL;
const fd = try std.posix.open(path, posix_flags, 0o644);
// ... wrap in File struct
}
};
Read Implementation
Uses pread for positioned reading (thread-safe, no seeking):
fn read(self: *PosixFile, offset: u64, buf: []u8) !usize {
return std.posix.pread(self.fd, buf, offset);
}
Write Implementation
Uses pwrite for positioned writing:
fn write(self: *PosixFile, offset: u64, data: []const u8) !void {
var written: usize = 0;
while (written < data.len) {
const n = try std.posix.pwrite(self.fd, data[written..], offset + written);
if (n == 0) return error.DiskFull;
written += n;
}
}
Sync Implementation
fn sync(self: *PosixFile) !void {
try std.posix.fsync(self.fd);
}
Error Handling
VFS operations can fail in various ways:
pub const VfsError = error{
FileNotFound, // Path doesn't exist
PermissionDenied, // No access rights
DiskFull, // No space left
IoError, // Hardware failure
FileLocked, // Another process has it
InvalidPath, // Bad path string
AlreadyExists, // Exclusive create failed
};
These are translated from OS-specific error codes:
fn mapPosixError(err: std.posix.Error) VfsError {
return switch (err) {
.ENOENT => VfsError.FileNotFound,
.EACCES, .EPERM => VfsError.PermissionDenied,
.ENOSPC => VfsError.DiskFull,
.EIO => VfsError.IoError,
// ...
};
}
Usage Pattern
// Create VFS
var posix_vfs = PosixVfs.init(allocator);
const vfs = posix_vfs.vfs(); // Get interface
// Open file
const file = try vfs.open("database.db", .{
.read = true,
.write = true,
.create = true,
});
defer file.close();
// Read page 5
var buf: [4096]u8 = undefined;
_ = try file.read(5 * 4096, &buf);
// Modify and write back
buf[0] = 42;
try file.write(5 * 4096, &buf);
// Ensure durability
try file.sync();
Why Not Just Use std.fs?
Zig's standard library has file operations, but:
- No positioned I/O -
std.fs.Fileuses streaming with seek, which isn't thread-safe - No vtable pattern - Can't swap implementations
- Different error types - We want database-specific errors
The VFS is a thin layer, but it gives us control where we need it.
Page Manager
What It Is
The Page Manager handles the lowest level of database storage: fixed-size pages. It manages the database file structure, page allocation/deallocation, and data integrity through checksums.
Why Fixed-Size Pages?
Simplicity
Every piece of data lives in a 4KB page. This uniformity simplifies everything:
┌────────────────────────────────────────────────────────────┐
│ Database File │
├────────────┬────────────┬────────────┬────────────┬────────┤
│ Page 0 │ Page 1 │ Page 2 │ Page 3 │ ... │
│ (Header) │ (B+Tree) │ (B+Tree) │ (Free) │ │
│ 4096 bytes│ 4096 bytes│ 4096 bytes│ 4096 bytes│ │
└────────────┴────────────┴────────────┴────────────┴────────┘
│ │ │ │
▼ ▼ ▼ ▼
Offset: 0 4096 8192 12288
To read page N: offset = N * 4096
Alignment with Hardware
- SSD pages: Modern SSDs have 4KB or 8KB internal pages
- OS pages: Most operating systems use 4KB virtual memory pages
- Disk sectors: Traditional HDDs use 512B or 4KB sectors
4KB aligns well with all of these, enabling:
- Direct I/O (bypassing OS cache)
- Atomic writes on some hardware
- Efficient memory mapping
Efficient Caching
Fixed-size pages make the buffer pool trivial - every frame is the same size, no fragmentation.
File Structure
Page 0: File Header
The first page is special - it contains metadata about the database:
┌────────────────────────────────────────────────────────────┐
│ File Header (Page 0) │
├────────────────────────────────────────────────────────────┤
│ Offset 0-3: Magic number (0x4C415454 = "LATT") │
│ Offset 4-5: Format version │
│ Offset 6-7: Minimum reader version │
│ Offset 8-11: Page size (4096) │
│ Offset 12-15: Freelist head page │
│ Offset 16-23: Created timestamp │
│ Offset 24-31: Modified timestamp │
│ Offset 32-47: File UUID (16 bytes) │
│ Offset 48+: Reserved / padding │
└────────────────────────────────────────────────────────────┘
Key fields:
- Magic number: Identifies this as a Lattice database file. Opening a random file will fail fast.
- Format version: Allows future changes to the format
- Freelist head: Points to the first free page (for allocation)
- File UUID: Unique identifier, used to match WAL files
Data Pages
Every other page starts with a common header:
┌────────────────────────────────────────────────────────────┐
│ Page Header (8 bytes) │
├────────────────────────────────────────────────────────────┤
│ Offset 0-3: Checksum (CRC32 of bytes 8-4095) │
│ Offset 4: Page type (btree_internal, btree_leaf, etc)│
│ Offset 5: Flags │
│ Offset 6-7: Reserved │
├────────────────────────────────────────────────────────────┤
│ Offset 8-4095: Page-specific data │
└────────────────────────────────────────────────────────────┘
Page Types
pub const PageType = enum(u8) {
free = 0, // Unallocated, on freelist
btree_internal = 1, // B+Tree internal node
btree_leaf = 2, // B+Tree leaf node
overflow = 3, // Large value overflow
freelist = 4, // Freelist continuation
};
Page Allocation
The Freelist
Free pages are linked together in a freelist:
Header Free pages linked together
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│freelist: ├────────────►│ Page 5 ├───►│ Page 3 ├───►│ Page 9 │
│ 5 │ │ next: 3 │ │ next: 9 │ │ next: 0 │
└──────────┘ └──────────┘ └──────────┘ └──────────┘
│
NULL (end)
Allocating a Page
pub fn allocatePage(self: *Self) !PageId {
// 1. Try freelist first
if (self.header.freelist_page != NULL_PAGE) {
return self.allocateFromFreelist();
}
// 2. Allocate at end of file
const page_id = self.pageCount();
// 3. Extend file with zeroed page
var zeros: [4096]u8 = [_]u8{0} ** 4096;
const header: *PageHeader = @ptrCast(&zeros);
header.* = PageHeader.init(.free);
try self.file.write(page_id * 4096, &zeros);
return page_id;
}
Allocating from Freelist
fn allocateFromFreelist(self: *Self) !PageId {
const page_id = self.header.freelist_page;
// Read the free page to get next pointer
var buf: [4096]u8 = undefined;
try self.readPageRaw(page_id, &buf);
// Next pointer is stored after the 8-byte header
const next_free = std.mem.readInt(u32, buf[8..12], .little);
// Update freelist head
self.header.freelist_page = next_free;
try self.writeHeader();
return page_id;
}
Freeing a Page
pub fn freePage(self: *Self, page_id: PageId) !void {
// 1. Can't free the header page
if (page_id == 0) return error.InvalidPageId;
// 2. Set up page as free, pointing to current freelist head
var buf: [4096]u8 = [_]u8{0} ** 4096;
const header: *PageHeader = @ptrCast(&buf);
header.* = PageHeader.init(.free);
// Store old freelist head as next pointer
std.mem.writeInt(u32, buf[8..12], self.header.freelist_page, .little);
// 3. Calculate checksum and write
header.checksum = calculateChecksum(buf[8..]);
try self.file.write(page_id * 4096, &buf);
// 4. Update freelist head to this page
self.header.freelist_page = page_id;
try self.writeHeader();
}
This is a LIFO (stack) freelist - last freed is first allocated. Simple and efficient.
Checksums
Every page has a CRC32 checksum that covers bytes 8-4095 (everything after the checksum field itself).
Why Checksums?
- Detect corruption: Hardware failures, cosmic rays, bugs
- Detect torn writes: Partial page writes from crashes
- Validate reads: Catch errors before using bad data
Checksum Calculation
pub fn calculateChecksum(data: []const u8) u32 {
var crc = std.hash.Crc32.init();
crc.update(data);
return crc.final();
}
Writing with Checksum
pub fn writePage(self: *Self, page_id: PageId, buf: []u8) !void {
// Calculate checksum of everything after the checksum field
const header: *PageHeader = @ptrCast(buf.ptr);
header.checksum = calculateChecksum(buf[8..]);
// Write to disk
try self.file.write(page_id * 4096, buf);
}
Reading with Verification
pub fn readPage(self: *Self, page_id: PageId, buf: []u8) !void {
try self.file.read(page_id * 4096, buf);
// Verify checksum
const header: *const PageHeader = @ptrCast(buf.ptr);
const expected = calculateChecksum(buf[8..]);
if (header.checksum != expected and header.checksum != 0) {
return error.ChecksumMismatch;
}
}
Note: Checksum of 0 is allowed for newly allocated pages that haven't been written yet.
Read-Only Mode
The Page Manager supports opening databases read-only:
var pm = try PageManager.init(allocator, vfs, "db.file", .{
.read_only = true,
});
// This fails:
_ = try pm.allocatePage(); // error.PermissionDenied
try pm.writePage(1, &buf); // error.PermissionDenied
Useful for:
- Backup tools
- Read replicas
- Forensic analysis
Error Handling
pub const PageManagerError = error{
InvalidHeader, // File header is malformed
InvalidMagic, // Not a Lattice database file
VersionTooNew, // File created by newer version
ChecksumMismatch, // Data corruption detected
InvalidPageId, // Page ID out of range
PageNotAllocated, // Accessing unallocated page
IoError, // Underlying I/O failed
PermissionDenied, // Write to read-only database
FileNotFound, // Database file doesn't exist
DiskFull, // No space for new pages
};
Thread Safety
The Page Manager itself is NOT thread-safe. Each operation directly touches the file. In a multi-threaded context, the Buffer Pool provides thread-safe access by:
- Caching pages in memory
- Using latches (reader-writer locks) per page
- Serializing writes through its own mutex
Usage Pattern
// Open or create database
var pm = try PageManager.init(allocator, vfs, "my.db", .{
.create = true,
.page_size = 4096,
});
defer pm.deinit();
// Allocate a page
const page_id = try pm.allocatePage();
// Write data
var buf: [4096]u8 align(4096) = undefined;
const header: *PageHeader = @ptrCast(&buf);
header.* = PageHeader.init(.btree_leaf);
// ... fill in page data ...
try pm.writePage(page_id, &buf);
// Read it back
var read_buf: [4096]u8 align(4096) = undefined;
try pm.readPage(page_id, &read_buf);
// Free the page
try pm.freePage(page_id);
// Ensure durability
try pm.sync();
Key Design Decisions
Fixed 4KB Pages
Chosen for hardware alignment. Could be made configurable (stored in header), but 4KB works well for most cases.
Freelist in Pages
The freelist uses the free pages themselves for storage. No external data structure needed. Elegant and space-efficient.
Checksum After Header
Checksum doesn't cover itself (circular dependency). By placing checksum first, we can compute it over the rest of the page in one pass.
No Internal Fragmentation Tracking
We don't track free space within pages - that's the responsibility of higher layers (B+Tree). The Page Manager only deals with whole pages.
Buffer Pool
What It Is
The Buffer Pool is a cache that keeps frequently accessed pages in memory. Instead of reading from disk every time, we check the buffer pool first.
The Problem It Solves
Disk I/O is slow:
Operation Time
─────────────────────────────
CPU instruction ~1 ns
RAM access ~100 ns
SSD read ~100,000 ns (100 μs)
HDD read ~10,000,000 ns (10 ms)
RAM is 1,000-100,000x faster than disk. If we can keep hot pages in RAM, performance improves dramatically.
Architecture
┌─────────────────────────────────────────────────────────────────┐
│ Buffer Pool │
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Page Table │ │
│ │ (PageId → FrameId mapping) │ │
│ │ │ │
│ │ PageId 5 ──► Frame 2 │ │
│ │ PageId 12 ─► Frame 0 │ │
│ │ PageId 3 ──► Frame 7 │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌───────┐ ┌───────┐ ┌───────┐ ┌───────┐ ┌───────┐ ┌───────┐ │
│ │Frame 0│ │Frame 1│ │Frame 2│ │Frame 3│ │Frame 4│ │Frame 5│ │
│ │───────│ │───────│ │───────│ │───────│ │───────│ │───────│ │
│ │pg: 12 │ │pg: -- │ │pg: 5 │ │pg: -- │ │pg: 8 │ │pg: 1 │ │
│ │pin: 2 │ │pin: 0 │ │pin: 1 │ │pin: 0 │ │pin: 0 │ │pin: 3 │ │
│ │dirty:Y│ │dirty:N│ │dirty:N│ │dirty:N│ │dirty:Y│ │dirty:N│ │
│ │use: 3 │ │use: 0 │ │use: 1 │ │use: 0 │ │use: 2 │ │use: 5 │ │
│ │[data] │ │[data] │ │[data] │ │[data] │ │[data] │ │[data] │ │
│ └───────┘ └───────┘ └───────┘ └───────┘ └───────┘ └───────┘ │
│ │
│ Free list: [1, 3] (frames not holding any page) │
│ Clock hand: ──────────────────────► │
└─────────────────────────────────────────────────────────────────┘
The Frame
Each frame holds one page:
pub const BufferFrame = struct {
page_id: PageId, // Which page is here (0 = empty)
data: []align(4096) u8, // The actual 4KB page data
pin_count: atomic(u32), // Reference count
dirty: bool, // Modified since read?
usage_count: u8, // For Clock eviction
latch: RwLatch, // Reader-writer lock
};
Pin Count
The pin count is a reference count:
fetchPage() ──► pin_count += 1 "I'm using this page"
unpinPage() ──► pin_count -= 1 "I'm done with this page"
pin_count > 0 ──► Page is in use, cannot be evicted
pin_count = 0 ──► Page can be evicted if needed
Dirty Flag
Read page ──► dirty = false "Matches disk"
Modify page ──► dirty = true "Different from disk"
Write to disk ─► dirty = false "Matches disk again"
Dirty pages MUST be written to disk before eviction. Otherwise we lose data!
Reader-Writer Latch
Multiple readers OR one writer
Read latch: "I'm reading, others can read too"
Write latch: "I'm modifying, exclusive access"
Fetching a Page
pub fn fetchPage(self: *Self, page_id: PageId, mode: LatchMode) !*BufferFrame {
self.mutex.lock();
defer self.mutex.unlock();
// 1. Check if already in buffer pool
if (self.page_table.get(page_id)) |frame_id| {
const frame = &self.frames[frame_id];
frame.pin_count.fetchAdd(1, .monotonic);
frame.usage_count = @min(frame.usage_count + 1, 255);
acquireLatch(frame, mode);
return frame;
}
// 2. Not in pool - need to load from disk
const frame = try self.findVictimFrame();
// 3. If victim is dirty, flush it first
if (frame.dirty) {
try self.pm.writePage(frame.page_id, frame.data);
frame.dirty = false;
}
// 4. Update page table
if (frame.page_id != NULL_PAGE) {
self.page_table.remove(frame.page_id);
}
self.page_table.put(page_id, frame_id);
// 5. Read page from disk
try self.pm.readPage(page_id, frame.data);
// 6. Set up frame
frame.page_id = page_id;
frame.pin_count.store(1, .monotonic);
frame.dirty = false;
frame.usage_count = 1;
acquireLatch(frame, mode);
return frame;
}
The Clock Eviction Algorithm
When the buffer pool is full, we need to evict a page to make room. We use the Clock algorithm (also called "second chance").
Why Clock?
- LRU (Least Recently Used) is optimal but expensive - requires updating timestamps on every access
- Clock approximates LRU cheaply using a usage bit
How It Works
Imagine the frames arranged in a circle with a clock hand:
┌─────┐
┌────│ 3 │────┐
╱ │use:1│ ╲
┌─────┐ └─────┘ ┌─────┐
│ 2 │ │ 4 │
│use:0│ │use:2│
└─────┘ └─────┘
╲ ┌─────┐ ╱
└────│ 1 │────┘
│use:0│◄──── clock hand
│pin:0│
└─────┘
To find a victim:
fn findVictimFrame(self: *Self) !*BufferFrame {
// Try free list first
if (self.free_list.pop()) |frame_id| {
return &self.frames[frame_id];
}
// Clock sweep
var attempts: usize = 0;
while (attempts < self.frame_count * 2) {
const frame = &self.frames[self.clock_hand];
// Move hand
self.clock_hand = (self.clock_hand + 1) % self.frame_count;
attempts += 1;
// Skip pinned pages
if (frame.pin_count.load(.monotonic) > 0) {
continue;
}
// Second chance: if used recently, clear and skip
if (frame.usage_count > 0) {
frame.usage_count -= 1;
continue;
}
// Found victim!
return frame;
}
return error.BufferPoolFull; // All pages pinned
}
The key insight: Usage count gives pages a "second chance". Recently used pages survive one clock sweep. Only pages that haven't been used in a full rotation get evicted.
Unpinning Pages
When done with a page:
pub fn unpinPage(self: *Self, frame: *BufferFrame, dirty: bool) void {
// Release latch
frame.latch.release();
// Mark dirty if modified
if (dirty) {
frame.dirty = true;
}
// Decrement pin count
_ = frame.pin_count.fetchSub(1, .monotonic);
}
Always unpin! Failure to unpin causes:
- Pages stuck in memory forever
- Buffer pool eventually fills with pinned pages
BufferPoolFullerrors
Flushing Pages
Writing dirty pages to disk:
// Flush one page
pub fn flushPage(self: *Self, page_id: PageId) !void {
const frame = self.getFrame(page_id) orelse return;
if (frame.dirty) {
try self.pm.writePage(frame.page_id, frame.data);
frame.dirty = false;
}
}
// Flush all dirty pages
pub fn flushAll(self: *Self) !void {
for (self.frames) |*frame| {
if (frame.page_id != NULL_PAGE and frame.dirty) {
try self.pm.writePage(frame.page_id, frame.data);
frame.dirty = false;
}
}
}
Thread Safety
The buffer pool is thread-safe:
- Mutex protects page_table, free_list, clock_hand
- Per-frame latches protect page data
- Atomic pin_count for safe reference counting
Thread 1: fetchPage(5) Thread 2: fetchPage(5)
│ │
├─► mutex.lock() ├─► mutex.lock() [WAIT]
│ look up page 5 │
│ pin_count++ │
│ mutex.unlock() ─────────────┼─► mutex.lock() [GOT IT]
│ │ look up page 5 [FOUND]
├─► frame.latch.read() │ pin_count++
│ ... read data ... │ mutex.unlock()
│ │
│ ├─► frame.latch.read()
│ │ ... read data ...
│ │
├─► unpinPage() ├─► unpinPage()
Multiple threads can read the same page concurrently (shared latch).
Memory Alignment
Page buffers are 4KB-aligned:
const data = try allocator.alignedAlloc(u8, 4096, page_size);
Why?
- Direct I/O: Some systems require aligned buffers for O_DIRECT
- SIMD: Aligned data enables vectorized operations
- Cache lines: Better CPU cache utilization
Sizing the Buffer Pool
// 64MB buffer pool = 16,384 pages
var bp = try BufferPool.init(allocator, &pm, 64 * 1024 * 1024);
Guidelines:
- More is better (to a point)
- Working set: Should fit frequently accessed pages
- Available RAM: Leave room for OS and other processes
- Typical: 25-75% of available RAM
Usage Pattern
var bp = try BufferPool.init(allocator, &pm, pool_size);
defer bp.deinit(); // Flushes dirty pages
// Read a page
const frame = try bp.fetchPage(page_id, .shared);
defer bp.unpinPage(frame, false);
const value = readValueFromPage(frame.data);
// Modify a page
const frame = try bp.fetchPage(page_id, .exclusive);
defer bp.unpinPage(frame, true); // true = dirty
modifyPage(frame.data);
Key Invariants
- Pin before access: Never access page data without pinning
- Unpin when done: Every fetchPage must have matching unpinPage
- Mark dirty: If you modified the page, set dirty=true when unpinning
- Flush before close: deinit() flushes, or call flushAll() explicitly
B+Tree
What It Is
A B+Tree is a self-balancing tree structure optimized for disk-based storage. It provides O(log N) lookups, inserts, and deletes, with efficient range scans.
Why B+Tree?
The Problem with Binary Trees
A binary search tree with 1 million entries is ~20 levels deep (log2(1M) ≈ 20). Each level requires a disk read. That's 20 disk reads per lookup!
B+Tree: Wide and Shallow
A B+Tree has many keys per node (hundreds), making it very shallow:
Binary Tree (1M entries): B+Tree (1M entries):
depth ~20 depth ~3
○ ○
/ \ / | \
○ ○ ○ ○ ○
/\ /\ / | \ / | \ / | \
○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○
... (20 levels) (leaf level)
20 disk reads 3 disk reads!
With 100 keys per node: log100(1M) ≈ 3 levels.
Data Only in Leaves
B+Trees store all data in leaf nodes. Internal nodes only contain keys for routing:
┌─────────────────────┐
│ [30] [60] [90] │ ◄── Internal: keys only
└───┬──────┬──────┬───┘
╱ │ ╲
┌─────────────┘ │ └─────────────┐
▼ ▼ ▼
┌───────────────┐ ┌───────────────┐ ┌───────────────┐
│ 10:val 20:val │◄──►│ 30:val 50:val │◄──►│ 60:val 80:val │
└───────────────┘ └───────────────┘ └───────────────┘
▲ ▲
└─── Leaves: keys AND values ─────────────┘
Linked for range scans
Our Implementation: Direct Page Manipulation
We don't create "node objects" in memory. Instead, we read/write bytes directly in page buffers. The "node" is just a lens over page bytes.
Traditional approach: Our approach:
┌─────────────┐ ┌─────────────┐
│ Page bytes │ │ Page bytes │
└──────┬──────┘ └──────┬──────┘
│ deserialize │
▼ │ direct access
┌─────────────┐ │
│ Node struct │ │
│ - keys[] │ ▼
│ - values[] │ Read/write at
│ - children[]│ calculated offsets
└──────┬──────┘
│ serialize
▼
┌─────────────┐
│ Page bytes │
└─────────────┘
No intermediate objects. No serialization overhead.
Page Layout
Leaf Node
┌────────────────────────────────────────────────────────────────┐
│ Leaf Page (4KB) │
├────────────────────────────────────────────────────────────────┤
│ Bytes 0-7: Page Header (checksum, type=leaf, flags) │
├────────────────────────────────────────────────────────────────┤
│ Bytes 8-9: Entry count (u16) │
│ Bytes 10-13: Prev leaf page (u32) │
│ Bytes 14-17: Next leaf page (u32) │
│ Bytes 18-19: Free space offset (u16) │
├────────────────────────────────────────────────────────────────┤
│ Slot 0: key_offset, key_len, val_offset, val_len (8 bytes) │
│ Slot 1: key_offset, key_len, val_offset, val_len (8 bytes) │
│ Slot 2: ... │
│ ... │
│ [free space grows down ↓] │
│ │
│ [entries grow up ↑] │
│ │
│ "value2" │
│ "key2" │
│ "value1" │
│ "key1" │
└────────────────────────────────────────────────────────────────┘
Byte 4095
Variable-length entries: Keys and values are stored at the end of the page, growing upward. Slots at the front point to them.
Internal Node
┌────────────────────────────────────────────────────────────────┐
│ Internal Page (4KB) │
├────────────────────────────────────────────────────────────────┤
│ Bytes 0-7: Page Header (checksum, type=internal, flags) │
├────────────────────────────────────────────────────────────────┤
│ Bytes 8-9: Key count (u16) │
│ Bytes 10-13: Rightmost child (u32) │
│ Bytes 14-15: Free space offset (u16) │
├────────────────────────────────────────────────────────────────┤
│ Slot 0: child_page, key_offset, key_len (8 bytes) │
│ Slot 1: child_page, key_offset, key_len (8 bytes) │
│ ... │
│ [free space] │
│ │
│ "separator_key2" │
│ "separator_key1" │
└────────────────────────────────────────────────────────────────┘
Point Lookup
Finding a value by key:
pub fn get(self: *Self, key: []const u8) !?[]const u8 {
// 1. Start at root
var page_id = self.root_page;
while (true) {
// 2. Fetch page from buffer pool
const frame = try self.bp.fetchPage(page_id, .shared);
defer self.bp.unpinPage(frame, false);
const page_type = getPageType(frame.data);
if (page_type == .btree_leaf) {
// 3. Binary search in leaf
return LeafNode.search(frame.data, key, self.comparator);
} else {
// 4. Binary search for child pointer, descend
page_id = InternalNode.findChild(frame.data, key, self.comparator);
}
}
}
Example lookup for key "dog":
Root (Internal)
┌─────────────────────┐
│ [cat] [fish] │
└───┬──────────┬──────┘
╱ ╲
"dog" > "cat" ╱ ╲
"dog" < "fish" ╱ ╲
▼
┌─────────────────┐
│ cow:v1 dog:v2 │ ◄── Found! Return v2
└─────────────────┘
Insertion
Simple Case: Space in Leaf
pub fn insert(self: *Self, key: []const u8, value: []const u8) !void {
// 1. Find the leaf
const leaf_page = try self.findLeaf(key);
// 2. Fetch with exclusive latch
const frame = try self.bp.fetchPage(leaf_page, .exclusive);
defer self.bp.unpinPage(frame, true);
// 3. Check if there's space
if (LeafNode.hasSpace(frame.data, key.len, value.len)) {
// 4. Insert directly
LeafNode.insert(frame.data, key, value, self.comparator);
} else {
// 5. Need to split
try self.splitLeafAndInsert(frame, key, value);
}
}
Splitting a Leaf
When a leaf is full:
Before split:
┌─────────────────────────────────────┐
│ a:1 b:2 c:3 d:4 e:5 [FULL] │
└─────────────────────────────────────┘
│
│ Insert "f:6" - no room!
▼
After split:
┌──────────────────────┐ ┌──────────────────────┐
│ a:1 b:2 c:3 │◄──►│ d:4 e:5 f:6 │
└──────────────────────┘ └──────────────────────┘
│ │
└───────────┬────────────────┘
│
▼
"d" promoted to parent
fn splitLeaf(self: *Self, frame: *BufferFrame, key: []const u8, value: []const u8) !void {
// 1. Allocate new page
const new_page = try self.pm.allocatePage();
const new_frame = try self.bp.fetchPage(new_page, .exclusive);
// 2. Find split point (middle)
const entry_count = LeafNode.getCount(frame.data);
const split_point = entry_count / 2;
// 3. Move upper half to new page
LeafNode.moveEntries(frame.data, new_frame.data, split_point, entry_count);
// 4. Insert new key into appropriate page
if (compare(key, split_key) < 0) {
LeafNode.insert(frame.data, key, value);
} else {
LeafNode.insert(new_frame.data, key, value);
}
// 5. Update sibling pointers
LeafNode.setNext(frame.data, new_page);
LeafNode.setPrev(new_frame.data, frame.page_id);
// 6. Get separator key and insert into parent
const separator = LeafNode.getFirstKey(new_frame.data);
try self.insertIntoParent(frame.page_id, separator, new_page);
}
Growing the Tree
When the root splits, we create a new root:
Before:
┌─────────────┐
│ Root (full)│
└─────────────┘
After root split:
┌─────────────┐
│ New Root │ ◄── New level!
│ [50] │
└──────┬──────┘
╱ ╲
┌──────────┘ └──────────┐
▼ ▼
┌─────────────┐ ┌─────────────┐
│ Old Root │ │ New Page │
│ (left) │ │ (right) │
└─────────────┘ └─────────────┘
The tree grows at the top, not the bottom. All leaves stay at the same depth.
Range Scan
Thanks to linked leaves, range scans are efficient:
pub fn range(self: *Self, start: ?[]const u8, end: ?[]const u8) !Iterator {
// 1. Find starting leaf
const start_page = if (start) |k|
try self.findLeaf(k)
else
try self.findLeftmostLeaf();
return Iterator{
.btree = self,
.current_page = start_page,
.current_slot = 0,
.end_key = end,
};
}
Iterator walks the leaf chain:
Start key: "dog" End key: "hamster"
┌─────────────────────────────────────┐
│ Internal nodes │
└─────────────────────────────────────┘
│
▼
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ cat│cow│ │──►│ dog│elk│fox│──►│goat│ham│ │
└─────────────┘ └─────────────┘ └─────────────┘
▲ ▲
│ start │ end
└───────────────────┘
scan range
No need to traverse internal nodes for each entry - just follow the links.
Concurrency
We use latch crabbing (simplified):
Lookup (read-only):
1. Latch root (shared)
2. Latch child (shared)
3. Unlatch root
4. Latch grandchild (shared)
5. Unlatch child
... continue to leaf
Insert (may modify):
1. Latch root (exclusive)
2. If child is safe (won't split), latch child and unlatch root
3. Continue down
A "safe" node is one with enough space that it won't split (for insert) or won't underflow (for delete).
The BTree Struct
pub const BTree = struct {
bp: *BufferPool, // Page cache
root_page: PageId, // Current root
comparator: KeyComparator, // For ordering
page_size: u32, // Usually 4096
allocator: Allocator,
// ...methods
};
This is just ~40 bytes of metadata. The actual data lives in pages on disk.
Key Design Decisions
Variable-Length Keys and Values
We store the actual bytes, not fixed-size slots. This supports:
- Short keys efficiently (no wasted space)
- Long keys (up to page size)
- Any binary data
Separator Keys in Internal Nodes
Internal nodes store separator keys, not full key-value pairs. This maximizes fanout (keys per internal node).
No Deletion (Yet)
Our implementation doesn't handle deletes with rebalancing. For a knowledge graph database, we can use soft deletes (mark as deleted) and periodic compaction.
Page-Based, Not Block-Based
Each node is exactly one page. No spanning, no compression across pages. Simple and predictable.
Example: Full Insert Sequence
Insert "zebra" into a tree:
Step 1: Start at root
┌─────────────────────────────────────────────┐
│ Root (Internal) │
│ [lion] [rabbit] │
│ ↓ ↓ │
│ page 2 page 3 │→ page 4 (rightmost)
└─────────────────────────────────────────────┘
"zebra" > "rabbit" → go right to page 4
Step 2: Descend to leaf (page 4)
┌─────────────────────────────────────────────┐
│ Leaf (Page 4) │
│ snake:v1 tiger:v2 wolf:v3 │
│ [has space for zebra] │
└─────────────────────────────────────────────┘
Step 3: Insert into leaf
┌─────────────────────────────────────────────┐
│ Leaf (Page 4) │
│ snake:v1 tiger:v2 wolf:v3 zebra:v4 │
└─────────────────────────────────────────────┘
Done! No splits needed.
Performance Characteristics
| Operation | Average Case | Worst Case |
|---|---|---|
| Lookup | O(log N) | O(log N) |
| Insert | O(log N) | O(log N) |
| Range(k) | O(log N + k) | O(log N + k) |
Where N is the number of entries and k is the number of entries in range.
Disk I/O per operation: ~3-4 reads for trees up to billions of entries.
Write-Ahead Log (WAL)
The Problem
Imagine you're updating a B+Tree. You need to:
- Modify a leaf page
- Maybe split it (modify parent)
- Maybe split the parent (modify grandparent)
- Update the root
If power fails between steps 2 and 3, your database is corrupted - the tree structure is broken.
Even a single page write isn't safe. A 4KB page write might not be atomic at the hardware level - you could end up with half old data, half new data (a "torn write").
The Insight
Instead of modifying data pages directly, first write a log of what you intend to do. The log is append-only and sequential. Only after the log is safely on disk do you modify the actual data pages.
If you crash:
- Before log write: Nothing happened, no corruption
- After log write, before data write: Replay the log to finish the operation
- After both: Everything is fine
This is the write-ahead rule: log first, then data.
Why Append-Only Sequential Writes Are Special
The WAL relies on a key property: sequential append is much safer than random writes.
Random writes (data pages): Sequential append (WAL):
┌─────┐ ┌─────┐ ┌─────┐ ┌─────────────────────────┐
│ P1 │ │ P2 │ │ P3 │ │ Log Log Log Log ... │
└─────┘ └─────┘ └─────┘ └─────────────────────────┘
↑ ↑ ↑ ↑
│ │ │ │
write write write single write position
With random writes, the disk head jumps around. A crash can leave any combination of pages in inconsistent states.
With sequential append:
- Only one "frontier" where writing happens
- Everything before the frontier is complete
- Everything after doesn't exist yet
- Much simpler to reason about crash states
The System Primitive: fsync
The WAL's safety depends on one critical system call: fsync (or fdatasync).
write(fd, data, len); // Data goes to OS buffer cache
fsync(fd); // Forces data to physical disk platters
Without fsync, your "written" data might sit in RAM for seconds or minutes. A power failure loses it. fsync is a durability barrier - when it returns, data is on persistent storage.
This is expensive (milliseconds, not microseconds), so we batch multiple records into frames before syncing.
WAL Structure
┌────────────────────────────────────────────────────────────┐
│ WAL File Layout │
├────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Header (4KB) │ │
│ │ • magic: 0x574C4F47 ("WLOG") │ │
│ │ • database_uuid: links WAL to its database │ │
│ │ • frame_count: how many frames written │ │
│ │ • checkpoint_lsn: recovery starting point │ │
│ │ • checksum: detects corruption │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Frame 0 (4KB) │ │
│ │ ┌───────────────────────────────────────────────┐ │ │
│ │ │ Frame Header (32 bytes) │ │ │
│ │ │ • frame_number: 0 │ │ │
│ │ │ • record_count: 3 │ │ │
│ │ │ • data_size: 156 │ │ │
│ │ │ • checksum: CRC of data area │ │ │
│ │ └───────────────────────────────────────────────┘ │ │
│ │ ┌───────────────────────────────────────────────┐ │ │
│ │ │ Record: TXN_BEGIN (txn=1, lsn=1) │ │ │
│ │ └───────────────────────────────────────────────┘ │ │
│ │ ┌───────────────────────────────────────────────┐ │ │
│ │ │ Record: INSERT (txn=1, lsn=2, payload=...) │ │ │
│ │ └───────────────────────────────────────────────┘ │ │
│ │ ┌───────────────────────────────────────────────┐ │ │
│ │ │ Record: TXN_COMMIT (txn=1, lsn=3) │ │ │
│ │ └───────────────────────────────────────────────┘ │ │
│ │ ... unused space ... │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Frame 1 (4KB) │ │
│ │ ... │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
└────────────────────────────────────────────────────────────┘
LSN: The Universal Clock
Every record gets a Log Sequence Number (LSN) - a monotonically increasing integer.
LSN 1: TXN_BEGIN (txn 1)
LSN 2: INSERT key="alice" (txn 1)
LSN 3: INSERT key="bob" (txn 1)
LSN 4: TXN_BEGIN (txn 2)
LSN 5: DELETE key="charlie" (txn 2)
LSN 6: TXN_COMMIT (txn 1)
LSN 7: TXN_ABORT (txn 2)
LSN serves multiple purposes:
- Ordering: Total order of all operations
- Recovery point: "Replay from LSN 42"
- Page tracking: Each data page stores the LSN of the last modification
The prev_lsn Chain
Each record stores prev_lsn - the previous LSN for the same transaction:
LSN 1: TXN_BEGIN (txn=1, prev_lsn=0)
LSN 2: INSERT (txn=1, prev_lsn=1) ←─┐
LSN 3: INSERT (txn=2, prev_lsn=0) │
LSN 4: UPDATE (txn=1, prev_lsn=2) ───┘
LSN 5: TXN_COMMIT (txn=1, prev_lsn=4)
This creates a backward chain for each transaction:
Transaction 1: LSN 5 → LSN 4 → LSN 2 → LSN 1
Transaction 2: LSN 3 (just one operation)
Why? Rollback. To abort transaction 1, follow the chain backward, undoing each operation.
Write Flow
Application
│
▼
┌─────────────────┐
│ Transaction │
│ INSERT(k,v) │
└────────┬────────┘
│
▼
┌─────────────────┐
│ WAL Manager │
│ │
│ 1. Assign LSN │
│ 2. Build record│
│ 3. Append to │
│ frame buffer│
└────────┬────────┘
│
┌────────────┴────────────┐
│ │
▼ ▼
Frame buffer full? COMMIT requested?
│ │
▼ ▼
┌─────────────┐ ┌─────────────┐
│ Flush frame │ │ Flush frame │
│ to disk │ │ + fsync │
└─────────────┘ └─────────────┘
│
▼
┌─────────────┐
│ Now safe to │
│ return to │
│ application │
└─────────────┘
The key insight: we buffer multiple records in memory, only hitting disk when:
- The frame buffer is full (4KB of records accumulated)
- A transaction commits (durability guarantee)
Commit Protocol
When you call COMMIT:
1. Write TXN_COMMIT record to WAL buffer
2. Flush current frame to disk
3. Call fsync() - WAIT for disk acknowledgment
4. Return "committed" to application
After step 3, even if power fails immediately, the commit record is on disk. Recovery will see the commit and know the transaction succeeded.
If we crash before step 3? The commit record might be lost. Recovery won't see a commit for that transaction, so it will be rolled back. No partial commits ever escape to the application.
Recovery: Redo and Undo
On startup after a crash, we need to:
- Find the end of valid log - Scan frames, verify checksums, find last valid record
- Redo committed transactions - Replay all operations from committed transactions
- Undo uncommitted transactions - Roll back any transaction without a commit record
WAL Contents
┌───────────────────────────────────────────┐
│ LSN 1: TXN_BEGIN (txn 1) │
│ LSN 2: INSERT x=1 (txn 1) │
│ LSN 3: TXN_BEGIN (txn 2) │
│ LSN 4: INSERT y=2 (txn 2) │
│ LSN 5: TXN_COMMIT (txn 1) ← committed │
│ LSN 6: INSERT z=3 (txn 2) │
│ ─── CRASH HERE ─── │
└───────────────────────────────────────────┘
Recovery:
• Txn 1: Has COMMIT → REDO (x=1 is permanent)
• Txn 2: No COMMIT → UNDO (y=2 and z=3 are rolled back)
Why Frames?
Why not just append individual records?
Reason 1: Atomicity
A 4KB write is more likely to be atomic than arbitrary-sized writes. Many storage systems guarantee 512-byte or 4KB atomic writes. By aligning to page boundaries, we reduce torn write risk.
Reason 2: Batching
Each fsync is expensive (~1-10ms). By batching records into frames, we amortize the sync cost across many operations.
Reason 3: Checksums
Each frame has a checksum. During recovery, we can detect partial/corrupted frames and stop replay at the right point.
The Checksum
We use CRC32 to detect corruption:
Frame on disk:
┌──────────────────────────────────────────┐
│ Header: checksum = 0xABCD1234 │
├──────────────────────────────────────────┤
│ Data: [record1][record2][record3] │
└──────────────────────────────────────────┘
On read:
1. Read frame
2. Compute CRC32(data)
3. Compare with stored checksum
4. If mismatch → corruption detected, stop replay
This catches:
- Torn writes (partial frame written)
- Bit rot (storage degradation)
- Wrong WAL file (UUID also checked)
The UUID Link
The WAL header stores the database file's UUID:
Database file header: WAL header:
┌──────────────────┐ ┌──────────────────┐
│ uuid: ABC123... │ ←──→ │ uuid: ABC123... │
└──────────────────┘ └──────────────────┘
This prevents accidentally using database A's WAL with database B. The UUID is generated randomly when the database is created.
Record Types
pub const WalRecordType = enum(u8) {
// Transaction control
txn_begin = 0x01,
txn_commit = 0x02,
txn_abort = 0x03,
// Data modifications
insert = 0x10,
update = 0x11,
delete = 0x12,
// Page-level operations
page_write = 0x20,
// Checkpointing
checkpoint_begin = 0x30,
checkpoint_end = 0x31,
// Savepoints
savepoint = 0x40,
savepoint_rollback = 0x41,
// Compensation (for undo during recovery)
clr = 0x50,
};
WalManager API
// Append a record, get back its LSN
const lsn = try wal.appendRecord(.insert, txn_id, prev_lsn, payload);
// Force all buffered records to disk
try wal.sync();
// Iterate records for recovery
var iter = wal.iterate(start_lsn);
while (try iter.next(&buf)) |record| {
// Process record
}
// Update checkpoint position
try wal.setCheckpointLsn(lsn);
Summary
| Concept | Purpose |
|---|---|
| Write-ahead rule | Log before data = crash safety |
| Sequential append | Simpler crash states than random writes |
| fsync | Durability barrier to physical storage |
| LSN | Universal ordering of all operations |
| prev_lsn chain | Enables transaction rollback |
| Frames | Atomic-ish writes + batching |
| Checksums | Detect corruption and torn writes |
| UUID | Match WAL to correct database file |
The WAL transforms the problem from "make random writes atomic" (very hard) to "make sequential append reliable" (much easier).
Transaction Manager
What is a Transaction?
A transaction is a logical unit of work - a group of operations that should either all succeed or all fail together.
Without transactions: With transactions:
1. Debit Alice $100 ✓ BEGIN
2. ─── CRASH ─── 1. Debit Alice $100
3. Credit Bob $100 ✗ 2. Credit Bob $100
COMMIT
Result: Alice lost $100,
Bob got nothing Result: Either both happen,
or neither happens
The ACID Properties
Transactions guarantee four properties:
| Property | Meaning | How We Achieve It |
|---|---|---|
| Atomicity | All or nothing | WAL + rollback |
| Consistency | Valid state to valid state | Application logic |
| Isolation | Transactions don't interfere | Timestamps + MVCC |
| Durability | Committed = permanent | WAL + fsync |
The Core Data Structures
Transaction (User-Facing Handle)
This is what the application sees:
pub const Transaction = struct {
id: u64, // Unique identifier (1, 2, 3, ...)
state: TxnState, // active, committed, or aborted
mode: TxnMode, // read_only or read_write
isolation: IsolationLevel, // snapshot, read_committed, serializable
start_ts: u64, // When transaction started (for visibility)
commit_ts: u64, // When committed (0 until commit)
};
TxnEntry (Internal State)
The manager keeps more detailed state internally:
const TxnEntry = struct {
txn: Transaction, // The user-facing handle
last_lsn: u64, // LSN of most recent operation
begin_lsn: u64, // LSN of TXN_BEGIN record
savepoints: ArrayList, // Stack of savepoints
undo_log: ArrayList, // Operations to reverse on abort
};
TxnManager (The Coordinator)
pub const TxnManager = struct {
allocator: Allocator,
wal: *WalManager, // For durability
active_txns: HashMap(u64, TxnEntry), // All running transactions
next_txn_id: u64, // Counter for IDs
current_ts: u64, // Global timestamp clock
committed_count: u64, // Statistics
aborted_count: u64,
mutex: Mutex, // Thread safety
};
Transaction Lifecycle
1. BEGIN
When you start a transaction:
Application TxnManager WAL
│ │ │
│ begin(read_write, snapshot) │ │
├─────────────────────────────►│ │
│ │ │
│ │ 1. Lock mutex │
│ │ 2. Assign txn_id = 1 │
│ │ 3. Assign start_ts = 1 │
│ │ 4. appendRecord(TXN_BEGIN) │
│ ├──────────────────────────────►│
│ │ lsn = 1 │
│ │◄──────────────────────────────┤
│ │ │
│ │ 5. Create TxnEntry │
│ │ last_lsn = 1 │
│ │ 6. Store in active_txns │
│ │ 7. Unlock mutex │
│ │ │
│ Transaction { id=1, ... } │ │
│◄─────────────────────────────┤ │
The start_ts is crucial for isolation - it determines what data this transaction can "see".
2. Operations
Each modification goes through logOperation:
Application TxnManager WAL
│ │ │
│ logOperation(INSERT, data) │ │
├─────────────────────────────►│ │
│ │ │
│ │ 1. Check txn.canWrite() │
│ │ 2. Get TxnEntry │
│ │ 3. appendRecord(INSERT, │
│ │ txn_id=1, │
│ │ prev_lsn=1, ◄── last_lsn
│ │ payload=data) │
│ ├──────────────────────────────►│
│ │ lsn = 2 │
│ │◄──────────────────────────────┤
│ │ │
│ │ 4. Update last_lsn = 2 │
│ │ │
│ lsn = 2 │ │
│◄─────────────────────────────┤ │
Notice how prev_lsn points to the previous operation. This builds a backward chain.
3. COMMIT
Commit makes everything permanent:
Application TxnManager WAL Disk
│ │ │ │
│ commit(&txn) │ │ │
├─────────────────────────────►│ │ │
│ │ │ │
│ │ 1. Check txn.state == active │ │
│ │ 2. appendRecord(TXN_COMMIT, │ │
│ │ prev_lsn=last_lsn) │ │
│ ├──────────────────────────────►│ │
│ │ │ │
│ │ 3. wal.sync() │ │
│ ├──────────────────────────────►│ fsync() │
│ │ ├─────────────►│
│ │ │ durable! │
│ │ │◄─────────────┤
│ │◄──────────────────────────────┤ │
│ │ │ │
│ │ 4. Assign commit_ts │ │
│ │ 5. Set txn.state = committed │ │
│ │ 6. Remove from active_txns │ │
│ │ 7. Increment committed_count │ │
│ │ │ │
│ OK │ │ │
│◄─────────────────────────────┤ │ │
The critical point: We only return success to the application AFTER fsync() completes. This is the durability guarantee.
4. ABORT
Abort discards everything:
Application TxnManager WAL
│ │ │
│ abort(&txn) │ │
├─────────────────────────────►│ │
│ │ │
│ │ 1. appendRecord(TXN_ABORT) │
│ ├──────────────────────────────►│
│ │ │
│ │ 2. wal.sync() │
│ │ │
│ │ 3. Set txn.state = aborted │
│ │ 4. Clean up TxnEntry │
│ │ 5. Remove from active_txns │
│ │ │
│ OK │ │
│◄─────────────────────────────┤ │
We log TXN_ABORT so that during crash recovery, we know this transaction was intentionally aborted.
The prev_lsn Chain
Every record points back to the previous record from the same transaction. This creates a linked list through the WAL:
┌─────────────────────────────────────────────────────────────────┐
│ WAL │
├─────────────────────────────────────────────────────────────────┤
│ │
│ LSN 1: TXN_BEGIN (txn=1, prev_lsn=0) ◄─────────────┐ │
│ │ │
│ LSN 2: INSERT (txn=1, prev_lsn=1) ──────────────┘ │
│ ▲ │
│ │ │
│ LSN 3: TXN_BEGIN (txn=2, prev_lsn=0) ◄───────┐ │
│ │ │
│ LSN 4: UPDATE (txn=1, prev_lsn=2) ────────┼───┐ │
│ ▲ │ │ │
│ │ │ │ │
│ LSN 5: DELETE (txn=2, prev_lsn=3) ────────┘ │ │
│ │ │
│ LSN 6: TXN_COMMIT (txn=1, prev_lsn=4) ─────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
Transaction 1's chain: 6 → 4 → 2 → 1
Transaction 2's chain: 5 → 3
Why is this useful?
- Rollback: To abort transaction 1, follow 6->4->2->1, undoing each operation
- Recovery: After crash, find uncommitted transactions by following chains
- No scanning: Don't need to scan entire WAL to find a transaction's operations
Savepoints: Partial Rollback
Sometimes you want to undo part of a transaction, not all of it:
BEGIN;
INSERT user Alice; -- LSN 2
SAVEPOINT before_orders; -- Remember this point
INSERT order #1; -- LSN 4
INSERT order #2; -- LSN 5
-- Oops, orders were wrong
ROLLBACK TO before_orders; -- Undo LSN 4, 5
INSERT order #3; -- LSN 7
COMMIT;
How Savepoints Work
Savepoint Stack Undo Log
┌─────────────────┐ ┌─────────────┐
│ "before_orders" │ │ INSERT #1 │ ← position 1
tm.savepoint() ──► │ lsn: 3 │ │ INSERT #2 │ ← position 2
│ undo_pos: 0 │──────────► └─────────────┘
└─────────────────┘ ▲
│
truncate here
on rollback
When we rollback to savepoint:
- Find the savepoint by name
- Log
SAVEPOINT_ROLLBACKto WAL - Truncate undo_log to
undo_position - Remove savepoints created after this one
Timestamps and Isolation
Every transaction gets two timestamps:
start_ts: Assigned at BEGIN - determines what data is visible
commit_ts: Assigned at COMMIT - marks when changes become visible to others
Snapshot Isolation Example
Timeline:
─────────────────────────────────────────────────────────────────►
│ │ │ │ │
ts=1 ts=2 ts=3 ts=4 ts=5
│ │ │ │ │
Txn A Txn A Txn B Txn A Txn B
BEGIN INSERT BEGIN COMMIT reads
start_ts=1 x=100 start_ts=3 commit_ts=4 x
Question: What does Txn B see when it reads x?
With snapshot isolation: Txn B started at ts=3, before Txn A committed at ts=4. So Txn B does NOT see x=100. It sees whatever x was before Txn A.
This is why we track start_ts and commit_ts - they determine visibility.
Read-Only Transactions
Read-only transactions are special:
var txn = try tm.begin(.read_only, .snapshot);
// This fails:
const result = tm.logOperation(&txn, .insert, "data");
// Error: TxnError.ReadOnly
Why have read-only transactions?
- No WAL writes - faster, no disk I/O for reads
- No locks needed - can always proceed
- Never abort due to conflicts - just reads a snapshot
- Helps garbage collection - we know this txn won't modify old versions
Thread Safety
The TxnManager uses a mutex to protect shared state:
pub fn begin(self: *Self, ...) TxnError!Transaction {
self.mutex.lock(); // ← Only one thread at a time
defer self.mutex.unlock(); // ← Released when function exits
// Safe to modify next_txn_id, active_txns, etc.
...
}
This is a simple approach. The mutex is held briefly (microseconds), and the WAL I/O dominates latency anyway.
Statistics
The manager tracks statistics for monitoring:
const stats = tm.getStats();
stats.active_count // Currently running transactions
stats.committed_count // Total successful commits
stats.aborted_count // Total aborts
stats.oldest_active_id // Oldest running transaction
stats.current_ts // Current timestamp
oldest_active_id is important for garbage collection - we can't clean up any data that this transaction might still need to see.
API Summary
var tm = TxnManager.init(allocator, &wal);
// Start a transaction
var txn = try tm.begin(.read_write, .snapshot);
// Log operations
_ = try tm.logOperation(&txn, .insert, payload);
_ = try tm.logOperation(&txn, .update, payload);
// Create savepoint
try tm.savepoint(&txn, "before_danger");
// More operations
_ = try tm.logOperation(&txn, .delete, payload);
// Rollback to savepoint if needed
try tm.rollbackToSavepoint(&txn, "before_danger");
// Commit or abort
try tm.commit(&txn); // or tm.abort(&txn)
Integration
┌─────────────────────────────────────────────────────────────────┐
│ Application │
└───────────────────────────────┬─────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ TxnManager │
│ • Manages transaction lifecycle │
│ • Maintains prev_lsn chains │
│ • Tracks active transactions │
└───────────────────────────────┬─────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ WalManager │
│ • Durability through logging │
│ • fsync on commit │
└───────────────────────────────┬─────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ BufferPool + BTree │
│ • Actual data storage │
│ • Page management │
│ • (Will check txn visibility for MVCC) │
└─────────────────────────────────────────────────────────────────┘
The Transaction Manager is the coordinator - it doesn't store data itself, but ensures that all operations follow ACID rules by orchestrating the WAL, tracking state, and enforcing invariants.
Checkpointing
The Problem
Without checkpointing, two bad things happen:
- WAL grows forever - Every transaction appends records, file gets huge
- Recovery takes forever - After crash, must replay entire WAL from the beginning
After 1 year of operation:
WAL: [millions of records from day 1 ... to today]
◄──────────────────────────────────────────►
Recovery replays ALL of this
Could take hours!
The Solution
A checkpoint says: "Everything up to this point is safely on disk in the main database file. We don't need the old WAL records anymore."
Before checkpoint:
┌─────────────────────────────────────────────────────────┐
│ Buffer Pool (RAM) │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ Page 5 │ │ Page 12 │ │ Page 3 │ │ Page 8 │ │
│ │ DIRTY │ │ DIRTY │ │ clean │ │ DIRTY │ │
│ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │
└─────────────────────────────────────────────────────────┘
│
│ Changes only in RAM!
│ If power fails, they're lost
▼
┌─────────────────────────────────────────────────────────┐
│ Database File (Disk) │
│ [old versions of pages] │
└─────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────┐
│ WAL (Disk) │
│ [record][record][record][record][record][record]... │
│ ◄──────────────────────────────────────────────────► │
│ Recovery needs ALL of these │
└─────────────────────────────────────────────────────────┘
After checkpoint:
┌─────────────────────────────────────────────────────────┐
│ Buffer Pool (RAM) │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ Page 5 │ │ Page 12 │ │ Page 3 │ │ Page 8 │ │
│ │ clean │ │ clean │ │ clean │ │ clean │ │
│ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │
└─────────────────────────────────────────────────────────┘
│ │ │
│ │ FLUSHED! │
▼ ▼ ▼
┌─────────────────────────────────────────────────────────┐
│ Database File (Disk) │
│ [current versions of pages] │
└─────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────┐
│ WAL (Disk) │
│ [old records...] │ CHECKPOINT │ [new records...] │
│ ▲ │
│ checkpoint_lsn │
│ │ │
│ Recovery starts HERE ─────► │
│ (old records ignored) │
└─────────────────────────────────────────────────────────┘
How It Works Step by Step
Step 1: Write CHECKPOINT_BEGIN to WAL
WAL: [...existing records...][CHECKPOINT_BEGIN, lsn=1000]
This marks the start of the checkpoint. If we crash during checkpointing, recovery sees this and knows a checkpoint was in progress.
Step 2: Sync WAL
fsync(wal_file)
Ensures CHECKPOINT_BEGIN is on disk before we start flushing pages.
Step 3: Flush Dirty Pages
Scan the buffer pool and write dirty pages to the database file:
for each frame in buffer_pool:
if frame.dirty:
write frame.data to database_file at page offset
frame.dirty = false
Passive mode: Skip pages that are currently pinned (in use by transactions). This avoids blocking active work.
Full mode: Wait for latches if needed. Guarantees all dirty pages are flushed.
Step 4: Sync Database File
fsync(database_file)
Critical! This ensures all the page writes are actually on disk, not sitting in OS buffers.
Step 5: Write CHECKPOINT_END to WAL
WAL: [...][CHECKPOINT_BEGIN][CHECKPOINT_END, lsn=1002]
This marks successful completion. The payload includes stats (pages flushed).
Step 6: Update checkpoint_lsn
wal_header.checkpoint_lsn = 1002 // LSN of CHECKPOINT_END
fsync(wal_file)
This is the key marker for recovery. It says "everything before LSN 1002 is safely on disk."
The Checkpoint Modes
Passive Mode
"Checkpoint what you can without disrupting active work"
for each frame:
if dirty AND not pinned AND can get latch immediately:
flush it
else:
skip it (maybe next time)
Good for: Background checkpointing during normal operation. Doesn't block transactions.
Bad: May leave some dirty pages unflushed.
Full Mode
"Checkpoint everything, wait if necessary"
for each frame:
if dirty:
wait for latch (spin until available)
flush it
Good for: Complete checkpoint before shutdown, or when WAL is getting too large.
Bad: May briefly block transactions waiting for latches.
Truncate Mode
"Full checkpoint, then reset WAL to beginning"
1. Do full checkpoint
2. Truncate WAL file to just the header
3. Reset frame_count to 0
Good for: Reclaiming disk space when WAL has grown large.
Requires: No active readers (they might be reading old WAL frames).
Why CHECKPOINT_BEGIN and CHECKPOINT_END?
Consider what happens if we crash during checkpointing:
Crash scenario 1: After BEGIN, before any flushes
─────────────────────────────────────────────────
WAL: [...records...][CHECKPOINT_BEGIN]
▲
crash
Recovery: Sees incomplete checkpoint (BEGIN but no END).
Ignores it, replays from previous checkpoint_lsn.
Safe!
Crash scenario 2: After some flushes, before END
─────────────────────────────────────────────────
WAL: [...records...][CHECKPOINT_BEGIN]
▲
crash
Database file: Has SOME pages flushed, but not all
Recovery: Sees incomplete checkpoint.
Some pages are updated, some aren't.
Replays WAL from previous checkpoint_lsn.
WAL replay overwrites pages with correct data.
Safe!
Crash scenario 3: After END
───────────────────────────
WAL: [...records...][CHECKPOINT_BEGIN][CHECKPOINT_END]
▲
crash
Recovery: Sees complete checkpoint.
Starts replay from checkpoint_lsn.
Fast recovery!
Statistics
Each checkpoint returns stats:
CheckpointStats{
pages_flushed: 42, // How many dirty pages written
duration_ns: 15_000_000, // 15ms
checkpoint_lsn: 1002, // New checkpoint position
wal_truncated: false, // Whether WAL was reset
}
Useful for monitoring:
- If
pages_flushedis always high, consider checkpointing more often - If
duration_nsis high, disk might be slow - Track
checkpoint_lsngrowth over time
When to Checkpoint
// Check if WAL has grown too large
if checkpointer.shouldCheckpoint(max_wal_frames: 1000) {
try checkpointer.checkpoint(.passive);
}
Common strategies:
- Size-based: When WAL exceeds N frames
- Time-based: Every N minutes
- Transaction-based: After N commits
- On shutdown: Always do full checkpoint before closing
Integration
┌─────────────────────────────────────────────────────────────────┐
│ Checkpointer │
│ │ │
│ ┌───────────────┼───────────────┐ │
│ ▼ ▼ ▼ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │BufferPool│ │PageManager│ │WalManager│ │
│ │ │ │ │ │ │ │
│ │• frames │ │• writePage│ │• append │ │
│ │• dirty │ │• sync │ │• sync │ │
│ │• latches │ │ │ │• setLsn │ │
│ └──────────┘ └──────────┘ └──────────┘ │
└─────────────────────────────────────────────────────────────────┘
The Checkpointer coordinates between:
- BufferPool: Knows which pages are dirty
- PageManager: Writes pages to database file
- WalManager: Records checkpoint markers, updates checkpoint_lsn
API
var checkpointer = Checkpointer.init(allocator, &bp, &pm, &wal);
// Passive checkpoint (background)
const stats = try checkpointer.checkpoint(.passive);
// Full checkpoint (before shutdown)
const stats = try checkpointer.checkpoint(.full);
// Truncate checkpoint (reclaim WAL space)
const stats = try checkpointer.checkpoint(.truncate);
// Check if checkpoint needed
if (checkpointer.shouldCheckpoint(1000)) {
_ = try checkpointer.checkpoint(.passive);
}
// Get last checkpoint stats
if (checkpointer.getLastStats()) |stats| {
log("Flushed {} pages in {}ms", stats.pages_flushed, stats.duration_ns / 1_000_000);
}
Summary
| Concept | Purpose |
|---|---|
| Dirty page flush | Write modified pages to database file |
| checkpoint_lsn | Recovery starting point |
| CHECKPOINT_BEGIN/END | Crash safety during checkpoint |
| Passive mode | Non-blocking background checkpoint |
| Full mode | Complete checkpoint, may block briefly |
| Truncate mode | Reclaim WAL disk space |
Checkpointing is the bridge between WAL-based durability (fast commits) and bounded recovery time (fast restart). Without it, durability requires replaying the entire history on every startup.
Crash Recovery
The Problem
Databases crash. Power fails, operating systems panic, processes get killed. When this happens, the database must be able to restart and return to a consistent state without losing committed data.
Consider what might be in-flight when a crash occurs:
Scenario: Crash during normal operation
Buffer Pool (RAM):
Page 5: dirty, modified by committed txn
Page 8: dirty, modified by uncommitted txn
Page 12: dirty, partially written by in-progress txn
WAL (Disk):
[committed txn records][uncommitted txn records][partial frame?]
Database File (Disk):
[old versions of pages - some stale, some current]
After restart, we need to:
- Redo committed work - If a transaction committed (COMMIT record in WAL) but its changes weren't flushed to the database file, redo them
- Ignore uncommitted work - If a transaction never committed, pretend it never happened
The ARIES Philosophy
Our recovery is inspired by ARIES (Algorithms for Recovery and Isolation Exploiting Semantics), a recovery algorithm developed at IBM. The key insight:
Write-Ahead Logging + Redo at Recovery = Durability
The WAL contains everything we need. After a crash:
- Read the WAL from the last checkpoint
- Determine which transactions committed
- Redo only committed operations
Two-Phase Recovery
Phase 1: Analysis
Scan the WAL to understand what happened:
WAL Contents:
┌────────────────────────────────────────────────────┐
│ LSN 1: TXN_BEGIN (txn=1) → txn 1 started │
│ LSN 2: INSERT (txn=1) → txn 1 modified │
│ LSN 3: TXN_BEGIN (txn=2) → txn 2 started │
│ LSN 4: INSERT (txn=2) → txn 2 modified │
│ LSN 5: TXN_COMMIT (txn=1) → txn 1 COMMITTED│
│ LSN 6: UPDATE (txn=2) → txn 2 modified │
│ ─── CRASH ─── │
└────────────────────────────────────────────────────┘
Analysis Result:
Transaction 1: COMMITTED (has TXN_COMMIT)
Transaction 2: IN_PROGRESS (no commit/abort)
During analysis, we build:
- Transaction table: State of each transaction (committed, aborted, in-progress)
- Redo list: All data operations that might need to be redone
Phase 2: Redo
Apply committed operations to the database:
Redo Process:
┌─────────────────────────────────────────────────────┐
│ For each operation in redo_list: │
│ 1. Check if transaction committed │
│ - If not: skip (discard uncommitted changes) │
│ 2. Apply the operation to database file │
│ 3. Continue to next operation │
└─────────────────────────────────────────────────────┘
Result:
LSN 2 (INSERT, txn=1): REDO (txn 1 committed)
LSN 4 (INSERT, txn=2): SKIP (txn 2 didn't commit)
LSN 6 (UPDATE, txn=2): SKIP (txn 2 didn't commit)
Why No Undo?
Traditional ARIES has three phases: Analysis, Redo, Undo. We skip Undo because of how we structure our system:
Our approach: Don't apply changes to data pages until commit
- During a transaction, changes are in the buffer pool (RAM)
- On commit, dirty pages are flushed
- On crash, uncommitted changes in RAM are simply lost
Traditional approach: Apply changes immediately, undo on abort
- Changes written to data pages as transaction runs
- If abort/crash, must read WAL backwards and undo each change
- More complex, but allows larger transactions (not limited by RAM)
For an embedded database like Lattice, the simpler "don't undo" approach works well.
The Checkpoint Starting Point
Recovery doesn't scan the entire WAL - only from the last checkpoint:
WAL Timeline:
─────────────────────────────────────────────────────────────────►
[old records] │ CHECKPOINT │ [records since checkpoint]
▲ ▲
│ │
checkpoint_lsn │
│
Recovery starts here
(everything before is safely on disk)
The checkpoint_lsn in the WAL header marks where recovery begins. The Checkpointer sets this after successfully flushing all dirty pages.
Transaction States
During recovery, each transaction can be in one of three states:
| State | Meaning | WAL Record | Action |
|---|---|---|---|
committed | Completed successfully | TXN_COMMIT present | Redo operations |
aborted | Explicitly rolled back | TXN_ABORT present | Ignore operations |
in_progress | Crash during execution | No commit/abort | Ignore operations |
The distinction between aborted and in_progress is informational - both are handled the same way (ignore their changes).
Record Type Handling
Different WAL records are handled differently:
switch (record.record_type) {
.txn_begin => {
// Track new transaction as in_progress
},
.txn_commit => {
// Mark transaction as committed
},
.txn_abort => {
// Mark transaction as aborted
},
.insert, .update, .delete => {
// Data modification - save for redo phase
},
.page_write => {
// Physical page write - save for redo phase
},
.checkpoint_begin, .checkpoint_end => {
// Informational - no action needed
},
.savepoint, .savepoint_rollback => {
// Track but no special handling
},
.clr => {
// Compensation Log Record - for undo operations
},
}
Physical vs Logical Redo
Our recovery supports two types of operations:
Physical: page_write
Contains the complete page image:
Payload: [page_id: 4 bytes][page_data: 4096 bytes]
Redo: Write entire page directly to database file
Logical: insert, update, delete
Contains the operation parameters:
Payload: [key][value][metadata]
Redo: Re-execute the operation against the B+Tree
For simplicity, our implementation currently relies on page_write for physical durability, with logical operations tracked for statistics.
Corruption Detection
When recovery encounters a checksum mismatch, it must determine whether this is:
- Tail corruption - A torn write at the end of the WAL (safe to proceed)
- Mid-log corruption - Real corruption with valid data after it (unsafe)
Scan-Ahead Detection
On checksum mismatch, we scan ahead to check for valid frames:
Scenario 1: Tail Corruption (Torn Write)
─────────────────────────────────────────
Frame 0: checksum OK
Frame 1: checksum OK
Frame 2: checksum MISMATCH
Frame 3: [nothing valid]
Frame 4: [nothing valid]
→ No valid frames after corruption
→ This is a torn write at end of WAL
→ Safe to proceed with frames 0-1
Scenario 2: Mid-Log Corruption (Real Corruption)
────────────────────────────────────────────────
Frame 0: checksum OK
Frame 1: checksum MISMATCH ← corruption here
Frame 2: checksum OK ← but valid data after!
Frame 3: checksum OK
→ Valid frames exist after corruption
→ This is real data corruption
→ FAIL with MidLogCorruption error
Why This Matters
Mid-log corruption is dangerous because the corrupted frame might contain:
- A
TXN_COMMITrecord we can't see - Data modifications needed for consistency
If we skip the corrupted frame and continue, we might:
- Treat a committed transaction as uncommitted (data loss)
- Apply partial transaction state (inconsistency)
By failing on mid-log corruption, we force manual intervention (restore from backup) rather than silently losing data.
Implementation
fn hasValidFramesAfter(wal: *WalManager, corrupted_frame: u64) bool {
// Check up to 10 frames ahead
for (corrupted_frame + 1 .. min(corrupted_frame + 10, frame_count)) |frame_num| {
if (frameHasValidChecksum(frame_num)) {
return true; // Mid-log corruption!
}
}
return false; // Tail corruption, safe to stop
}
Statistics
Recovery returns detailed statistics:
RecoveryStats{
start_lsn: 1000, // Where we started (checkpoint_lsn)
end_lsn: 1523, // Last valid record
records_scanned: 523, // Total records processed
transactions_found: 15, // Distinct transactions
transactions_committed: 12, // Successfully committed
transactions_aborted: 2, // Explicitly aborted
transactions_rolled_back: 1, // In-progress at crash
redo_operations: 156, // Operations redone
duration_ns: 45_000_000, // 45ms
stopped_at_corruption: true, // Hit tail corruption
corrupted_frame: 42, // Frame number with bad checksum
}
These are useful for:
- Monitoring recovery time
- Debugging transaction issues
- Capacity planning
- Detecting disk health issues (frequent tail corruption may indicate hardware problems)
Recovery Flow Diagram
┌─────────────────────────────────────────────────────────────────┐
│ Database Startup │
└───────────────────────────────┬─────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ Open WAL File │
│ Read checkpoint_lsn from header │
└───────────────────────────────┬─────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ ANALYSIS PHASE │
│ │
│ for each record from checkpoint_lsn to end: │
│ - Track transaction states │
│ - Build redo list for data operations │
│ - Stop at corruption/end of valid log │
└───────────────────────────────┬─────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ REDO PHASE │
│ │
│ for each operation in redo_list: │
│ if transaction.state == committed: │
│ apply operation to database │
│ else: │
│ discard (transaction didn't commit) │
└───────────────────────────────┬─────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ Sync Database File │
│ Return recovery statistics │
└───────────────────────────────┬─────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ Database Ready for Use │
└─────────────────────────────────────────────────────────────────┘
API
// Simple recovery on startup
const stats = try recoverDatabase(allocator, &wal, &pm);
std.debug.print("Recovered {} transactions, redid {} operations in {}ms\n", .{
stats.transactions_committed,
stats.redo_operations,
stats.duration_ns / 1_000_000,
});
// Or use RecoveryManager directly for more control
var rm = RecoveryManager.init(allocator);
const stats = try rm.recover(&wal, &pm);
Integration with Startup
Typical database startup sequence:
1. Open database file (PageManager)
2. Open WAL file (WalManager)
3. Check if recovery needed (WAL has records past checkpoint?)
4. Run recovery
5. Checkpoint to clean slate
6. Ready for operations
Summary
| Concept | Purpose |
|---|---|
| Analysis phase | Determine transaction outcomes |
| Redo phase | Apply committed operations |
| checkpoint_lsn | Recovery starting point |
| Transaction states | committed, aborted, in_progress |
| Checksum verification | Detect corruption/partial writes |
| Tail corruption | Torn write at end, safe to tolerate |
| Mid-log corruption | Real corruption, fail to prevent data loss |
| Scan-ahead detection | Distinguish tail from mid-log corruption |
| No Undo | Uncommitted changes not written to pages |
Recovery transforms a potentially inconsistent crash state into a consistent database by leveraging the WAL as the authoritative record of committed transactions.
Graph Storage
The Property Graph Model
Lattice implements a Labeled Property Graph - the same model used by Neo4j, Amazon Neptune, and other graph databases. It consists of:
Node:
- id: unique identifier (u64)
- labels: set of strings (e.g., "Person", "Employee")
- properties: key-value pairs (e.g., name: "Alice", age: 30)
Edge:
- source: node id
- target: node id
- type: string (e.g., "KNOWS", "WORKS_AT")
- properties: key-value pairs (e.g., since: 2020)
This model is expressive enough to represent almost any domain while remaining simple to query and traverse.
The Storage Challenge
How do you store a graph in a B+Tree (which is fundamentally a key-value store)?
The key insight: decompose the graph into multiple B+Trees, each optimized for a specific access pattern.
┌─────────────────────────────────────────────────────────────────┐
│ Graph Storage Layer │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ SYMBOLS │ │ NODES │ │ EDGES │ │
│ │ B+Tree │ │ B+Tree │ │ B+Tree │ │
│ │ │ │ │ │ │ │
│ │ string → id │ │ node_id → │ │ composite │ │
│ │ id → string │ │ NodeData │ │ key → data │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │
│ ┌──────────────┐ │
│ │ LABEL_INDEX │ │
│ │ B+Tree │ │
│ │ │ │
│ │ (label,node) │ │
│ │ → ∅ │ │
│ └──────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
String Interning (Symbol Table)
Graphs have lots of repeated strings: label names, property keys, edge types. Storing "Person" thousands of times wastes space. Instead, we intern strings.
String Interning:
"Person" ──────► 1000
"Employee" ─────► 1001
"name" ──────► 1002
"KNOWS" ──────► 1003
Now instead of storing "Person" (6 bytes),
we store 1000 (2 bytes)
The Symbol Table uses two B+Trees:
SYMBOLS (forward): SYMBOLS_REVERSE:
"Person" → 1000 1000 → "Person"
"Employee" → 1001 1001 → "Employee"
"name" → 1002 1002 → "name"
"KNOWS" → 1003 1003 → "KNOWS"
Symbol IDs are u16 (0-65535):
- 0: Reserved (null)
- 1-999: Reserved for system use
- 1000-65535: User-defined symbols
API
var symbols = SymbolTable.init(allocator, &forward_tree, &reverse_tree);
// Intern a string (creates if not exists, returns existing if present)
const person_id = try symbols.intern("Person"); // 1000
// Lookup without creating
const id = try symbols.lookup("Person"); // 1000
// or
try symbols.lookup("Unknown"); // SymbolError.NotFound
// Resolve ID back to string
const name = try symbols.resolve(person_id); // "Person"
defer symbols.freeString(name); // Must free allocated string
Node Storage
Nodes are stored in a B+Tree with simple u64 keys:
NODES B+Tree:
Key: node_id (u64, little-endian)
Value: NodeData (serialized)
NodeData Format
┌────────────────────────────────────────────────────────────┐
│ num_labels: u16 │
│ labels: [symbol_id: u16] × num_labels │
│ num_properties: u16 │
│ properties: [PropertyEntry] × num_properties │
└────────────────────────────────────────────────────────────┘
PropertyEntry:
┌────────────────────────────────────────────────────────────┐
│ key_id: u16 (interned string) │
│ value_type: u8 │
│ value_data: variable │
└────────────────────────────────────────────────────────────┘
Value Types:
0 = Null
1 = Bool (1 byte: 0 or 1)
2 = Int64 (8 bytes, little-endian)
3 = Float64 (8 bytes, IEEE 754)
4 = String (u32 length + bytes)
5 = Bytes (u32 length + bytes)
6 = List (TODO)
7 = Map (TODO)
API
var store = NodeStore.init(allocator, &nodes_tree);
// Create a node
const labels = [_]SymbolId{ person_id, employee_id };
const properties = [_]Property{
.{ .key_id = name_id, .value = .{ .string_val = "Alice" } },
.{ .key_id = age_id, .value = .{ .int_val = 30 } },
};
const node_id = try store.create(&labels, &properties);
// Get a node
var node = try store.get(node_id);
defer node.deinit(allocator);
// Check existence
if (store.exists(node_id)) { ... }
// Update a node
try store.update(node_id, &new_labels, &new_properties);
// Delete a node
try store.delete(node_id);
Edge Storage
Edges are more complex because we need efficient traversal in both directions:
- "Find all people Alice knows" (outgoing)
- "Find all people who know Alice" (incoming)
The Double-Write Pattern
For edge (Alice)-[:KNOWS]->(Bob), we store two entries:
Entry 1 (Outgoing from Alice):
Key: (Alice, OUTGOING, KNOWS, Bob)
Value: edge properties
Entry 2 (Incoming to Bob):
Key: (Bob, INCOMING, KNOWS, Alice)
Value: edge properties (same data)
This doubles write cost but enables O(1) traversal in either direction.
Key Format
Edge Key (19 bytes, big-endian for lexicographic ordering):
┌──────────────────────────────────────────────────────────┐
│ source_id: u64 │ direction: u8 │ type_id: u16 │ target_id: u64 │
│ (8 bytes) │ (1 byte) │ (2 bytes) │ (8 bytes) │
└──────────────────────────────────────────────────────────┘
Direction:
0 = Outgoing (source → target)
1 = Incoming (target ← source)
Big-endian encoding ensures keys sort correctly for range scans:
- All edges from node X are contiguous
- Within that, all outgoing edges are together
- Within that, edges of same type are together
Why This Key Order?
The key (source, direction, type, target) is optimized for common queries:
Query: "All outgoing edges from Alice"
Scan: (Alice, 0, *, *)
Keys are contiguous!
Query: "All KNOWS edges from Alice"
Scan: (Alice, 0, KNOWS, *)
Even more specific prefix!
Query: "Does Alice know Bob?"
Point lookup: (Alice, 0, KNOWS, Bob)
Direct key access!
API
var store = EdgeStore.init(allocator, &edges_tree);
// Create an edge
const properties = [_]Property{
.{ .key_id = since_id, .value = .{ .int_val = 2020 } },
};
try store.create(alice_id, bob_id, knows_id, &properties);
// Get an edge
var edge = try store.get(alice_id, bob_id, knows_id);
defer edge.deinit(allocator);
// Check existence
if (store.exists(alice_id, bob_id, knows_id)) { ... }
// Delete an edge (removes both outgoing and incoming entries)
try store.delete(alice_id, bob_id, knows_id);
// Iterate all outgoing edges from a node
var iter = try store.getOutgoing(alice_id);
defer iter.deinit();
while (try iter.next()) |edge| {
defer edge.deinit(allocator);
// Process edge...
}
// Iterate incoming edges
var incoming = try store.getIncoming(bob_id);
defer incoming.deinit();
// Filter by edge type
var knows_edges = try store.getOutgoingByType(alice_id, knows_id);
defer knows_edges.deinit();
// Count edges without allocating
const out_count = try store.countOutgoing(alice_id);
const in_count = try store.countIncoming(bob_id);
Label Index
For queries like MATCH (n:Person), we need to find all nodes with a given label efficiently. The Label Index provides this.
LABEL_INDEX B+Tree:
Key: (label_id: u16, node_id: u64) - big-endian
Value: empty (existence only)
How It Works
When creating node Alice with labels [Person, Employee]:
Insert: (Person, Alice) → ∅
Insert: (Employee, Alice) → ∅
Query "all Person nodes":
Range scan: (Person, 0) to (Person, MAX)
Returns: Alice, Bob, Carol, ...
API
var index = LabelIndex.init(allocator, &label_tree);
// Add labels when creating a node
try index.addLabels(&[_]SymbolId{ person_id, employee_id }, node_id);
// Check if node has a label
if (index.hasLabel(person_id, node_id)) { ... }
// Remove a label from a node
try index.remove(person_id, node_id);
// Get all nodes with a label (allocates result slice)
const person_nodes = try index.getNodesByLabel(person_id);
defer allocator.free(person_nodes);
for (person_nodes) |node_id| {
// Process each Person node...
}
// Lazy iteration (memory-efficient for large result sets)
var iter = try index.iterNodesByLabel(person_id);
defer iter.deinit();
while (try iter.next()) |node_id| {
// Process one node at a time...
}
// Count nodes without allocating
const person_count = try index.countNodesByLabel(person_id);
Putting It Together
A complete graph operation involves multiple B+Trees:
Creating node Alice:Person with name="Alice":
1. Symbol Table:
- intern("Person") → 1000
- intern("name") → 1001
2. Node Store:
- Allocate node_id = 1
- Serialize: [1 label: 1000][1 prop: 1001="Alice"]
- Insert: 1 → serialized_data
3. Label Index:
- Insert: (1000, 1) → ∅
Creating edge (Alice)-[:KNOWS]->(Bob):
1. Symbol Table:
- intern("KNOWS") → 1002
2. Edge Store:
- Insert: (1, OUT, 1002, 2) → properties
- Insert: (2, IN, 1002, 1) → properties
Performance Characteristics
| Operation | Complexity | Notes |
|---|---|---|
| Create node | O(log n) | One B+Tree insert + label index inserts |
| Get node | O(log n) | Single B+Tree lookup |
| Delete node | O(log n) | One B+Tree delete |
| Create edge | O(log n) | Two B+Tree inserts (double-write) |
| Get edge | O(log n) | Single B+Tree lookup |
| Delete edge | O(log n) | Two B+Tree deletes (double-delete) |
| Check label | O(log n) | Single B+Tree lookup |
| Remove label | O(log n) | Single B+Tree delete |
| All nodes with label | O(log n + k) | Range scan, k = result count |
| All edges from node | O(log n + k) | Range scan, k = edge count |
Where n = total items in the respective B+Tree.
Current Limitations
- No MVCC: All operations see latest data (no snapshot isolation yet)
- Property updates overwrite: No partial property updates
- No underflow handling: Deleted entries leave wasted space until compaction
Future Enhancements
- Property indexes: Secondary indexes on property values
- Edge type index: Fast lookup by edge type across all nodes
- MVCC integration: Transaction-aware reads and writes
- B+Tree underflow handling: Merge/redistribute pages after deletions
Summary
| Component | B+Tree Key | Purpose |
|---|---|---|
| Symbol Table (forward) | string | String → ID mapping |
| Symbol Table (reverse) | symbol_id | ID → String mapping |
| Node Store | node_id | Node data storage |
| Edge Store | (src, dir, type, tgt) | Edge data + traversal |
| Label Index | (label_id, node_id) | Label-based queries |
The graph storage layer transforms a B+Tree (a key-value store) into a full property graph database through careful key design and the double-write pattern for edges.
Vector Search
Overview
Lattice provides approximate nearest neighbor (ANN) search using the HNSW (Hierarchical Navigable Small World) algorithm. This enables semantic search over high-dimensional embedding vectors—a core requirement for AI/RAG applications.
┌─────────────────────────────────────────────────────────────────┐
│ Vector Search Stack │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────┐ │
│ │ EmbeddingClient │ Optional: HTTP calls to embedding APIs │
│ │ (optional) │ Ollama, OpenAI, etc. │
│ └────────┬────────┘ │
│ │ []f32 │
│ ▼ │
│ ┌─────────────────┐ │
│ │ HnswIndex │ Multi-layer graph for ANN search │
│ │ │ O(log n) search complexity │
│ └────────┬────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────┐ │
│ │ VectorStorage │ Page-based vector data storage │
│ │ │ Integrates with BufferPool │
│ └─────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
Embedding Generation
Lattice uses a bring your own embeddings approach by default. You can:
- Generate embeddings externally and pass them directly
- Use the optional
EmbeddingClientto call HTTP embedding APIs
Option 1: Bring Your Own Embeddings
// You generate embeddings however you want
const embedding: []const f32 = your_embedding_function("Hello world");
// Insert directly into HNSW index
try hnsw_index.insert(doc_id, embedding);
Option 2: HTTP Embedding Client
The EmbeddingClient calls external HTTP endpoints to generate embeddings. It's disabled by default—you must explicitly create and configure it.
const lattice = @import("lattice");
// Ollama (local) - simplest config
var client = lattice.EmbeddingClient.init(allocator, .{
.endpoint = "http://localhost:11434/api/embeddings",
});
defer client.deinit();
// Generate embedding
const vector = try client.embed("Hello, world!");
defer allocator.free(vector);
Configuration Options
const config = lattice.EmbeddingConfig{
// Required: HTTP endpoint URL
.endpoint = "http://localhost:11434/api/embeddings",
// Model name (default: "nomic-embed-text")
.model = "nomic-embed-text",
// API format: .ollama (default) or .openai
.api_format = .ollama,
// Optional API key for authenticated endpoints
.api_key = null,
// Request timeout in milliseconds (default: 30000)
.timeout_ms = 30_000,
};
Supported API Formats
| Format | Request | Response |
|---|---|---|
.ollama | {"model": "...", "prompt": "..."} | {"embedding": [...]} |
.openai | {"model": "...", "input": "..."} | {"data": [{"embedding": [...]}]} |
Examples
Ollama (local)
var client = EmbeddingClient.init(allocator, .{
.endpoint = "http://localhost:11434/api/embeddings",
.model = "nomic-embed-text",
});
OpenAI
var client = EmbeddingClient.init(allocator, .{
.endpoint = "https://api.openai.com/v1/embeddings",
.model = "text-embedding-3-small",
.api_format = .openai,
.api_key = "sk-...",
});
Local OpenAI-compatible (llama.cpp, vLLM, etc.)
var client = EmbeddingClient.init(allocator, .{
.endpoint = "http://localhost:8080/v1/embeddings",
.model = "local-model",
.api_format = .openai,
});
HNSW Index
HNSW (Hierarchical Navigable Small World) is a graph-based algorithm for approximate nearest neighbor search. It achieves O(log n) search complexity with high recall.
How HNSW Works
Layer 2: [A] ─────────────────────── [D]
│ │
Layer 1: [A] ──── [B] ──── [C] ──── [D] ──── [E]
│ │ │ │ │
Layer 0: [A]─[F]─[B]─[G]─[C]─[H]─[D]─[I]─[E]─[J]
(dense connections at layer 0)
- Upper layers have exponentially fewer nodes (sparse)
- Search starts at top layer, greedily descends
- Layer 0 uses beam search for final candidates
- Each node maintains bidirectional connections to neighbors
Configuration
const config = lattice.HnswConfig{
.m = 16, // Connections per node (layers 1+)
.m_max0 = 32, // Connections at layer 0
.ef_construction = 200, // Search width during insert
.ef_search = 64, // Search width during query
.ml = 0.36067977, // Level multiplier (1/ln(2))
.metric = .cosine, // Distance metric
};
| Parameter | Default | Description |
|---|---|---|
m | 16 | Max connections per node (higher = better recall, more memory) |
m_max0 | 32 | Max connections at layer 0 (typically 2×m) |
ef_construction | 200 | Beam width during insert (higher = better graph quality) |
ef_search | 64 | Beam width during search (higher = better recall, slower) |
metric | .cosine | Distance metric: .euclidean, .cosine, .inner_product |
API Usage
const lattice = @import("lattice");
// Initialize vector storage
var vector_storage = try lattice.VectorStorage.init(allocator, buffer_pool, 384);
defer vector_storage.deinit();
// Initialize HNSW index
var hnsw = lattice.HnswIndex.init(allocator, buffer_pool, &vector_storage, .{
.metric = .cosine,
.ef_search = 100,
});
defer hnsw.deinit();
// Insert vectors
try hnsw.insert(1, embedding1);
try hnsw.insert(2, embedding2);
try hnsw.insert(3, embedding3);
// Search for 10 nearest neighbors
const results = try hnsw.search(query_vector, 10, null);
defer hnsw.freeResults(results);
for (results) |result| {
std.debug.print("ID: {}, Distance: {d:.4}\n", .{ result.id, result.distance });
}
Distance Metrics
| Metric | Formula | Best For |
|---|---|---|
.euclidean | √Σ(aᵢ - bᵢ)² | General purpose |
.cosine | 1 - (a·b)/(‖a‖‖b‖) | Text embeddings (normalized) |
.inner_product | -Σ(aᵢ × bᵢ) | When vectors are pre-normalized |
For text embeddings from most models (OpenAI, Cohere, etc.), use .cosine.
Vector Storage
Vectors are stored in pages managed by the buffer pool, separate from the HNSW graph structure.
┌────────────────────────────────────────────────────────────────┐
│ Vector Page (4096 bytes) │
├────────────────────────────────────────────────────────────────┤
│ Header (24 bytes) │
│ page_type: u8 = 0x04 (vector_data) │
│ dimensions: u16 │
│ vector_count: u16 │
│ next_page: u32 (overflow chain) │
├────────────────────────────────────────────────────────────────┤
│ Slot 0: [vector_id: u64][f32 × dimensions] │
│ Slot 1: [vector_id: u64][f32 × dimensions] │
│ ... │
└────────────────────────────────────────────────────────────────┘
Vectors per page depends on dimensions:
- 384-dim (1536 bytes): 2 vectors/page
- 768-dim (3072 bytes): 1 vector/page
- 1536-dim (6144 bytes): spans multiple pages
Performance Characteristics
| Operation | Complexity | Notes |
|---|---|---|
| Insert | O(log n) | Builds graph connections |
| Search (k-NN) | O(log n) | Independent of k for small k |
| Delete | O(m × log n) | Removes connections |
| Memory | O(n × m × layers) | ~1KB per vector at m=16 |
Tuning Guidelines
For higher recall:
- Increase
ef_search(e.g., 100-200) - Increase
m(e.g., 32-64) - Use more
ef_construction(e.g., 400)
For faster search:
- Decrease
ef_search(e.g., 32) - Accept lower recall
Typical configuration for 1M vectors:
.m = 16,
.m_max0 = 32,
.ef_construction = 200,
.ef_search = 64, // Adjust based on recall/speed tradeoff
Complete Example
const std = @import("std");
const lattice = @import("lattice");
pub fn main() !void {
var gpa = std.heap.GeneralPurposeAllocator(.{}){};
defer _ = gpa.deinit();
const allocator = gpa.allocator();
// Setup storage (abbreviated)
var buffer_pool = try lattice.BufferPool.init(allocator, page_manager, 1000);
defer buffer_pool.deinit();
// Initialize embedding client (optional)
var embedder = lattice.EmbeddingClient.init(allocator, .{
.endpoint = "http://localhost:11434/api/embeddings",
});
defer embedder.deinit();
// Initialize vector storage and HNSW index
var vector_storage = try lattice.VectorStorage.init(allocator, &buffer_pool, 384);
defer vector_storage.deinit();
var hnsw = lattice.HnswIndex.init(allocator, &buffer_pool, &vector_storage, .{
.metric = .cosine,
});
defer hnsw.deinit();
// Index some documents
const docs = [_][]const u8{
"Zig is a systems programming language",
"Rust focuses on memory safety",
"Go emphasizes simplicity and concurrency",
};
for (docs, 0..) |doc, i| {
const embedding = try embedder.embed(doc);
defer allocator.free(embedding);
try hnsw.insert(i, embedding);
}
// Search
const query_embedding = try embedder.embed("memory safe language");
defer allocator.free(query_embedding);
const results = try hnsw.search(query_embedding, 3, null);
defer hnsw.freeResults(results);
for (results) |result| {
std.debug.print("Doc {}: distance {d:.4}\n", .{ result.id, result.distance });
}
}
SIMD-Optimized Distance Functions
Distance calculations use Zig's @Vector for portable SIMD acceleration. This provides 4-8x speedup over scalar implementations for typical embedding dimensions (384, 768, 1536).
// All distance functions are SIMD-optimized
const dist = lattice.vector.distance.euclideanDistance(vec_a, vec_b);
const cos_dist = lattice.vector.distance.cosineDistance(vec_a, vec_b);
const ip_dist = lattice.vector.distance.innerProductDistance(vec_a, vec_b);
// Additional utilities
const dot = lattice.vector.distance.dotProduct(vec_a, vec_b);
const magnitude = lattice.vector.distance.norm(vec);
lattice.vector.distance.normalize(vec); // in-place
Implementation details:
- Processes 8 floats at a time (AVX-256 compatible)
- Handles non-aligned remainder with scalar fallback
- Works on ARM NEON (128-bit, processed as 2×4)
Current Limitations
- In-memory graph: Node connections stored in memory (persisted to pages on write)
- No incremental persistence: Full graph rebuild on restart
- Single-threaded: No concurrent insert/search yet
Future Enhancements
- Persistent graph: Load HNSW structure from disk
- Concurrent access: Reader-writer locks for parallel search
- Product quantization: Compress vectors for larger datasets
- Filtered search: Combine with label/property predicates
Full-Text Search (BM25)
This document explains how Lattice's full-text search works, from text input to ranked results.
Overview
Full-text search allows you to find documents containing specific words or phrases, ranked by relevance. Lattice implements the BM25 (Best Match 25) ranking algorithm, the same algorithm used by Elasticsearch and other production search engines.
The FTS system consists of five components:
┌─────────────┐ ┌────────────┐ ┌──────────────┐ ┌─────────┐
│ Tokenizer │ ──► │ Dictionary │ ──► │ Posting List │ ──► │ Scorer │
│ │ │ (B+Tree) │ │ (Pages) │ │ (BM25) │
└─────────────┘ └────────────┘ └──────────────┘ └─────────┘
│ │ │ │
▼ ▼ ▼ ▼
"hello world" "hello" → 1 doc_ids with ranked
→ ["hello", "world" → 2 term frequencies results
"world"]
1. Tokenizer
The tokenizer breaks text into searchable tokens.
What It Does
Given input text:
"The quick brown fox jumps over the lazy dog"
The tokenizer produces:
["quick", "brown", "fox", "jumps", "lazy", "dog"]
Notice "The", "the", and "over" are missing—they're stop words.
How It Works
Input: "Hello, World! This is a TEST."
│
▼
┌─────────────────────────────────────┐
│ 1. Scan for word boundaries │
│ Split on non-alphanumeric chars │
│ "Hello" "World" "This" "is" "a" "TEST"
└─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────┐
│ 2. Apply length filter │
│ min_length=2, max_length=64 │
│ "Hello" "World" "This" "is" "TEST"
│ ("a" removed - too short) │
└─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────┐
│ 3. Normalize to lowercase │
│ "hello" "world" "this" "is" "test"
└─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────┐
│ 4. Filter stop words │
│ "hello" "world" "test" │
│ ("this", "is" removed) │
└─────────────────────────────────────┘
│
▼
Output: [Token{text="hello", position=0},
Token{text="world", position=1},
Token{text="test", position=2}]
Stop Words
Stop words are common words that add little search value. Lattice supports stop word filtering for 11 languages:
| Language | Example Stop Words |
|---|---|
| English | the, and, is, a, to, of, in, that, it |
| German | der, die, das, und, ist, in, zu, den |
| French | le, la, les, de, et, en, que, un |
| Spanish | el, la, los, de, en, que, y, es |
| Italian | il, la, lo, i, di, e, che, un |
| Portuguese | o, a, os, de, e, que, em, um |
| Dutch | de, het, een, en, van, in, is |
| Swedish | och, i, att, det, en, som, av |
| Norwegian | og, i, det, er, en, at, til |
| Danish | og, i, at, det, en, er, til |
| Finnish | ja, on, ei, ole, oli, se, han |
| Russian | и, в, не, на, что, он, с, как |
Set the language in your tokenizer config:
var tokenizer = Tokenizer.init(allocator, text, .{
.remove_stop_words = true,
.language = .german, // Use German stop words
});
Configuration
pub const TokenizerConfig = struct {
min_token_length: u8 = 2, // Skip tokens shorter than this
max_token_length: u8 = 64, // Skip tokens longer than this
lowercase: bool = true, // Convert to lowercase
remove_accents: bool = true, // Remove diacritics (planned)
remove_stop_words: bool = true, // Filter common words
use_stemming: bool = false, // Apply Porter stemmer
language: Language = .english, // Language for stop words
};
Porter Stemmer
When use_stemming is enabled, tokens are reduced to their root forms:
Input Token → Stemmed
─────────────────────────
"running" → "run"
"connected" → "connect"
"optimization" → "optim"
"databases" → "databas"
Why stem? Stemming improves recall by matching morphological variants:
- Query "run" matches documents containing "running", "runs", "runner"
- Query "connect" matches "connected", "connecting", "connection"
Note: Currently only English stemming is supported. For other languages, words are returned unchanged when stemming is enabled. Future versions may add Snowball stemmers for additional languages.
Use normalizeAndStem() or normalizeAndStemWithLanguage() for manual stemming:
var buf: [64]u8 = undefined;
// English stemming
const en_stemmed = tokenizer_mod.normalizeAndStemWithLanguage("RUNNING", &buf, true, .english);
// en_stemmed = "run"
// German (no stemmer available, returns lowercased)
const de_stemmed = tokenizer_mod.normalizeAndStemWithLanguage("RUNNING", &buf, true, .german);
// de_stemmed = "running"
2. Dictionary
The dictionary maps tokens to integer IDs and tracks statistics.
What It Does
Token → TokenId DocFreq PostingPage
─────────────────────────────────────────────────
"hello" → 1 5 42
"world" → 2 3 43
"database" → 3 12 44
"search" → 4 8 45
Why TokenIds?
Storing the full token string everywhere would waste space. Instead:
- Dictionary stores:
"database"→TokenId 3 - Posting lists store:
TokenId 3(4 bytes instead of 8) - Reduced I/O and memory usage
Storage Format
The dictionary uses a B+Tree with:
- Key: Token string (e.g.,
"hello") - Value: DictionaryEntry (24 bytes)
DictionaryEntry (24 bytes, extern struct):
┌──────────────┬──────────────┬──────────────┬──────────────┬──────────────┐
│ total_freq │ token_id │ doc_freq │ posting_page │ _padding │
│ (8 bytes) │ (4 bytes) │ (4 bytes) │ (4 bytes) │ (4 bytes) │
└──────────────┴──────────────┴──────────────┴──────────────┴──────────────┘
Fields are ordered largest-first to minimize internal padding (u64 requires 8-byte alignment).
| Field | Type | Description |
|---|---|---|
total_freq | u64 | Total occurrences across all documents |
token_id | u32 | Unique identifier (1 to ~4 billion) |
doc_freq | u32 | Number of documents containing this token |
posting_page | PageId | First page of the posting list |
_padding | u32 | Explicit padding for 8-byte struct alignment |
Operations
getOrCreate(token) - Get existing TokenId or create new one:
Input: "hello"
│
▼
┌─────────────────────────────────────┐
│ B+Tree lookup for "hello" │
│ │
│ Found? → Return existing token_id │
│ │
│ Not found? → │
│ 1. Assign next_token_id (e.g., 5) │
│ 2. Insert into B+Tree │
│ 3. Return 5 │
└─────────────────────────────────────┘
3. Posting Lists
A posting list stores which documents contain a specific token.
What It Does
For the token "database" (TokenId 3):
Posting List for "database":
┌────────┬───────────┐
│ DocId │ TermFreq │
├────────┼───────────┤
│ 15 │ 3 │ ← Document 15 contains "database" 3 times
│ 42 │ 1 │ ← Document 42 contains "database" 1 time
│ 89 │ 7 │ ← Document 89 contains "database" 7 times
│ 156 │ 2 │
│ 203 │ 1 │
│ ... │ ... │
└────────┴───────────┘
Page Layout
Each posting list page is 4096 bytes:
Posting Page (4096 bytes):
┌─────────────────────────────────────────────────────────────┐
│ PageHeader (8 bytes) │
│ page_type = FTS_POSTING │
├─────────────────────────────────────────────────────────────┤
│ PostingPageHeader (20 bytes) │
│ token_id: u32 - Which token this list is for │
│ num_entries: u32 - Number of postings in this page │
│ next_page: u32 - Overflow page (0 = none) │
│ num_skip_ptrs: u16 - Number of skip pointers │
│ flags: u16 - 0x01 = has positions │
│ data_start: u32 - Byte offset where data begins │
├─────────────────────────────────────────────────────────────┤
│ Skip Pointers (16 bytes each, optional) │
│ [doc_id, byte_offset, entry_count] × N │
├─────────────────────────────────────────────────────────────┤
│ Posting Data (variable length, varint encoded) │
│ [doc_id: varint] [term_freq: varint] │
│ [doc_id: varint] [term_freq: varint] │
│ ... │
└─────────────────────────────────────────────────────────────┘
PostingPageHeader Explained
The PostingPageHeader is metadata at the start of each posting page:
| Field | Bytes | Purpose |
|---|---|---|
token_id | 4 | Identifies which token this posting list belongs to |
num_entries | 4 | How many (doc_id, term_freq) pairs are in this page |
next_page | 4 | PageId of overflow page if list doesn't fit (0 = no overflow) |
num_skip_pointers | 2 | Count of skip pointers for fast seeking |
flags | 2 | Bit flags: 0x01 = positions stored for phrase queries |
data_start | 4 | Byte offset where posting entries begin (after skip pointers) |
Varint Encoding
Document IDs and frequencies are stored using variable-length integers (varints) for compression:
Value Binary Varint Bytes Savings
─────────────────────────────────────────────────────────────
127 0111 1111 [0x7F] 1 byte (vs 8)
128 1000 0000 [0x80, 0x01] 2 bytes
16383 0011 1111 1111 1111 [0xFF, 0x7F] 2 bytes
1,000,000 ... [0xC0, 0x84, 0x3D] 3 bytes
Encoding algorithm:
while value >= 0x80:
output byte = (value & 0x7F) | 0x80 // Low 7 bits + continuation flag
value = value >> 7
output byte = value // Final byte (no continuation)
Example: Encoding 300 (binary: 1 0010 1100)
300 = 0b100101100
Step 1: 300 >= 128, so:
output[0] = (300 & 0x7F) | 0x80 = 0x2C | 0x80 = 0xAC
300 >> 7 = 2
Step 2: 2 < 128, so:
output[1] = 2 = 0x02
Result: [0xAC, 0x02] (2 bytes instead of 8)
Overflow Pages
When a posting list exceeds one page, it chains to overflow pages:
Page 42 (first page) Page 87 (overflow)
┌────────────────────┐ ┌────────────────────┐
│ Header │ │ Header │
│ next_page = 87 ──┼───────►│ next_page = 0 │
│ num_entries = 200│ │ num_entries = 50 │
├────────────────────┤ ├────────────────────┤
│ Entries 1-200 │ │ Entries 201-250 │
└────────────────────┘ └────────────────────┘
Skip Pointers
Skip pointers enable O(log n) seeking within posting lists, dramatically speeding up multi-term AND queries.
Structure:
SkipPointer (16 bytes):
┌────────────┬─────────────┬─────────────┐
│ doc_id │ byte_offset │ entry_count │
│ (8 bytes) │ (4 bytes) │ (4 bytes) │
└────────────┴─────────────┴─────────────┘
How they work:
Skip pointers are created every 128 entries (SKIP_INTERVAL). They record:
doc_id: The document ID at that entrybyte_offset: Where to jump in the posting dataentry_count: Number of entries before this pointer
Posting list with 500 entries:
Skip Pointers:
[0] doc_id=1280, offset=512, count=128 ← entry 128
[1] doc_id=2560, offset=1024, count=256 ← entry 256
[2] doc_id=3840, offset=1536, count=384 ← entry 384
Seeking to doc_id 3000:
1. Binary search skip pointers: find [1] (doc_id=2560 < 3000)
2. Jump to offset 1024
3. Linear scan from entry 256 to find doc_id 3000
Result: Skipped 256 entries instead of scanning all 300
Multi-term intersection optimization:
For AND queries like "database optimization":
- Sort terms by doc_freq (smallest first)
- Iterate through smallest posting list
- For each doc_id, use
skipTo()on other lists - Skip pointers let large lists jump ahead efficiently
Query: "the database" (AND)
"the": 10,000 documents
"database": 100 documents ← Driver (smallest)
Without skip pointers: 10,000 + 100 = 10,100 iterations
With skip pointers: 100 + ~100 seeks = ~200 operations
4. Document Length Storage
BM25 scoring requires knowing each document's length. This is stored in a separate B+Tree:
DocId → Length (tokens)
────────────────────────
1 → 150
2 → 45
3 → 892
... → ...
Why store lengths?
BM25 penalizes long documents to prevent them from dominating results just because they contain more words. A 10,000-word document mentioning "database" once is less relevant than a 100-word document mentioning it once.
Statistics tracked:
total_docs: Total documents indexedtotal_tokens: Sum of all document lengthsavg_doc_length: Average tokens per document (for normalization)
5. BM25 Scoring
BM25 calculates a relevance score for each document.
The Formula
Score(D, Q) = Σ IDF(term) × TF_norm(term, D)
for each term in query Q
Where:
IDF (Inverse Document Frequency):
IDF(term) = log((N - df + 0.5) / (df + 0.5) + 1)
N = total number of documents
df = number of documents containing this term
Rare terms get higher IDF scores. If "quantum" appears in 5 of 10,000 documents, it's more significant than "the" appearing in 9,500.
TF_norm (Normalized Term Frequency):
TF_norm = (tf × (k1 + 1)) / (tf + k1 × (1 - b + b × (dl / avgdl)))
tf = term frequency in this document
k1 = saturation parameter (default 1.2)
b = length normalization (default 0.75)
dl = document length
avgdl = average document length
Parameters
| Parameter | Default | Effect |
|---|---|---|
k1 | 1.2 | Controls term frequency saturation. Higher = more weight to repeated terms |
b | 0.75 | Length normalization. 0 = no normalization, 1 = full normalization |
Scoring Example
Corpus: 1000 documents, average length 200 tokens
Query: "database optimization"
Document 42:
- Length: 150 tokens
- "database" appears 3 times
- "optimization" appears 1 time
Term: "database"
- doc_freq = 50 (appears in 50 docs)
- IDF = log((1000 - 50 + 0.5) / (50 + 0.5) + 1) = log(19.82) ≈ 2.99
- tf = 3
- dl/avgdl = 150/200 = 0.75
- TF_norm = (3 × 2.2) / (3 + 1.2 × (1 - 0.75 + 0.75 × 0.75))
= 6.6 / (3 + 1.2 × 0.8125)
= 6.6 / 3.975
≈ 1.66
- Score for "database" = 2.99 × 1.66 ≈ 4.96
Term: "optimization"
- doc_freq = 10 (rarer term)
- IDF = log((1000 - 10 + 0.5) / (10 + 0.5) + 1) ≈ 4.55
- tf = 1
- TF_norm = (1 × 2.2) / (1 + 1.2 × 0.8125) ≈ 1.12
- Score for "optimization" = 4.55 × 1.12 ≈ 5.10
Total Score for Document 42 = 4.96 + 5.10 = 10.06
6. Search Flow
Single-Term Search
Query: "database"
│
▼
┌─────────────────────────────────────┐
│ 1. Tokenize query │
│ → ["database"] │
└─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────┐
│ 2. Dictionary lookup │
│ "database" → TokenId 3 │
│ doc_freq = 50 │
│ posting_page = 44 │
└─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────┐
│ 3. Iterate posting list (page 44) │
│ For each (doc_id, term_freq): │
│ - Get doc_length │
│ - Calculate BM25 score │
│ - Add to results │
└─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────┐
│ 4. Sort by score descending │
│ Return top K results │
└─────────────────────────────────────┘
Multi-Term Search (AND Semantics)
Query: "database optimization"
│
▼
┌─────────────────────────────────────┐
│ 1. Tokenize query │
│ → ["database", "optimization"] │
└─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────┐
│ 2. For each term, iterate posting │
│ list and accumulate scores │
│ │
│ doc_scores[doc_id] += score │
│ doc_term_count[doc_id] += 1 │
└─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────┐
│ 3. Filter: keep only docs where │
│ term_count == num_query_terms │
│ (AND semantics) │
└─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────┐
│ 4. Sort by accumulated score │
│ Return top K results │
└─────────────────────────────────────┘
OR Search
OR search returns documents matching any query term:
// Returns docs containing "mysql" OR "postgres" OR both
const results = try fts.searchOr("mysql postgres", 10);
// Or use explicit mode selection
const results = try fts.searchWithMode("mysql postgres", .@"or", 10);
Query: "mysql postgres" (OR mode)
│
▼
┌─────────────────────────────────────┐
│ 1. Tokenize query │
│ → ["mysql", "postgres"] │
└─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────┐
│ 2. For each term, iterate posting │
│ list and accumulate scores │
│ │
│ doc_scores[doc_id] += score │
│ (no term count filtering) │
└─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────┐
│ 3. Sort by accumulated score │
│ Docs with both terms rank higher │
└─────────────────────────────────────┘
NOT Search (Exclusions)
Prefix terms with - to exclude documents containing them:
// Find "database" docs that don't mention "mysql"
const results = try fts.searchWithMode("database -mysql", .@"and", 10);
// Multiple exclusions
const results = try fts.searchWithMode("database -mysql -oracle", .@"and", 10);
Query: "database -mysql"
│
▼
┌─────────────────────────────────────┐
│ 1. Parse query │
│ terms = ["database"] │
│ excluded = ["mysql"] │
└─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────┐
│ 2. Search for positive terms │
│ → candidate documents │
└─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────┐
│ 3. Build exclusion set │
│ (all doc_ids containing "mysql") │
└─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────┐
│ 4. Filter candidates │
│ Remove docs in exclusion set │
└─────────────────────────────────────┘
Phrase Search
Phrase search finds documents where terms appear adjacent and in order:
// Enable position storage in config
var fts = FtsIndex.init(allocator, &bp, &dict_tree, &lengths_tree, .{
.store_positions = true,
});
// Search for exact phrase
const results = try fts.searchPhrase("quick brown fox", 10);
Query: "quick brown fox" (phrase)
Document 1: "The quick brown fox jumps" ← MATCHES (adjacent: pos 1,2,3)
Document 2: "A quick red brown fox" ← NO MATCH (not adjacent)
Document 3: "The brown quick fox" ← NO MATCH (wrong order)
How it works:
┌─────────────────────────────────────┐
│ 1. Get posting lists with positions │
│ │
│ "quick": doc1[pos=1], doc2[pos=1]│
│ "brown": doc1[pos=2], doc2[pos=3]│
│ "fox": doc1[pos=3], doc2[pos=4]│
└─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────┐
│ 2. For each candidate document: │
│ Check position adjacency │
│ │
│ doc1: quick@1, brown@2, fox@3 │
│ 1+1=2 ✓, 1+2=3 ✓ → MATCH │
│ │
│ doc2: quick@1, brown@3, fox@4 │
│ 1+1=2 ≠ 3 → NO MATCH │
└─────────────────────────────────────┘
Note: Phrase queries require store_positions = true in FtsConfig. Without positions, searchPhrase() falls back to AND semantics.
Quoted Phrase Syntax
You can also use quoted strings in regular search() and searchWithMode() calls to automatically detect phrase queries:
// These are equivalent:
const results1 = try fts.searchPhrase("quick brown", 10);
const results2 = try fts.searchWithMode("\"quick brown\"", .@"and", 10);
Combining phrases with terms and exclusions:
// Phrase + additional term (AND mode)
// Matches documents with "quick brown" phrase AND term "jumps"
const results = try fts.searchWithMode("\"quick brown\" jumps", .@"and", 10);
// Phrase + exclusion
// Matches documents with "quick brown" phrase but NOT containing "fox"
const results = try fts.searchWithMode("\"quick brown\" -fox", .@"and", 10);
// Multiple phrases
// Matches documents with both phrases
const results = try fts.searchWithMode("\"quick brown\" \"lazy dog\"", .@"and", 10);
// Phrase with OR mode
// Matches documents with "quick brown" phrase OR term "rabbit"
const results = try fts.searchWithMode("\"quick brown\" rabbit", .@"or", 10);
Single-word quotes:
Single words in quotes are treated as regular terms (since a one-word phrase is just a term):
// These are equivalent:
const results1 = try fts.search("database", 10);
const results2 = try fts.searchWithMode("\"database\"", .@"and", 10);
Fuzzy Search
Fuzzy search finds documents even when query terms contain typos. It uses Levenshtein (edit) distance to match terms within a configurable threshold.
const fuzzy = @import("fts/fuzzy.zig");
// Search with typo tolerance (max 2 edits)
const results = try fts.searchFuzzy("databse", .{
.max_distance = 2, // Allow up to 2 edits
.min_term_length = 4, // Only fuzzy-match terms >= 4 chars
}, 10);
defer fts.freeResults(results);
// Matches documents containing "database" (edit distance 1)
How it works:
Query: "databse" (typo)
↓
Scan dictionary for terms within edit distance 2:
- "database" (distance=1) ✓
- "datastore" (distance=3) ✗
↓
Search matching terms with distance penalty:
- Score("database") × penalty(1) = Score × 0.75
↓
Return ranked results
Distance penalty formula:
penalty = 1.0 - (distance / max_distance)²
Examples (max_distance = 2):
- distance 0: 1.00 (exact match)
- distance 1: 0.75 (25% penalty)
- distance 2: 0.00 (filtered out)
Levenshtein distance counts the minimum single-character edits needed:
- Insertion: "helo" → "hello" (distance 1)
- Deletion: "hello" → "helo" (distance 1)
- Substitution: "hello" → "hallo" (distance 1)
Note: Transpositions ("recieve" → "receive") count as 2 edits in standard Levenshtein.
Prefix/Wildcard Search
Prefix search finds documents containing terms that start with a given prefix. Use * as a suffix wildcard.
const prefix = @import("fts/prefix.zig");
// Search for terms starting with "optim" (matches "optimize", "optimization", "optimizer")
const results = try fts.searchWithPrefix("optim*", .{
.min_prefix_length = 2, // Minimum prefix length (prevents "a*" explosion)
.max_expansions = 50, // Maximum terms to expand
}, 10);
defer fts.freeResults(results);
How it works:
Query: "optim*"
↓
Calculate range bounds: ["optim", "optin")
↓
B+Tree range scan to find matching terms:
- "optimization" ✓
- "optimize" ✓
- "optimizer" ✓
↓
Search all matching terms (like OR query)
↓
Return ranked results
Combining prefix with regular terms:
// AND semantics: documents must contain "systems" AND a term starting with "data"
const results = try fts.searchWithPrefix("systems data*", .{}, 10);
// Matches documents with both "systems" and "database"/"datastore"/"data"
Constraints:
- Suffix wildcards only (
optim*), not prefix wildcards (*tion) - No middle wildcards (
da*base) - Minimum prefix length configurable (default 2 chars)
- Maximum expansions capped to prevent explosion
Search Highlighting
Search highlighting returns text snippets with matched terms marked, enabling UI display of search results with context.
const highlight = @import("fts/highlight.zig");
const text = "The database stores data efficiently for optimal performance.";
const query_terms = [_][]const u8{ "database", "data" };
const result = try highlight.highlight(
allocator,
text,
&query_terms,
.{ .use_stemming = false }, // TokenizerConfig
.{
.context_chars = 80, // Characters of context around matches
.max_snippets = 3, // Maximum snippets to return
.merge_distance = 40, // Merge snippets closer than this
.prefix_marker = "<em>", // Marker before matched terms
.suffix_marker = "</em>", // Marker after matched terms
.ellipsis = "...", // Added when text is truncated
},
);
defer highlight.freeResult(allocator, result);
// result.snippets[0].text = "The <em>database</em> stores <em>data</em> efficiently..."
// result.total_matches = 2
How it works:
Query terms: ["database", "data"]
Document: "The database stores data efficiently for optimal performance."
↓
Re-tokenize document with same config:
- "database" at offset 4 → stems to "databas" or "database"
- "data" at offset 20 → stems to "data"
↓
Match against query terms (stemmed comparison):
- Match found at positions 4-12 and 20-24
↓
Group matches into snippets with context:
- Single snippet with both matches (close together)
↓
Insert markers around matches:
- "The <em>database</em> stores <em>data</em> efficiently..."
With stemming:
const text = "The runners were running fast in the marathon.";
const query_terms = [_][]const u8{"run"}; // Already stemmed query
const result = try highlight.highlight(
allocator,
text,
&query_terms,
.{ .use_stemming = true }, // Enable stemming
.{ .prefix_marker = "**", .suffix_marker = "**" },
);
// "running" stems to "run" → match
// "runners" stems to "runner" → no match
// result.snippets[0].text = "The runners were **running** fast..."
Key design decisions:
- Re-tokenization approach: Document text is re-tokenized at query time (no storage overhead)
- Stemmed matching: Query terms are matched against stemmed document tokens
- Original text preserved: Markers wrap the original text, not stemmed forms
- Configurable markers: Default
<em>/</em>but customizable for any format - Snippet merging: Close matches combined into single snippets
Finding matches only (without snippets):
const matches = try highlight.findMatches(
allocator,
text,
&query_terms,
.{ .use_stemming = false },
);
defer highlight.freeMatches(allocator, matches);
for (matches) |match| {
// match.start = byte offset of match start
// match.end = byte offset of match end (exclusive)
const matched_text = text[match.start..match.end];
}
7. Indexing Flow
indexDocument(doc_id, text)
Input: doc_id=42, text="The quick database optimization guide"
│
▼
┌─────────────────────────────────────┐
│ 1. Tokenize text │
│ → ["quick", "database", │
│ "optimization", "guide"] │
└─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────┐
│ 2. Count term frequencies │
│ term_freqs = { │
│ "quick": 1, │
│ "database": 1, │
│ "optimization": 1, │
│ "guide": 1 │
│ } │
└─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────┐
│ 3. For each term: │
│ a. getOrCreate in dictionary │
│ b. Create posting page if needed │
│ c. Append posting entry │
│ d. Increment doc_freq │
└─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────┐
│ 4. Store document length │
│ doc_lengths[42] = 4 │
│ Update avg_doc_length │
└─────────────────────────────────────┘
│
▼
Output: 4 (number of tokens indexed)
8. Data Structures Summary
In-Memory
| Structure | Purpose |
|---|---|
Tokenizer | Streaming text tokenization |
PostingIterator | Iterate posting list entries |
Bm25Scorer | Calculate relevance scores |
On-Disk (B+Trees)
| B+Tree | Key | Value |
|---|---|---|
| Dictionary | Token string | DictionaryEntry (24 bytes) |
| DocLengths | DocId (8 bytes) | Length (4 bytes) |
On-Disk (Pages)
| Page Type | Content |
|---|---|
FTS_POSTING | Posting list entries (varint encoded) |
9. Limitations and Future Work
Current Limitations
- English-only stemming - Porter stemmer for English only (other languages skip stemming)
Implemented Features
- Phrase queries -
searchPhrase("quick brown fox")with position verification - Boolean queries - AND (default), OR (
searchOr), NOT (-termsyntax) - Porter stemming -
use_stemming=truereduces words to roots (English) - Position indexing -
store_positions=truefor phrase queries - Skip pointers - O(log n) posting list intersection for multi-term queries
- Quoted phrase syntax - Parse
"exact phrase"in regular search() calls - Fuzzy search - Levenshtein distance for typo tolerance via
searchFuzzy() - Multi-language stop words - 11 languages supported (EN, DE, FR, ES, IT, PT, NL, SV, NO, DA, FI, RU)
- Prefix/wildcard search -
searchWithPrefix("optim*")expands to matching terms - Search highlighting -
highlight()returns snippets with matched terms marked - Document deletion -
removeDocument()with reverse index properly cleans posting lists and stats
Planned Features
- Multi-language stemmers - Snowball stemmers for German, French, etc.
10. Usage Example
const std = @import("std");
const lattice = @import("lattice");
pub fn main() !void {
var gpa = std.heap.GeneralPurposeAllocator(.{}){};
const allocator = gpa.allocator();
// Initialize storage (simplified)
var bp = try BufferPool.init(allocator, &pm, 64 * 4096);
var dict_tree = try BTree.init(allocator, &bp);
var lengths_tree = try BTree.init(allocator, &bp);
// Create FTS index with phrase query support
var fts = FtsIndex.init(allocator, &bp, &dict_tree, &lengths_tree, .{
.store_positions = true, // Enable phrase queries
});
// Index documents
_ = try fts.indexDocument(1, "Introduction to database systems");
_ = try fts.indexDocument(2, "Advanced database optimization techniques");
_ = try fts.indexDocument(3, "Web development with JavaScript");
_ = try fts.indexDocument(4, "Database performance and MySQL tuning");
// Basic AND search (all terms must match)
const results1 = try fts.search("database optimization", 10);
defer fts.freeResults(results1);
// Returns doc 2 (contains both terms)
// OR search (any term matches)
const results2 = try fts.searchOr("mysql postgres", 10);
defer fts.freeResults(results2);
// Returns doc 4 (contains "mysql")
// NOT search (exclusions with -prefix)
const results3 = try fts.searchWithMode("database -mysql", .@"and", 10);
defer fts.freeResults(results3);
// Returns docs 1, 2 (contain "database" but not "mysql")
// Phrase search (exact sequence)
const results4 = try fts.searchPhrase("database systems", 10);
defer fts.freeResults(results4);
// Returns doc 1 (has "database systems" as adjacent phrase)
// Quoted phrase syntax (alternative to searchPhrase)
// Phrase "database optimization" + term "advanced"
const results5 = try fts.searchWithMode("\"database optimization\" advanced", .@"and", 10);
defer fts.freeResults(results5);
// Returns doc 2 (has phrase "database optimization" AND term "advanced")
// Fuzzy search (typo tolerance)
const results6 = try fts.searchFuzzy("databse", .{
.max_distance = 2,
.min_term_length = 4,
}, 10);
defer fts.freeResults(results6);
// Returns docs with "database" (edit distance 1 from "databse")
for (results1) |result| {
std.debug.print("Doc {}: score {d:.2}\n", .{ result.doc_id, result.score });
}
}
11. Serialization Pattern
All on-disk structures use extern struct with compile-time size assertions:
/// Good: Self-documenting, compile-time verified
pub const PostingPageHeader = extern struct {
token_id: TokenId,
num_entries: u32,
next_page: PageId,
num_skip_pointers: u16,
flags: u16,
data_start: u32,
comptime {
std.debug.assert(@sizeOf(PostingPageHeader) == 20);
}
pub fn read(data: []const u8) PostingPageHeader {
return std.mem.bytesAsValue(PostingPageHeader, data[OFFSET..][0..@sizeOf(PostingPageHeader)]).*;
}
pub fn write(self: *const PostingPageHeader, data: []u8) void {
@memcpy(data[OFFSET..][0..@sizeOf(PostingPageHeader)], std.mem.asBytes(self));
}
};
Why this pattern?
- No magic numbers - Field layout is defined by the struct, not manual offsets
- Compile-time verification -
comptimeblock catches size mismatches immediately - Self-documenting - Struct fields serve as documentation
- Consistent with codebase - WAL, PageHeader, FileHeader all use this pattern
Alignment considerations:
When structs contain u64 fields, use explicit padding and order fields largest-first:
pub const DictionaryEntry = extern struct {
total_freq: u64, // 8 bytes - largest first (requires 8-byte alignment)
token_id: u32, // 4 bytes
doc_freq: u32, // 4 bytes
posting_page: u32, // 4 bytes
_padding: u32 = 0, // 4 bytes - explicit trailing padding
comptime {
std.debug.assert(@sizeOf(DictionaryEntry) == 24);
}
};
12. File Reference
| File | Purpose |
|---|---|
src/fts/tokenizer.zig | Text tokenization, normalization, language config |
src/fts/stopwords.zig | Multi-language stop word lists (11 languages) |
src/fts/dictionary.zig | Token → TokenId mapping via B+Tree, range iteration |
src/fts/posting.zig | Posting list storage with varint encoding, entry removal |
src/fts/scorer.zig | BM25 scoring, document length storage |
src/fts/index.zig | Main FtsIndex coordinator, boolean/phrase/fuzzy/prefix search |
src/fts/stemmer.zig | Porter stemmer algorithm (English), language routing |
src/fts/fuzzy.zig | Levenshtein distance, fuzzy term expansion |
src/fts/prefix.zig | Prefix/wildcard search, upper bound calculation |
src/fts/highlight.zig | Search result highlighting, snippet extraction |
src/fts/reverse_index.zig | doc_id → terms mapping for document deletion |
Query Execution
Overview
Lattice executes Cypher queries using the Volcano iterator model—a pull-based execution engine where operators form a tree and data flows upward one row at a time. This design enables lazy evaluation, memory efficiency, and composable query plans.
┌─────────────────────────────────────────────────────────────────┐
│ Query Execution Pipeline │
├─────────────────────────────────────────────────────────────────┤
│ │
│ "MATCH (p:Person) WHERE p.age > 30 RETURN p.name LIMIT 10" │
│ │ │
│ ▼ │
│ ┌─────────────────┐ │
│ │ Lexer │ Tokens │
│ └────────┬────────┘ │
│ ▼ │
│ ┌─────────────────┐ │
│ │ Parser │ AST │
│ └────────┬────────┘ │
│ ▼ │
│ ┌─────────────────┐ │
│ │ Semantic │ Validated AST │
│ │ Analyzer │ + Variable bindings │
│ └────────┬────────┘ │
│ ▼ │
│ ┌─────────────────┐ │
│ │ Planner │ Operator Tree │
│ └────────┬────────┘ │
│ ▼ │
│ ┌─────────────────┐ │
│ │ Executor │ Result Rows │
│ └─────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
The Volcano Iterator Model
Every operator implements three methods:
pub const Operator = struct {
vtable: *const VTable,
ptr: *anyopaque,
pub const VTable = struct {
open: fn (*anyopaque, *ExecutionContext) OperatorError!void,
next: fn (*anyopaque, *ExecutionContext) OperatorError!?*Row,
close: fn (*anyopaque, *ExecutionContext) void,
deinit: fn (*anyopaque, Allocator) void,
};
};
- open(): Initialize the operator, acquire resources (iterators, memory)
- next(): Return the next row, or
nullif exhausted - close(): Release resources (unpin pages, close iterators)
- deinit(): Free the operator itself
Pull-Based Execution
Data flows upward through the operator tree. The root operator "pulls" from its children:
┌─────────┐
│ Limit │ ◄── Executor calls next() here
└────┬────┘
│ pulls from
┌────┴────┐
│ Project │
└────┬────┘
│ pulls from
┌────┴────┐
│ Filter │
└────┬────┘
│ pulls from
┌────┴────┐
│LabelScan│ ◄── Reads from storage
└─────────┘
When the executor calls root.next():
- Limit calls Project.next()
- Project calls Filter.next()
- Filter calls LabelScan.next()
- LabelScan reads from B+Tree, returns row
- Filter evaluates predicate—if false, pulls another row
- Project evaluates expressions, transforms row
- Limit checks count, returns row (or null if limit reached)
Lazy Evaluation
Work happens only when next() is called. For LIMIT 10, we might scan millions of nodes but only process 10 matching rows. Unused rows are never materialized.
Rows and Slots
A Row is a fixed-size tuple flowing between operators:
pub const Row = struct {
slots: [16]SlotValue, // Variable bindings
distances: [16]f32, // Vector search distances
scores: [16]f32, // FTS relevance scores
populated: u16, // Bitmask of populated slots
};
Slot Values
Each slot holds one of:
pub const SlotValue = union(enum) {
empty: void, // Unset
node_ref: NodeId, // Reference to a node (just the ID)
edge_ref: EdgeId, // Reference to an edge
property: PropertyValue, // Actual value (int, string, bool, etc.)
};
Why Slots Instead of Named Variables?
Performance. Looking up slots[3] is O(1). The planner assigns each variable to a numbered slot:
MATCH (p:Person)-[r:KNOWS]->(f:Person)
│ │ │
slot 0 slot 1 slot 2
The execution context maintains the mapping from names to slots for expression evaluation.
Lazy Materialization
Rows store references (NodeId, EdgeId), not full objects. Properties are fetched on-demand during expression evaluation. This avoids loading data that's never accessed.
Operators
Scan Operators
AllNodesScan: Iterates every node in the database.
open(): Create B+Tree iterator (null start key = first entry)
next(): Read entry, extract NodeId from key, set output slot
close(): Release iterator, unpin pages
LabelScan: Iterates nodes with a specific label.
open(): Create LabelIndex iterator for the label
next(): Get next NodeId from index, set output slot
close(): Release iterator
Both produce rows with a single slot containing a node reference.
Filter Operator
Evaluates a predicate and passes through only matching rows.
fn next(self: *Filter, ctx: *ExecutionContext) !?*Row {
while (true) {
const row = try self.input.next(ctx) orelse return null;
const result = try self.evaluator.evaluate(self.predicate, row, ctx);
if (result.isTruthy()) {
return row;
}
// Predicate failed, try next row
}
}
Supports short-circuit evaluation for AND/OR.
Project Operator
Transforms rows by evaluating expressions for each output column.
RETURN p.name, p.age + 1, "literal"
│ │ │
expr[0] expr[1] expr[2]
fn next(self: *Project, ctx: *ExecutionContext) !?*Row {
const input_row = try self.input.next(ctx) orelse return null;
self.output_row.clear();
for (self.items) |item| {
const value = try self.evaluator.evaluate(item.expr, input_row, ctx);
self.output_row.setSlot(item.output_slot, resultToSlotValue(value));
}
return self.output_row;
}
Expand Operator
Traverses edges from input nodes. This implements pattern matching like (a)-[r]->(b).
State:
source_slot = slot containing the source node
target_slot = slot to write target node
edge_slot = slot to write edge (optional)
edge_iterator = current iterator over edges
current_input = row being expanded
next():
loop:
if edge_iterator.next() returns edge:
output_row = copy input row
output_row[target_slot] = edge.target
output_row[edge_slot] = edge (if requested)
return output_row
// Exhausted edges for current input, get next input
current_input = input.next()
if null: return null
source = current_input[source_slot]
edge_iterator = edgeStore.getOutgoing(source)
This is a one-to-many operator: one input row can produce multiple output rows.
Supports three directions:
outgoing:(a)-[r]->(b)— traverse outgoing edgesincoming:(a)<-[r]-(b)— traverse incoming edgesboth:(a)-[r]-(b)— traverse both directions
Limit and Skip
Limit: Returns at most N rows.
fn next(self: *Limit, ctx: *ExecutionContext) !?*Row {
if (self.returned >= self.count) return null;
const row = try self.input.next(ctx) orelse return null;
self.returned += 1;
return row;
}
Skip: Discards the first N rows.
fn next(self: *Skip, ctx: *ExecutionContext) !?*Row {
while (self.skipped < self.count) {
_ = try self.input.next(ctx) orelse return null;
self.skipped += 1;
}
return try self.input.next(ctx);
}
Sort Operator (Blocking)
Sort must see all input before producing any output—it's a blocking operator.
fn open(self: *Sort, ctx: *ExecutionContext) !void {
try self.input.open(ctx);
// Materialize all input rows
while (try self.input.next(ctx)) |row| {
try self.rows.append(row.*);
}
// Sort in memory
self.sortRows();
}
fn next(self: *Sort, ctx: *ExecutionContext) !?*Row {
if (self.index >= self.rows.items.len) return null;
defer self.index += 1;
return &self.rows.items[self.index];
}
This breaks the streaming model but is necessary for ordering.
Vector Search Operator
Performs k-NN search using the HNSW index.
fn open(self: *VectorSearch, ctx: *ExecutionContext) !void {
// Execute search upfront
self.results = try self.index.search(self.query_vector, self.k, null);
}
fn next(self: *VectorSearch, ctx: *ExecutionContext) !?*Row {
while (self.current < self.results.len) {
const result = self.results[self.current];
self.current += 1;
// Apply distance threshold
if (self.threshold) |t| {
if (result.distance > t) continue;
}
self.row.setSlot(self.output_slot, .{ .node_ref = result.node_id });
self.row.setDistance(self.output_slot, result.distance);
return self.row;
}
return null;
}
The distance is stored in the row for use in expressions or sorting.
FTS Search Operator
Performs full-text search with BM25 scoring.
fn open(self: *FtsSearch, ctx: *ExecutionContext) !void {
self.results = try self.index.search(self.query_text, self.limit);
}
fn next(self: *FtsSearch, ctx: *ExecutionContext) !?*Row {
// Similar to VectorSearch but with scores instead of distances
// ...
self.row.setScore(self.output_slot, result.score);
return self.row;
}
Mutation Operators
Mutation operators modify the graph structure. Unlike read operators, they have side effects.
CreateNode: Creates new nodes with labels and properties.
fn next(self: *CreateNode, ctx: *ExecutionContext) !?*Row {
const node_id = try ctx.storage.createNode(self.labels);
for (self.properties) |prop| {
const value = try self.evaluator.evaluate(prop.value, null, ctx);
try ctx.storage.setProperty(node_id, prop.key, value);
}
self.output_row.setSlot(self.output_slot, .{ .node_ref = node_id });
return &self.output_row;
}
DeleteNode: Removes nodes from the graph. With DETACH, also removes connected edges.
fn next(self: *DeleteNode, ctx: *ExecutionContext) !?*Row {
const row = try self.input.next(ctx) orelse return null;
const node_id = row.slots[self.target_slot].node_ref;
if (self.detach) {
try ctx.storage.detachDelete(node_id);
} else {
try ctx.storage.deleteNode(node_id);
}
return row; // Pass through for chaining
}
SetProperty: Updates properties on nodes or edges.
MATCH (n:Person {name: "Alice"}) SET n.age = 31, n.city = "NYC"
fn next(self: *SetProperty, ctx: *ExecutionContext) !?*Row {
const row = try self.input.next(ctx) orelse return null;
const target = row.slots[self.target_slot];
const value = try self.evaluator.evaluate(self.value_expr, row, ctx);
// SET n.prop = NULL removes the property
if (value == .null_val) {
try ctx.storage.removeProperty(target, self.property_name);
} else {
try ctx.storage.setProperty(target, self.property_name, value);
}
return row;
}
SetLabels: Adds labels to nodes.
MATCH (n:Person {name: "Alice"}) SET n:Admin:Verified
RemoveProperty: Explicitly removes properties.
MATCH (n:Person {name: "Alice"}) REMOVE n.city
RemoveLabel: Removes labels from nodes.
MATCH (n:Person {name: "Alice"}) REMOVE n:Verified
Expression Evaluation
The expression evaluator handles predicates and projections.
pub fn evaluate(expr: *const Expression, row: *const Row, ctx: *ExecutionContext) !EvalResult {
return switch (expr.*) {
.literal => |lit| evaluateLiteral(lit),
.variable => |v| evaluateVariable(v, row, ctx),
.property_access => |pa| evaluatePropertyAccess(pa, row, ctx),
.binary => |b| evaluateBinary(b, row, ctx),
.unary => |u| evaluateUnary(u, row, ctx),
.function_call => |f| evaluateFunction(f, row, ctx),
// ...
};
}
Supported Operations
Comparison: =, <>, <, <=, >, >=
Logical: AND, OR, NOT, XOR (with short-circuit evaluation)
Arithmetic: +, -, *, /, %, ^
String: CONTAINS, STARTS WITH, ENDS WITH
Null handling: IS NULL, IS NOT NULL
Functions: id(), coalesce(), abs(), size(), toInteger(), toFloat()
Vector Search: <=> (vector distance operator)
-- k-NN search with distance threshold
MATCH (c:Chunk) WHERE c.embedding <=> $query < 0.5 RETURN c
Full-Text Search: @@ (FTS match operator)
-- BM25-scored full-text search
MATCH (d:Doc) WHERE d.text @@ $search RETURN d
MATCH (d:Doc) WHERE d.text @@ "neural networks" RETURN d
These operators are recognized by the planner and converted to specialized search operators that use the HNSW and FTS indexes directly, rather than scanning all nodes.
Type Coercion
Numeric operations promote integers to floats when mixed:
42 + 3.14 → 45.14 (float)
Null propagates through most operations:
null + 1 → null
null = null → true (special case for equality)
The Query Planner
The planner transforms an AST into an operator tree by walking the query clauses.
Planning Algorithm
pub fn plan(query: *Query) !Operator {
var current: ?Operator = null;
for (query.clauses) |clause| {
current = switch (clause) {
.match => try planMatch(clause.match, current),
.where => try planWhere(clause.where, current),
.create => try planCreate(clause.create, current),
.delete => try planDelete(clause.delete, current),
.set => try planSet(clause.set, current),
.remove => try planRemove(clause.remove, current),
.return_ => try planReturn(clause.return_, current),
.order_by => try planOrderBy(clause.order_by, current),
.limit => try planLimit(clause.limit, current),
.skip => try planSkip(clause.skip, current),
};
}
return current.?;
}
Example: Simple Query
MATCH (p:Person) WHERE p.age > 30 RETURN p.name LIMIT 10
Produces:
Limit(count=10)
└── Project([p.name → slot 0])
└── Filter(p.age > 30)
└── LabelScan(label="Person", output=slot 0)
Example: Edge Traversal
MATCH (a:Person)-[:KNOWS]->(b:Person) RETURN a.name, b.name
Produces:
Project([a.name → slot 0, b.name → slot 1])
└── Expand(source=0, target=1, type="KNOWS", dir=outgoing)
└── LabelScan(label="Person", output=slot 0)
Variable Binding
The planner allocates slots for variables:
| Variable | Slot | Kind |
|---|---|---|
| a | 0 | node |
| b | 1 | node |
| r | 2 | edge |
These bindings are registered in the ExecutionContext for expression evaluation.
Execution Context
Runtime state for query execution:
pub const ExecutionContext = struct {
allocator: Allocator,
row_arena: ArenaAllocator, // Fast row allocation
parameters: StringHashMap(Value), // Query parameters ($name)
variables: StringHashMap(u8), // Variable → slot mapping
};
Row Arena
Rows are allocated from an arena that's reset between queries. This avoids per-row allocation overhead:
// During query execution
const row = try ctx.allocRow(); // Fast bump allocation
// After query completes
ctx.resetRowArena(); // Frees all rows at once
Query Parameters
Parameters allow safe value injection:
MATCH (p:Person) WHERE p.age > $min_age RETURN p
try ctx.setParameter("min_age", .{ .int_val = 30 });
Operator Composition
Operators are composable—any operator can wrap any other:
// Filter wrapping Expand wrapping LabelScan
Filter
└── Expand
└── LabelScan
// Limit wrapping Sort wrapping Filter
Limit
└── Sort
└── Filter
└── ...
This enables flexible query plans without special-casing combinations.
Memory Management
Operator Lifecycle
1. Planner creates operators (heap allocated)
2. Executor calls open() on root (cascades down)
3. Executor calls next() repeatedly (rows flow up)
4. Executor calls close() on root (cascades down)
5. Executor calls deinit() on root (frees operators bottom-up)
Iterator Cleanup
Operators holding B+Tree or index iterators must release them in close():
fn close(self: *LabelScan, ctx: *ExecutionContext) void {
if (self.iterator) |*iter| {
iter.deinit(); // Unpins pages in buffer pool
self.iterator = null;
}
}
Failing to do this leaks pinned pages.
Future Optimizations
The current planner uses simple heuristics. Future improvements could include:
Predicate Pushdown
Move filters closer to scans:
Before: Filter(a.x > 10) → Expand → LabelScan
After: Expand → Filter(a.x > 10) → LabelScan
Index Selection
Choose between AllNodesScan vs LabelScan vs property index based on selectivity.
Join Ordering
For multi-pattern queries, order joins by estimated cardinality.
Vectorized Execution
Process rows in batches instead of one-at-a-time for better cache utilization.
Database Query API
The Database.query() method provides the primary interface for executing Cypher queries. It orchestrates all components of the query pipeline.
Basic Usage
// Open database
var db = try Database.open(allocator, "mydb.ltdb", .{ .create = true });
defer db.close();
// Create some data
_ = try db.createNode(&[_][]const u8{"Person"});
_ = try db.createNode(&[_][]const u8{"Person"});
// Execute a query
var result = try db.query("MATCH (n:Person) RETURN n");
defer result.deinit(); // Always clean up results
// Process results
std.debug.print("Found {} rows\n", .{result.rowCount()});
for (result.rows) |row| {
for (row.values) |val| {
switch (val) {
.node_id => |id| std.debug.print("Node: {}\n", .{id}),
.string_val => |s| std.debug.print("String: {s}\n", .{s}),
.int_val => |i| std.debug.print("Int: {}\n", .{i}),
.null_val => std.debug.print("null\n", .{}),
else => {},
}
}
}
Result Types
/// A single value in a query result
pub const ResultValue = union(enum) {
null_val: void, // NULL
bool_val: bool, // true/false
int_val: i64, // integers
float_val: f64, // floating point
string_val: []const u8, // strings
node_id: NodeId, // node reference
};
/// A row in a query result
pub const ResultRow = struct {
values: []ResultValue, // One value per column
};
/// Complete query result
pub const QueryResult = struct {
columns: [][]const u8, // Column names from RETURN clause
rows: []ResultRow, // Result rows
allocator: Allocator,
pub fn deinit(self: *QueryResult) void; // Free all memory
pub fn rowCount(self: *const QueryResult) usize;
pub fn columnCount(self: *const QueryResult) usize;
};
Error Handling
The query() method returns QueryError on failure:
pub const QueryError = error{
ParseError, // Syntax error in Cypher
SemanticError, // Undefined variable, type mismatch
PlanError, // Query cannot be planned
ExecutionError, // Runtime error during execution
OutOfMemory,
};
Example error handling:
const result = db.query("MATCH (n RETURN n"); // Missing )
if (result) |r| {
defer r.deinit();
// process results
} else |err| switch (err) {
QueryError.ParseError => std.debug.print("Syntax error\n", .{}),
QueryError.SemanticError => std.debug.print("Semantic error\n", .{}),
else => std.debug.print("Query failed: {}\n", .{err}),
}
Internal Pipeline
When you call db.query(cypher), the following happens:
1. Parser.init(allocator, cypher)
└── parser.parse() → AST (or ParseError)
2. SemanticAnalyzer.init(allocator)
└── analyzer.analyze(ast) → AnalysisResult (or SemanticError)
3. QueryPlanner.init(allocator, storage_context)
└── planner.plan(ast, analysis) → Operator tree (or PlanError)
4. ExecutionContext.init(allocator)
└── Register variable bindings from planner
5. execute(allocator, root_operator, context)
└── Pull rows through operator tree → executor.QueryResult
6. convertResult(exec_result, planner)
└── Transform to user-friendly QueryResult
The StorageContext connects the planner to database storage:
const storage_ctx = StorageContext{
.node_tree = &self.node_tree, // B+Tree for nodes
.label_index = &self.label_index, // Label → NodeId index
.edge_store = &self.edge_store, // Edge storage
.symbol_table = &self.symbol_table, // String interning
};
Memory Management
- QueryResult owns all memory: Call
deinit()when done - Strings are copied: Safe to use after query components are freed
- Row arena: Internal rows use arena allocation, reset between queries
Files
| File | Purpose |
|---|---|
src/storage/database.zig | Database.query() API, QueryResult types |
src/query/executor.zig | Operator interface, Row, ExecutionContext |
src/query/expression.zig | Expression evaluator |
src/query/planner.zig | AST to operator tree transformation |
src/query/operators/scan.zig | AllNodesScan, LabelScan |
src/query/operators/filter.zig | Filter operator |
src/query/operators/project.zig | Project operator |
src/query/operators/expand.zig | Edge traversal |
src/query/operators/vector.zig | HNSW k-NN search |
src/query/operators/fts.zig | Full-text search |
src/query/operators/limit.zig | Limit, Skip, Sort |
src/query/operators/mutation.zig | CREATE, DELETE, SET, REMOVE |
Benchmarks
Benchmarked on Apple M1, single-threaded, with auto-scaled buffer pool.
Core Operations
| Operation | Latency | Throughput | Target | Status |
|---|---|---|---|---|
| Node lookup | 0.13 us | 7.9M ops/sec | < 1 us | PASS |
| Node creation | 0.65 us | 1.5M ops/sec | — | — |
| Edge traversal | 9 us | 111K ops/sec | — | — |
| Full-text search (100 docs) | 19 us | 53K ops/sec | — | — |
| 10-NN vector search (1M vectors) | 0.83 ms | 1.2K ops/sec | < 10 ms @ 1M | PASS |
Vector Search (HNSW) at Scale
128-dimensional cosine vectors, M=16, ef_construction=200, ef_search=64, k=10.
| Scale | Mean Latency | P99 Latency | Recall@10 | Memory |
|---|---|---|---|---|
| 1,000 | 65 us | 70 us | 100% | 1 MB |
| 10,000 | 174 us | 695 us | 99% | 10 MB |
| 100,000 | 438 us | 1.2 ms | 99% | 101 MB |
| 1,000,000 | 832 us | 1.8 ms | 100% | 1,040 MB |
Search latency scales sub-linearly (O(log N)) with 99-100% recall@10. Uses heuristic neighbor selection (HNSW paper Algorithm 4) for diverse graph connectivity, connection page packing for ~4.5x memory reduction, and pre-normalized dot product for fast cosine distance.
ef_search Sensitivity (1M vectors)
| ef_search | Mean Latency | Recall@10 |
|---|---|---|
| 16 | 506 us | 57% |
| 32 | 1.9 ms | 79% |
| 64 | 990 us | 100% |
| 128 | 3.2 ms | 100% |
| 256 | 11.6 ms | 100% |
Optimization History
Baseline (pre-optimization)
| Scale | Insert Rate | Search Mean | Recall@10 |
|---|---|---|---|
| 1K | ~91/sec | 1.7ms | 100% |
| 10K | ~42/sec | 3.8ms | 99% |
| 100K | ~23/sec | 4.5ms | 99% |
| 1M | ~14/sec | 6.4ms | 100% |
Post-optimization (Phase 2)
Six optimizations applied: last_page tracking, pre-sized search structures, stack-buffer connection I/O, cached vectors in heuristic pruning, pre-normalize + dot product for cosine.
| Scale | Insert Rate | Search Mean | Recall@10 |
|---|---|---|---|
| 1K | ~954/sec | 65us | 100% |
| 10K | ~726/sec | 174us | 99% |
| 100K | ~526/sec | 438us | 99% |
| 1M | ~248/sec | 832us | 100% |
Improvement Summary
| Scale | Insert Speedup | Search Speedup |
|---|---|---|
| 1K | 10.5x | 26x |
| 10K | 17.5x | 22x |
| 100K | 22.8x | 10x |
| 1M | 17.7x | 7.7x |
Reproducing
zig build benchmark # Core operation benchmarks
zig build vector-benchmark -- --quick # Vector benchmarks (1K/10K/100K, ~7 min)
zig build vector-benchmark # Full vector benchmarks including 1M (~70 min)
zig build graph-benchmark -- --quick # Graph traversal benchmarks
Competitive Analysis
Point Lookups
| System | Latency | Type | Source |
|---|---|---|---|
| LatticeDB | 0.13 us | Embedded | zig build benchmark |
| RocksDB (in-memory) | 0.14 us | Embedded | RocksDB wiki |
| SQLite (in-memory) | ~0.2 us | Embedded | Turso blog |
| SQLite (WAL, disk) | 3 us (p90) | Embedded | marending.dev |
| Neo4j | 28 ms (p99) | Server | Memgraph comparison |
LatticeDB's B+Tree achieves sub-microsecond cached lookups, matching RocksDB in-memory and outperforming SQLite on disk by 23x.
Vector Search
| System | Latency (10-NN) | Scale | Type | Source |
|---|---|---|---|---|
| LatticeDB | 0.83 ms mean, 100% recall | 1M | Embedded | zig build vector-benchmark |
| FAISS HNSW (single-thread) | 0.5-3 ms | 1M | Library | FAISS wiki |
| Weaviate | 1.4 ms mean, 3.1 ms P99 | 1M | Server | Weaviate benchmarks |
| Qdrant | ~1-2 ms | 1M | Server | Qdrant benchmarks |
| Milvus + SQ8 | 2.2 ms P99 | 1M | Server | VectorDBBench |
| pgvector HNSW | ~5 ms @ 99% recall | 1M | Extension | Jonathan Katz |
| LanceDB | 3-5 ms | 1M | Embedded | LanceDB blog |
| Chroma | 4-5 ms mean | 1M | Embedded | Chroma docs |
| Pinecone P2 | ~15 ms (incl. network) | 1M | Cloud | Pinecone blog |
| sqlite-vec (brute force) | 17 ms | 1M | Extension | Alex Garcia |
LatticeDB at 1M achieves 0.83 ms mean with 100% recall@10 — faster than FAISS single-threaded HNSW and competitive with Weaviate and Qdrant server-based systems (which add network overhead in practice).
Graph Traversal
| System | 2-hop (100K nodes) | Type | Source |
|---|---|---|---|
| LatticeDB | 39 us | Embedded | zig build sqlite-benchmark |
| SQLite (recursive CTE) | 548 us | Embedded | zig build sqlite-benchmark |
| Kuzu | 19 ms | Embedded | The Data Quarry |
| Neo4j | 10 ms (1M nodes) | Server | Neo4j blog |
LatticeDB vs SQLite
Social network graph with power-law degree distribution, adjacency cache pre-warmed.
Small Scale (10K nodes, 50K edges)
| Workload | LatticeDB | SQLite | Speedup |
|---|---|---|---|
| 1-hop traversal | 560 ns | 13.0 us | 23x |
| 2-hop traversal | 3.0 us | 37.5 us | 13x |
| 3-hop traversal | 19.1 us | 178.5 us | 9x |
| Variable path (1..5) | 82.4 us | 4.3 ms | 52x |
Medium Scale (100K nodes, 500K edges)
| Workload | LatticeDB | SQLite | Speedup |
|---|---|---|---|
| 1-hop traversal | 8.0 us | 290.0 us | 36x |
| 2-hop traversal | 38.7 us | 548.3 us | 14x |
| 3-hop traversal | 197.3 us | 1.2 ms | 6x |
| Variable path (1..5) | 134.4 us | 10.1 ms | 75x |
Depth-Limited Traversal (10K nodes, 50K edges)
| Depth | LatticeDB | SQLite | Speedup |
|---|---|---|---|
| 10 | 311 us | 121 ms | 390x |
| 15 | 380 us | 271 ms | 713x |
| 25 | 318 us | 587 ms | 1,848x |
| 50 | 500 us | 1.4 s | 2,819x |
LatticeDB uses BFS with adjacency cache and bitset visited tracking. SQLite uses a recursive CTE with UNION deduplication. Both compute identical reachable node sets (~8K nodes). The gap widens at deeper depths as SQLite's CTE overhead grows with each recursion level.
Full-Text Search (BM25)
| System | Search Latency | Type | Source |
|---|---|---|---|
| LatticeDB | 19 us | Embedded | zig build benchmark |
| SQLite FTS5 | < 6 ms | Embedded | SQLite Cloud |
| Elasticsearch | 1-10 ms | Server | Various |
| Tantivy | 10-100 us | Library | Various |
LatticeDB's inverted index with BM25 scoring is ~300x faster than SQLite FTS5 and competitive with Tantivy (a dedicated Rust search library).
Building from Source
LatticeDB is written in Zig with zero external dependencies.
Prerequisites
- Zig (0.15.x or later)
Clone and Build
git clone https://github.com/jeffhajewski/latticedb.git
cd latticedb
zig build # build everything
Build Targets
zig build # Build everything
zig build library # Build static library only
zig build cli # Build CLI tool only
zig build amalgamation # Create single-file distribution
zig build shared # Build shared library for bindings
Optimized Builds
zig build release-safe # Release build with safety checks
zig build release-fast # Optimized release build
zig build -Doptimize=ReleaseFast # Alternative optimized build
Building Language Bindings
Python
# Build the shared library first
zig build shared
# The Python bindings use ctypes to load the shared library
cd bindings/python
pip install -e .
TypeScript
# Build the shared library first
zig build shared
# Build the TypeScript bindings
cd bindings/typescript
npm install
npm run build
Project Structure
src/
├── core/ # Core types and utilities
├── storage/ # B+Tree, page management, WAL
├── vector/ # HNSW index, vector operations
├── fts/ # Full-text search, tokenizer
├── query/ # Cypher parser, planner, executor
├── transaction/ # Transaction management, MVCC
├── concurrency/ # Locking, latches
├── api/ # C API bindings
└── cli/ # CLI tool
include/
└── lattice.h # C API header
bindings/
├── python/ # Python bindings
└── typescript/ # TypeScript/Node.js bindings
Running Tests
LatticeDB has a comprehensive test suite covering unit tests, integration tests, concurrency tests, crash recovery tests, and benchmarks.
Test Commands
zig build test # Run unit tests
zig build integration-test # Run integration tests
zig build concurrency-test # Run concurrency tests
zig build crash-test # Run crash recovery tests
Benchmarks
zig build benchmark # Core operation benchmarks
zig build vector-benchmark -- --quick # Vector benchmarks (1K/10K/100K, ~7 min)
zig build vector-benchmark # Full vector benchmarks including 1M (~70 min)
zig build graph-benchmark -- --quick # Graph traversal benchmarks
Test Structure
tests/
├── unit/ # Unit tests for individual modules
├── integration/ # End-to-end integration tests
├── fuzz/ # Fuzzing targets for parser and serialization
├── crash/ # Crash recovery tests (kill process mid-transaction)
├── concurrency/ # Multi-threaded concurrency tests
└── performance/ # Performance benchmarks
Testing Standards
- Aim for 100% branch coverage on core modules
- Fuzzing is mandatory for the parser and serialization code
- Crash recovery is tested by killing the process mid-transaction and verifying data integrity
- Concurrency tests cover all multi-threaded code paths
TypeScript Binding Tests
cd bindings/typescript
npm test
Python Binding Tests
cd bindings/python
pytest
Contributing
Contributions to LatticeDB are welcome.
Getting Started
- Fork the repository on GitHub
- Clone your fork and create a branch
- Make your changes
- Run the test suite:
zig build test - Submit a pull request
Development Setup
git clone https://github.com/YOUR_USERNAME/latticedb.git
cd latticedb
zig build test # Verify everything builds and passes
Code Style
- Follow existing Zig conventions in the codebase
- All allocation goes through explicit allocator parameters
- Fail fast: detect and report corruption, don't hide it
- Keep the C API as the contract: all bindings wrap
include/lattice.h
Testing
All changes should include appropriate tests:
- Unit tests for new functions or modules
- Integration tests for end-to-end behavior changes
- Fuzz tests for parser or serialization changes
- Crash tests for durability-related changes
Run the full test suite before submitting:
zig build test
zig build integration-test
Pull Requests
- Keep PRs focused on a single change
- Include a clear description of what changed and why
- Ensure all tests pass
- Add tests for new functionality
Reporting Issues
File issues on GitHub with:
- A clear description of the problem
- Steps to reproduce
- Expected vs actual behavior
- LatticeDB version and platform