B+Tree

What It Is

A B+Tree is a self-balancing tree structure optimized for disk-based storage. It provides O(log N) lookups, inserts, and deletes, with efficient range scans.

Why B+Tree?

The Problem with Binary Trees

A binary search tree with 1 million entries is ~20 levels deep (log2(1M) ≈ 20). Each level requires a disk read. That's 20 disk reads per lookup!

B+Tree: Wide and Shallow

A B+Tree has many keys per node (hundreds), making it very shallow:

Binary Tree (1M entries):          B+Tree (1M entries):
        depth ~20                       depth ~3

          ○                               ○
         / \                         /    |    \
        ○   ○                       ○     ○     ○
       /\   /\                    / | \ / | \ / | \
      ○ ○ ○  ○                   ○  ○  ○ ○  ○  ○ ○ ○
     ... (20 levels)             (leaf level)

    20 disk reads                   3 disk reads!

With 100 keys per node: log100(1M) ≈ 3 levels.

Data Only in Leaves

B+Trees store all data in leaf nodes. Internal nodes only contain keys for routing:

                    ┌─────────────────────┐
                    │   [30]  [60]  [90]  │  ◄── Internal: keys only
                    └───┬──────┬──────┬───┘
                       ╱       │       ╲
        ┌─────────────┘        │        └─────────────┐
        ▼                      ▼                      ▼
┌───────────────┐    ┌───────────────┐    ┌───────────────┐
│ 10:val 20:val │◄──►│ 30:val 50:val │◄──►│ 60:val 80:val │
└───────────────┘    └───────────────┘    └───────────────┘
        ▲                                         ▲
        └─── Leaves: keys AND values ─────────────┘
             Linked for range scans

Our Implementation: Direct Page Manipulation

We don't create "node objects" in memory. Instead, we read/write bytes directly in page buffers. The "node" is just a lens over page bytes.

Traditional approach:                Our approach:

┌─────────────┐                     ┌─────────────┐
│ Page bytes  │                     │ Page bytes  │
└──────┬──────┘                     └──────┬──────┘
       │ deserialize                       │
       ▼                                   │ direct access
┌─────────────┐                            │
│ Node struct │                            │
│ - keys[]    │                            ▼
│ - values[]  │                     Read/write at
│ - children[]│                     calculated offsets
└──────┬──────┘
       │ serialize
       ▼
┌─────────────┐
│ Page bytes  │
└─────────────┘

No intermediate objects. No serialization overhead.

Page Layout

Leaf Node

┌────────────────────────────────────────────────────────────────┐
│                         Leaf Page (4KB)                         │
├────────────────────────────────────────────────────────────────┤
│ Bytes 0-7:   Page Header (checksum, type=leaf, flags)          │
├────────────────────────────────────────────────────────────────┤
│ Bytes 8-9:   Entry count (u16)                                 │
│ Bytes 10-11: Flags (u16)                                       │
│ Bytes 12-15: Next leaf page (u32)                              │
│ Bytes 16-19: Prev leaf page (u32)                              │
├────────────────────────────────────────────────────────────────┤
│ Slot 0: entry_offset (u16)                                      │
│ Slot 1: entry_offset (u16)                                      │
│ Slot 2: ...                                                    │
│         ...                                                    │
│                    [free space grows down ↓]                   │
│                                                                │
│                    [entry bytes grow down ↓ from page end]      │
│                                                                │
│ [key_len][value_len][key2][value2-or-overflow-descriptor]      │
│ [key_len][value_len][key1][value1-or-overflow-descriptor]      │
└────────────────────────────────────────────────────────────────┘
                                                          Byte 4095

Variable-length entries: Keys and values are stored at the end of the page, growing downward from high offsets. Slots at the front point to the entry records.

Large values are stored through generic B+Tree overflow records. A leaf entry uses the normal key bytes plus a compact overflow descriptor; the descriptor contains the total value length, first overflow page, page count, checksum, and a small local prefix. Point reads and range scans materialize the full value transparently, and deletes free the overflow chain.

Values up to 64 MiB are accepted by the B+Tree layer when the key and overflow descriptor can fit in an empty leaf. Inline-capable values may still be stored as overflow records when keeping them inline would make future 4 KiB leaf splits impossible to divide into two valid ordered pages.

Internal Node

┌────────────────────────────────────────────────────────────────┐
│                       Internal Page (4KB)                       │
├────────────────────────────────────────────────────────────────┤
│ Bytes 0-7:   Page Header (checksum, type=internal, flags)      │
├────────────────────────────────────────────────────────────────┤
│ Bytes 8-9:   Key count (u16)                                   │
│ Bytes 10-11: Level (u16; 0 = parent of leaves)                 │
│ Bytes 12-15: Rightmost child (u32)                             │
├────────────────────────────────────────────────────────────────┤
│ Slot 0: key_offset (u16), child_page (u32)                     │
│ Slot 1: key_offset (u16), child_page (u32)                     │
│         ...                                                    │
│                    [free space]                                │
│                                                                │
│ "separator_key2"                                               │
│ "separator_key1"                                               │
└────────────────────────────────────────────────────────────────┘

Point Lookup

Finding a value by key:

pub fn get(self: *Self, key: []const u8) !?[]const u8 {
    // 1. Start at root
    var page_id = self.root_page;

    while (true) {
        // 2. Fetch page from buffer pool
        const frame = try self.bp.fetchPage(page_id, .shared);
        defer self.bp.unpinPage(frame, false);

        const page_type = getPageType(frame.data);

        if (page_type == .btree_leaf) {
            // 3. Binary search in leaf
            return LeafNode.search(frame.data, key, self.comparator);
        } else {
            // 4. Binary search for child pointer, descend
            page_id = InternalNode.findChild(frame.data, key, self.comparator);
        }
    }
}

Example lookup for key "dog":

                        Root (Internal)
                    ┌─────────────────────┐
                    │  [cat]     [fish]   │
                    └───┬──────────┬──────┘
                       ╱           ╲
     "dog" > "cat"    ╱             ╲
     "dog" < "fish"  ╱               ╲
                    ▼
             ┌─────────────────┐
             │ cow:v1  dog:v2  │ ◄── Found! Return v2
             └─────────────────┘

Insertion

Simple Case: Space in Leaf

pub fn insert(self: *Self, key: []const u8, value: []const u8) !void {
    // 1. Find the leaf
    const leaf_page = try self.findLeaf(key);

    // 2. Fetch with exclusive latch
    const frame = try self.bp.fetchPage(leaf_page, .exclusive);
    defer self.bp.unpinPage(frame, true);

    // 3. Prepare either inline bytes or an overflow descriptor
    const prepared = try self.prepareStoredValue(key, value);

    // 4. Check if the prepared entry fits
    if (LeafNode.hasSpace(frame.data, key.len, prepared.value.len)) {
        // 5. Insert directly
        LeafNode.insert(frame.data, key, prepared.value, self.comparator);
    } else {
        // 6. Need to split
        try self.splitLeafAndInsert(frame, key, prepared);
    }
}

Splitting a Leaf

When a leaf is full:

Before split:
┌─────────────────────────────────────┐
│  a:1  b:2  c:3  d:4  e:5  [FULL]   │
└─────────────────────────────────────┘
                 │
                 │ Insert "f:6" - no room!
                 ▼
After split:
┌──────────────────────┐    ┌──────────────────────┐
│  a:1  b:2  c:3       │◄──►│  d:4  e:5  f:6       │
└──────────────────────┘    └──────────────────────┘
         │                            │
         └───────────┬────────────────┘
                     │
                     ▼
              "d" promoted to parent

fn splitLeaf(self: *Self, frame: *BufferFrame, key: []const u8, value: []const u8) !void {
    // 1. Allocate new page
    const new_page = try self.pm.allocatePage();
    const new_frame = try self.bp.fetchPage(new_page, .exclusive);

    // 2. Find a byte-balanced split point that keeps both leaves valid.
    const split_point = try self.chooseLeafSplitPoint(frame.data, key, value);

    // 3. Move upper half to new page
    LeafNode.moveEntries(frame.data, new_frame.data, split_point, entry_count);

    // 4. Insert new key into appropriate page
    if (compare(key, split_key) < 0) {
        LeafNode.insert(frame.data, key, value);
    } else {
        LeafNode.insert(new_frame.data, key, value);
    }

    // 5. Update sibling pointers
    LeafNode.setNext(frame.data, new_page);
    LeafNode.setPrev(new_frame.data, frame.page_id);

    // 6. Get separator key and insert into parent
    const separator = LeafNode.getFirstKey(new_frame.data);
    try self.insertIntoParent(frame.page_id, separator, new_page);
}

The split point is chosen by serialized byte size, not just entry count. This matters when a leaf mixes small keys with large values: both halves must fit in the configured page size after the new entry is included.

Growing the Tree

When the root splits, we create a new root:

Before:
          ┌─────────────┐
          │  Root (full)│
          └─────────────┘

After root split:
          ┌─────────────┐
          │  New Root   │  ◄── New level!
          │    [50]     │
          └──────┬──────┘
                ╱ ╲
    ┌──────────┘   └──────────┐
    ▼                         ▼
┌─────────────┐      ┌─────────────┐
│  Old Root   │      │  New Page   │
│  (left)     │      │  (right)    │
└─────────────┘      └─────────────┘

The tree grows at the top, not the bottom. All leaves stay at the same depth.

Range Scan

Thanks to linked leaves, range scans are efficient:

pub fn range(self: *Self, start: ?[]const u8, end: ?[]const u8) !Iterator {
    // 1. Find starting leaf
    const start_page = if (start) |k|
        try self.findLeaf(k)
    else
        try self.findLeftmostLeaf();

    return Iterator{
        .btree = self,
        .current_page = start_page,
        .current_slot = 0,
        .end_key = end,
    };
}

Iterator walks the leaf chain:

Start key: "dog"     End key: "hamster"

   ┌─────────────────────────────────────┐
   │           Internal nodes            │
   └─────────────────────────────────────┘
                     │
                     ▼
┌─────────────┐   ┌─────────────┐   ┌─────────────┐
│ cat│cow│    │──►│ dog│elk│fox│──►│goat│ham│    │
└─────────────┘   └─────────────┘   └─────────────┘
                    ▲                   ▲
                    │ start             │ end
                    └───────────────────┘
                          scan range

No need to traverse internal nodes for each entry - just follow the links.

Concurrency

We use latch crabbing (simplified):

Lookup (read-only):
    1. Latch root (shared)
    2. Latch child (shared)
    3. Unlatch root
    4. Latch grandchild (shared)
    5. Unlatch child
    ... continue to leaf

Insert (may modify):
    1. Latch root (exclusive)
    2. If child is safe (won't split), latch child and unlatch root
    3. Continue down

A "safe" node is one with enough space that it won't split (for insert) or won't underflow (for delete).

The BTree Struct

pub const BTree = struct {
    bp: *BufferPool,           // Page cache
    root_page: PageId,         // Current root
    comparator: KeyComparator, // For ordering
    page_size: u32,            // Usually 4096
    allocator: Allocator,

    // ...methods
};

This is just ~40 bytes of metadata. The actual data lives in pages on disk.

Key Design Decisions

Variable-Length Keys and Values

We store the actual bytes, not fixed-size slots. This supports:

Short keys efficiently (no wasted space)
Long keys (up to page size)
Any binary data

Separator Keys in Internal Nodes

Internal nodes store separator keys, not full key-value pairs. This maximizes fanout (keys per internal node).

Delete Compaction Without Rebalancing

Deletes remove the slot, rebuild the surviving leaf entries contiguously, and free any overflow chain owned by the removed entry. The implementation still does not merge or redistribute underfull pages, so a delete-heavy workload can leave low-density leaves, but the freed space within each leaf is reusable immediately by later inserts or updates.

Page-Based, With Explicit Overflow

Each B+Tree node is exactly one page. Large values span linked overflow pages through an explicit descriptor stored in the leaf; ordinary internal and leaf pages remain fixed-size and directly addressable.

Example: Full Insert Sequence

Insert "zebra" into a tree:

Step 1: Start at root
┌─────────────────────────────────────────────┐
│  Root (Internal)                            │
│  [lion]           [rabbit]                  │
│    ↓                 ↓                      │
│  page 2           page 3                    │→ page 4 (rightmost)
└─────────────────────────────────────────────┘
"zebra" > "rabbit" → go right to page 4

Step 2: Descend to leaf (page 4)
┌─────────────────────────────────────────────┐
│  Leaf (Page 4)                              │
│  snake:v1  tiger:v2  wolf:v3                │
│  [has space for zebra]                      │
└─────────────────────────────────────────────┘

Step 3: Insert into leaf
┌─────────────────────────────────────────────┐
│  Leaf (Page 4)                              │
│  snake:v1  tiger:v2  wolf:v3  zebra:v4     │
└─────────────────────────────────────────────┘
Done! No splits needed.

Performance Characteristics

Operation	Average Case	Worst Case
Lookup	O(log N)	O(log N)
Insert	O(log N)	O(log N)
Range(k)	O(log N + k)	O(log N + k)

Where N is the number of entries and k is the number of entries in range.

Disk I/O per operation: ~3-4 reads for trees up to billions of entries.

LatticeDB Documentation