B+Tree

What It Is

A B+Tree is a self-balancing tree structure optimized for disk-based storage. It provides O(log N) lookups, inserts, and deletes, with efficient range scans.

Why B+Tree?

The Problem with Binary Trees

A binary search tree with 1 million entries is ~20 levels deep (log2(1M) ≈ 20). Each level requires a disk read. That's 20 disk reads per lookup!

B+Tree: Wide and Shallow

A B+Tree has many keys per node (hundreds), making it very shallow:

Binary Tree (1M entries):          B+Tree (1M entries):
        depth ~20                       depth ~3

          ○                               ○
         / \                         /    |    \
        ○   ○                       ○     ○     ○
       /\   /\                    / | \ / | \ / | \
      ○ ○ ○  ○                   ○  ○  ○ ○  ○  ○ ○ ○
     ... (20 levels)             (leaf level)

    20 disk reads                   3 disk reads!

With 100 keys per node: log100(1M) ≈ 3 levels.

Data Only in Leaves

B+Trees store all data in leaf nodes. Internal nodes only contain keys for routing:

                    ┌─────────────────────┐
                    │   [30]  [60]  [90]  │  ◄── Internal: keys only
                    └───┬──────┬──────┬───┘
                       ╱       │       ╲
        ┌─────────────┘        │        └─────────────┐
        ▼                      ▼                      ▼
┌───────────────┐    ┌───────────────┐    ┌───────────────┐
│ 10:val 20:val │◄──►│ 30:val 50:val │◄──►│ 60:val 80:val │
└───────────────┘    └───────────────┘    └───────────────┘
        ▲                                         ▲
        └─── Leaves: keys AND values ─────────────┘
             Linked for range scans

Our Implementation: Direct Page Manipulation

We don't create "node objects" in memory. Instead, we read/write bytes directly in page buffers. The "node" is just a lens over page bytes.

Traditional approach:                Our approach:

┌─────────────┐                     ┌─────────────┐
│ Page bytes  │                     │ Page bytes  │
└──────┬──────┘                     └──────┬──────┘
       │ deserialize                       │
       ▼                                   │ direct access
┌─────────────┐                            │
│ Node struct │                            │
│ - keys[]    │                            ▼
│ - values[]  │                     Read/write at
│ - children[]│                     calculated offsets
└──────┬──────┘
       │ serialize
       ▼
┌─────────────┐
│ Page bytes  │
└─────────────┘

No intermediate objects. No serialization overhead.

Page Layout

Leaf Node

┌────────────────────────────────────────────────────────────────┐
│                         Leaf Page (4KB)                         │
├────────────────────────────────────────────────────────────────┤
│ Bytes 0-7:   Page Header (checksum, type=leaf, flags)          │
├────────────────────────────────────────────────────────────────┤
│ Bytes 8-9:   Entry count (u16)                                 │
│ Bytes 10-13: Prev leaf page (u32)                              │
│ Bytes 14-17: Next leaf page (u32)                              │
│ Bytes 18-19: Free space offset (u16)                           │
├────────────────────────────────────────────────────────────────┤
│ Slot 0: key_offset, key_len, val_offset, val_len (8 bytes)     │
│ Slot 1: key_offset, key_len, val_offset, val_len (8 bytes)     │
│ Slot 2: ...                                                    │
│         ...                                                    │
│                    [free space grows down ↓]                   │
│                                                                │
│                    [entries grow up ↑]                         │
│                                                                │
│ "value2"                                                       │
│ "key2"                                                         │
│ "value1"                                                       │
│ "key1"                                                         │
└────────────────────────────────────────────────────────────────┘
                                                          Byte 4095

Variable-length entries: Keys and values are stored at the end of the page, growing upward. Slots at the front point to them.

Internal Node

┌────────────────────────────────────────────────────────────────┐
│                       Internal Page (4KB)                       │
├────────────────────────────────────────────────────────────────┤
│ Bytes 0-7:   Page Header (checksum, type=internal, flags)      │
├────────────────────────────────────────────────────────────────┤
│ Bytes 8-9:   Key count (u16)                                   │
│ Bytes 10-13: Rightmost child (u32)                             │
│ Bytes 14-15: Free space offset (u16)                           │
├────────────────────────────────────────────────────────────────┤
│ Slot 0: child_page, key_offset, key_len (8 bytes)              │
│ Slot 1: child_page, key_offset, key_len (8 bytes)              │
│         ...                                                    │
│                    [free space]                                │
│                                                                │
│ "separator_key2"                                               │
│ "separator_key1"                                               │
└────────────────────────────────────────────────────────────────┘

Point Lookup

Finding a value by key:

pub fn get(self: *Self, key: []const u8) !?[]const u8 {
    // 1. Start at root
    var page_id = self.root_page;

    while (true) {
        // 2. Fetch page from buffer pool
        const frame = try self.bp.fetchPage(page_id, .shared);
        defer self.bp.unpinPage(frame, false);

        const page_type = getPageType(frame.data);

        if (page_type == .btree_leaf) {
            // 3. Binary search in leaf
            return LeafNode.search(frame.data, key, self.comparator);
        } else {
            // 4. Binary search for child pointer, descend
            page_id = InternalNode.findChild(frame.data, key, self.comparator);
        }
    }
}

Example lookup for key "dog":

                        Root (Internal)
                    ┌─────────────────────┐
                    │  [cat]     [fish]   │
                    └───┬──────────┬──────┘
                       ╱           ╲
     "dog" > "cat"    ╱             ╲
     "dog" < "fish"  ╱               ╲
                    ▼
             ┌─────────────────┐
             │ cow:v1  dog:v2  │ ◄── Found! Return v2
             └─────────────────┘

Insertion

Simple Case: Space in Leaf

pub fn insert(self: *Self, key: []const u8, value: []const u8) !void {
    // 1. Find the leaf
    const leaf_page = try self.findLeaf(key);

    // 2. Fetch with exclusive latch
    const frame = try self.bp.fetchPage(leaf_page, .exclusive);
    defer self.bp.unpinPage(frame, true);

    // 3. Check if there's space
    if (LeafNode.hasSpace(frame.data, key.len, value.len)) {
        // 4. Insert directly
        LeafNode.insert(frame.data, key, value, self.comparator);
    } else {
        // 5. Need to split
        try self.splitLeafAndInsert(frame, key, value);
    }
}

Splitting a Leaf

When a leaf is full:

Before split:
┌─────────────────────────────────────┐
│  a:1  b:2  c:3  d:4  e:5  [FULL]   │
└─────────────────────────────────────┘
                 │
                 │ Insert "f:6" - no room!
                 ▼
After split:
┌──────────────────────┐    ┌──────────────────────┐
│  a:1  b:2  c:3       │◄──►│  d:4  e:5  f:6       │
└──────────────────────┘    └──────────────────────┘
         │                            │
         └───────────┬────────────────┘
                     │
                     ▼
              "d" promoted to parent
fn splitLeaf(self: *Self, frame: *BufferFrame, key: []const u8, value: []const u8) !void {
    // 1. Allocate new page
    const new_page = try self.pm.allocatePage();
    const new_frame = try self.bp.fetchPage(new_page, .exclusive);

    // 2. Find split point (middle)
    const entry_count = LeafNode.getCount(frame.data);
    const split_point = entry_count / 2;

    // 3. Move upper half to new page
    LeafNode.moveEntries(frame.data, new_frame.data, split_point, entry_count);

    // 4. Insert new key into appropriate page
    if (compare(key, split_key) < 0) {
        LeafNode.insert(frame.data, key, value);
    } else {
        LeafNode.insert(new_frame.data, key, value);
    }

    // 5. Update sibling pointers
    LeafNode.setNext(frame.data, new_page);
    LeafNode.setPrev(new_frame.data, frame.page_id);

    // 6. Get separator key and insert into parent
    const separator = LeafNode.getFirstKey(new_frame.data);
    try self.insertIntoParent(frame.page_id, separator, new_page);
}

Growing the Tree

When the root splits, we create a new root:

Before:
          ┌─────────────┐
          │  Root (full)│
          └─────────────┘

After root split:
          ┌─────────────┐
          │  New Root   │  ◄── New level!
          │    [50]     │
          └──────┬──────┘
                ╱ ╲
    ┌──────────┘   └──────────┐
    ▼                         ▼
┌─────────────┐      ┌─────────────┐
│  Old Root   │      │  New Page   │
│  (left)     │      │  (right)    │
└─────────────┘      └─────────────┘

The tree grows at the top, not the bottom. All leaves stay at the same depth.

Range Scan

Thanks to linked leaves, range scans are efficient:

pub fn range(self: *Self, start: ?[]const u8, end: ?[]const u8) !Iterator {
    // 1. Find starting leaf
    const start_page = if (start) |k|
        try self.findLeaf(k)
    else
        try self.findLeftmostLeaf();

    return Iterator{
        .btree = self,
        .current_page = start_page,
        .current_slot = 0,
        .end_key = end,
    };
}

Iterator walks the leaf chain:

Start key: "dog"     End key: "hamster"

   ┌─────────────────────────────────────┐
   │           Internal nodes            │
   └─────────────────────────────────────┘
                     │
                     ▼
┌─────────────┐   ┌─────────────┐   ┌─────────────┐
│ cat│cow│    │──►│ dog│elk│fox│──►│goat│ham│    │
└─────────────┘   └─────────────┘   └─────────────┘
                    ▲                   ▲
                    │ start             │ end
                    └───────────────────┘
                          scan range

No need to traverse internal nodes for each entry - just follow the links.

Concurrency

We use latch crabbing (simplified):

Lookup (read-only):
    1. Latch root (shared)
    2. Latch child (shared)
    3. Unlatch root
    4. Latch grandchild (shared)
    5. Unlatch child
    ... continue to leaf

Insert (may modify):
    1. Latch root (exclusive)
    2. If child is safe (won't split), latch child and unlatch root
    3. Continue down

A "safe" node is one with enough space that it won't split (for insert) or won't underflow (for delete).

The BTree Struct

pub const BTree = struct {
    bp: *BufferPool,           // Page cache
    root_page: PageId,         // Current root
    comparator: KeyComparator, // For ordering
    page_size: u32,            // Usually 4096
    allocator: Allocator,

    // ...methods
};

This is just ~40 bytes of metadata. The actual data lives in pages on disk.

Key Design Decisions

Variable-Length Keys and Values

We store the actual bytes, not fixed-size slots. This supports:

  • Short keys efficiently (no wasted space)
  • Long keys (up to page size)
  • Any binary data

Separator Keys in Internal Nodes

Internal nodes store separator keys, not full key-value pairs. This maximizes fanout (keys per internal node).

No Deletion (Yet)

Our implementation doesn't handle deletes with rebalancing. For a knowledge graph database, we can use soft deletes (mark as deleted) and periodic compaction.

Page-Based, Not Block-Based

Each node is exactly one page. No spanning, no compression across pages. Simple and predictable.

Example: Full Insert Sequence

Insert "zebra" into a tree:

Step 1: Start at root
┌─────────────────────────────────────────────┐
│  Root (Internal)                            │
│  [lion]           [rabbit]                  │
│    ↓                 ↓                      │
│  page 2           page 3                    │→ page 4 (rightmost)
└─────────────────────────────────────────────┘
"zebra" > "rabbit" → go right to page 4

Step 2: Descend to leaf (page 4)
┌─────────────────────────────────────────────┐
│  Leaf (Page 4)                              │
│  snake:v1  tiger:v2  wolf:v3                │
│  [has space for zebra]                      │
└─────────────────────────────────────────────┘

Step 3: Insert into leaf
┌─────────────────────────────────────────────┐
│  Leaf (Page 4)                              │
│  snake:v1  tiger:v2  wolf:v3  zebra:v4     │
└─────────────────────────────────────────────┘
Done! No splits needed.

Performance Characteristics

OperationAverage CaseWorst Case
LookupO(log N)O(log N)
InsertO(log N)O(log N)
Range(k)O(log N + k)O(log N + k)

Where N is the number of entries and k is the number of entries in range.

Disk I/O per operation: ~3-4 reads for trees up to billions of entries.