> For the complete documentation index, see [llms.txt](https://bucketdb.sullux.com/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://bucketdb.sullux.com/core-concepts/architecture.md).

# Architecture

BucketDB is designed around a **Write-Forward** paradigm with **immutable blocks**, optimized for environments where horizontal read-scalability and zero-downtime schema migrations are prioritized over high-throughput single-record transactional processing.

This document provides a deep dive into the internal mechanics of BucketDB, explaining how data flows from the application down to the underlying storage drivers.

***

## 1. Core Concepts

### Immutable Blocks

Unlike traditional databases that update files or blocks in place, BucketDB never modifies an existing data block. Every flush of data creates a *new* block. Data blocks are strictly formatted with fixed-width rows and a variable-length heap for dynamic data (like strings or JSON). Because blocks are immutable, readers (queries) never encounter a partially updated or locked file.

### Write-Forward Log (WAL)

When a node executes a write operation (insert, update, delete), it does not immediately rewrite the data blocks. Instead, it packages the operations into a batch and writes them to a distributed Write-Forward Log (stored in the `committed/` prefix). This operation is extremely fast and ensures data is durably committed.

### Timeshare & MISC Coordination

S3 and compatible storage layers are naturally eventually consistent. If multiple nodes attempted to merge the Write-Forward Log into the immutable blocks simultaneously, it would result in massive race conditions.

BucketDB solves this using a **Timeshare** authority model.

* Nodes are assigned an ID (e.g., 0 to 255).
* Write authority is passed around the nodes based on a strict time window (default 5000ms).
* **MISC (Monotonically Increasing System Clocks)**: BucketDB uses a combination of high-resolution time (HRT) and the Node ID to generate universally unique, perfectly sortable 64-bit integers.
* Only the node currently holding authority is permitted to act as the "Leader" and run the background Write-Forward Service.

***

## 2. Storage Topography

Whether using the S3 Driver, File Driver, or Memory Driver, the internal key-value layout remains consistent:

```
driver-root/
  └── {dbPrefix}/          ← Database root (e.g., "prod_db")
      ├── block0           ← Root metadata (schemas, pointers)
      ├── data/            ← Immutable data blocks
      │   ├── 0001a2b3c4d5e6f7
      │   └── ...
      ├── indexes/         ← B+Tree blocks
      │   └── (indexName)/(blockId)
      ├── committed/       ← Write-Forward Log (batches awaiting merge)
      │   ├── (inverted_misc_key_1)
      │   └── ...
      └── blobs/           ← Managed Unstructured Blob storage
          └── sha256hash1
```

***

## 3. Data Flow Pipelines

### The Write Path (Batch Flush)

When a user calls `batch.flush()`, the following occurs:

```mermaid
flowchart TD
    App[Application] --> |insert/update/delete| BatchCache[Local Memory Batch]
    BatchCache --> |flush()| Codec[Binary Codec]
    Codec --> |Encode to Row & Heap| Payload[Batch Payload]
    Payload --> |Generate MISC| MISC[MISC Clock]
    MISC --> |Assign Sequence Number| S3Key[S3 committed/nodeId-sequenceNum]
    S3Key --> Storage[(Storage Driver)]
```

*Note: We use sequential naming for WAL files to allow fast sequence-guessing via S3 `get` operations, avoiding expensive `list` calls. Absolute ordering is maintained internally via the MISC payload.*

### The Write-Forward Path (Background Service)

This runs *only* on the node currently holding write authority.

```mermaid
flowchart TD
    Leader[Leader Node] --> |Guess Sequence Numbers| Storage[(Storage Driver)]
    Storage --> |Load Unprocessed Batches| Cache[Committed Cache]
    Cache --> |Read Affected Blocks| DataBlocks[data/ blocks]
    DataBlocks --> |Merge Rows| NewBlocks[New Immutable Blocks]
    NewBlocks --> |Write| Storage
    Storage --> |Atomic ETag Swap| Block0[Update Block 0 Index]
    Block0 --> |Delete Processed Log| Clean[Delete committed/ keys]
```

### The Query Path

Queries must account for both the highly-optimized disk state and the pending Write-Forward log.

```mermaid
sequenceDiagram
    participant App
    participant QueryEngine
    participant Indexes
    participant Cache as Committed Cache
    participant Driver as Storage

    App->>QueryEngine: db.query('users').eq('role', 'admin').execute()
    QueryEngine->>Indexes: Check CoW B+Tree for 'role'
    Indexes-->>QueryEngine: Yield block pointers
    QueryEngine->>Cache: Refresh pending commits (if stale > 1s)
    Cache->>Driver: Scan committed/ prefix
    Driver-->>Cache: Return new batches
    QueryEngine->>QueryEngine: Apply Query Optimizer (100-row sampling)
    QueryEngine->>QueryEngine: Overlay pending cache on top of Index results
    QueryEngine-->>App: Return merged rows
```

***

## 4. Advanced Components

### Block 0 Management

`Block 0` is the absolute source of truth for the database. It contains:

* The Schema Registry (serialized).
* The Function Registry (serialized).
* The Block Index (pointers to the active `data/` blocks).
* Metadata regarding processed MISC commits.

Updates to Block 0 use **Conditional S3 ETags**. If the leader tries to write a new Block 0 but the ETag has changed (due to a rare race condition or clock skew), the write is rejected and the service must retry.

### B+Tree Indexes

As of v0.2.0, BucketDB features native B+Tree indexes.

* **Copy-on-Write (CoW)**: Similar to standard blocks, B+Tree nodes are never modified in place. An insert that splits a leaf node will result in new block files being written to the `indexes/` prefix, bubbling up to a new Root Pointer.
* The Root Pointer is atomically saved inside Block 0 during the Write-Forward phase.

### Managed Blobs

Unstructured data (like images) are handled via the `blobPrefix`. The `createBlobTarget` API returns a stream. Once the stream finishes, it yields a `BlobReference` object containing the SHA-256 hash. This hash is stored in the variable-length heap of the schema's document. When documents are deleted, the blob references are appended to a Write-Forward tombstone log (`_deleted_blobs`), which the `db.gc()` function later processes to physically remove the orphaned files without requiring full-table scans.


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://bucketdb.sullux.com/core-concepts/architecture.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
