> For the complete documentation index, see [llms.txt](https://bucketdb.sullux.com/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://bucketdb.sullux.com/core-concepts/cluster.md).

# Cluster & High Availability

BucketDB is designed to scale horizontally across multiple stateless nodes without requiring complex consensus protocols like RAFT or Paxos. It achieves this using a timesharing authority model with all commits given a **MISC** (Monotonically Increasing System Clock).

## The Timesharing Model

In a distributed system, multiple nodes might try to merge writes from the Write-Forward log into the immutable data blocks simultaneously. This would cause race conditions and data corruption. To prevent this, only *one* node at a time is allowed to act as the "Leader".

### How Authority is Assigned

Instead of electing a leader via a consensus algorithm, BucketDB assigns leadership strictly based on time and a node's assigned ID.

1. When configuring a cluster, each node is assigned a unique `nodeId` (from 0 to 255).
2. The `writeForwardInterval` (default `5000` milliseconds) determines how often the timesharing window rotates.
3. The cluster dynamically coordinates based on the number of active nodes. The current window index modulo the active `ringSize` dictates which node holds authority.

```mermaid
stateDiagram-v2
    [*] --> Node0
    Node0 --> Node1: Window + 5000ms
    Node1 --> Node2: Window + 5000ms
    Node2 --> Node0: Window + 5000ms
```

### The Role of the Leader

The node currently holding authority becomes the Leader. Its sole responsibility is to run the background **Write-Forward Service**.

* It scans the `committed/` prefix in the storage driver.
* It pulls down any pending batches (from *any* node).
* It merges those rows into new immutable data blocks.
* It atomically swaps the block pointers in `Block 0`.
* It deletes the pending batches from the log.

*Note: ALL nodes can accept write requests and flush them to the `committed/` log at any time. The timesharing mechanism only restricts who is allowed to write those logs forward into permanent blocks.*

***

## Dynamic Scaling and Replicas

Because the cluster coordinates dynamically, you can scale your database tier automatically.

### Overlaps and Gaps (Split-Brain)

When a cluster scales dynamically (e.g., a node crashes or a new one boots up), nodes will not discover the change at the exact same millisecond.

* **The Gap:** For a few seconds, there may be a leadership window that no node thinks it owns. The Write-Forward log simply builds up slightly, which is harmless.
* **The Overlap:** Two nodes might temporarily both believe they are the leader of the current window.

In traditional databases, an overlap causes catastrophic split-brain corruption. In BucketDB, **it is completely harmless**.

Because `Block 0` updates are guarded by strict Storage Driver ETag conditional writes, if two nodes run the Write-Forward process simultaneously, the slower node will receive an ETag mismatch when it tries to save the new block pointers. It will silently discard its redundant work and back off, treating the Timesharing ring purely as an optimization to prevent wasted compute, rather than a hard concurrency lock.

### Performance Implications During Gaps

The Write-Ahead gap may have an impact on query performance.

When a query is executed, BucketDB must guarantee eventual consistency. To do this, the local `QueryEngine` will overlay pending commits. If the `committed/` prefix temporarily grows larger than normal because the leader missed its turn, queries will have to process more batches. This may cause a noticeable latency spike in read queries for the duration of the gap.

## Read Replicas (Edge Nodes)

A massive architectural feature of BucketDB's timesharing model is the ability to run pure **Read Replicas**.

Because any node can read data and append to the `committed/` log, you can spin up 50 nodes globally (e.g., on edge locations) to serve lightning-fast reads. However, you don't want an edge node in Tokyo merging S3 blocks for a bucket hosted in Virginia.

To solve this, simply assign your edge nodes a `nodeId` that falls outside the active cluster's designated Timeshare ring, or do not run the `WriteForwardService` on them. They will serve reads from the in-memory cache and overlay local writes into the S3 WAL, but will never wake up to perform heavy background block merges, drastically reducing cloud egress costs and processing latency.


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://bucketdb.sullux.com/core-concepts/cluster.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
