kiradb

CRDTs in KiraDB

A CRDT is a data structure whose merge operation is commutative, associative, and idempotent. The math guarantees that any two replicas which have seen the same set of updates converge to the same state — regardless of message order, duplication, or partition length.

This doc covers the five CRDTs implemented in kiradb-crdt: GCounter, PNCounter, LWWRegister, MVRegister, ORSet, plus the VectorClock building block underneath MVRegister.

For the user-facing commands, see the CRDT.* section in the command reference.


Why CRDTs alongside Raft?

Raft (Phase 4) gives strong consistency through a single leader: every write goes through one node, gets ordered, gets replicated. Beautiful for state where order matters (cluster membership, schema, leadership of a key range).

But Raft has a cost:

For workloads like like counters on a viral tweet, online-presence sets in a chat room, or config-flag values changed by an admin, you don’t need a global order. You need every node to accept writes locally, and the math to guarantee convergence when nodes can talk again.

That’s CRDTs. Both models live in the same database; you pick per data type, not for the whole system.


The three magic properties

A CRDT’s merge(A, B) function must satisfy:

Property Meaning
Commutative merge(A, B) == merge(B, A) — order of arrival doesn’t matter
Associative merge(merge(A, B), C) == merge(A, merge(B, C)) — grouping doesn’t matter
Idempotent merge(A, A) == A — duplicate gossip is a no-op

If all three hold, any two replicas seeing the same updates eventually converge — no coordination needed.

KiraDB’s CRDTs are all state-based (CvRDTs): nodes gossip the whole state and merge on receive. The alternative (op-based) is bandwidth-cheaper but requires causal-broadcast and exactly-once delivery — both hard. State-based wins on robustness; the bandwidth cost is rarely the bottleneck.


GCounter — grow-only counter

State: Map<NodeId, Long> — one slot per writing node. Value: sum of all slots. Merge: element-wise max per slot.

Node A's view:        { node-a: 5, node-b: 3, node-c: 7 }      value = 15
Node B's view:        { node-a: 4, node-b: 8, node-c: 7 }      value = 19
After merge:          { node-a: 5, node-b: 8, node-c: 7 }      value = 20

Why max-merge is safe: only node N ever increments slot N. So if two replicas disagree on slot N, the higher value strictly contains more information — no node could have written less.

File: kiradb-crdt/src/main/java/io/kiradb/crdt/GCounter.java. The API enforces the invariant: increment() takes no node-id argument; the instance is bound to its own id at construction.

Use cases inside KiraDB: rate limiter buckets, flag impression/conversion counters.


PNCounter — positive/negative counter

State: two GCounters — one for additions (P), one for subtractions (N). Value: P.value() - N.value().

This is the standard “decompose what you can’t do into two things you can do” pattern. A single counter that can go both up and down would not be safe under max-merge. By restricting each underlying counter to monotonic growth, the merge math stays trivially correct.

Increment by 5:  P slot += 5
Decrement by 2:  N slot += 2
Value:           P.sum() - N.sum() = 5 - 2 = 3

File: PNCounter.java. Three lines of merge, each calling already-correct GCounter.merge.


LWWRegister — last-write-wins

State: (value, timestamp, writerNodeId). Merge: higher timestamp wins. Ties broken by lexicographically larger node id.

private static boolean winsOver(long tA, String idA, long tB, String idB) {
    if (tA > tB) return true;
    if (tA < tB) return false;
    return idA.compareTo(idB) > 0;
}

The honest disclaimer: wall clocks lie. NTP drift, leap seconds, VM pauses. A write from a clock-skewed node can silently discard a more-recently-written value. Do not use LWW for data where loss is unacceptable; use MVRegister instead.

Where LWW is fine in KiraDB: feature flag values (admins update slowly; “last admin wins” matches the mental model). Configuration values used through the Phase 7 path.

A real bug that hit us, fixed in Phase 7: if a single node performs two set() calls within the same millisecond, the original implementation rejected the second because idA.compareTo(idA) == 0 (not strictly greater). Fix: local writes always win over their own prior state via a monotonically-advancing timestamp. The LWW tiebreaker only meaningfully applies between different nodes. Regression test: LWWRegisterTest.rapidSameNodeRewritesAllSucceed.


VectorClock — causality without wall clocks

A vector clock is a Map<NodeId, Long> that captures the causal order between distributed events. Three rules:

Event Rule
Local event on node N clock[N] += 1
Send a message from N Attach a copy of N’s clock
Receive at N with attached V clock[i] = max(clock[i], V[i]) for every i, then clock[N] += 1

Comparison between two clocks A and B yields one of:

The trick: every node only ever increments its own slot. So if two clocks disagree on slot N, the higher value strictly knows more about what happened on N. Nothing is lost.

File: VectorClock.java. Used by MVRegister.


MVRegister — multi-value (Dynamo-paper style)

State: a set of (value, vectorClock) pairs. Write: bump local VC, drop pairs strictly dominated by the new VC, append (newValue, copy of VC). Merge: union the pairs, then drop those dominated by some other pair in the union. Read: return the surviving values — usually 1, more than 1 if there are unresolved concurrent writes.

The classic shopping-cart use case:

T1: A writes "socks"   →  A = { ("socks", {A:1}) }
T2: B writes "shirt"   →  B = { ("shirt", {B:1}) }   (concurrent — A and B haven't synced)
T3: A and B sync. Combined: { ("socks", {A:1}), ("shirt", {B:1}) }
    Neither dominates → both survive.
T4: read() → ["socks", "shirt"]. App applies its rule (union for carts), writes back "socks+shirt".
T5: A writes "socks+shirt"; A's VC is now {A:1, B:1}, then increments A → {A:2, B:1}.
    All prior pairs are now dominated → dropped.
    Final: { ("socks+shirt", {A:2, B:1}) }.

Important nomenclature: this is the original Dynamo paper (2007). Amazon DynamoDB the AWS product (2012) actually uses LWW by default and hides conflicts. When people say “Dynamo-style consistency” they almost always mean the paper, not the AWS service. Riak is the closest open-source heir to the paper.

File: MVRegister.java.


ORSet — observed-remove set

The trick: every add(x) produces a fresh UUID tag. The element is “in the set” iff at least one of its tags is alive. remove(x) only kills the tags it has currently observed on this replica.

This means a concurrent add(x) on another replica (with a fresh UUID the remover never saw) survives the remove — matching real-world intent in shopping carts and presence sets (“better resurrect than lose”).

A:  add("apple")            → liveAdds: {apple → {tag-1}}
B:  observes A's add via merge.
A:  remove("apple")         → A: liveAdds={}, tombstones={tag-1}
B:  add("apple")            → B: liveAdds: {apple → {tag-1, tag-2}}   (fresh UUID!)

After A.merge(B) and B.merge(A):
  liveAdds = {apple → {tag-1, tag-2}}
  tombstones = {tag-1}
  Strip tombstoned tags: {apple → {tag-2}}
  → "apple" survives in both replicas. ✓

The defining test that locks this in: ORSetTest.concurrentAddSurvivesConcurrentRemove.

Known limitation: tombstones grow forever. Real production CRDTs garbage-collect via causal stability — once every replica has acknowledged seeing tag T, the tombstone for T is safe to drop. This requires a per-replica gossip “I have seen up to dot D” digest plus a stability tracker. Filed in Phase 13 — Deferred Hardening Backlog in CLAUDE.md.

File: ORSet.java.


CrdtStore — the orchestrator

CrdtStore (in kiradb-crdt/.../CrdtStore.java) is the layer that owns named CRDT instances and persists their state via the StorageEngine. Each CRDT type lives under a reserved key prefix:

crdt:g:<name>     →  GCounter state
crdt:pn:<name>    →  PNCounter state
crdt:lww:<name>   →  LWWRegister state
crdt:mv:<name>    →  MVRegister state
crdt:or:<name>    →  ORSet state

CRDT instances are cached in-memory per name (lazy-loaded on first read). After every mutation, the new state is serialized and written back to storage. ConcurrentHashMap.computeIfAbsent ensures two threads racing on the same name converge on a single instance — no torn state.

The identity invariant is critical: localNodeId is supplied at construction and must be stable across restarts. A drifting node id splits one logical node into two — counters double, set elements duplicate. KiraDB resolves the id from -Dkiradb.node.id, falling back to the hostname.


What’s deferred

These items are tracked in the Phase 13 hardening backlog in CLAUDE.md, each with an explicit Trigger condition for when to pull them forward:


Reading list