1007 lines
40 KiB
Markdown
Executable File
1007 lines
40 KiB
Markdown
Executable File
# RFC-CBDD: High-Performance Embedded Document Database for .NET
|
||
|
||
**Status:** Draft
|
||
**Version:** 0.1.0
|
||
**Date:** February 2026
|
||
**Authors:** CBDD Development Team
|
||
|
||
---
|
||
|
||
## Abstract
|
||
|
||
This document specifies **CBDD**, a high-performance embedded document-oriented database engine for .NET 10. CBDD is designed from the ground up for **zero-allocation performance**, leveraging modern .NET features including `Span<T>`, Memory-Mapped Files, and Source Generators. The database implements a custom **C-BSON** (Compressed BSON) format that achieves 30-60% storage reduction compared to standard BSON while maintaining full type compatibility.
|
||
|
||
Key innovations include:
|
||
- **Zero-allocation I/O** via `Span<byte>` and `stackalloc`
|
||
- **C-BSON format** with field name compression (2-byte IDs vs. variable-length strings)
|
||
- **Page-based storage** with memory-mapped file I/O
|
||
- **Multiple index types:** B+Tree, R-Tree (geospatial), and HNSW (vector similarity)
|
||
- **ACID transactions** with Write-Ahead Logging and Snapshot Isolation
|
||
- **Compile-time code generation** for zero-reflection serialization
|
||
|
||
---
|
||
|
||
## 1. Introduction
|
||
|
||
### 1.1 Motivation
|
||
|
||
Most embedded databases for .NET fall into two categories:
|
||
|
||
1. **Native wrappers** (SQLite, RocksDB): High performance but with interop overhead and GC pressure from marshalling
|
||
2. **Managed implementations** (LiteDB, others): Pure C# but burdened by reflection, excessive allocations, and legacy design
|
||
|
||
CBDD bridges this gap by leveraging **.NET 10's modern performance features** while remaining a pure managed implementation:
|
||
|
||
- `Span<T>` and `Memory<T>` for zero-copy I/O
|
||
- Source Generators for zero-reflection serialization
|
||
- Memory-Mapped Files for OS-level page caching
|
||
- Stack allocation (`stackalloc`) for ephemeral buffers
|
||
|
||
### 1.2 Scope
|
||
|
||
This RFC specifies:
|
||
- **Storage Engine:** Page file format, WAL protocol, transaction semantics
|
||
- **C-BSON Format:** Wire format, schema management, key compression
|
||
- **Indexing:** B+Tree, R-Tree, and HNSW implementations
|
||
- **Query Engine:** LINQ provider and hybrid execution model
|
||
- **Code Generation:** Mapper generation rules and attribute support
|
||
|
||
### 1.3 Terminology
|
||
|
||
**MUST**, **SHOULD**, **MAY**: As defined in RFC 2119
|
||
|
||
- **Page:** Fixed-size block of data (default 16KB)
|
||
- **Slot:** Variable-size entry within a slotted page
|
||
- **C-BSON:** Compressed BSON with field ID compression
|
||
- **WAL:** Write-Ahead Log for durability
|
||
- **MBR:** Minimum Bounding Rectangle (for R-Tree)
|
||
- **HNSW:** Hierarchical Navigable Small World (vector search algorithm)
|
||
|
||
---
|
||
|
||
## 2. Architecture Overview
|
||
|
||
```
|
||
┌─────────────────────────────────────────────────────────┐
|
||
│ CBDD Architecture │
|
||
├─────────────────────────────────────────────────────────┤
|
||
│ ┌───────────────┐ ┌───────────────┐ ┌─────────────┐ │
|
||
│ │ LINQ Provider │ │ Source Gen │ │ Collections │ │
|
||
│ │ (Queryable) │ │ (Mappers) │ │ (DbContext)│ │
|
||
│ └───────┬───────┘ └───────┬───────┘ └──────┬──────┘ │
|
||
│ │ │ │ │
|
||
│ ┌───────▼─────────────────────────────────────▼──────┐ │
|
||
│ │ Index Layer (B-Tree, R-Tree, HNSW) │ │
|
||
│ └───────┬─────────────────────────────────────┬──────┘ │
|
||
│ │ │ │
|
||
│ ┌───────▼──────────────────────────────────────▼─────┐ │
|
||
│ │ Storage Engine (Pages, Transactions) │ │
|
||
│ │ ┌──────────────┐ ┌──────────────┐ ┌──────────┐ │ │
|
||
│ │ │ PageFile │ │ WAL │ │ FreeList │ │ │
|
||
│ │ └──────────────┘ └──────────────┘ └──────────┘ │ │
|
||
│ └────────────────────────────────────────────────────┘ │
|
||
│ ┌────────────────────────────────────────────────────┐ │
|
||
│ │ C-BSON (Span-based Reader/Writer) │ │
|
||
│ └────────────────────────────────────────────────────┘ │
|
||
│ ┌────────────────────────────────────────────────────┐ │
|
||
│ │ OS Memory-Mapped Files (Kernel Page Cache) │ │
|
||
│ └────────────────────────────────────────────────────┘ │
|
||
└─────────────────────────────────────────────────────────┘
|
||
```
|
||
|
||
### 2.1 Storage Layer
|
||
|
||
The storage layer manages page-based I/O using memory-mapped files:
|
||
|
||
- **PageFile:** Fixed-size pages (8KB, 16KB, or 32KB)
|
||
- **Page Types:** Header, Data, Index, Vector, Spatial, Dictionary, Schema, Overflow, Free
|
||
- **Free List:** Linked list of reusable pages
|
||
|
||
### 2.2 C-BSON Layer
|
||
|
||
Provides zero-allocation BSON serialization/deserialization:
|
||
|
||
- **BsonSpanWriter:** Writes C-BSON to `Span<byte>`
|
||
- **BsonSpanReader:** Reads C-BSON from `ReadOnlySpan<byte>`
|
||
- **Schema Management:** Field name → ID mapping
|
||
|
||
Full specification: See `C-BSON.md`
|
||
|
||
### 2.3 Index Layer
|
||
|
||
Three specialized index types:
|
||
|
||
- **B+Tree:** General-purpose sorted index (range queries, equality)
|
||
- **R-Tree:** Geospatial index (proximity, bounding box queries)
|
||
- **HNSW:** Vector similarity search (k-NN, ANN)
|
||
|
||
### 2.4 Query Layer
|
||
|
||
LINQ-to-CBDD provider:
|
||
|
||
- Translates LINQ expressions to index operations
|
||
- Hybrid execution: Index-based filtering + in-memory LINQ to Objects
|
||
- Supports `Where`, `OrderBy`, `Skip`, `Take`, `GroupBy`, `Join`, aggregations
|
||
|
||
### 2.5 Transaction Layer
|
||
|
||
ACID guarantees via:
|
||
|
||
- **Atomicity:** All-or-nothing commits
|
||
- **Consistency:** Schema validation
|
||
- **Isolation:** Snapshot isolation (MVCC-like)
|
||
- **Durability:** Write-Ahead Logging (WAL)
|
||
|
||
---
|
||
|
||
## 3. Storage Engine Specification
|
||
|
||
### 3.1 Page File Format
|
||
|
||
#### 3.1.1 Page Sizes
|
||
|
||
CBDD supports 3 predefined page sizes:
|
||
|
||
| Configuration | Page Size | Use Case |
|
||
|:--------------|:----------|:--------------------------------|
|
||
| **Small** | 8 KB | Embedded, tiny documents |
|
||
| **Default** | 16 KB | General purpose (InnoDB-like) |
|
||
| **Large** | 32 KB | Big documents (MongoDB-like) |
|
||
|
||
**Rationale:** 16KB aligns with Linux page cache and InnoDB defaults, balancing fragmentation vs. overhead.
|
||
|
||
#### 3.1.2 File Layout
|
||
|
||
```
|
||
┌─────────────────────────────────────────────────────────┐
|
||
│ Page 0: File Header │
|
||
│ [PageHeader (32)] │
|
||
│ [Database Version (4)] │
|
||
│ [Page Size (4)] │
|
||
│ [First Free Page ID (4)] ← Free list head │
|
||
│ [Dictionary Root Page ID (4)] │
|
||
│ [Reserved (remaining)] │
|
||
├─────────────────────────────────────────────────────────┤
|
||
│ Page 1: Collection Metadata │
|
||
│ [SlottedPageHeader (24)] │
|
||
│ [Collection Schemas...] │
|
||
├─────────────────────────────────────────────────────────┤
|
||
│ Page 2+: Data, Index, Vector, Spatial, Dictionary... │
|
||
└─────────────────────────────────────────────────────────┘
|
||
```
|
||
|
||
### 3.2 Page Header Format
|
||
|
||
All pages start with a **32-byte header**:
|
||
|
||
```
|
||
Offset Size Field Description
|
||
------ ---- ------------------ --------------------------
|
||
0 4 PageId Page number (0-indexed)
|
||
4 1 PageType Enum (see §3.3)
|
||
5 2 FreeBytes Unused space in page
|
||
7 4 NextPageId Linked list pointer
|
||
11 8 TransactionId Last modifying transaction
|
||
19 4 Checksum CRC32 of page data
|
||
23 4 DictionaryRootPageId (Page 0 only)
|
||
27 5 Reserved Future use
|
||
```
|
||
|
||
**Total:** 32 bytes
|
||
|
||
**Implementation:**
|
||
```csharp
|
||
[StructLayout(LayoutKind.Explicit, Size = 32)]
|
||
public struct PageHeader
|
||
{
|
||
[FieldOffset(0)] public uint PageId;
|
||
[FieldOffset(4)] public PageType PageType;
|
||
[FieldOffset(5)] public ushort FreeBytes;
|
||
[FieldOffset(7)] public uint NextPageId;
|
||
[FieldOffset(11)] public ulong TransactionId;
|
||
[FieldOffset(19)] public uint Checksum;
|
||
[FieldOffset(23)] public uint DictionaryRootPageId;
|
||
}
|
||
```
|
||
|
||
### 3.3 Page Types
|
||
|
||
| Value | Type | Purpose |
|
||
|:------|:------------|:-----------------------------------------|
|
||
| 0 | Empty | Uninitialized |
|
||
| 1 | Header | Page 0 (file header) |
|
||
| 2 | Collection | Schema and collection metadata |
|
||
| 3 | Data | Document storage (slotted page) |
|
||
| 4 | Index | B+Tree node |
|
||
| 5 | FreeList | Deprecated (unused) |
|
||
| 6 | Overflow | Continuation of large documents |
|
||
| 7 | Dictionary | String interning for C-BSON keys |
|
||
| 8 | Schema | Schema versioning |
|
||
| 9 | Vector | HNSW index node |
|
||
| 10 | Free | Reusable page (linked via NextPageId) |
|
||
| 11 | Spatial | R-Tree node |
|
||
|
||
### 3.4 Slotted Page Structure
|
||
|
||
Data pages use a **slotted page** design for variable-size documents:
|
||
|
||
```
|
||
┌─────────────────────────────────────────────────────────┐
|
||
│ [SlottedPageHeader (24)] │
|
||
│ PageId, PageType, SlotCount, FreeSpaceStart, │
|
||
│ FreeSpaceEnd, NextOverflowPage, TransactionId │
|
||
├─────────────────────────────────────────────────────────┤
|
||
│ [Slot Array (grows down)] │
|
||
│ ┌──────────────────────────────────┐ │
|
||
│ │ Slot 0: [Offset, Length, Flags] │ (8 bytes) │
|
||
│ │ Slot 1: [Offset, Length, Flags] │ │
|
||
│ │ Slot 2: [Offset, Length, Flags] │ │
|
||
│ │ ... │ │
|
||
│ └──────────────────────────────────┘ │
|
||
│ ↓ │
|
||
│ [Free Space] │
|
||
│ ↑ │
|
||
│ [Data Area (grows up)] │
|
||
│ ┌──────────────────────────────────┐ │
|
||
│ │ ... Document N C-BSON bytes ... │ │
|
||
│ │ ... Document 2 C-BSON bytes ... │ │
|
||
│ │ ... Document 1 C-BSON bytes ... │ │
|
||
│ │ ... Document 0 C-BSON bytes ... │ │
|
||
│ └──────────────────────────────────┘ │
|
||
└─────────────────────────────────────────────────────────┘
|
||
```
|
||
|
||
#### SlottedPageHeader (24 bytes)
|
||
|
||
```
|
||
Offset Size Field Description
|
||
------ ---- ------------------ --------------------------
|
||
0 4 PageId
|
||
4 1 PageType (= 3 for Data pages)
|
||
8 2 SlotCount Number of slots
|
||
10 2 FreeSpaceStart Offset where slots end
|
||
12 2 FreeSpaceEnd Offset where data begins
|
||
14 4 NextOverflowPage For large documents
|
||
18 4 TransactionId
|
||
22 2 Reserved
|
||
```
|
||
|
||
#### SlotEntry (8 bytes)
|
||
|
||
```
|
||
Offset Size Field Description
|
||
------ ---- ------------------ --------------------------
|
||
0 2 Offset Byte offset to document data
|
||
2 2 Length Document length in bytes
|
||
4 4 Flags SlotFlags enum
|
||
```
|
||
|
||
**SlotFlags:**
|
||
- `None = 0`: Active slot
|
||
- `Deleted = 1`: Slot marked for reuse
|
||
- `HasOverflow = 2`: Document continues in overflow pages
|
||
- `Compressed = 4`: Reserved for future compression
|
||
|
||
### 3.5 DocumentLocation
|
||
|
||
Documents are addressed by **(PageId, SlotIndex)** tuple:
|
||
|
||
```csharp
|
||
public readonly struct DocumentLocation
|
||
{
|
||
public uint PageId { get; init; } // 4 bytes
|
||
public ushort SlotIndex { get; init; } // 2 bytes
|
||
}
|
||
// Total: 6 bytes when serialized
|
||
```
|
||
|
||
**Used in:**
|
||
- Index entries (key → location mapping)
|
||
- Overflow page chains
|
||
- Internal references
|
||
|
||
### 3.6 Free Page Management
|
||
|
||
Deleted pages form a **linked list** for reuse:
|
||
|
||
1. `PageHeader.PageType = Free`
|
||
2. `PageHeader.NextPageId` points to next free page
|
||
3. Page 0's `NextPageId` points to **head of free list**
|
||
|
||
**Allocation:**
|
||
```
|
||
if (freeListHead != 0):
|
||
pageId = freeListHead
|
||
freeListHead = ReadPage(pageId).NextPageId
|
||
UpdatePage0(freeListHead)
|
||
else:
|
||
pageId = nextPageId++
|
||
ExpandFile()
|
||
```
|
||
|
||
**Deallocation:**
|
||
```
|
||
WritePage(pageId, Free with next=freeListHead)
|
||
freeListHead = pageId
|
||
UpdatePage0(freeListHead)
|
||
```
|
||
|
||
---
|
||
|
||
## 4. C-BSON Format Specification
|
||
|
||
C-BSON (Compressed BSON) is CBDD's wire format. See the dedicated **`C-BSON.md`** document for complete specification.
|
||
|
||
**Key points:**
|
||
|
||
- **Element header:** `[1 byte type][2 byte field ID]` (vs. BSON's `[1 byte type][N byte name\0]`)
|
||
- **Schema-based:** Field names mapped to `ushort` IDs via `ConcurrentDictionary`
|
||
- **Storage savings:** 30-60% reduction for typical schemas
|
||
- **Type compatible:** Uses standard BSON type codes and value encoding
|
||
|
||
**Example:**
|
||
|
||
```
|
||
Standard BSON element: [0x02]['em','ai','l','\0'][value] = 6 bytes overhead
|
||
C-BSON element: [0x02][0x03, 0x00][value] = 3 bytes overhead
|
||
Savings: 50%
|
||
```
|
||
|
||
---
|
||
|
||
## 5. Indexing Specifications
|
||
|
||
### 5.1 B+Tree Index
|
||
|
||
#### 5.1.1 Node Structure
|
||
|
||
B+Tree nodes are stored in **Index pages (PageType = 4)**:
|
||
|
||
```
|
||
┌─────────────────────────────────────────────────────────┐
|
||
│ [PageHeader (32)] │
|
||
├─────────────────────────────────────────────────────────┤
|
||
│ [BTreeNodeHeader (20)] │
|
||
│ PageId, IsLeaf, EntryCount, ParentPageId, │
|
||
│ NextLeafPageId, PrevLeafPageId │
|
||
├─────────────────────────────────────────────────────────┤
|
||
│ [Entries...] │
|
||
│ For Leaf Nodes: │
|
||
│ ┌────────────────────────────────────┐ │
|
||
│ │ IndexKey (variable) │ │
|
||
│ │ DocumentLocation (6 bytes) │ │
|
||
│ └────────────────────────────────────┘ │
|
||
│ For Internal Nodes: │
|
||
│ ┌────────────────────────────────────┐ │
|
||
│ │ IndexKey (variable) │ │
|
||
│ │ ChildPageId (4 bytes) │ │
|
||
│ └────────────────────────────────────┘ │
|
||
└─────────────────────────────────────────────────────────┘
|
||
```
|
||
|
||
#### 5.1.2 BTreeNodeHeader (20 bytes)
|
||
|
||
```csharp
|
||
public struct BTreeNodeHeader
|
||
{
|
||
public uint PageId; // 4 bytes
|
||
public bool IsLeaf; // 1 byte
|
||
public ushort EntryCount; // 2 bytes
|
||
public uint ParentPageId; // 4 bytes
|
||
public uint NextLeafPageId; // 4 bytes (leaf only)
|
||
public uint PrevLeafPageId; // 4 bytes (leaf only)
|
||
}
|
||
```
|
||
|
||
#### 5.1.3 IndexKey
|
||
|
||
Supports composite keys with multiple types:
|
||
|
||
```csharp
|
||
public struct IndexKey : IComparable<IndexKey>
|
||
{
|
||
public object[] Values { get; set; } // Multi-column support
|
||
public int CompareTo(IndexKey other) { ... }
|
||
}
|
||
```
|
||
|
||
**Serialization:**
|
||
- Each value serialized as C-BSON element
|
||
- Supports: String, Int32, Int64, Double, ObjectId, DateTime, Guid
|
||
|
||
#### 5.1.4 Operations
|
||
|
||
**Insert:** O(log n)
|
||
- Traverse to leaf
|
||
- Insert key-location pair
|
||
- Split if full (B+Tree standard split)
|
||
|
||
**Search:** O(log n)
|
||
- Binary search in nodes
|
||
- Equality: Single lookup
|
||
- Range: Scan linked leaf nodes
|
||
|
||
**Delete:** O(log n)
|
||
- Mark entry as deleted
|
||
- Lazy compaction on split
|
||
|
||
### 5.2 R-Tree Index (Geospatial)
|
||
|
||
#### 5.2.1 Node Structure
|
||
|
||
R-Tree nodes use **Spatial pages (PageType = 11)**:
|
||
|
||
```
|
||
┌─────────────────────────────────────────────────────────┐
|
||
│ [PageHeader (32)] │
|
||
├─────────────────────────────────────────────────────────┤
|
||
│ [SpatialPageHeader (16)] │
|
||
│ IsLeaf, Level, EntryCount, ParentPageId │
|
||
├─────────────────────────────────────────────────────────┤
|
||
│ [Entries...] (38 bytes each) │
|
||
│ ┌──────────────────────────────────────┐ │
|
||
│ │ MBR (GeoBox): 4 × double = 32 bytes │ │
|
||
│ │ MinLat, MinLon, MaxLat, MaxLon │ │
|
||
│ │ Pointer: DocumentLocation = 6 bytes │ │
|
||
│ └──────────────────────────────────────┘ │
|
||
│ (For internal nodes: Pointer = ChildPageId) │
|
||
└─────────────────────────────────────────────────────────┘
|
||
```
|
||
|
||
#### 5.2.2 GeoBox (Minimum Bounding Rectangle)
|
||
|
||
```csharp
|
||
public struct GeoBox
|
||
{
|
||
public double MinLat { get; set; }
|
||
public double MinLon { get; set; }
|
||
public double MaxLat { get; set; }
|
||
public double MaxLon { get; set; }
|
||
|
||
public bool Intersects(GeoBox other) { ... }
|
||
public bool Contains((double, double) point) { ... }
|
||
public GeoBox ExpandTo(GeoBox other) { ... }
|
||
}
|
||
```
|
||
|
||
#### 5.2.3 Operations
|
||
|
||
**Insert:** O(log n)
|
||
- Choose subtree with minimal MBR expansion
|
||
- Insert and update MBRs up the tree
|
||
- Split using quadratic algorithm
|
||
|
||
**Search (Proximity):**
|
||
```
|
||
Query: Find points within radius R of (lat, lon)
|
||
1. Convert to GeoBox: (lat-R, lon-R) to (lat+R, lon+R)
|
||
2. Traverse R-Tree, pruning non-intersecting branches
|
||
3. For leaf entries: Calculate exact distance
|
||
4. Return sorted by distance
|
||
```
|
||
|
||
**Search (Bounding Box):**
|
||
```
|
||
Query: Find points within box (minLat, minLon, maxLat, maxLon)
|
||
1. Create GeoBox
|
||
2. Traverse R-Tree, returning intersecting leaf entries
|
||
```
|
||
|
||
### 5.3 HNSW Index (Vector Similarity)
|
||
|
||
#### 5.3.1 Node Structure
|
||
|
||
HNSW nodes use **Vector pages (PageType = 9)**:
|
||
|
||
```
|
||
┌─────────────────────────────────────────────────────────┐
|
||
│ [PageHeader (32)] │
|
||
├─────────────────────────────────────────────────────────┤
|
||
│ [VectorPageHeader (16)] │
|
||
│ Dimensions, MaxM, NodeSize, NodeCount │
|
||
├─────────────────────────────────────────────────────────┤
|
||
│ [Nodes...] (variable size) │
|
||
│ ┌──────────────────────────────────────┐ │
|
||
│ │ DocumentLocation (6 bytes) │ │
|
||
│ │ MaxLevel (1 byte) │ │
|
||
│ │ Vector (dimensions × 4 bytes) │ │
|
||
│ │ Links Level 0 (2M × 6 bytes) │ │
|
||
│ │ Links Level 1-15 (M × 6 bytes each) │ │
|
||
│ └──────────────────────────────────────┘ │
|
||
└─────────────────────────────────────────────────────────┘
|
||
```
|
||
|
||
#### 5.3.2 HNSW Parameters
|
||
|
||
- **M:** Max bidirectional links per level (typically 16)
|
||
- **Dimensions:** Vector dimensionality (e.g., 1536 for OpenAI embeddings)
|
||
- **ef_construction:** Quality parameter during build (typically 200)
|
||
- **ef_search:** Quality parameter during search (typically 50)
|
||
|
||
#### 5.3.3 Vector Similarity Metrics
|
||
|
||
```csharp
|
||
public enum VectorMetric
|
||
{
|
||
Cosine, // cos(θ) = dot(a,b) / (||a|| × ||b||)
|
||
Euclidean, // √Σ(ai - bi)²
|
||
DotProduct // Σ(ai × bi)
|
||
}
|
||
```
|
||
|
||
#### 5.3.4 Operations
|
||
|
||
**Insert:** O(log n) expected
|
||
- Assign random level (exponential distribution)
|
||
- Starting from top level, greedily descend
|
||
- At each level, add bidirectional links to nearest M neighbors
|
||
|
||
**Search (k-NN):**
|
||
```
|
||
Query: Find k nearest neighbors to query vector
|
||
1. Start at entry point (top level)
|
||
2. Greedily search to local minimum at each level
|
||
3. At level 0, maintain priority queue of k candidates
|
||
4. Expand candidate set with ef_search parameter
|
||
5. Return top k by similarity
|
||
```
|
||
|
||
---
|
||
|
||
## 6. Transaction and WAL Specification
|
||
|
||
### 6.1 Transaction Model
|
||
|
||
CBDD implements **Snapshot Isolation** (SI):
|
||
|
||
- **Read transactions:** See consistent snapshot as of transaction start
|
||
- **Write transactions:** Accumulate changes in-memory, commit atomically
|
||
- **Conflict detection:** Last-write-wins (optimistic concurrency)
|
||
|
||
### 6.2 Write-Ahead Log (WAL)
|
||
|
||
CBDD implements a **full WAL** for durability and crash recovery:
|
||
|
||
#### WAL Entry Format
|
||
|
||
```
|
||
┌─────────────────────────────────────────────────────────┐
|
||
│ WAL Entry Format │
|
||
├─────────────────────────────────────────────────────────┤
|
||
│ [Record Type: 1 byte] │
|
||
│ 0x01 = Begin │
|
||
│ 0x02 = Write │
|
||
│ 0x03 = Commit │
|
||
│ 0x04 = Abort │
|
||
│ 0x05 = Checkpoint │
|
||
├─────────────────────────────────────────────────────────┤
|
||
│ For Begin/Commit/Abort: │
|
||
│ [Transaction ID: 8 bytes] │
|
||
│ [Timestamp: 8 bytes (Unix ms)] │
|
||
│ Total: 17 bytes │
|
||
├─────────────────────────────────────────────────────────┤
|
||
│ For Write: │
|
||
│ [Transaction ID: 8 bytes] │
|
||
│ [PageId: 4 bytes] │
|
||
│ [After Image Length: 4 bytes] │
|
||
│ [After Image: variable bytes] │
|
||
│ Total: 17 + AfterImage.Length │
|
||
└─────────────────────────────────────────────────────────┘
|
||
```
|
||
|
||
#### WAL Protocol
|
||
|
||
**Write Path:**
|
||
```
|
||
1. Begin Transaction → WriteBeginRecord(txnId)
|
||
2. Modify Data → WriteDataRecord(txnId, pageId, afterImage)
|
||
3. Commit → WriteCommitRecord(txnId) + Flush()
|
||
```
|
||
|
||
**Recovery Path:**
|
||
```csharp
|
||
var records = wal.ReadAll();
|
||
foreach (var record in records)
|
||
{
|
||
if (record.Type == WalRecordType.Write && IsCommitted(record.TransactionId))
|
||
{
|
||
pageFile.WritePage(record.PageId, record.AfterImage);
|
||
}
|
||
}
|
||
```
|
||
|
||
#### Implementation Details
|
||
|
||
**Zero-Allocation Writes:**
|
||
```csharp
|
||
// Synchronous (stack allocated)
|
||
Span<byte> buffer = stackalloc byte[17];
|
||
buffer[0] = (byte)WalRecordType.Begin;
|
||
BitConverter.TryWriteBytes(buffer[1..9], transactionId);
|
||
BitConverter.TryWriteBytes(buffer[9..17], timestamp);
|
||
_walStream.Write(buffer);
|
||
|
||
// Asynchronous (pooled)
|
||
var buffer = ArrayPool<byte>.Shared.Rent(totalSize);
|
||
try
|
||
{
|
||
// ... write to buffer
|
||
await _walStream.WriteAsync(buffer.AsMemory(0, totalSize), ct);
|
||
}
|
||
finally
|
||
{
|
||
ArrayPool<byte>.Shared.Return(buffer);
|
||
}
|
||
```
|
||
|
||
**Durability Guarantee:**
|
||
```csharp
|
||
public void Flush()
|
||
{
|
||
_walStream?.Flush(flushToDisk: true); // Force OS fsync
|
||
}
|
||
```
|
||
|
||
**Checkpoint and Truncate:**
|
||
```csharp
|
||
// After applying WAL to pages:
|
||
pageFile.Flush(); // Ensure pages on disk
|
||
wal.Truncate(); // Remove committed WAL records
|
||
wal.Flush(); // Sync truncation
|
||
```
|
||
|
||
### 6.3 ACID Guarantees
|
||
|
||
- **Atomicity:** WAL ensures all-or-nothing commits
|
||
- **Consistency:** Schema validation before commit
|
||
- **Isolation:** Snapshot isolation (MVCC-like)
|
||
- **Durability:** WAL flush on commit
|
||
|
||
---
|
||
|
||
## 7. Query Processing
|
||
|
||
### 7.1 LINQ Provider
|
||
|
||
CBDD implements `IQueryable<T>` via custom query provider:
|
||
|
||
```csharp
|
||
var results = db.Users.AsQueryable()
|
||
.Where(u => u.Age > 25 && u.City == "NYC")
|
||
.OrderBy(u => u.Name)
|
||
.Take(10)
|
||
.ToList();
|
||
```
|
||
|
||
**Translation:**
|
||
1. Expression tree → BTree query plan
|
||
2. Index selection: Choose index on `Age` or `City`
|
||
3. Index scan: Retrieve candidate DocumentLocations
|
||
4. Post-filter: Apply remaining predicates in-memory
|
||
5. Materialize: Read documents, deserialize to objects
|
||
|
||
### 7.2 Index Selection Algorithm
|
||
|
||
```
|
||
For WHERE clause:
|
||
1. Extract equality and range predicates
|
||
2. Score each index by predicate coverage
|
||
3. Select index with highest score
|
||
4. Use index scan if selective, else full scan
|
||
```
|
||
|
||
**Example:**
|
||
```sql
|
||
WHERE Age > 25 AND City = "NYC"
|
||
```
|
||
|
||
Index options:
|
||
- Index on `Age` → Range scan (potentially many results)
|
||
- Index on `City` → Equality lookup (likely fewer results)
|
||
- **Choose:** `City` index, post-filter `Age > 25`
|
||
|
||
### 7.3 Hybrid Execution Model
|
||
|
||
CBDD combines **index-based** and **in-memory** execution:
|
||
|
||
1. **Index Phase:** Use B-Tree/R-Tree to filter candidates
|
||
2. **Materialization:** Read documents from pages
|
||
3. **LINQ to Objects:** Apply complex predicates, projections, aggregations
|
||
|
||
**Benefits:**
|
||
- Leverage index selectivity
|
||
- Support full LINQ semantics
|
||
- Avoid building complex query execution engine
|
||
|
||
---
|
||
|
||
## 8. Source Generation Protocol
|
||
|
||
### 8.1 Mapper Generation
|
||
|
||
CBDD uses **Roslyn Source Generators** to produce zero-reflection mappers:
|
||
|
||
**Input:**
|
||
```csharp
|
||
public class User
|
||
{
|
||
public ObjectId Id { get; set; }
|
||
public string Name { get; set; }
|
||
public int Age { get; set; }
|
||
}
|
||
|
||
public partial class MyDbContext : DocumentDbContext
|
||
{
|
||
public DocumentCollection<ObjectId, User> Users { get; set; }
|
||
}
|
||
```
|
||
|
||
**Generated:**
|
||
```csharp
|
||
namespace MyApp.Mappers
|
||
{
|
||
public class UserMapper : ObjectIdMapperBase<User>
|
||
{
|
||
public override string CollectionName => "users";
|
||
|
||
public override int Serialize(User entity, BsonSpanWriter writer)
|
||
{
|
||
var start = writer.BeginDocument();
|
||
writer.WriteObjectId("_id", entity.Id);
|
||
writer.WriteString("name", entity.Name);
|
||
writer.WriteInt32("age", entity.Age);
|
||
writer.EndDocument(start);
|
||
return writer.Position;
|
||
}
|
||
|
||
public override User Deserialize(BsonSpanReader reader)
|
||
{
|
||
var user = new User();
|
||
reader.ReadDocumentSize();
|
||
while (reader.Remaining > 0)
|
||
{
|
||
var type = reader.ReadBsonType();
|
||
if (type == BsonType.EndOfDocument) break;
|
||
var name = reader.ReadElementHeader();
|
||
switch (name)
|
||
{
|
||
case "_id": user.Id = reader.ReadObjectId(); break;
|
||
case "name": user.Name = reader.ReadString(); break;
|
||
case "age": user.Age = reader.ReadInt32(); break;
|
||
default: reader.SkipValue(type); break;
|
||
}
|
||
}
|
||
return user;
|
||
}
|
||
}
|
||
}
|
||
```
|
||
|
||
### 8.2 Attribute Support
|
||
|
||
See **Data Annotations Support** section in README and related documentation.
|
||
|
||
Supported attributes:
|
||
- `[Table(Name, Schema)]` → Collection name mapping
|
||
- `[Column(Name, TypeName)]` → Field name and special type mapping
|
||
- `[Key]` → Primary key identification
|
||
- `[NotMapped]` → Exclusion from serialization
|
||
- `[Required]`, `[StringLength]`, `[Range]` → Validation
|
||
|
||
### 8.3 Code Generation Rules
|
||
|
||
1. **Lowercase policy:** BSON field names are lowercase by default
|
||
2. **Attribute override:** `[BsonProperty]`, `[JsonPropertyName]`, `[Column]` override default names
|
||
3. **Nested objects:** Recursively analyzed and mapped
|
||
4. **Collections:** Arrays and `IEnumerable<T>` mapped to BSON arrays
|
||
5. **Value types:** Primitives, enums, `DateTime`, `ObjectId`, `Guid` handled natively
|
||
|
||
---
|
||
|
||
## 9. Security Considerations
|
||
|
||
### 9.1 Data Integrity
|
||
|
||
- **Checksums:** CRC32 on page headers (planned: extend to full pages)
|
||
- **WAL:** Ensures consistency even on crash
|
||
- **Schema validation:** Prevents type mismatches
|
||
|
||
### 9.2 Concurrent Access
|
||
|
||
**Multi-Threading:** ✅ **Fully Supported**
|
||
- **Thread-safe writes:** Multiple threads can write concurrently within the same process
|
||
- **Internal synchronization:** `SemaphoreSlim` for critical sections, `ConcurrentDictionary` for shared state
|
||
- **WAL coordination:** Commit lock ensures serializable WAL writes
|
||
- **Page cache:** Thread-safe access to memory-mapped pages
|
||
|
||
**Multi-Process:** ❌ **Not Supported**
|
||
- **Exclusive file lock:** `FileShare.None` prevents multiple processes from opening the same database
|
||
- **Rationale:** Simplifies consistency guarantees, avoids complex inter-process coordination
|
||
- **Future:** May support via cooperative locking or server mode (HTTP API)
|
||
|
||
### 9.3 Injection Attacks
|
||
|
||
- **No SQL injection:** No query language, only type-safe LINQ
|
||
- **Schema-validated:** All operations type-checked at compile-time
|
||
|
||
---
|
||
|
||
## 10. Performance Considerations
|
||
|
||
### 10.1 Zero-Allocation Design
|
||
|
||
**Stack allocation:**
|
||
```csharp
|
||
Span<byte> buffer = stackalloc byte[16384]; // Page buffer on stack
|
||
var writer = new BsonSpanWriter(buffer, keyMap);
|
||
```
|
||
|
||
**No boxing:**
|
||
- Value types remain unboxed throughout serialization
|
||
- No reflection or dynamic invocation
|
||
|
||
**Pooling:**
|
||
```csharp
|
||
var buffer = ArrayPool<byte>.Shared.Rent(pageSize);
|
||
try { /* use buffer */ }
|
||
finally { ArrayPool<byte>.Shared.Return(buffer); }
|
||
```
|
||
|
||
### 10.2 Memory-Mapped Files
|
||
|
||
**Benefits:**
|
||
- OS kernel manages page cache
|
||
- Zero-copy reads (map file → process memory)
|
||
- Prefetching and read-ahead by OS
|
||
|
||
**Tradeoffs:**
|
||
- Limited to single process (file lock)
|
||
- Windows vs. Linux differences in `mmap` behavior
|
||
|
||
### 10.3 Cache Efficiency
|
||
|
||
**Compact C-BSON:**
|
||
- More documents per 16KB page
|
||
- Better CPU cache utilization
|
||
- Reduced TLB misses
|
||
|
||
---
|
||
|
||
## 11. Implementation Notes
|
||
|
||
### 11.1 .NET 10 Requirements
|
||
|
||
CBDD targets **.NET 10** to leverage:
|
||
- `Span<T>` and `ref struct` for zero-copy I/O
|
||
- Source Generators (Roslyn)
|
||
- `MemoryMarshal` for efficient struct serialization
|
||
- Improved JIT optimizations
|
||
|
||
**Future:** Evaluate `.NET Standard 2.1` for broader compatibility.
|
||
|
||
### 11.2 Platform Considerations
|
||
|
||
- **Windows:** Full support via MemoryMappedFile
|
||
- **Linux/macOS:** Supported, potential differences in `mmap` behavior
|
||
- **Endianness:** Little-endian assumed (matches x86/x64/ARM)
|
||
|
||
### 11.3 Future Compatibility
|
||
|
||
**Planned enhancements:**
|
||
- **Compression:** Page-level or document-level LZ4
|
||
- **Encryption:** AES-256 for data-at-rest
|
||
- **Multi-process:** Via cooperative locking or server mode
|
||
|
||
---
|
||
|
||
## 12. References
|
||
|
||
### 12.1 BSON and Formats
|
||
|
||
- [BSON Specification v1.1](http://bsonspec.org/)
|
||
- [MongoDB BSON Types](https://www.mongodb.com/docs/manual/reference/bson-types/)
|
||
|
||
### 12.2 Database Internals
|
||
|
||
- *Database Internals* by Alex Petrov (O'Reilly, 2019)
|
||
- [B+ Trees (Wikipedia)](https://en.wikipedia.org/wiki/B%2B_tree)
|
||
- [R-Trees (Guttman, 1984)](http://www-db.deis.unibo.it/courses/SI-LS/papers/Gut84.pdf)
|
||
|
||
### 12.3 Vector Search
|
||
|
||
- [HNSW Algorithm (Malkov & Yashunin, 2018)](https://arxiv.org/abs/1603.09320)
|
||
- [Faiss Library (Facebook AI)](https://github.com/facebookresearch/faiss)
|
||
|
||
### 12.4 Standards
|
||
|
||
- [RFC 2119: Key words for RFCs](https://www.ietf.org/rfc/rfc2119.txt)
|
||
- [IEEE 754: Floating Point Arithmetic](https://ieeexplore.ieee.org/document/8766229)
|
||
|
||
---
|
||
|
||
## Appendix A: Page Format Diagrams
|
||
|
||
### A.1 Slotted Page Visual
|
||
|
||
```
|
||
Offset: 0 16383
|
||
┌────────────────────────────────────────────────┐
|
||
24 │ [Header: 24 bytes] │
|
||
├────────────────────────────────────────────────┤
|
||
│ [Slot 0: Offset=16320, Len=64, Flags=0] │
|
||
│ [Slot 1: Offset=16256, Len=64, Flags=0] │
|
||
│ [Slot 2: Offset=16192, Len=64, Flags=0] │
|
||
56 │ ← FreeSpaceStart │
|
||
│ │
|
||
│ ~~~~~~~~~~~~~~~~ Free Space ~~~~~~~~~~~~~~~~~~~ │
|
||
│ │
|
||
16192 │ ← FreeSpaceEnd │
|
||
│ [Document 2 data: 64 bytes] │
|
||
│ [Document 1 data: 64 bytes] │
|
||
│ [Document 0 data: 64 bytes] │
|
||
16384 └────────────────────────────────────────────────┘
|
||
```
|
||
|
||
### A.2 B+Tree Node Visual
|
||
|
||
```
|
||
Internal Node:
|
||
┌──────────────────────────────────────────────────────┐
|
||
│ [PageHeader 32] [BTreeNodeHeader 20] │
|
||
├──────────────────────────────────────────────────────┤
|
||
│ Key1 | ChildPtr1 │
|
||
│ Key2 | ChildPtr2 │
|
||
│ Key3 | ChildPtr3 │
|
||
│ ... │
|
||
└──────────────────────────────────────────────────────┘
|
||
|
||
Leaf Node:
|
||
┌──────────────────────────────────────────────────────┐
|
||
│ [PageHeader 32] [BTreeNodeHeader 20] │
|
||
├──────────────────────────────────────────────────────┤
|
||
│ Key1 | DocumentLocation1 │
|
||
│ Key2 | DocumentLocation2 │
|
||
│ Key3 | DocumentLocation3 │
|
||
│ ... │
|
||
│ NextLeafPageId → [next leaf] │
|
||
└──────────────────────────────────────────────────────┘
|
||
```
|
||
|
||
---
|
||
|
||
## Appendix B: C-BSON Hex Dump
|
||
|
||
See `C-BSON.md`, Section "Hex Dump Examples" for detailed wire format examples.
|
||
|
||
---
|
||
|
||
## Appendix C: Transaction State Machine
|
||
|
||
```
|
||
┌─────────────┐
|
||
│ Idle │
|
||
└──────┬──────┘
|
||
│ BeginTransaction()
|
||
▼
|
||
┌─────────────┐
|
||
│ Active │ ← Accumulate writes in memory
|
||
└──┬───────┬──┘
|
||
│ │ Rollback()
|
||
│ └────────────┐
|
||
│ Commit() │
|
||
▼ ▼
|
||
┌─────────────┐ ┌─────────────┐
|
||
│ Committing │ │ Aborted │
|
||
│ (WAL flush) │ └─────────────┘
|
||
└──────┬──────┘
|
||
│ Flush complete
|
||
▼
|
||
┌─────────────┐
|
||
│ Committed │
|
||
└─────────────┘
|
||
```
|
||
|
||
---
|
||
|
||
**End of RFC-CBDD Specification**
|