Files
CBDD/RFC.md

1007 lines
40 KiB
Markdown
Executable File
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# RFC-CBDD: High-Performance Embedded Document Database for .NET
**Status:** Draft
**Version:** 0.1.0
**Date:** February 2026
**Authors:** CBDD Development Team
---
## Abstract
This document specifies **CBDD**, a high-performance embedded document-oriented database engine for .NET 10. CBDD is designed from the ground up for **zero-allocation performance**, leveraging modern .NET features including `Span<T>`, Memory-Mapped Files, and Source Generators. The database implements a custom **C-BSON** (Compressed BSON) format that achieves 30-60% storage reduction compared to standard BSON while maintaining full type compatibility.
Key innovations include:
- **Zero-allocation I/O** via `Span<byte>` and `stackalloc`
- **C-BSON format** with field name compression (2-byte IDs vs. variable-length strings)
- **Page-based storage** with memory-mapped file I/O
- **Multiple index types:** B+Tree, R-Tree (geospatial), and HNSW (vector similarity)
- **ACID transactions** with Write-Ahead Logging and Snapshot Isolation
- **Compile-time code generation** for zero-reflection serialization
---
## 1. Introduction
### 1.1 Motivation
Most embedded databases for .NET fall into two categories:
1. **Native wrappers** (SQLite, RocksDB): High performance but with interop overhead and GC pressure from marshalling
2. **Managed implementations** (LiteDB, others): Pure C# but burdened by reflection, excessive allocations, and legacy design
CBDD bridges this gap by leveraging **.NET 10's modern performance features** while remaining a pure managed implementation:
- `Span<T>` and `Memory<T>` for zero-copy I/O
- Source Generators for zero-reflection serialization
- Memory-Mapped Files for OS-level page caching
- Stack allocation (`stackalloc`) for ephemeral buffers
### 1.2 Scope
This RFC specifies:
- **Storage Engine:** Page file format, WAL protocol, transaction semantics
- **C-BSON Format:** Wire format, schema management, key compression
- **Indexing:** B+Tree, R-Tree, and HNSW implementations
- **Query Engine:** LINQ provider and hybrid execution model
- **Code Generation:** Mapper generation rules and attribute support
### 1.3 Terminology
**MUST**, **SHOULD**, **MAY**: As defined in RFC 2119
- **Page:** Fixed-size block of data (default 16KB)
- **Slot:** Variable-size entry within a slotted page
- **C-BSON:** Compressed BSON with field ID compression
- **WAL:** Write-Ahead Log for durability
- **MBR:** Minimum Bounding Rectangle (for R-Tree)
- **HNSW:** Hierarchical Navigable Small World (vector search algorithm)
---
## 2. Architecture Overview
```
┌─────────────────────────────────────────────────────────┐
│ CBDD Architecture │
├─────────────────────────────────────────────────────────┤
│ ┌───────────────┐ ┌───────────────┐ ┌─────────────┐ │
│ │ LINQ Provider │ │ Source Gen │ │ Collections │ │
│ │ (Queryable) │ │ (Mappers) │ │ (DbContext)│ │
│ └───────┬───────┘ └───────┬───────┘ └──────┬──────┘ │
│ │ │ │ │
│ ┌───────▼─────────────────────────────────────▼──────┐ │
│ │ Index Layer (B-Tree, R-Tree, HNSW) │ │
│ └───────┬─────────────────────────────────────┬──────┘ │
│ │ │ │
│ ┌───────▼──────────────────────────────────────▼─────┐ │
│ │ Storage Engine (Pages, Transactions) │ │
│ │ ┌──────────────┐ ┌──────────────┐ ┌──────────┐ │ │
│ │ │ PageFile │ │ WAL │ │ FreeList │ │ │
│ │ └──────────────┘ └──────────────┘ └──────────┘ │ │
│ └────────────────────────────────────────────────────┘ │
│ ┌────────────────────────────────────────────────────┐ │
│ │ C-BSON (Span-based Reader/Writer) │ │
│ └────────────────────────────────────────────────────┘ │
│ ┌────────────────────────────────────────────────────┐ │
│ │ OS Memory-Mapped Files (Kernel Page Cache) │ │
│ └────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────┘
```
### 2.1 Storage Layer
The storage layer manages page-based I/O using memory-mapped files:
- **PageFile:** Fixed-size pages (8KB, 16KB, or 32KB)
- **Page Types:** Header, Data, Index, Vector, Spatial, Dictionary, Schema, Overflow, Free
- **Free List:** Linked list of reusable pages
### 2.2 C-BSON Layer
Provides zero-allocation BSON serialization/deserialization:
- **BsonSpanWriter:** Writes C-BSON to `Span<byte>`
- **BsonSpanReader:** Reads C-BSON from `ReadOnlySpan<byte>`
- **Schema Management:** Field name → ID mapping
Full specification: See `C-BSON.md`
### 2.3 Index Layer
Three specialized index types:
- **B+Tree:** General-purpose sorted index (range queries, equality)
- **R-Tree:** Geospatial index (proximity, bounding box queries)
- **HNSW:** Vector similarity search (k-NN, ANN)
### 2.4 Query Layer
LINQ-to-CBDD provider:
- Translates LINQ expressions to index operations
- Hybrid execution: Index-based filtering + in-memory LINQ to Objects
- Supports `Where`, `OrderBy`, `Skip`, `Take`, `GroupBy`, `Join`, aggregations
### 2.5 Transaction Layer
ACID guarantees via:
- **Atomicity:** All-or-nothing commits
- **Consistency:** Schema validation
- **Isolation:** Snapshot isolation (MVCC-like)
- **Durability:** Write-Ahead Logging (WAL)
---
## 3. Storage Engine Specification
### 3.1 Page File Format
#### 3.1.1 Page Sizes
CBDD supports 3 predefined page sizes:
| Configuration | Page Size | Use Case |
|:--------------|:----------|:--------------------------------|
| **Small** | 8 KB | Embedded, tiny documents |
| **Default** | 16 KB | General purpose (InnoDB-like) |
| **Large** | 32 KB | Big documents (MongoDB-like) |
**Rationale:** 16KB aligns with Linux page cache and InnoDB defaults, balancing fragmentation vs. overhead.
#### 3.1.2 File Layout
```
┌─────────────────────────────────────────────────────────┐
│ Page 0: File Header │
│ [PageHeader (32)] │
│ [Database Version (4)] │
│ [Page Size (4)] │
│ [First Free Page ID (4)] ← Free list head │
│ [Dictionary Root Page ID (4)] │
│ [Reserved (remaining)] │
├─────────────────────────────────────────────────────────┤
│ Page 1: Collection Metadata │
│ [SlottedPageHeader (24)] │
│ [Collection Schemas...] │
├─────────────────────────────────────────────────────────┤
│ Page 2+: Data, Index, Vector, Spatial, Dictionary... │
└─────────────────────────────────────────────────────────┘
```
### 3.2 Page Header Format
All pages start with a **32-byte header**:
```
Offset Size Field Description
------ ---- ------------------ --------------------------
0 4 PageId Page number (0-indexed)
4 1 PageType Enum (see §3.3)
5 2 FreeBytes Unused space in page
7 4 NextPageId Linked list pointer
11 8 TransactionId Last modifying transaction
19 4 Checksum CRC32 of page data
23 4 DictionaryRootPageId (Page 0 only)
27 5 Reserved Future use
```
**Total:** 32 bytes
**Implementation:**
```csharp
[StructLayout(LayoutKind.Explicit, Size = 32)]
public struct PageHeader
{
[FieldOffset(0)] public uint PageId;
[FieldOffset(4)] public PageType PageType;
[FieldOffset(5)] public ushort FreeBytes;
[FieldOffset(7)] public uint NextPageId;
[FieldOffset(11)] public ulong TransactionId;
[FieldOffset(19)] public uint Checksum;
[FieldOffset(23)] public uint DictionaryRootPageId;
}
```
### 3.3 Page Types
| Value | Type | Purpose |
|:------|:------------|:-----------------------------------------|
| 0 | Empty | Uninitialized |
| 1 | Header | Page 0 (file header) |
| 2 | Collection | Schema and collection metadata |
| 3 | Data | Document storage (slotted page) |
| 4 | Index | B+Tree node |
| 5 | FreeList | Deprecated (unused) |
| 6 | Overflow | Continuation of large documents |
| 7 | Dictionary | String interning for C-BSON keys |
| 8 | Schema | Schema versioning |
| 9 | Vector | HNSW index node |
| 10 | Free | Reusable page (linked via NextPageId) |
| 11 | Spatial | R-Tree node |
### 3.4 Slotted Page Structure
Data pages use a **slotted page** design for variable-size documents:
```
┌─────────────────────────────────────────────────────────┐
│ [SlottedPageHeader (24)] │
│ PageId, PageType, SlotCount, FreeSpaceStart, │
│ FreeSpaceEnd, NextOverflowPage, TransactionId │
├─────────────────────────────────────────────────────────┤
│ [Slot Array (grows down)] │
│ ┌──────────────────────────────────┐ │
│ │ Slot 0: [Offset, Length, Flags] │ (8 bytes) │
│ │ Slot 1: [Offset, Length, Flags] │ │
│ │ Slot 2: [Offset, Length, Flags] │ │
│ │ ... │ │
│ └──────────────────────────────────┘ │
│ ↓ │
│ [Free Space] │
│ ↑ │
│ [Data Area (grows up)] │
│ ┌──────────────────────────────────┐ │
│ │ ... Document N C-BSON bytes ... │ │
│ │ ... Document 2 C-BSON bytes ... │ │
│ │ ... Document 1 C-BSON bytes ... │ │
│ │ ... Document 0 C-BSON bytes ... │ │
│ └──────────────────────────────────┘ │
└─────────────────────────────────────────────────────────┘
```
#### SlottedPageHeader (24 bytes)
```
Offset Size Field Description
------ ---- ------------------ --------------------------
0 4 PageId
4 1 PageType (= 3 for Data pages)
8 2 SlotCount Number of slots
10 2 FreeSpaceStart Offset where slots end
12 2 FreeSpaceEnd Offset where data begins
14 4 NextOverflowPage For large documents
18 4 TransactionId
22 2 Reserved
```
#### SlotEntry (8 bytes)
```
Offset Size Field Description
------ ---- ------------------ --------------------------
0 2 Offset Byte offset to document data
2 2 Length Document length in bytes
4 4 Flags SlotFlags enum
```
**SlotFlags:**
- `None = 0`: Active slot
- `Deleted = 1`: Slot marked for reuse
- `HasOverflow = 2`: Document continues in overflow pages
- `Compressed = 4`: Reserved for future compression
### 3.5 DocumentLocation
Documents are addressed by **(PageId, SlotIndex)** tuple:
```csharp
public readonly struct DocumentLocation
{
public uint PageId { get; init; } // 4 bytes
public ushort SlotIndex { get; init; } // 2 bytes
}
// Total: 6 bytes when serialized
```
**Used in:**
- Index entries (key → location mapping)
- Overflow page chains
- Internal references
### 3.6 Free Page Management
Deleted pages form a **linked list** for reuse:
1. `PageHeader.PageType = Free`
2. `PageHeader.NextPageId` points to next free page
3. Page 0's `NextPageId` points to **head of free list**
**Allocation:**
```
if (freeListHead != 0):
pageId = freeListHead
freeListHead = ReadPage(pageId).NextPageId
UpdatePage0(freeListHead)
else:
pageId = nextPageId++
ExpandFile()
```
**Deallocation:**
```
WritePage(pageId, Free with next=freeListHead)
freeListHead = pageId
UpdatePage0(freeListHead)
```
---
## 4. C-BSON Format Specification
C-BSON (Compressed BSON) is CBDD's wire format. See the dedicated **`C-BSON.md`** document for complete specification.
**Key points:**
- **Element header:** `[1 byte type][2 byte field ID]` (vs. BSON's `[1 byte type][N byte name\0]`)
- **Schema-based:** Field names mapped to `ushort` IDs via `ConcurrentDictionary`
- **Storage savings:** 30-60% reduction for typical schemas
- **Type compatible:** Uses standard BSON type codes and value encoding
**Example:**
```
Standard BSON element: [0x02]['em','ai','l','\0'][value] = 6 bytes overhead
C-BSON element: [0x02][0x03, 0x00][value] = 3 bytes overhead
Savings: 50%
```
---
## 5. Indexing Specifications
### 5.1 B+Tree Index
#### 5.1.1 Node Structure
B+Tree nodes are stored in **Index pages (PageType = 4)**:
```
┌─────────────────────────────────────────────────────────┐
│ [PageHeader (32)] │
├─────────────────────────────────────────────────────────┤
│ [BTreeNodeHeader (20)] │
│ PageId, IsLeaf, EntryCount, ParentPageId, │
│ NextLeafPageId, PrevLeafPageId │
├─────────────────────────────────────────────────────────┤
│ [Entries...] │
│ For Leaf Nodes: │
│ ┌────────────────────────────────────┐ │
│ │ IndexKey (variable) │ │
│ │ DocumentLocation (6 bytes) │ │
│ └────────────────────────────────────┘ │
│ For Internal Nodes: │
│ ┌────────────────────────────────────┐ │
│ │ IndexKey (variable) │ │
│ │ ChildPageId (4 bytes) │ │
│ └────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────┘
```
#### 5.1.2 BTreeNodeHeader (20 bytes)
```csharp
public struct BTreeNodeHeader
{
public uint PageId; // 4 bytes
public bool IsLeaf; // 1 byte
public ushort EntryCount; // 2 bytes
public uint ParentPageId; // 4 bytes
public uint NextLeafPageId; // 4 bytes (leaf only)
public uint PrevLeafPageId; // 4 bytes (leaf only)
}
```
#### 5.1.3 IndexKey
Supports composite keys with multiple types:
```csharp
public struct IndexKey : IComparable<IndexKey>
{
public object[] Values { get; set; } // Multi-column support
public int CompareTo(IndexKey other) { ... }
}
```
**Serialization:**
- Each value serialized as C-BSON element
- Supports: String, Int32, Int64, Double, ObjectId, DateTime, Guid
#### 5.1.4 Operations
**Insert:** O(log n)
- Traverse to leaf
- Insert key-location pair
- Split if full (B+Tree standard split)
**Search:** O(log n)
- Binary search in nodes
- Equality: Single lookup
- Range: Scan linked leaf nodes
**Delete:** O(log n)
- Mark entry as deleted
- Lazy compaction on split
### 5.2 R-Tree Index (Geospatial)
#### 5.2.1 Node Structure
R-Tree nodes use **Spatial pages (PageType = 11)**:
```
┌─────────────────────────────────────────────────────────┐
│ [PageHeader (32)] │
├─────────────────────────────────────────────────────────┤
│ [SpatialPageHeader (16)] │
│ IsLeaf, Level, EntryCount, ParentPageId │
├─────────────────────────────────────────────────────────┤
│ [Entries...] (38 bytes each) │
│ ┌──────────────────────────────────────┐ │
│ │ MBR (GeoBox): 4 × double = 32 bytes │ │
│ │ MinLat, MinLon, MaxLat, MaxLon │ │
│ │ Pointer: DocumentLocation = 6 bytes │ │
│ └──────────────────────────────────────┘ │
│ (For internal nodes: Pointer = ChildPageId) │
└─────────────────────────────────────────────────────────┘
```
#### 5.2.2 GeoBox (Minimum Bounding Rectangle)
```csharp
public struct GeoBox
{
public double MinLat { get; set; }
public double MinLon { get; set; }
public double MaxLat { get; set; }
public double MaxLon { get; set; }
public bool Intersects(GeoBox other) { ... }
public bool Contains((double, double) point) { ... }
public GeoBox ExpandTo(GeoBox other) { ... }
}
```
#### 5.2.3 Operations
**Insert:** O(log n)
- Choose subtree with minimal MBR expansion
- Insert and update MBRs up the tree
- Split using quadratic algorithm
**Search (Proximity):**
```
Query: Find points within radius R of (lat, lon)
1. Convert to GeoBox: (lat-R, lon-R) to (lat+R, lon+R)
2. Traverse R-Tree, pruning non-intersecting branches
3. For leaf entries: Calculate exact distance
4. Return sorted by distance
```
**Search (Bounding Box):**
```
Query: Find points within box (minLat, minLon, maxLat, maxLon)
1. Create GeoBox
2. Traverse R-Tree, returning intersecting leaf entries
```
### 5.3 HNSW Index (Vector Similarity)
#### 5.3.1 Node Structure
HNSW nodes use **Vector pages (PageType = 9)**:
```
┌─────────────────────────────────────────────────────────┐
│ [PageHeader (32)] │
├─────────────────────────────────────────────────────────┤
│ [VectorPageHeader (16)] │
│ Dimensions, MaxM, NodeSize, NodeCount │
├─────────────────────────────────────────────────────────┤
│ [Nodes...] (variable size) │
│ ┌──────────────────────────────────────┐ │
│ │ DocumentLocation (6 bytes) │ │
│ │ MaxLevel (1 byte) │ │
│ │ Vector (dimensions × 4 bytes) │ │
│ │ Links Level 0 (2M × 6 bytes) │ │
│ │ Links Level 1-15 (M × 6 bytes each) │ │
│ └──────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────┘
```
#### 5.3.2 HNSW Parameters
- **M:** Max bidirectional links per level (typically 16)
- **Dimensions:** Vector dimensionality (e.g., 1536 for OpenAI embeddings)
- **ef_construction:** Quality parameter during build (typically 200)
- **ef_search:** Quality parameter during search (typically 50)
#### 5.3.3 Vector Similarity Metrics
```csharp
public enum VectorMetric
{
Cosine, // cos(θ) = dot(a,b) / (||a|| × ||b||)
Euclidean, // √Σ(ai - bi)²
DotProduct // Σ(ai × bi)
}
```
#### 5.3.4 Operations
**Insert:** O(log n) expected
- Assign random level (exponential distribution)
- Starting from top level, greedily descend
- At each level, add bidirectional links to nearest M neighbors
**Search (k-NN):**
```
Query: Find k nearest neighbors to query vector
1. Start at entry point (top level)
2. Greedily search to local minimum at each level
3. At level 0, maintain priority queue of k candidates
4. Expand candidate set with ef_search parameter
5. Return top k by similarity
```
---
## 6. Transaction and WAL Specification
### 6.1 Transaction Model
CBDD implements **Snapshot Isolation** (SI):
- **Read transactions:** See consistent snapshot as of transaction start
- **Write transactions:** Accumulate changes in-memory, commit atomically
- **Conflict detection:** Last-write-wins (optimistic concurrency)
### 6.2 Write-Ahead Log (WAL)
CBDD implements a **full WAL** for durability and crash recovery:
#### WAL Entry Format
```
┌─────────────────────────────────────────────────────────┐
│ WAL Entry Format │
├─────────────────────────────────────────────────────────┤
│ [Record Type: 1 byte] │
│ 0x01 = Begin │
│ 0x02 = Write │
│ 0x03 = Commit │
│ 0x04 = Abort │
│ 0x05 = Checkpoint │
├─────────────────────────────────────────────────────────┤
│ For Begin/Commit/Abort: │
│ [Transaction ID: 8 bytes] │
│ [Timestamp: 8 bytes (Unix ms)] │
│ Total: 17 bytes │
├─────────────────────────────────────────────────────────┤
│ For Write: │
│ [Transaction ID: 8 bytes] │
│ [PageId: 4 bytes] │
│ [After Image Length: 4 bytes] │
│ [After Image: variable bytes] │
│ Total: 17 + AfterImage.Length │
└─────────────────────────────────────────────────────────┘
```
#### WAL Protocol
**Write Path:**
```
1. Begin Transaction → WriteBeginRecord(txnId)
2. Modify Data → WriteDataRecord(txnId, pageId, afterImage)
3. Commit → WriteCommitRecord(txnId) + Flush()
```
**Recovery Path:**
```csharp
var records = wal.ReadAll();
foreach (var record in records)
{
if (record.Type == WalRecordType.Write && IsCommitted(record.TransactionId))
{
pageFile.WritePage(record.PageId, record.AfterImage);
}
}
```
#### Implementation Details
**Zero-Allocation Writes:**
```csharp
// Synchronous (stack allocated)
Span<byte> buffer = stackalloc byte[17];
buffer[0] = (byte)WalRecordType.Begin;
BitConverter.TryWriteBytes(buffer[1..9], transactionId);
BitConverter.TryWriteBytes(buffer[9..17], timestamp);
_walStream.Write(buffer);
// Asynchronous (pooled)
var buffer = ArrayPool<byte>.Shared.Rent(totalSize);
try
{
// ... write to buffer
await _walStream.WriteAsync(buffer.AsMemory(0, totalSize), ct);
}
finally
{
ArrayPool<byte>.Shared.Return(buffer);
}
```
**Durability Guarantee:**
```csharp
public void Flush()
{
_walStream?.Flush(flushToDisk: true); // Force OS fsync
}
```
**Checkpoint and Truncate:**
```csharp
// After applying WAL to pages:
pageFile.Flush(); // Ensure pages on disk
wal.Truncate(); // Remove committed WAL records
wal.Flush(); // Sync truncation
```
### 6.3 ACID Guarantees
- **Atomicity:** WAL ensures all-or-nothing commits
- **Consistency:** Schema validation before commit
- **Isolation:** Snapshot isolation (MVCC-like)
- **Durability:** WAL flush on commit
---
## 7. Query Processing
### 7.1 LINQ Provider
CBDD implements `IQueryable<T>` via custom query provider:
```csharp
var results = db.Users.AsQueryable()
.Where(u => u.Age > 25 && u.City == "NYC")
.OrderBy(u => u.Name)
.Take(10)
.ToList();
```
**Translation:**
1. Expression tree → BTree query plan
2. Index selection: Choose index on `Age` or `City`
3. Index scan: Retrieve candidate DocumentLocations
4. Post-filter: Apply remaining predicates in-memory
5. Materialize: Read documents, deserialize to objects
### 7.2 Index Selection Algorithm
```
For WHERE clause:
1. Extract equality and range predicates
2. Score each index by predicate coverage
3. Select index with highest score
4. Use index scan if selective, else full scan
```
**Example:**
```sql
WHERE Age > 25 AND City = "NYC"
```
Index options:
- Index on `Age` → Range scan (potentially many results)
- Index on `City` → Equality lookup (likely fewer results)
- **Choose:** `City` index, post-filter `Age > 25`
### 7.3 Hybrid Execution Model
CBDD combines **index-based** and **in-memory** execution:
1. **Index Phase:** Use B-Tree/R-Tree to filter candidates
2. **Materialization:** Read documents from pages
3. **LINQ to Objects:** Apply complex predicates, projections, aggregations
**Benefits:**
- Leverage index selectivity
- Support full LINQ semantics
- Avoid building complex query execution engine
---
## 8. Source Generation Protocol
### 8.1 Mapper Generation
CBDD uses **Roslyn Source Generators** to produce zero-reflection mappers:
**Input:**
```csharp
public class User
{
public ObjectId Id { get; set; }
public string Name { get; set; }
public int Age { get; set; }
}
public partial class MyDbContext : DocumentDbContext
{
public DocumentCollection<ObjectId, User> Users { get; set; }
}
```
**Generated:**
```csharp
namespace MyApp.Mappers
{
public class UserMapper : ObjectIdMapperBase<User>
{
public override string CollectionName => "users";
public override int Serialize(User entity, BsonSpanWriter writer)
{
var start = writer.BeginDocument();
writer.WriteObjectId("_id", entity.Id);
writer.WriteString("name", entity.Name);
writer.WriteInt32("age", entity.Age);
writer.EndDocument(start);
return writer.Position;
}
public override User Deserialize(BsonSpanReader reader)
{
var user = new User();
reader.ReadDocumentSize();
while (reader.Remaining > 0)
{
var type = reader.ReadBsonType();
if (type == BsonType.EndOfDocument) break;
var name = reader.ReadElementHeader();
switch (name)
{
case "_id": user.Id = reader.ReadObjectId(); break;
case "name": user.Name = reader.ReadString(); break;
case "age": user.Age = reader.ReadInt32(); break;
default: reader.SkipValue(type); break;
}
}
return user;
}
}
}
```
### 8.2 Attribute Support
See **Data Annotations Support** section in README and related documentation.
Supported attributes:
- `[Table(Name, Schema)]` → Collection name mapping
- `[Column(Name, TypeName)]` → Field name and special type mapping
- `[Key]` → Primary key identification
- `[NotMapped]` → Exclusion from serialization
- `[Required]`, `[StringLength]`, `[Range]` → Validation
### 8.3 Code Generation Rules
1. **Lowercase policy:** BSON field names are lowercase by default
2. **Attribute override:** `[BsonProperty]`, `[JsonPropertyName]`, `[Column]` override default names
3. **Nested objects:** Recursively analyzed and mapped
4. **Collections:** Arrays and `IEnumerable<T>` mapped to BSON arrays
5. **Value types:** Primitives, enums, `DateTime`, `ObjectId`, `Guid` handled natively
---
## 9. Security Considerations
### 9.1 Data Integrity
- **Checksums:** CRC32 on page headers (planned: extend to full pages)
- **WAL:** Ensures consistency even on crash
- **Schema validation:** Prevents type mismatches
### 9.2 Concurrent Access
**Multi-Threading:****Fully Supported**
- **Thread-safe writes:** Multiple threads can write concurrently within the same process
- **Internal synchronization:** `SemaphoreSlim` for critical sections, `ConcurrentDictionary` for shared state
- **WAL coordination:** Commit lock ensures serializable WAL writes
- **Page cache:** Thread-safe access to memory-mapped pages
**Multi-Process:****Not Supported**
- **Exclusive file lock:** `FileShare.None` prevents multiple processes from opening the same database
- **Rationale:** Simplifies consistency guarantees, avoids complex inter-process coordination
- **Future:** May support via cooperative locking or server mode (HTTP API)
### 9.3 Injection Attacks
- **No SQL injection:** No query language, only type-safe LINQ
- **Schema-validated:** All operations type-checked at compile-time
---
## 10. Performance Considerations
### 10.1 Zero-Allocation Design
**Stack allocation:**
```csharp
Span<byte> buffer = stackalloc byte[16384]; // Page buffer on stack
var writer = new BsonSpanWriter(buffer, keyMap);
```
**No boxing:**
- Value types remain unboxed throughout serialization
- No reflection or dynamic invocation
**Pooling:**
```csharp
var buffer = ArrayPool<byte>.Shared.Rent(pageSize);
try { /* use buffer */ }
finally { ArrayPool<byte>.Shared.Return(buffer); }
```
### 10.2 Memory-Mapped Files
**Benefits:**
- OS kernel manages page cache
- Zero-copy reads (map file → process memory)
- Prefetching and read-ahead by OS
**Tradeoffs:**
- Limited to single process (file lock)
- Windows vs. Linux differences in `mmap` behavior
### 10.3 Cache Efficiency
**Compact C-BSON:**
- More documents per 16KB page
- Better CPU cache utilization
- Reduced TLB misses
---
## 11. Implementation Notes
### 11.1 .NET 10 Requirements
CBDD targets **.NET 10** to leverage:
- `Span<T>` and `ref struct` for zero-copy I/O
- Source Generators (Roslyn)
- `MemoryMarshal` for efficient struct serialization
- Improved JIT optimizations
**Future:** Evaluate `.NET Standard 2.1` for broader compatibility.
### 11.2 Platform Considerations
- **Windows:** Full support via MemoryMappedFile
- **Linux/macOS:** Supported, potential differences in `mmap` behavior
- **Endianness:** Little-endian assumed (matches x86/x64/ARM)
### 11.3 Future Compatibility
**Planned enhancements:**
- **Compression:** Page-level or document-level LZ4
- **Encryption:** AES-256 for data-at-rest
- **Multi-process:** Via cooperative locking or server mode
---
## 12. References
### 12.1 BSON and Formats
- [BSON Specification v1.1](http://bsonspec.org/)
- [MongoDB BSON Types](https://www.mongodb.com/docs/manual/reference/bson-types/)
### 12.2 Database Internals
- *Database Internals* by Alex Petrov (O'Reilly, 2019)
- [B+ Trees (Wikipedia)](https://en.wikipedia.org/wiki/B%2B_tree)
- [R-Trees (Guttman, 1984)](http://www-db.deis.unibo.it/courses/SI-LS/papers/Gut84.pdf)
### 12.3 Vector Search
- [HNSW Algorithm (Malkov & Yashunin, 2018)](https://arxiv.org/abs/1603.09320)
- [Faiss Library (Facebook AI)](https://github.com/facebookresearch/faiss)
### 12.4 Standards
- [RFC 2119: Key words for RFCs](https://www.ietf.org/rfc/rfc2119.txt)
- [IEEE 754: Floating Point Arithmetic](https://ieeexplore.ieee.org/document/8766229)
---
## Appendix A: Page Format Diagrams
### A.1 Slotted Page Visual
```
Offset: 0 16383
┌────────────────────────────────────────────────┐
24 │ [Header: 24 bytes] │
├────────────────────────────────────────────────┤
│ [Slot 0: Offset=16320, Len=64, Flags=0] │
│ [Slot 1: Offset=16256, Len=64, Flags=0] │
│ [Slot 2: Offset=16192, Len=64, Flags=0] │
56 │ ← FreeSpaceStart │
│ │
│ ~~~~~~~~~~~~~~~~ Free Space ~~~~~~~~~~~~~~~~~~~ │
│ │
16192 │ ← FreeSpaceEnd │
│ [Document 2 data: 64 bytes] │
│ [Document 1 data: 64 bytes] │
│ [Document 0 data: 64 bytes] │
16384 └────────────────────────────────────────────────┘
```
### A.2 B+Tree Node Visual
```
Internal Node:
┌──────────────────────────────────────────────────────┐
│ [PageHeader 32] [BTreeNodeHeader 20] │
├──────────────────────────────────────────────────────┤
│ Key1 | ChildPtr1 │
│ Key2 | ChildPtr2 │
│ Key3 | ChildPtr3 │
│ ... │
└──────────────────────────────────────────────────────┘
Leaf Node:
┌──────────────────────────────────────────────────────┐
│ [PageHeader 32] [BTreeNodeHeader 20] │
├──────────────────────────────────────────────────────┤
│ Key1 | DocumentLocation1 │
│ Key2 | DocumentLocation2 │
│ Key3 | DocumentLocation3 │
│ ... │
│ NextLeafPageId → [next leaf] │
└──────────────────────────────────────────────────────┘
```
---
## Appendix B: C-BSON Hex Dump
See `C-BSON.md`, Section "Hex Dump Examples" for detailed wire format examples.
---
## Appendix C: Transaction State Machine
```
┌─────────────┐
│ Idle │
└──────┬──────┘
│ BeginTransaction()
┌─────────────┐
│ Active │ ← Accumulate writes in memory
└──┬───────┬──┘
│ │ Rollback()
│ └────────────┐
│ Commit() │
▼ ▼
┌─────────────┐ ┌─────────────┐
│ Committing │ │ Aborted │
│ (WAL flush) │ └─────────────┘
└──────┬──────┘
│ Flush complete
┌─────────────┐
│ Committed │
└─────────────┘
```
---
**End of RFC-CBDD Specification**