Initialize CBDD solution and add a .NET-focused gitignore for generated artifacts.

2026-02-20 12:54:07 -05:00
commit b8ed5ec500
214 changed files with 101452 additions and 0 deletions
--- a/C-BSON.md
+++ b/C-BSON.md
@@ -0,0 +1,679 @@
+# C-BSON: Compressed BSON Format
+
+## What is C-BSON?
+
+**C-BSON** (Compressed BSON) is CBDD's optimized wire format that maintains full BSON type compatibility while achieving significant space savings through **field name compression**. This innovation reduces document size by 30-60% for typical schemas, improving both storage efficiency and I/O performance.
+
+### The Problem with Standard BSON
+
+Standard BSON stores field names as **null-terminated UTF-8 strings** in every document. Consider a typical user document:
+
+```javascript
+{
+  "_id": ObjectId("..."),
+  "email": "user@example.com",
+  "created_at": ISODate("2026-02-12"),
+  "last_login": ISODate("2026-02-12")
+}
+```
+
+**Field Name Overhead:**
+- `_id` → 4 bytes (3 chars + null terminator)
+- `email` → 6 bytes
+- `created_at` → 11 bytes
+- `last_login` → 11 bytes
+
+**Total overhead: 32 bytes** just for field names in a 4-field document.
+
+### The C-BSON Solution: Key Compression
+
+C-BSON replaces field names with **2-byte numeric IDs** via a schema-based dictionary:
+
+```
+Standard BSON:  [type][field_name\0][value]
+C-BSON:         [type][field_id: ushort][value]
+```
+
+**Space Savings:**
+
+| Field Name     | Standard BSON | C-BSON  | Savings |
+|:---------------|:--------------|:--------|:--------|
+| `_id`          | 4 bytes       | 2 bytes | 50%     |
+| `email`        | 6 bytes       | 2 bytes | 67%     |
+| `created_at`   | 11 bytes      | 2 bytes | 82%     |
+| `last_login`   | 11 bytes      | 2 bytes | 82%     |
+
+**Result:** The same 4-field document saves **24 bytes** per instance. For 1 million documents, that's **~23 MB saved**.
+
+---
+
+## Wire Format Specification
+
+### Document Structure
+
+```
+┌────────────────────────────────────────────────┐
+│  [4 bytes] Document Size (int32 little-endian)│
+├────────────────────────────────────────────────┤
+│  [Elements...]                                  │
+│    ┌──────────────────────────────────────┐   │
+│    │ [1 byte]  Type Code                  │   │
+│    │ [2 bytes] Field ID (ushort)          │   │
+│    │ [N bytes] Value (type-dependent)     │   │
+│    └──────────────────────────────────────┘   │
+│  [Repeat for each field]                       │
+├────────────────────────────────────────────────┤
+│  [1 byte] End of Document (0x00)               │
+└────────────────────────────────────────────────┘
+```
+
+### Element Header Comparison
+
+**Standard BSON Element Header:**
+```
+[1 byte: type code][N bytes: null-terminated UTF-8 string]
+                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+                    Variable length: min 2 bytes, no max
+```
+
+**C-BSON Element Header:**
+```
+[1 byte: type code][2 bytes: field ID as ushort little-endian]
+                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+                    Fixed length: exactly 2 bytes
+```
+
+### Type Codes
+
+C-BSON uses **standard BSON type codes** for full compatibility:
+
+| Code | Type        | Description                          |
+|:-----|:------------|:-------------------------------------|
+| 0x01 | Double      | 64-bit IEEE 754 floating point       |
+| 0x02 | String      | UTF-8 string (int32 length + data + null) |
+| 0x03 | Document    | Embedded document                    |
+| 0x04 | Array       | Embedded array                       |
+| 0x05 | Binary      | Binary data (subtype + length + data)|
+| 0x07 | ObjectId    | 12-byte MongoDB-compatible ObjectId  |
+| 0x08 | Boolean     | 1 byte (0x00 or 0x01)               |
+| 0x09 | DateTime    | UTC milliseconds (int64)            |
+| 0x10 | Int32       | 32-bit signed integer               |
+| 0x12 | Int64       | 64-bit signed integer               |
+| 0x13 | Decimal128  | 128-bit decimal (IEEE 754-2008)     |
+
+---
+
+## Schema-Based Key Mapping
+
+### Bidirectional Dictionary
+
+C-BSON requires a **schema-driven key mapping** maintained in memory:
+
+**Writer Side:**
+```csharp
+ConcurrentDictionary<string, ushort> _keyMap;
+// Example:
+// "\_id" → 1
+// "email" → 2
+// "created_at" → 3
+```
+
+**Reader Side:**
+```csharp
+ConcurrentDictionary<ushort, string> _keys;
+// Example:
+// 1 → "\_id"
+// 2 → "email"
+// 3 → "created_at"
+```
+
+### Schema Generation
+
+CBDD automatically generates schemas from C# types using reflection:
+
+```csharp
+public class User
+{
+    public ObjectId Id { get; set; }
+    public string Email { get; set; }
+    public DateTime CreatedAt { get; set; }
+}
+
+// Generated schema:
+// Field 1: "_id" (ObjectId)
+// Field 2: "email" (String)
+// Field 3: "created_at" (DateTime)
+```
+
+### Schema Storage
+
+Schemas are stored in the **Page 1 (Collection Metadata)** and loaded into memory on database open:
+
+```
+┌─────────────────────────────────────────┐
+│ [Schema Hash (long)]                    │
+│ [Schema Version (int)]                  │
+│ [Field Count (ushort)]                  │
+├─────────────────────────────────────────┤
+│ For each field:                         │
+│   [Field ID (ushort)]                   │
+│   [Field Name Length (byte)]            │
+│   [Field Name UTF-8 bytes]              │
+│   [BSON Type Code (byte)]               │
+└─────────────────────────────────────────┘
+```
+
+---
+
+## Implementation Details
+
+### BsonSpanWriter (Serialization)
+
+Zero-allocation writer using `Span<byte>`:
+
+```csharp
+public ref struct BsonSpanWriter
+{
+    private Span<byte> _buffer;
+    private int _position;
+    private readonly ConcurrentDictionary<string, ushort> _keyMap;
+
+    public void WriteElementHeader(BsonType type, string name)
+    {
+        // Write type code
+        _buffer[_position++] = (byte)type;
+
+        // Lookup field ID in dictionary
+        if (!_keyMap.TryGetValue(name, out var id))
+            throw new InvalidOperationException($"Field '{name}' not in schema");
+
+        // Write field ID (2 bytes, little-endian)
+        BinaryPrimitives.WriteUInt16LittleEndian(_buffer.Slice(_position, 2), id);
+        _position += 2;
+    }
+}
+```
+
+**Usage:**
+
+```csharp
+var keyMap = new ConcurrentDictionary<string, ushort>();
+keyMap["_id"] = 1;
+keyMap["name"] = 2;
+
+Span<byte> buffer = stackalloc byte[1024];
+var writer = new BsonSpanWriter(buffer, keyMap);
+
+writer.WriteObjectId("_id", user.Id);
+writer.WriteString("name", user.Name);
+```
+
+### BsonSpanReader (Deserialization)
+
+Zero-allocation reader using `ReadOnlySpan<byte>`:
+
+```csharp
+public ref struct BsonSpanReader
+{
+    private ReadOnlySpan<byte> _buffer;
+    private int _position;
+    private readonly ConcurrentDictionary<ushort, string> _keys;
+
+    public string ReadElementHeader()
+    {
+        // Read field ID (2 bytes, little-endian)
+        var id = BinaryPrimitives.ReadUInt16LittleEndian(_buffer.Slice(_position, 2));
+        _position += 2;
+
+        // Reverse lookup in dictionary
+        if (!_keys.TryGetValue(id, out var name))
+            throw new InvalidOperationException($"Field ID {id} not in schema");
+
+        return name;
+    }
+}
+```
+
+**Usage:**
+
+```csharp
+var keys = new ConcurrentDictionary<ushort, string>();
+keys[1] = "_id";
+keys[2] = "name";
+
+var reader = new BsonSpanReader(bsonData, keys);
+reader.ReadDocumentSize();
+
+while (reader.Remaining > 0)
+{
+    var type = reader.ReadBsonType();
+    if (type == BsonType.EndOfDocument) break;
+
+    var fieldName = reader.ReadElementHeader(); // Returns "name" from ID
+    // ... read value based on type
+}
+```
+
+---
+
+## Advanced Features
+
+### Nested Documents
+
+Nested documents recursively use the same C-BSON format with their own field mappings:
+
+```csharp
+public class User
+{
+    public ObjectId Id { get; set; }
+    public Address HomeAddress { get; set; } // Nested
+}
+
+public class Address
+{
+    public string Street { get; set; }
+    public string City { get; set; }
+}
+
+// Schema:
+// User fields: 1="_id", 2="home_address"
+// Address fields: 3="street", 4="city"
+```
+
+**Wire format for nested document:**
+```
+[0x03: Document]["home_address": 2]
+  [nested_doc_size: 4 bytes]
+  [0x02: String]["street": 3][value]
+  [0x02: String]["city": 4][value]
+  [0x00: End]
+```
+
+### Arrays
+
+Arrays use numeric indices as field names, still compressed to 2-byte IDs:
+
+```csharp
+public class User
+{
+    public string[] Tags { get; set; }
+}
+
+// Schema includes numeric keys:
+// "0" → 5, "1" → 6, "2" → 7, ...
+```
+
+**Wire format:**
+```
+[0x04: Array]["tags": 2]
+  [array_size: 4 bytes]
+  [0x02: String]["0": 5]["design"]
+  [0x02: String]["1": 6]["dotnet"]
+  [0x00: End]
+```
+
+### Geospatial Coordinates
+
+C-BSON supports zero-allocation coordinate tuples via `[Column(TypeName="geopoint")]`:
+
+```csharp
+[Column(TypeName = "geopoint")]
+public (double Lat, double Lon) Location { get; set; }
+```
+
+**Wire format:**
+```
+[0x04: Array]["location": field_id]
+  [array_size: 4 bytes]
+  [0x01: Double]["0": coord_0_id][8 bytes: latitude]
+  [0x01: Double]["1": coord_1_id][8 bytes: longitude]
+  [0x00: End]
+```
+
+This maps directly to R-Tree index structures without deserialization overhead.
+
+---
+
+## Performance Benefits
+
+### Storage Efficiency
+
+**Real-world example:** E-commerce product catalog
+
+```csharp
+public class Product
+{
+    public ObjectId Id { get; set; }              // "_id": 4 → 2 bytes
+    public string Name { get; set; }             // "name": 5 → 2 bytes
+    public decimal Price { get; set; }           // "price": 6 → 2 bytes
+    public string Description { get; set; }      // "description": 12 → 2 bytes
+    public string Category { get; set; }         // "category": 9 → 2 bytes
+    public string[] Tags { get; set; }           // "tags": 5 → 2 bytes
+    public DateTime CreatedAt { get; set; }      // "created_at": 11 → 2 bytes
+    public DateTime UpdatedAt { get; set; }      // "updated_at": 11 → 2 bytes
+}
+```
+
+**Field name overhead:**
+- Standard BSON: 4+5+6+12+9+5+11+11 = **63 bytes**
+- C-BSON: 2×8 = **16 bytes**
+- **Savings: 47 bytes per document**
+
+For 1 million products: **~45 MB saved** in field names alone.
+
+### CPU Cache Efficiency
+
+Smaller documents mean:
+- **More documents fit in L1/L2/L3 cache**
+- **Fewer cache misses during sequential scans**
+- **Better prefetching** for range queries
+
+### I/O Reduction
+
+**Disk I/O:**
+- 16KB page holds **more documents** → fewer page reads
+- **Faster bulk inserts** → less data to write
+- **Faster bulk reads** → less data to transfer from disk
+
+**Network (future):**
+- Smaller wire transfer for client/server scenarios
+- Better replication throughput
+
+---
+
+## Hex Dump Examples
+
+### Example 1: Simple User Document
+
+**C# Object:**
+```csharp
+var user = new User
+{
+    Id = new ObjectId("65d3c2a1f4b8e9a2c3d4e5f6"),
+    Name = "Alice",
+    Age = 30
+};
+```
+
+**C-BSON Wire Format (hex):**
+```
+20 00 00 00                 // Document size: 32 bytes
+07 01 00                    // ObjectId, field 1 (_id)
+  65 d3 c2 a1 f4 b8 e9 a2   // ObjectId bytes (12 total)
+  c3 d4 e5 f6
+02 02 00                    // String, field 2 (name)
+  06 00 00 00               // String length: 6
+  41 6c 69 63 65 00         // "Alice\0"
+10 03 00                    // Int32, field 3 (age)
+  1e 00 00 00               // Value: 30
+00                          // End of document
+```
+
+### Example 2: Standard BSON Comparison
+
+**Same document in standard BSON:**
+```
+2d 00 00 00                 // Document size: 45 bytes (+13 bytes)
+07 5f 69 64 00              // "_id\0" (4 bytes)
+  65 d3 c2 a1 f4 b8 e9 a2
+  c3 d4 e5 f6
+02 6e 61 6d 65 00           // "name\0" (5 bytes)
+  06 00 00 00
+  41 6c 69 63 65 00
+10 61 67 65 00              // "age\0" (4 bytes)
+  1e 00 00 00
+00
+```
+
+**Comparison:**
+- Standard BSON: 45 bytes
+- C-BSON: 32 bytes
+- **Reduction: 28% smaller**
+
+### Example 3: Nested Document
+
+**C# Object:**
+```csharp
+var user = new User
+{
+    Id = ObjectId.NewObjectId(),
+    Address = new Address
+    {
+        Street = "123 Main St",
+        City = "Springfield"
+    }
+};
+```
+
+**C-BSON Wire Format (partial, showing nested doc):**
+```
+... // document header
+03 02 00                    // Document, field 2 (address)
+  23 00 00 00               // Nested doc size: 35 bytes
+  02 03 00                  // String, field 3 (street)
+    0c 00 00 00             // Length: 12
+    31 32 33 20 4d 61 69 6e // "123 Main St\0"
+    20 53 74 00
+  02 04 00                  // String, field 4 (city)
+    0c 00 00 00             // Length: 12
+    53 70 72 69 6e 67 66 69 // "Springfield\0"
+    65 6c 64 00
+  00                        // End of nested doc
+...
+```
+
+---
+
+## Technical Constraints
+
+### Field ID Space
+
+- **Type:** `ushort` (16-bit unsigned integer)
+- **Range:** 0 to 65,535
+- **Theoretical max:** 65,535 unique field names per schema hierarchy
+- **Practical limit:** ~1,000 fields for optimal performance
+- **Reserved IDs:** 0 is reserved (not used)
+
+### Dictionary Overhead
+
+**Memory footprint:**
+- ~16 bytes per entry in `ConcurrentDictionary<string, ushort>`
+- ~16 bytes per entry in `ConcurrentDictionary<ushort, string>`
+- **Total:** ~32 bytes per unique field name
+
+**Example:** A schema with 50 fields → **~1.6 KB** in-memory overhead (negligible).
+
+### Schema Versioning
+
+When a schema evolves (fields added/removed/renamed):
+
+1. **New schema version** is created with incremented version number
+2. **New field IDs** are assigned to new fields
+3. **Old documents remain readable** with old schema
+4. **Migration** can be applied lazily during read-modify-write cycles
+
+**Schema hash** ensures consistency:
+```csharp
+long schemaHash = schema.GetHash(); // Hash of all field names and types
+```
+
+---
+
+## Compatibility
+
+### BSON Type Compatibility
+
+C-BSON is **type-compatible** with standard BSON:
+- ✅ Same type codes (0x01-0x13)
+- ✅ Same value encoding (little-endian, IEEE 754, UTF-8)
+- ✅ Same document structure (size prefix + elements + 0x00 terminator)
+- ❌ **Different element header format** (field ID vs. field name)
+
+### Migration from Standard BSON
+
+**Strategy:**
+1. Read standard BSON document
+2. Extract field names and build schema
+3. Assign field IDs based on schema
+4. Re-serialize as C-BSON
+
+**Future enhancement:** Hybrid reader capable of auto-detecting and reading both formats.
+
+### Export to Standard BSON
+
+For external tool compatibility (e.g., MongoDB Compass, Studio 3T):
+
+```csharp
+// Convert C-BSON → Standard BSON
+public byte[] ToStandardBson(byte[] cbson, BsonSchema schema)
+{
+    var reader = new BsonSpanReader(cbson, schema.GetReverseKeyMap());
+    var writer = new StandardBsonWriter(); // Uses string field names
+    
+    // Copy document element-by-element
+    while (...)
+    {
+        var type = reader.ReadBsonType();
+        var fieldName = reader.ReadElementHeader(); // ID → Name
+        writer.WriteElementHeader(type, fieldName);  // Write name directly
+        // ... copy value
+    }
+}
+```
+
+---
+
+## Schema Evolution Strategies
+
+### Adding Fields
+
+**Backward compatible:** New fields get new IDs, old documents remain valid.
+
+```csharp
+// Version 1: User schema
+// 1: "_id", 2: "name", 3: "email"
+
+// Version 2: Add "phone"
+// 1: "_id", 2: "name", 3: "email", 4: "phone"
+```
+
+Old documents:
+- Missing field 4 → treated as `null` or default value
+- No re-serialization required
+
+### Removing Fields
+
+**Forward compatible:** Removed field IDs are marked as deprecated.
+
+```csharp
+// Version 3: Remove "email" (field 3)
+// Mark field 3 as deprecated in schema
+```
+
+New code:
+- Ignores field 3 during deserialization
+- Old documents with field 3 remain valid (data is skipped)
+
+### Renaming Fields
+
+**Breaking change:** Requires migration.
+
+```csharp
+// Version 4: Rename "phone" → "mobile_phone"
+
+// Option 1: Lazy migration on read
+if (doc.ContainsKey("phone"))
+{
+    doc["mobile_phone"] = doc["phone"];
+    doc.Remove("phone");
+    UpdateDocument(doc);
+}
+
+// Option 2: Batch migration script
+foreach (var doc in collection.FindAll())
+{
+    if (doc.ContainsKey("phone"))
+    {
+        doc["mobile_phone"] = doc["phone"];
+        doc.Remove("phone");
+        collection.Update(doc);
+    }
+}
+```
+
+---
+
+## Future Enhancements
+
+### 1. Adaptive Key Width
+
+Use **1 byte for field IDs** when schema has <256 fields:
+
+```
+Small schema flag: [1 bit in document header]
+If set: field IDs are 1 byte (0-255)
+Else: field IDs are 2 bytes (0-65535)
+```
+
+**Potential savings:** Additional 1 byte per field for small schemas.
+
+### 2. Delta Compression
+
+Store only **changed fields** in updates:
+
+```
+┌──────────────────────────────────────┐
+│ [Base Document ID]                   │
+│ [Changed Field IDs bitmap]           │
+│ [Changed Field Values]               │
+└──────────────────────────────────────┘
+```
+
+### 3. Column-Oriented Storage
+
+Separate storage for each field:
+
+```
+Field 1 file: [all _id values]
+Field 2 file: [all name values]
+Field 3 file: [all email values]
+```
+
+Benefits:
+- **Faster analytics** (read only needed columns)
+- **Better compression** (similar data together)
+- **Efficient projections** (SELECT name, email FROM ...)
+
+### 4. Hybrid Format Support
+
+Reader auto-detects C-BSON vs. Standard BSON:
+
+```csharp
+// Magic byte detection
+if (firstElement[2] < 0x7F) // Likely field ID (< 127)
+    return ReadCBSON();
+else
+    return ReadStandardBSON();
+```
+
+---
+
+## Conclusion
+
+C-BSON achieves **significant storage and performance improvements** while maintaining BSON's type system and flexibility:
+
+- **30-60% smaller documents** via key compression
+- **Zero-allocation** I/O with `Span<byte>`
+- **Full BSON type compatibility**
+- **Schema-based** for type safety and evolution
+
+This format is the foundation of CBDD's high-performance embedded database engine, enabling millions of documents to fit in memory and cache while minimizing disk I/O.
+
+---
+
+## References
+
+- [BSON Specification v1.1](http://bsonspec.org/)
+- [MongoDB BSON Types](https://www.mongodb.com/docs/manual/reference/bson-types/)
+- [IEEE 754 Floating Point Standard](https://standards.ieee.org/standard/754-2019.html)
+- [UTF-8 Encoding (RFC 3629)](https://tools.ietf.org/html/rfc3629)