CBDD/C-BSON.md

# C-BSON: Compressed BSON Format

## What is C-BSON?

**C-BSON** (Compressed BSON) is CBDD's optimized wire format that maintains full BSON type compatibility while achieving significant space savings through **field name compression**. This innovation reduces document size by 30-60% for typical schemas, improving both storage efficiency and I/O performance.

### The Problem with Standard BSON

Standard BSON stores field names as **null-terminated UTF-8 strings** in every document. Consider a typical user document:

```javascript
{
  "_id": ObjectId("..."),
  "email": "user@example.com",
  "created_at": ISODate("2026-02-12"),
  "last_login": ISODate("2026-02-12")
}
```

**Field Name Overhead:**
- `_id` → 4 bytes (3 chars + null terminator)
- `email` → 6 bytes
- `created_at` → 11 bytes
- `last_login` → 11 bytes

**Total overhead: 32 bytes** just for field names in a 4-field document.

### The C-BSON Solution: Key Compression

C-BSON replaces field names with **2-byte numeric IDs** via a schema-based dictionary:

```
Standard BSON:  [type][field_name\0][value]
C-BSON:         [type][field_id: ushort][value]
```

**Space Savings:**

| Field Name     | Standard BSON | C-BSON  | Savings |
|:---------------|:--------------|:--------|:--------|
| `_id`          | 4 bytes       | 2 bytes | 50%     |
| `email`        | 6 bytes       | 2 bytes | 67%     |
| `created_at`   | 11 bytes      | 2 bytes | 82%     |
| `last_login`   | 11 bytes      | 2 bytes | 82%     |

**Result:** The same 4-field document saves **24 bytes** per instance. For 1 million documents, that's **~23 MB saved**.

---

## Wire Format Specification

### Document Structure

```
┌────────────────────────────────────────────────┐
│  [4 bytes] Document Size (int32 little-endian)│
├────────────────────────────────────────────────┤
│  [Elements...]                                  │
│    ┌──────────────────────────────────────┐   │
│    │ [1 byte]  Type Code                  │   │
│    │ [2 bytes] Field ID (ushort)          │   │
│    │ [N bytes] Value (type-dependent)     │   │
│    └──────────────────────────────────────┘   │
│  [Repeat for each field]                       │
├────────────────────────────────────────────────┤
│  [1 byte] End of Document (0x00)               │
└────────────────────────────────────────────────┘
```

### Element Header Comparison

**Standard BSON Element Header:**
```
[1 byte: type code][N bytes: null-terminated UTF-8 string]
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                    Variable length: min 2 bytes, no max
```

**C-BSON Element Header:**
```
[1 byte: type code][2 bytes: field ID as ushort little-endian]
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                    Fixed length: exactly 2 bytes
```

### Type Codes

C-BSON uses **standard BSON type codes** for full compatibility:

| Code | Type        | Description                          |
|:-----|:------------|:-------------------------------------|
| 0x01 | Double      | 64-bit IEEE 754 floating point       |
| 0x02 | String      | UTF-8 string (int32 length + data + null) |
| 0x03 | Document    | Embedded document                    |
| 0x04 | Array       | Embedded array                       |
| 0x05 | Binary      | Binary data (subtype + length + data)|
| 0x07 | ObjectId    | 12-byte MongoDB-compatible ObjectId  |
| 0x08 | Boolean     | 1 byte (0x00 or 0x01)               |
| 0x09 | DateTime    | UTC milliseconds (int64)            |
| 0x10 | Int32       | 32-bit signed integer               |
| 0x12 | Int64       | 64-bit signed integer               |
| 0x13 | Decimal128  | 128-bit decimal (IEEE 754-2008)     |

---

## Schema-Based Key Mapping

### Bidirectional Dictionary

C-BSON requires a **schema-driven key mapping** maintained in memory:

**Writer Side:**
```csharp
ConcurrentDictionary<string, ushort> _keyMap;
// Example:
// "\_id" → 1
// "email" → 2
// "created_at" → 3
```

**Reader Side:**
```csharp
ConcurrentDictionary<ushort, string> _keys;
// Example:
// 1 → "\_id"
// 2 → "email"
// 3 → "created_at"
```

### Schema Generation

CBDD automatically generates schemas from C# types using reflection:

```csharp
public class User
{
    public ObjectId Id { get; set; }
    public string Email { get; set; }
    public DateTime CreatedAt { get; set; }
}

// Generated schema:
// Field 1: "_id" (ObjectId)
// Field 2: "email" (String)
// Field 3: "created_at" (DateTime)
```

### Schema Storage

Schemas are stored in the **Page 1 (Collection Metadata)** and loaded into memory on database open:

```
┌─────────────────────────────────────────┐
│ [Schema Hash (long)]                    │
│ [Schema Version (int)]                  │
│ [Field Count (ushort)]                  │
├─────────────────────────────────────────┤
│ For each field:                         │
│   [Field ID (ushort)]                   │
│   [Field Name Length (byte)]            │
│   [Field Name UTF-8 bytes]              │
│   [BSON Type Code (byte)]               │
└─────────────────────────────────────────┘
```

---

## Implementation Details

### BsonSpanWriter (Serialization)

Zero-allocation writer using `Span<byte>`:

```csharp
public ref struct BsonSpanWriter
{
    private Span<byte> _buffer;
    private int _position;
    private readonly ConcurrentDictionary<string, ushort> _keyMap;

    public void WriteElementHeader(BsonType type, string name)
    {
        // Write type code
        _buffer[_position++] = (byte)type;

        // Lookup field ID in dictionary
        if (!_keyMap.TryGetValue(name, out var id))
            throw new InvalidOperationException($"Field '{name}' not in schema");

        // Write field ID (2 bytes, little-endian)
        BinaryPrimitives.WriteUInt16LittleEndian(_buffer.Slice(_position, 2), id);
        _position += 2;
    }
}
```

**Usage:**

```csharp
var keyMap = new ConcurrentDictionary<string, ushort>();
keyMap["_id"] = 1;
keyMap["name"] = 2;

Span<byte> buffer = stackalloc byte[1024];
var writer = new BsonSpanWriter(buffer, keyMap);

writer.WriteObjectId("_id", user.Id);
writer.WriteString("name", user.Name);
```

### BsonSpanReader (Deserialization)

Zero-allocation reader using `ReadOnlySpan<byte>`:

```csharp
public ref struct BsonSpanReader
{
    private ReadOnlySpan<byte> _buffer;
    private int _position;
    private readonly ConcurrentDictionary<ushort, string> _keys;

    public string ReadElementHeader()
    {
        // Read field ID (2 bytes, little-endian)
        var id = BinaryPrimitives.ReadUInt16LittleEndian(_buffer.Slice(_position, 2));
        _position += 2;

        // Reverse lookup in dictionary
        if (!_keys.TryGetValue(id, out var name))
            throw new InvalidOperationException($"Field ID {id} not in schema");

        return name;
    }
}
```

**Usage:**

```csharp
var keys = new ConcurrentDictionary<ushort, string>();
keys[1] = "_id";
keys[2] = "name";

var reader = new BsonSpanReader(bsonData, keys);
reader.ReadDocumentSize();

while (reader.Remaining > 0)
{
    var type = reader.ReadBsonType();
    if (type == BsonType.EndOfDocument) break;

    var fieldName = reader.ReadElementHeader(); // Returns "name" from ID
    // ... read value based on type
}
```

---

## Advanced Features

### Nested Documents

Nested documents recursively use the same C-BSON format with their own field mappings:

```csharp
public class User
{
    public ObjectId Id { get; set; }
    public Address HomeAddress { get; set; } // Nested
}

public class Address
{
    public string Street { get; set; }
    public string City { get; set; }
}

// Schema:
// User fields: 1="_id", 2="home_address"
// Address fields: 3="street", 4="city"
```

**Wire format for nested document:**
```
[0x03: Document]["home_address": 2]
  [nested_doc_size: 4 bytes]
  [0x02: String]["street": 3][value]
  [0x02: String]["city": 4][value]
  [0x00: End]
```

### Arrays

Arrays use numeric indices as field names, still compressed to 2-byte IDs:

```csharp
public class User
{
    public string[] Tags { get; set; }
}

// Schema includes numeric keys:
// "0" → 5, "1" → 6, "2" → 7, ...
```

**Wire format:**
```
[0x04: Array]["tags": 2]
  [array_size: 4 bytes]
  [0x02: String]["0": 5]["design"]
  [0x02: String]["1": 6]["dotnet"]
  [0x00: End]
```

### Geospatial Coordinates

C-BSON supports zero-allocation coordinate tuples via `[Column(TypeName="geopoint")]`:

```csharp
[Column(TypeName = "geopoint")]
public (double Lat, double Lon) Location { get; set; }
```

**Wire format:**
```
[0x04: Array]["location": field_id]
  [array_size: 4 bytes]
  [0x01: Double]["0": coord_0_id][8 bytes: latitude]
  [0x01: Double]["1": coord_1_id][8 bytes: longitude]
  [0x00: End]
```

This maps directly to R-Tree index structures without deserialization overhead.

---

## Performance Benefits

### Storage Efficiency

**Real-world example:** E-commerce product catalog

```csharp
public class Product
{
    public ObjectId Id { get; set; }              // "_id": 4 → 2 bytes
    public string Name { get; set; }             // "name": 5 → 2 bytes
    public decimal Price { get; set; }           // "price": 6 → 2 bytes
    public string Description { get; set; }      // "description": 12 → 2 bytes
    public string Category { get; set; }         // "category": 9 → 2 bytes
    public string[] Tags { get; set; }           // "tags": 5 → 2 bytes
    public DateTime CreatedAt { get; set; }      // "created_at": 11 → 2 bytes
    public DateTime UpdatedAt { get; set; }      // "updated_at": 11 → 2 bytes
}
```

**Field name overhead:**
- Standard BSON: 4+5+6+12+9+5+11+11 = **63 bytes**
- C-BSON: 2×8 = **16 bytes**
- **Savings: 47 bytes per document**

For 1 million products: **~45 MB saved** in field names alone.

### CPU Cache Efficiency

Smaller documents mean:
- **More documents fit in L1/L2/L3 cache**
- **Fewer cache misses during sequential scans**
- **Better prefetching** for range queries

### I/O Reduction

**Disk I/O:**
- 16KB page holds **more documents** → fewer page reads
- **Faster bulk inserts** → less data to write
- **Faster bulk reads** → less data to transfer from disk

**Network (future):**
- Smaller wire transfer for client/server scenarios
- Better replication throughput

---

## Hex Dump Examples

### Example 1: Simple User Document

**C# Object:**
```csharp
var user = new User
{
    Id = new ObjectId("65d3c2a1f4b8e9a2c3d4e5f6"),
    Name = "Alice",
    Age = 30
};
```

**C-BSON Wire Format (hex):**
```
20 00 00 00                 // Document size: 32 bytes
07 01 00                    // ObjectId, field 1 (_id)
  65 d3 c2 a1 f4 b8 e9 a2   // ObjectId bytes (12 total)
  c3 d4 e5 f6
02 02 00                    // String, field 2 (name)
  06 00 00 00               // String length: 6
  41 6c 69 63 65 00         // "Alice\0"
10 03 00                    // Int32, field 3 (age)
  1e 00 00 00               // Value: 30
00                          // End of document
```

### Example 2: Standard BSON Comparison

**Same document in standard BSON:**
```
2d 00 00 00                 // Document size: 45 bytes (+13 bytes)
07 5f 69 64 00              // "_id\0" (4 bytes)
  65 d3 c2 a1 f4 b8 e9 a2
  c3 d4 e5 f6
02 6e 61 6d 65 00           // "name\0" (5 bytes)
  06 00 00 00
  41 6c 69 63 65 00
10 61 67 65 00              // "age\0" (4 bytes)
  1e 00 00 00
00
```

**Comparison:**
- Standard BSON: 45 bytes
- C-BSON: 32 bytes
- **Reduction: 28% smaller**

### Example 3: Nested Document

**C# Object:**
```csharp
var user = new User
{
    Id = ObjectId.NewObjectId(),
    Address = new Address
    {
        Street = "123 Main St",
        City = "Springfield"
    }
};
```

**C-BSON Wire Format (partial, showing nested doc):**
```
... // document header
03 02 00                    // Document, field 2 (address)
  23 00 00 00               // Nested doc size: 35 bytes
  02 03 00                  // String, field 3 (street)
    0c 00 00 00             // Length: 12
    31 32 33 20 4d 61 69 6e // "123 Main St\0"
    20 53 74 00
  02 04 00                  // String, field 4 (city)
    0c 00 00 00             // Length: 12
    53 70 72 69 6e 67 66 69 // "Springfield\0"
    65 6c 64 00
  00                        // End of nested doc
...
```

---

## Technical Constraints

### Field ID Space

- **Type:** `ushort` (16-bit unsigned integer)
- **Range:** 0 to 65,535
- **Theoretical max:** 65,535 unique field names per schema hierarchy
- **Practical limit:** ~1,000 fields for optimal performance
- **Reserved IDs:** 0 is reserved (not used)

### Dictionary Overhead

**Memory footprint:**
- ~16 bytes per entry in `ConcurrentDictionary<string, ushort>`
- ~16 bytes per entry in `ConcurrentDictionary<ushort, string>`
- **Total:** ~32 bytes per unique field name

**Example:** A schema with 50 fields → **~1.6 KB** in-memory overhead (negligible).

### Schema Versioning

When a schema evolves (fields added/removed/renamed):

1. **New schema version** is created with incremented version number
2. **New field IDs** are assigned to new fields
3. **Old documents remain readable** with old schema
4. **Migration** can be applied lazily during read-modify-write cycles

**Schema hash** ensures consistency:
```csharp
long schemaHash = schema.GetHash(); // Hash of all field names and types
```

---

## Compatibility

### BSON Type Compatibility

C-BSON is **type-compatible** with standard BSON:
- ✅ Same type codes (0x01-0x13)
- ✅ Same value encoding (little-endian, IEEE 754, UTF-8)
- ✅ Same document structure (size prefix + elements + 0x00 terminator)
- ❌ **Different element header format** (field ID vs. field name)

### Migration from Standard BSON

**Strategy:**
1. Read standard BSON document
2. Extract field names and build schema
3. Assign field IDs based on schema
4. Re-serialize as C-BSON

**Future enhancement:** Hybrid reader capable of auto-detecting and reading both formats.

### Export to Standard BSON

For external tool compatibility (e.g., MongoDB Compass, Studio 3T):

```csharp
// Convert C-BSON → Standard BSON
public byte[] ToStandardBson(byte[] cbson, BsonSchema schema)
{
    var reader = new BsonSpanReader(cbson, schema.GetReverseKeyMap());
    var writer = new StandardBsonWriter(); // Uses string field names

    // Copy document element-by-element
    while (...)
    {
        var type = reader.ReadBsonType();
        var fieldName = reader.ReadElementHeader(); // ID → Name
        writer.WriteElementHeader(type, fieldName);  // Write name directly
        // ... copy value
    }
}
```

---

## Schema Evolution Strategies

### Adding Fields

**Backward compatible:** New fields get new IDs, old documents remain valid.

```csharp
// Version 1: User schema
// 1: "_id", 2: "name", 3: "email"

// Version 2: Add "phone"
// 1: "_id", 2: "name", 3: "email", 4: "phone"
```

Old documents:
- Missing field 4 → treated as `null` or default value
- No re-serialization required

### Removing Fields

**Forward compatible:** Removed field IDs are marked as deprecated.

```csharp
// Version 3: Remove "email" (field 3)
// Mark field 3 as deprecated in schema
```

New code:
- Ignores field 3 during deserialization
- Old documents with field 3 remain valid (data is skipped)

### Renaming Fields

**Breaking change:** Requires migration.

```csharp
// Version 4: Rename "phone" → "mobile_phone"

// Option 1: Lazy migration on read
if (doc.ContainsKey("phone"))
{
    doc["mobile_phone"] = doc["phone"];
    doc.Remove("phone");
    UpdateDocument(doc);
}

// Option 2: Batch migration script
foreach (var doc in collection.FindAll())
{
    if (doc.ContainsKey("phone"))
    {
        doc["mobile_phone"] = doc["phone"];
        doc.Remove("phone");
        collection.Update(doc);
    }
}
```

---

## Future Enhancements

### 1. Adaptive Key Width

Use **1 byte for field IDs** when schema has <256 fields:

```
Small schema flag: [1 bit in document header]
If set: field IDs are 1 byte (0-255)
Else: field IDs are 2 bytes (0-65535)
```

**Potential savings:** Additional 1 byte per field for small schemas.

### 2. Delta Compression

Store only **changed fields** in updates:

```
┌──────────────────────────────────────┐
│ [Base Document ID]                   │
│ [Changed Field IDs bitmap]           │
│ [Changed Field Values]               │
└──────────────────────────────────────┘
```

### 3. Column-Oriented Storage

Separate storage for each field:

```
Field 1 file: [all _id values]
Field 2 file: [all name values]
Field 3 file: [all email values]
```

Benefits:
- **Faster analytics** (read only needed columns)
- **Better compression** (similar data together)
- **Efficient projections** (SELECT name, email FROM ...)

### 4. Hybrid Format Support

Reader auto-detects C-BSON vs. Standard BSON:

```csharp
// Magic byte detection
if (firstElement[2] < 0x7F) // Likely field ID (< 127)
    return ReadCBSON();
else
    return ReadStandardBSON();
```

---

## Conclusion

C-BSON achieves **significant storage and performance improvements** while maintaining BSON's type system and flexibility:

- **30-60% smaller documents** via key compression
- **Zero-allocation** I/O with `Span<byte>`
- **Full BSON type compatibility**
- **Schema-based** for type safety and evolution

This format is the foundation of CBDD's high-performance embedded database engine, enabling millions of documents to fit in memory and cache while minimizing disk I/O.

---

## References

- [BSON Specification v1.1](http://bsonspec.org/)
- [MongoDB BSON Types](https://www.mongodb.com/docs/manual/reference/bson-types/)
- [IEEE 754 Floating Point Standard](https://standards.ieee.org/standard/754-2019.html)
- [UTF-8 Encoding (RFC 3629)](https://tools.ietf.org/html/rfc3629)