Files

Joseph Doherty b8ed5ec500 Initialize CBDD solution and add a .NET-focused gitignore for generated artifacts.

2026-02-20 12:54:07 -05:00

19 KiB

Executable File

Raw Permalink Blame History

C-BSON: Compressed BSON Format

What is C-BSON?

C-BSON (Compressed BSON) is CBDD's optimized wire format that maintains full BSON type compatibility while achieving significant space savings through field name compression. This innovation reduces document size by 30-60% for typical schemas, improving both storage efficiency and I/O performance.

The Problem with Standard BSON

Standard BSON stores field names as null-terminated UTF-8 strings in every document. Consider a typical user document:

{
  "_id": ObjectId("..."),
  "email": "user@example.com",
  "created_at": ISODate("2026-02-12"),
  "last_login": ISODate("2026-02-12")
}

Field Name Overhead:

_id → 4 bytes (3 chars + null terminator)
email → 6 bytes
created_at → 11 bytes
last_login → 11 bytes

Total overhead: 32 bytes just for field names in a 4-field document.

The C-BSON Solution: Key Compression

C-BSON replaces field names with 2-byte numeric IDs via a schema-based dictionary:

Standard BSON:  [type][field_name\0][value]
C-BSON:         [type][field_id: ushort][value]

Space Savings:

Field Name	Standard BSON	C-BSON	Savings
`_id`	4 bytes	2 bytes	50%
`email`	6 bytes	2 bytes	67%
`created_at`	11 bytes	2 bytes	82%
`last_login`	11 bytes	2 bytes	82%

Result: The same 4-field document saves 24 bytes per instance. For 1 million documents, that's ~23 MB saved.

Wire Format Specification

Document Structure

┌────────────────────────────────────────────────┐
│  [4 bytes] Document Size (int32 little-endian)│
├────────────────────────────────────────────────┤
│  [Elements...]                                  │
│    ┌──────────────────────────────────────┐   │
│    │ [1 byte]  Type Code                  │   │
│    │ [2 bytes] Field ID (ushort)          │   │
│    │ [N bytes] Value (type-dependent)     │   │
│    └──────────────────────────────────────┘   │
│  [Repeat for each field]                       │
├────────────────────────────────────────────────┤
│  [1 byte] End of Document (0x00)               │
└────────────────────────────────────────────────┘

Element Header Comparison

Standard BSON Element Header:

[1 byte: type code][N bytes: null-terminated UTF-8 string]
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                    Variable length: min 2 bytes, no max

C-BSON Element Header:

[1 byte: type code][2 bytes: field ID as ushort little-endian]
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                    Fixed length: exactly 2 bytes

Type Codes

C-BSON uses standard BSON type codes for full compatibility:

Code	Type	Description
0x01	Double	64-bit IEEE 754 floating point
0x02	String	UTF-8 string (int32 length + data + null)
0x03	Document	Embedded document
0x04	Array	Embedded array
0x05	Binary	Binary data (subtype + length + data)
0x07	ObjectId	12-byte MongoDB-compatible ObjectId
0x08	Boolean	1 byte (0x00 or 0x01)
0x09	DateTime	UTC milliseconds (int64)
0x10	Int32	32-bit signed integer
0x12	Int64	64-bit signed integer
0x13	Decimal128	128-bit decimal (IEEE 754-2008)

Schema-Based Key Mapping

Bidirectional Dictionary

C-BSON requires a schema-driven key mapping maintained in memory:

Writer Side:

ConcurrentDictionary<string, ushort> _keyMap;
// Example:
// "\_id" → 1
// "email" → 2
// "created_at" → 3

Reader Side:

ConcurrentDictionary<ushort, string> _keys;
// Example:
// 1 → "\_id"
// 2 → "email"
// 3 → "created_at"

Schema Generation

CBDD automatically generates schemas from C# types using reflection:

public class User
{
    public ObjectId Id { get; set; }
    public string Email { get; set; }
    public DateTime CreatedAt { get; set; }
}

// Generated schema:
// Field 1: "_id" (ObjectId)
// Field 2: "email" (String)
// Field 3: "created_at" (DateTime)

Schema Storage

Schemas are stored in the Page 1 (Collection Metadata) and loaded into memory on database open:

┌─────────────────────────────────────────┐
│ [Schema Hash (long)]                    │
│ [Schema Version (int)]                  │
│ [Field Count (ushort)]                  │
├─────────────────────────────────────────┤
│ For each field:                         │
│   [Field ID (ushort)]                   │
│   [Field Name Length (byte)]            │
│   [Field Name UTF-8 bytes]              │
│   [BSON Type Code (byte)]               │
└─────────────────────────────────────────┘

Implementation Details

BsonSpanWriter (Serialization)

Zero-allocation writer using Span<byte>:

public ref struct BsonSpanWriter
{
    private Span<byte> _buffer;
    private int _position;
    private readonly ConcurrentDictionary<string, ushort> _keyMap;

    public void WriteElementHeader(BsonType type, string name)
    {
        // Write type code
        _buffer[_position++] = (byte)type;

        // Lookup field ID in dictionary
        if (!_keyMap.TryGetValue(name, out var id))
            throw new InvalidOperationException($"Field '{name}' not in schema");

        // Write field ID (2 bytes, little-endian)
        BinaryPrimitives.WriteUInt16LittleEndian(_buffer.Slice(_position, 2), id);
        _position += 2;
    }
}

Usage:

var keyMap = new ConcurrentDictionary<string, ushort>();
keyMap["_id"] = 1;
keyMap["name"] = 2;

Span<byte> buffer = stackalloc byte[1024];
var writer = new BsonSpanWriter(buffer, keyMap);

writer.WriteObjectId("_id", user.Id);
writer.WriteString("name", user.Name);

BsonSpanReader (Deserialization)

Zero-allocation reader using ReadOnlySpan<byte>:

public ref struct BsonSpanReader
{
    private ReadOnlySpan<byte> _buffer;
    private int _position;
    private readonly ConcurrentDictionary<ushort, string> _keys;

    public string ReadElementHeader()
    {
        // Read field ID (2 bytes, little-endian)
        var id = BinaryPrimitives.ReadUInt16LittleEndian(_buffer.Slice(_position, 2));
        _position += 2;

        // Reverse lookup in dictionary
        if (!_keys.TryGetValue(id, out var name))
            throw new InvalidOperationException($"Field ID {id} not in schema");

        return name;
    }
}

Usage:

var keys = new ConcurrentDictionary<ushort, string>();
keys[1] = "_id";
keys[2] = "name";

var reader = new BsonSpanReader(bsonData, keys);
reader.ReadDocumentSize();

while (reader.Remaining > 0)
{
    var type = reader.ReadBsonType();
    if (type == BsonType.EndOfDocument) break;

    var fieldName = reader.ReadElementHeader(); // Returns "name" from ID
    // ... read value based on type
}

Advanced Features

Nested Documents

Nested documents recursively use the same C-BSON format with their own field mappings:

public class User
{
    public ObjectId Id { get; set; }
    public Address HomeAddress { get; set; } // Nested
}

public class Address
{
    public string Street { get; set; }
    public string City { get; set; }
}

// Schema:
// User fields: 1="_id", 2="home_address"
// Address fields: 3="street", 4="city"

Wire format for nested document:

[0x03: Document]["home_address": 2]
  [nested_doc_size: 4 bytes]
  [0x02: String]["street": 3][value]
  [0x02: String]["city": 4][value]
  [0x00: End]

Arrays

Arrays use numeric indices as field names, still compressed to 2-byte IDs:

public class User
{
    public string[] Tags { get; set; }
}

// Schema includes numeric keys:
// "0" → 5, "1" → 6, "2" → 7, ...

Wire format:

[0x04: Array]["tags": 2]
  [array_size: 4 bytes]
  [0x02: String]["0": 5]["design"]
  [0x02: String]["1": 6]["dotnet"]
  [0x00: End]

Geospatial Coordinates

C-BSON supports zero-allocation coordinate tuples via [Column(TypeName="geopoint")]:

[Column(TypeName = "geopoint")]
public (double Lat, double Lon) Location { get; set; }

Wire format:

[0x04: Array]["location": field_id]
  [array_size: 4 bytes]
  [0x01: Double]["0": coord_0_id][8 bytes: latitude]
  [0x01: Double]["1": coord_1_id][8 bytes: longitude]
  [0x00: End]

This maps directly to R-Tree index structures without deserialization overhead.

Performance Benefits

Storage Efficiency

Real-world example: E-commerce product catalog

public class Product
{
    public ObjectId Id { get; set; }              // "_id": 4 → 2 bytes
    public string Name { get; set; }             // "name": 5 → 2 bytes
    public decimal Price { get; set; }           // "price": 6 → 2 bytes
    public string Description { get; set; }      // "description": 12 → 2 bytes
    public string Category { get; set; }         // "category": 9 → 2 bytes
    public string[] Tags { get; set; }           // "tags": 5 → 2 bytes
    public DateTime CreatedAt { get; set; }      // "created_at": 11 → 2 bytes
    public DateTime UpdatedAt { get; set; }      // "updated_at": 11 → 2 bytes
}

Field name overhead:

Standard BSON: 4+5+6+12+9+5+11+11 = 63 bytes
C-BSON: 2×8 = 16 bytes
Savings: 47 bytes per document

For 1 million products: ~45 MB saved in field names alone.

CPU Cache Efficiency

Smaller documents mean:

More documents fit in L1/L2/L3 cache
Fewer cache misses during sequential scans
Better prefetching for range queries

I/O Reduction

Disk I/O:

16KB page holds more documents → fewer page reads
Faster bulk inserts → less data to write
Faster bulk reads → less data to transfer from disk

Network (future):

Smaller wire transfer for client/server scenarios
Better replication throughput

Hex Dump Examples

Example 1: Simple User Document

C# Object:

var user = new User
{
    Id = new ObjectId("65d3c2a1f4b8e9a2c3d4e5f6"),
    Name = "Alice",
    Age = 30
};

C-BSON Wire Format (hex):

20 00 00 00                 // Document size: 32 bytes
07 01 00                    // ObjectId, field 1 (_id)
  65 d3 c2 a1 f4 b8 e9 a2   // ObjectId bytes (12 total)
  c3 d4 e5 f6
02 02 00                    // String, field 2 (name)
  06 00 00 00               // String length: 6
  41 6c 69 63 65 00         // "Alice\0"
10 03 00                    // Int32, field 3 (age)
  1e 00 00 00               // Value: 30
00                          // End of document

Example 2: Standard BSON Comparison

Same document in standard BSON:

2d 00 00 00                 // Document size: 45 bytes (+13 bytes)
07 5f 69 64 00              // "_id\0" (4 bytes)
  65 d3 c2 a1 f4 b8 e9 a2
  c3 d4 e5 f6
02 6e 61 6d 65 00           // "name\0" (5 bytes)
  06 00 00 00
  41 6c 69 63 65 00
10 61 67 65 00              // "age\0" (4 bytes)
  1e 00 00 00
00

Comparison:

Standard BSON: 45 bytes
C-BSON: 32 bytes
Reduction: 28% smaller

Example 3: Nested Document

C# Object:

var user = new User
{
    Id = ObjectId.NewObjectId(),
    Address = new Address
    {
        Street = "123 Main St",
        City = "Springfield"
    }
};

C-BSON Wire Format (partial, showing nested doc):

... // document header
03 02 00                    // Document, field 2 (address)
  23 00 00 00               // Nested doc size: 35 bytes
  02 03 00                  // String, field 3 (street)
    0c 00 00 00             // Length: 12
    31 32 33 20 4d 61 69 6e // "123 Main St\0"
    20 53 74 00
  02 04 00                  // String, field 4 (city)
    0c 00 00 00             // Length: 12
    53 70 72 69 6e 67 66 69 // "Springfield\0"
    65 6c 64 00
  00                        // End of nested doc
...

Technical Constraints

Field ID Space

Type: ushort (16-bit unsigned integer)
Range: 0 to 65,535
Theoretical max: 65,535 unique field names per schema hierarchy
Practical limit: ~1,000 fields for optimal performance
Reserved IDs: 0 is reserved (not used)

Dictionary Overhead

Memory footprint:

~16 bytes per entry in ConcurrentDictionary<string, ushort>
~16 bytes per entry in ConcurrentDictionary<ushort, string>
Total: ~32 bytes per unique field name

Example: A schema with 50 fields → ~1.6 KB in-memory overhead (negligible).

Schema Versioning

When a schema evolves (fields added/removed/renamed):

New schema version is created with incremented version number
New field IDs are assigned to new fields
Old documents remain readable with old schema
Migration can be applied lazily during read-modify-write cycles

Schema hash ensures consistency:

long schemaHash = schema.GetHash(); // Hash of all field names and types

Compatibility

BSON Type Compatibility

C-BSON is type-compatible with standard BSON:

✅ Same type codes (0x01-0x13)
✅ Same value encoding (little-endian, IEEE 754, UTF-8)
✅ Same document structure (size prefix + elements + 0x00 terminator)
❌ Different element header format (field ID vs. field name)

Migration from Standard BSON

Strategy:

Read standard BSON document
Extract field names and build schema
Assign field IDs based on schema
Re-serialize as C-BSON

Future enhancement: Hybrid reader capable of auto-detecting and reading both formats.

Export to Standard BSON

For external tool compatibility (e.g., MongoDB Compass, Studio 3T):

// Convert C-BSON → Standard BSON
public byte[] ToStandardBson(byte[] cbson, BsonSchema schema)
{
    var reader = new BsonSpanReader(cbson, schema.GetReverseKeyMap());
    var writer = new StandardBsonWriter(); // Uses string field names
    
    // Copy document element-by-element
    while (...)
    {
        var type = reader.ReadBsonType();
        var fieldName = reader.ReadElementHeader(); // ID → Name
        writer.WriteElementHeader(type, fieldName);  // Write name directly
        // ... copy value
    }
}

Schema Evolution Strategies

Adding Fields

Backward compatible: New fields get new IDs, old documents remain valid.

// Version 1: User schema
// 1: "_id", 2: "name", 3: "email"

// Version 2: Add "phone"
// 1: "_id", 2: "name", 3: "email", 4: "phone"

Old documents:

Missing field 4 → treated as null or default value
No re-serialization required

Removing Fields

Forward compatible: Removed field IDs are marked as deprecated.

// Version 3: Remove "email" (field 3)
// Mark field 3 as deprecated in schema

New code:

Ignores field 3 during deserialization
Old documents with field 3 remain valid (data is skipped)

Renaming Fields

Breaking change: Requires migration.

// Version 4: Rename "phone" → "mobile_phone"

// Option 1: Lazy migration on read
if (doc.ContainsKey("phone"))
{
    doc["mobile_phone"] = doc["phone"];
    doc.Remove("phone");
    UpdateDocument(doc);
}

// Option 2: Batch migration script
foreach (var doc in collection.FindAll())
{
    if (doc.ContainsKey("phone"))
    {
        doc["mobile_phone"] = doc["phone"];
        doc.Remove("phone");
        collection.Update(doc);
    }
}

Future Enhancements

1. Adaptive Key Width

Use 1 byte for field IDs when schema has <256 fields:

Small schema flag: [1 bit in document header]
If set: field IDs are 1 byte (0-255)
Else: field IDs are 2 bytes (0-65535)

Potential savings: Additional 1 byte per field for small schemas.

2. Delta Compression

Store only changed fields in updates:

┌──────────────────────────────────────┐
│ [Base Document ID]                   │
│ [Changed Field IDs bitmap]           │
│ [Changed Field Values]               │
└──────────────────────────────────────┘

3. Column-Oriented Storage

Separate storage for each field:

Field 1 file: [all _id values]
Field 2 file: [all name values]
Field 3 file: [all email values]

Benefits:

Faster analytics (read only needed columns)
Better compression (similar data together)
Efficient projections (SELECT name, email FROM ...)

4. Hybrid Format Support

Reader auto-detects C-BSON vs. Standard BSON:

// Magic byte detection
if (firstElement[2] < 0x7F) // Likely field ID (< 127)
    return ReadCBSON();
else
    return ReadStandardBSON();

Conclusion

C-BSON achieves significant storage and performance improvements while maintaining BSON's type system and flexibility:

30-60% smaller documents via key compression
Zero-allocation I/O with Span<byte>
Full BSON type compatibility
Schema-based for type safety and evolution

This format is the foundation of CBDD's high-performance embedded database engine, enabling millions of documents to fit in memory and cache while minimizing disk I/O.

19 KiB Executable File Raw Permalink Blame History Unescape Escape