Files
CBDD/compact_plan.md

18 KiB

Compression + Compaction Implementation Plan

1. Objectives

Implement two major capabilities in CBDD:

  1. Compression for stored document payloads (including overflow support, compatibility, safety, and telemetry).
  2. Compaction/Vacuum for reclaiming fragmented space and shrinking database files safely.

This plan is grounded in the current architecture:

  • Storage pages and free-list in src/CBDD.Core/Storage/PageFile.cs
  • Slot/page layout in src/CBDD.Core/Storage/SlottedPage.cs
  • Document CRUD + overflow paths in src/CBDD.Core/Collections/DocumentCollection.cs
  • WAL/transaction/recovery semantics in src/CBDD.Core/Storage/StorageEngine*.cs and src/CBDD.Core/Transactions/WriteAheadLog.cs
  • Collection metadata in src/CBDD.Core/Storage/StorageEngine.Collections.cs

2. Key Design Decisions

2.1 Compression unit: Per-document before overflow

Chosen model: compress the full serialized BSON document first, then apply existing overflow chunking to the stored bytes.

Why this choice:

  • Single compression decision per document (simple threshold logic).
  • Overflow logic remains generic over byte blobs.
  • Update path can compare stored compressed size vs existing slot size directly.
  • Better ratio than per-chunk for many document shapes.

Consequences:

  • HasOverflow continues to represent storage chaining only.
  • Compressed continues to represent payload encoding.
  • Read path always reconstructs full stored blob (from primary + overflow), then decompresses if flagged.

2.2 Codec strategy

Implement a codec abstraction with initial built-in codecs backed by .NET primitives:

  • None
  • Brotli
  • Deflate

Expose via config:

  • EnableCompression
  • Codec
  • Level
  • MinSizeBytes
  • MinSavingsPercent

Add safety knobs:

  • MaxDecompressedSizeBytes (guardrail)
  • optional MaxCompressionInputBytes (defensive cap for write path)

2.3 Metadata strategy

Use layered metadata for compatibility and decoding:

  1. DB-level persistent metadata (Page 0 extension region):
  • DB format version
  • feature flags (compression enabled capability)
  • default codec id
  1. Page-level format metadata:
  • page format version marker (for mixed old/new page parsing)
  • optional default codec hint (for diagnostics and future tuning)
  1. Slot payload metadata for compressed entries (fixed header prefix in stored payload):
  • codec id
  • original length
  • compressed length
  • checksum (CRC32 of compressed payload bytes)

This avoids breaking old uncompressed pages while still satisfying “readers know how to decode” and checksum requirements.

3. Workstreams

3.1 Workstream A: Compression Core + Config Surface

Deliverables

  • New compression options model and codec abstraction.
  • Persistent file/page format metadata support.
  • Telemetry primitives for compression counters.

Changes

  • Add src/CBDD.Core/Compression/:
    • CompressionOptions.cs
    • CompressionCodec.cs
    • ICompressionCodec.cs
    • CompressionService.cs
    • CompressedPayloadHeader.cs
    • CompressionTelemetry.cs
  • Extend context/engine construction:
    • src/CBDD.Core/DocumentDbContext.cs
    • src/CBDD.Core/Storage/StorageEngine.cs
  • Add DB metadata read/write helpers:
    • src/CBDD.Core/Storage/PageFile.cs
    • new src/CBDD.Core/Storage/StorageEngine.Format.cs

Implementation tasks

  1. Introduce CompressionOptions with defaults:
    • compression disabled by default
    • conservative thresholds (MinSizeBytes, MinSavingsPercent)
  2. Add codec registry/factory.
  3. Add DB format metadata block in page 0 extension with version + feature flags + default codec id.
  4. Add page format marker to slotted pages on allocation path (new pages only).
  5. Add telemetry counter container (thread-safe atomic counters).

Acceptance

  • Existing DB files open unchanged.
  • New DB files persist format metadata.
  • Compression service can roundtrip payloads with selected codec.

3.2 Workstream B: Insert + Read Paths (new writes first)

Deliverables

  • Compression on insert with threshold logic.
  • Safe decompression on reads with checksum and size validation.
  • Fallback to uncompressed write on compression failure.

Changes

  • src/CBDD.Core/Collections/DocumentCollection.cs:
    • InsertDataCore
    • InsertIntoPage
    • InsertWithOverflow
    • FindByLocation
  • src/CBDD.Core/Storage/SlottedPage.cs (if slot/page metadata helpers are added)

Implementation tasks

  1. Insert path:
    • Serialize BSON (existing behavior).
    • If compression enabled and docData.Length >= MinSizeBytes, try codec.
    • Compute savings and only set SlotFlags.Compressed if threshold met.
    • Build compressed payload as: [CompressedPayloadHeader][compressed bytes].
    • On any compression exception, increment failure counter and store uncompressed.
  2. Overflow path:
    • Apply overflow based on stored bytes length (compressed or uncompressed).
    • No separate compression of chunks/pages.
  3. Read path:
    • Existing overflow reassembly first.
    • If Compressed flag present:
      • parse payload header
      • validate compressed length + original length bounds
      • validate checksum before decompression
      • decompress using codec id
      • enforce MaxDecompressedSizeBytes
  4. Corruption handling:
    • throw deterministic InvalidDataException for header/checksum/size violations.

Acceptance

  • Inserts/reads succeed for uncompressed docs (regression-safe).
  • Mixed compressed/uncompressed documents in same collection read correctly.
  • Corrupted compressed payload is detected and rejected predictably.

3.3 Workstream C: Update/Delete + Overflow Consistency

Deliverables

  • Compression-aware update decisions.
  • Correct delete behavior for compressed+overflow combinations.

Changes

  • src/CBDD.Core/Collections/DocumentCollection.cs:
    • UpdateDataCore
    • DeleteCore
    • FreeOverflowChain

Implementation tasks

  1. Update path:
    • Recompute storage payload for new document using same compression decision logic as insert.
    • In-place update only when:
      • existing slot is non-overflow, and
      • new stored payload length <= old slot length, and
      • compression flag/metadata can be updated safely.
    • Otherwise relocate (existing delete+insert strategy).
  2. Delete path:
    • Keep logical semantics unchanged.
    • Ensure overflow chain extraction still works when slot has both Compressed and HasOverflow.
  3. Overflow consistency tests:
    • compressed small -> compressed overflow transitions on update
    • compressed overflow -> uncompressed small transitions

Acceptance

  • Update behavior preserves correctness for all compression/overflow combinations.
  • Delete frees overflow pages for compressed and uncompressed overflow docs.

3.4 Workstream D: Compaction / Shrink (Offline first)

Deliverables

  • Public Compact/Vacuum maintenance API.
  • Offline copy-and-swap compaction with crash-safe finalize.
  • Exact pre/post space accounting.

API surface

  • Add to DocumentDbContext:
    • Compact(CompactionOptions? options = null)
    • CompactAsync(...)
    • alias Vacuum(...)
  • Engine-level operation in new file:
    • src/CBDD.Core/Storage/StorageEngine.Maintenance.cs

Offline mode algorithm (Phase 1)

  1. Acquire exclusive maintenance gate (block writes).
  2. Checkpoint WAL before start.
  3. Build a temporary database file (*.compact.tmp) with same page config and compression config.
  4. Copy logical contents collection-by-collection:
    • preserve collection metadata/index definitions
    • reinsert documents through collection APIs so locations are rewritten correctly
    • rebuild/update index roots in metadata
  5. Checkpoint temp DB and fsync.
  6. Atomic finalize (copy-and-swap):
    • write state marker (*.compact.state) for resumability
    • rename original -> backup, temp -> original
    • reset/remove WAL appropriately
    • remove marker
  7. Produce CompactionStats with exact pre/post bytes and deltas.

Crash safety

  • Use explicit state-machine marker file with phases (Started, Copied, Swapped, CleanupDone).
  • On startup, detect marker and resume/repair idempotently.

Acceptance

  • File shrinks when free tail pages exist.
  • No data/index loss after compaction.
  • Crash during compaction is recoverable and deterministic.

3.5 Workstream E: Online Compaction + Scheduling

Deliverables

  • Online mode with throttled relocation.
  • Scheduling hooks (manual/startup/threshold-based trigger).

Online mode strategy (Phase 2)

  1. Background scanner identifies fragmented pages and relocation candidates.
  2. Move documents in bounded batches under short write exclusion windows.
  3. Update primary and secondary index locations transactionally.
  4. Periodic checkpoints to bound WAL growth.
  5. Tail truncation pass when contiguous free pages reach EOF.

Scheduling hooks

  • MaintenanceOptions:
    • RunAtStartup
    • MinFragmentationPercent
    • MinReclaimableBytes
    • MaxRunDuration
    • OnlineThrottle (ops/sec or pages/batch)

Acceptance

  • Writes continue during online mode except small lock windows.
  • Recovery semantics remain valid with WAL and checkpoints.

4. Compaction Internals Required by Both Modes

4.1 Page defragmentation utilities

  • Add slotted-page defrag helper:
    • rewrites active slots/data densely
    • recomputes FreeSpaceStart/End

4.2 Free-list consolidation + tail truncation

  • Extend PageFile with:
    • free-page enumeration
    • free-list normalization
    • safe truncation when free pages are contiguous at end-of-file

4.3 Metadata/index pointer correctness

  • Ensure collection metadata root IDs and index root IDs are rewritten/verified after relocation/copy.
  • Add validation pass that checks all primary index locations resolve to non-deleted slots.

4.4 WAL coordination

  • Explicit checkpoint before and after compaction.
  • Ensure compaction writes follow normal WAL durability semantics.
  • Keep Recover() behavior valid with compaction marker states.

5. Compatibility and Migration

5.1 Compatibility goals

  • Read old uncompressed files unchanged.
  • Support mixed pages/documents (compressed + uncompressed) in same DB.
  • Preserve existing APIs unless explicitly extended.

5.2 Migration tool (optional one-time rewrite)

  • Implement MigrateCompression(...) as an offline rewrite command using the same copy-and-swap machinery.
  • Options:
    • target codec/level
    • per-collection include/exclude
    • dry-run estimation mode

6. Telemetry + Admin Tooling

6.1 Compression telemetry counters

  • compressed document count
  • bytes before/after
  • compression ratio aggregate
  • compression CPU time
  • decompression CPU time
  • compression failure count
  • checksum failure count
  • safety-limit rejection count

Expose via:

  • StorageEngine.GetCompressionStats()
  • context-level forwarding method in DocumentDbContext

6.2 Compaction telemetry/stats

  • pre/post file size
  • live bytes
  • free bytes
  • fragmentation percentage
  • documents/pages relocated
  • runtime and throughput

6.3 Admin inspection APIs

Add diagnostics APIs (engine/context):

  • page usage by collection/page type
  • compression ratio by collection
  • fragmentation map and free-list summary

7. Tests

Add focused test suites in tests/CBDD.Tests/:

  1. CompressionInsertReadTests.cs
  • threshold on/off behavior
  • mixed compressed/uncompressed reads
  • fallback to uncompressed on forced codec error
  1. CompressionOverflowTests.cs
  • compressed docs that span overflow pages
  • transitions across size thresholds
  1. CompressionCorruptionTests.cs
  • bad checksum
  • bad original length
  • oversized decompression guardrail
  • invalid codec id
  1. CompressionCompatibilityTests.cs
  • open existing uncompressed DB files
  • mixed-format pages after partial migration
  1. CompactionOfflineTests.cs
  • logical equivalence pre/post compact
  • index correctness pre/post compact
  • tail truncation actually reduces file size
  1. CompactionCrashRecoveryTests.cs
  • simulate crashes at each copy/swap state
  • resume/finalize behavior
  1. CompactionOnlineConcurrencyTests.cs
  • concurrent writes + reads during online compact
  • correctness and no deadlock
  1. CompactionWalCoordinationTests.cs
  • checkpoint before/after behavior
  • recoverability with in-flight WAL entries

Also update/extend existing tests:

  • tests/CBDD.Tests/DocumentOverflowTests.cs
  • tests/CBDD.Tests/DbContextTests.cs
  • tests/CBDD.Tests/WalIndexTests.cs

8. Benchmark Additions

Extend tests/CBDD.Tests.Benchmark/ with:

  1. CompressionBenchmarks.cs
  • insert/update/read workloads with compression on/off
  • codec and level comparison
  1. CompactionBenchmarks.cs
  • offline compact runtime
  • reclaimable bytes vs elapsed
  1. MixedWorkloadBenchmarks.cs
  • insert/update/delete-heavy cache workload
  • periodic compact impact
  1. Update DatabaseSizeBenchmark.cs
  • pre/post compact shrink delta reporting
  • compression ratio reporting

9. Suggested Implementation Order (Execution Plan)

Phase 1 (as requested): Compression config + read/write path for new writes only

  • Workstream A
  • Workstream B (insert/read only)
  • initial tests: insert/read + compatibility

Phase 2 (as requested): Compression-aware updates/deletes + overflow handling

  • Workstream C
  • overflow-focused tests + corruption guards

Phase 3 (as requested): Offline copy-and-swap compaction/shrink

  • Workstream D + shared internals from section 4
  • crash-safe finalize + space accounting

Phase 4 (as requested): Online compaction + automation hooks

  • Workstream E
  • concurrency and scheduling tests

10. Subagent Execution Safety + Completion Verification

10.1 Subagent ownership model

Use explicit, non-overlapping ownership to avoid unsafe parallel edits:

  1. Subagent A (Compression Core)
  • Owns src/CBDD.Core/Compression/*
  • Owns format/config plumbing in src/CBDD.Core/Storage/StorageEngine.Format.cs, src/CBDD.Core/Storage/StorageEngine.cs, src/CBDD.Core/DocumentDbContext.cs
  1. Subagent B (CRUD + Overflow Compression Semantics)
  • Owns compression-aware CRUD changes in src/CBDD.Core/Collections/DocumentCollection.cs
  • Owns slot/payload metadata helpers in src/CBDD.Core/Storage/SlottedPage.cs (if needed)
  1. Subagent C (Compaction/Vacuum Engine)
  • Owns src/CBDD.Core/Storage/StorageEngine.Maintenance.cs
  • Owns related PageFile extensions in src/CBDD.Core/Storage/PageFile.cs
  1. Subagent D (Verification Assets)
  • Owns new/updated tests in tests/CBDD.Tests/*Compression*, tests/CBDD.Tests/*Compaction*
  • Owns benchmark additions in tests/CBDD.Tests.Benchmark/*

Rule: one file has exactly one active subagent owner at a time.

10.2 Safe collaboration rules for subagents

  1. Do not edit files outside assigned ownership scope.
  2. Do not revert or reformat unrelated existing changes.
  3. Do not change public contracts outside assigned workstream without explicit handoff.
  4. If overlap is discovered mid-task, stop and reassign ownership before continuing.
  5. Keep changes atomic and reviewable (small PR-sized batches per workstream phase).

10.3 Required handoff payload from each subagent

Each completion handoff MUST include:

  1. Summary of implemented requirements and non-implemented items.
  2. Exact touched file list.
  3. Risk list (behavioral, compatibility, recovery, perf).
  4. Test commands executed and pass/fail results.
  5. Mapping to plan acceptance criteria sections.

10.4 Mandatory verification on subagent completion

Every subagent completion is verified before merge/mark-done:

  1. Scope verification
  • Confirm touched files are in owned scope only.
  • Confirm required plan items for that phase are implemented.
  1. Correctness verification
  • Run targeted tests for touched behavior.
  • Run related regression suites (CRUD, overflow, WAL/recovery, index consistency where applicable).
  1. Safety verification
  • Validate corruption/safety guard behavior (checksum, size limits, crash-state handling where applicable).
  • Validate backward compatibility with old uncompressed files when relevant.
  1. Performance verification
  • Run benchmark smoke checks for modified hot paths.
  • Verify no obvious regressions against baseline thresholds.
  1. Integration verification
  • Rebuild solution and run full test pass before final phase closure.

10.5 Verification commands (minimum gate)

Run these gates after each subagent completion (adjust filters to scope):

  1. dotnet build /Users/dohertj2/Desktop/CBDD/CBDD.slnx
  2. dotnet test /Users/dohertj2/Desktop/CBDD/tests/CBDD.Tests/ZB.MOM.WW.CBDD.Tests.csproj
  3. Targeted suites for the changed area, for example:
  • dotnet test /Users/dohertj2/Desktop/CBDD/tests/CBDD.Tests/ZB.MOM.WW.CBDD.Tests.csproj --filter \"FullyQualifiedName~Compression\"
  • dotnet test /Users/dohertj2/Desktop/CBDD/tests/CBDD.Tests/ZB.MOM.WW.CBDD.Tests.csproj --filter \"FullyQualifiedName~Compaction\"
  1. Benchmark smoke run when hot paths changed:
  • dotnet run -c Release --project /Users/dohertj2/Desktop/CBDD/tests/CBDD.Tests.Benchmark/ZB.MOM.WW.CBDD.Tests.Benchmark.csproj

If any gate fails, the subagent task is not complete and must be returned for rework with failure notes.

11. Definition of Done (Release Gates)

  1. Correctness
  • all new compression/compaction test suites green
  • no regressions in existing test suite
  1. Compatibility
  • old DB files readable with no migration required
  • mixed-format operation validated
  1. Safety
  • decompression guardrails enforced
  • corruption checks and deterministic failure behavior
  • crash recovery scenarios covered
  1. Performance
  • documented benchmark deltas for write/read overhead and compaction throughput
  • no pathological GC spikes under compression-enabled workloads
  1. Operability
  • telemetry counters exposed
  • admin diagnostics available for page usage/compression/fragmentation