# Compression + Compaction Implementation Plan ## 1. Objectives Implement two major capabilities in CBDD: 1. Compression for stored document payloads (including overflow support, compatibility, safety, and telemetry). 2. Compaction/Vacuum for reclaiming fragmented space and shrinking database files safely. This plan is grounded in the current architecture: - Storage pages and free-list in `src/CBDD.Core/Storage/PageFile.cs` - Slot/page layout in `src/CBDD.Core/Storage/SlottedPage.cs` - Document CRUD + overflow paths in `src/CBDD.Core/Collections/DocumentCollection.cs` - WAL/transaction/recovery semantics in `src/CBDD.Core/Storage/StorageEngine*.cs` and `src/CBDD.Core/Transactions/WriteAheadLog.cs` - Collection metadata in `src/CBDD.Core/Storage/StorageEngine.Collections.cs` ## 2. Key Design Decisions ### 2.1 Compression unit: **Per-document before overflow** Chosen model: compress the full serialized BSON document first, then apply existing overflow chunking to the stored bytes. Why this choice: - Single compression decision per document (simple threshold logic). - Overflow logic remains generic over byte blobs. - Update path can compare stored compressed size vs existing slot size directly. - Better ratio than per-chunk for many document shapes. Consequences: - `HasOverflow` continues to represent storage chaining only. - `Compressed` continues to represent payload encoding. - Read path always reconstructs full stored blob (from primary + overflow), then decompresses if flagged. ### 2.2 Codec strategy Implement a codec abstraction with initial built-in codecs backed by .NET primitives: - `None` - `Brotli` - `Deflate` Expose via config: - `EnableCompression` - `Codec` - `Level` - `MinSizeBytes` - `MinSavingsPercent` Add safety knobs: - `MaxDecompressedSizeBytes` (guardrail) - optional `MaxCompressionInputBytes` (defensive cap for write path) ### 2.3 Metadata strategy Use layered metadata for compatibility and decoding: 1. DB-level persistent metadata (Page 0 extension region): - DB format version - feature flags (compression enabled capability) - default codec id 2. Page-level format metadata: - page format version marker (for mixed old/new page parsing) - optional default codec hint (for diagnostics and future tuning) 3. Slot payload metadata for compressed entries (fixed header prefix in stored payload): - codec id - original length - compressed length - checksum (CRC32 of compressed payload bytes) This avoids breaking old uncompressed pages while still satisfying “readers know how to decode” and checksum requirements. ## 3. Workstreams ## 3.1 Workstream A: Compression Core + Config Surface ### Deliverables - New compression options model and codec abstraction. - Persistent file/page format metadata support. - Telemetry primitives for compression counters. ### Changes - Add `src/CBDD.Core/Compression/`: - `CompressionOptions.cs` - `CompressionCodec.cs` - `ICompressionCodec.cs` - `CompressionService.cs` - `CompressedPayloadHeader.cs` - `CompressionTelemetry.cs` - Extend context/engine construction: - `src/CBDD.Core/DocumentDbContext.cs` - `src/CBDD.Core/Storage/StorageEngine.cs` - Add DB metadata read/write helpers: - `src/CBDD.Core/Storage/PageFile.cs` - new `src/CBDD.Core/Storage/StorageEngine.Format.cs` ### Implementation tasks 1. Introduce `CompressionOptions` with defaults: - compression disabled by default - conservative thresholds (`MinSizeBytes`, `MinSavingsPercent`) 2. Add codec registry/factory. 3. Add DB format metadata block in page 0 extension with version + feature flags + default codec id. 4. Add page format marker to slotted pages on allocation path (new pages only). 5. Add telemetry counter container (thread-safe atomic counters). ### Acceptance - Existing DB files open unchanged. - New DB files persist format metadata. - Compression service can roundtrip payloads with selected codec. ## 3.2 Workstream B: Insert + Read Paths (new writes first) ### Deliverables - Compression on insert with threshold logic. - Safe decompression on reads with checksum and size validation. - Fallback to uncompressed write on compression failure. ### Changes - `src/CBDD.Core/Collections/DocumentCollection.cs`: - `InsertDataCore` - `InsertIntoPage` - `InsertWithOverflow` - `FindByLocation` - `src/CBDD.Core/Storage/SlottedPage.cs` (if slot/page metadata helpers are added) ### Implementation tasks 1. Insert path: - Serialize BSON (existing behavior). - If compression enabled and `docData.Length >= MinSizeBytes`, try codec. - Compute savings and only set `SlotFlags.Compressed` if threshold met. - Build compressed payload as: `[CompressedPayloadHeader][compressed bytes]`. - On any compression exception, increment failure counter and store uncompressed. 2. Overflow path: - Apply overflow based on stored bytes length (compressed or uncompressed). - No separate compression of chunks/pages. 3. Read path: - Existing overflow reassembly first. - If `Compressed` flag present: - parse payload header - validate compressed length + original length bounds - validate checksum before decompression - decompress using codec id - enforce `MaxDecompressedSizeBytes` 4. Corruption handling: - throw deterministic `InvalidDataException` for header/checksum/size violations. ### Acceptance - Inserts/reads succeed for uncompressed docs (regression-safe). - Mixed compressed/uncompressed documents in same collection read correctly. - Corrupted compressed payload is detected and rejected predictably. ## 3.3 Workstream C: Update/Delete + Overflow Consistency ### Deliverables - Compression-aware update decisions. - Correct delete behavior for compressed+overflow combinations. ### Changes - `src/CBDD.Core/Collections/DocumentCollection.cs`: - `UpdateDataCore` - `DeleteCore` - `FreeOverflowChain` ### Implementation tasks 1. Update path: - Recompute storage payload for new document using same compression decision logic as insert. - In-place update only when: - existing slot is non-overflow, and - new stored payload length <= old slot length, and - compression flag/metadata can be updated safely. - Otherwise relocate (existing delete+insert strategy). 2. Delete path: - Keep logical semantics unchanged. - Ensure overflow chain extraction still works when slot has both `Compressed` and `HasOverflow`. 3. Overflow consistency tests: - compressed small -> compressed overflow transitions on update - compressed overflow -> uncompressed small transitions ### Acceptance - Update behavior preserves correctness for all compression/overflow combinations. - Delete frees overflow pages for compressed and uncompressed overflow docs. ## 3.4 Workstream D: Compaction / Shrink (Offline first) ### Deliverables - Public `Compact`/`Vacuum` maintenance API. - Offline copy-and-swap compaction with crash-safe finalize. - Exact pre/post space accounting. ### API surface - Add to `DocumentDbContext`: - `Compact(CompactionOptions? options = null)` - `CompactAsync(...)` - alias `Vacuum(...)` - Engine-level operation in new file: - `src/CBDD.Core/Storage/StorageEngine.Maintenance.cs` ### Offline mode algorithm (Phase 1) 1. Acquire exclusive maintenance gate (block writes). 2. Checkpoint WAL before start. 3. Build a temporary database file (`*.compact.tmp`) with same page config and compression config. 4. Copy logical contents collection-by-collection: - preserve collection metadata/index definitions - reinsert documents through collection APIs so locations are rewritten correctly - rebuild/update index roots in metadata 5. Checkpoint temp DB and fsync. 6. Atomic finalize (copy-and-swap): - write state marker (`*.compact.state`) for resumability - rename original -> backup, temp -> original - reset/remove WAL appropriately - remove marker 7. Produce `CompactionStats` with exact pre/post bytes and deltas. ### Crash safety - Use explicit state-machine marker file with phases (`Started`, `Copied`, `Swapped`, `CleanupDone`). - On startup, detect marker and resume/repair idempotently. ### Acceptance - File shrinks when free tail pages exist. - No data/index loss after compaction. - Crash during compaction is recoverable and deterministic. ## 3.5 Workstream E: Online Compaction + Scheduling ### Deliverables - Online mode with throttled relocation. - Scheduling hooks (manual/startup/threshold-based trigger). ### Online mode strategy (Phase 2) 1. Background scanner identifies fragmented pages and relocation candidates. 2. Move documents in bounded batches under short write exclusion windows. 3. Update primary and secondary index locations transactionally. 4. Periodic checkpoints to bound WAL growth. 5. Tail truncation pass when contiguous free pages reach EOF. ### Scheduling hooks - `MaintenanceOptions`: - `RunAtStartup` - `MinFragmentationPercent` - `MinReclaimableBytes` - `MaxRunDuration` - `OnlineThrottle` (ops/sec or pages/batch) ### Acceptance - Writes continue during online mode except small lock windows. - Recovery semantics remain valid with WAL and checkpoints. ## 4. Compaction Internals Required by Both Modes ### 4.1 Page defragmentation utilities - Add slotted-page defrag helper: - rewrites active slots/data densely - recomputes `FreeSpaceStart/End` ### 4.2 Free-list consolidation + tail truncation - Extend `PageFile` with: - free-page enumeration - free-list normalization - safe truncation when free pages are contiguous at end-of-file ### 4.3 Metadata/index pointer correctness - Ensure collection metadata root IDs and index root IDs are rewritten/verified after relocation/copy. - Add validation pass that checks all primary index locations resolve to non-deleted slots. ### 4.4 WAL coordination - Explicit checkpoint before and after compaction. - Ensure compaction writes follow normal WAL durability semantics. - Keep `Recover()` behavior valid with compaction marker states. ## 5. Compatibility and Migration ### 5.1 Compatibility goals - Read old uncompressed files unchanged. - Support mixed pages/documents (compressed + uncompressed) in same DB. - Preserve existing APIs unless explicitly extended. ### 5.2 Migration tool (optional one-time rewrite) - Implement `MigrateCompression(...)` as an offline rewrite command using the same copy-and-swap machinery. - Options: - target codec/level - per-collection include/exclude - dry-run estimation mode ## 6. Telemetry + Admin Tooling ### 6.1 Compression telemetry counters - compressed document count - bytes before/after - compression ratio aggregate - compression CPU time - decompression CPU time - compression failure count - checksum failure count - safety-limit rejection count Expose via: - `StorageEngine.GetCompressionStats()` - context-level forwarding method in `DocumentDbContext` ### 6.2 Compaction telemetry/stats - pre/post file size - live bytes - free bytes - fragmentation percentage - documents/pages relocated - runtime and throughput ### 6.3 Admin inspection APIs Add diagnostics APIs (engine/context): - page usage by collection/page type - compression ratio by collection - fragmentation map and free-list summary ## 7. Tests Add focused test suites in `tests/CBDD.Tests/`: 1. `CompressionInsertReadTests.cs` - threshold on/off behavior - mixed compressed/uncompressed reads - fallback to uncompressed on forced codec error 2. `CompressionOverflowTests.cs` - compressed docs that span overflow pages - transitions across size thresholds 3. `CompressionCorruptionTests.cs` - bad checksum - bad original length - oversized decompression guardrail - invalid codec id 4. `CompressionCompatibilityTests.cs` - open existing uncompressed DB files - mixed-format pages after partial migration 5. `CompactionOfflineTests.cs` - logical equivalence pre/post compact - index correctness pre/post compact - tail truncation actually reduces file size 6. `CompactionCrashRecoveryTests.cs` - simulate crashes at each copy/swap state - resume/finalize behavior 7. `CompactionOnlineConcurrencyTests.cs` - concurrent writes + reads during online compact - correctness and no deadlock 8. `CompactionWalCoordinationTests.cs` - checkpoint before/after behavior - recoverability with in-flight WAL entries Also update/extend existing tests: - `tests/CBDD.Tests/DocumentOverflowTests.cs` - `tests/CBDD.Tests/DbContextTests.cs` - `tests/CBDD.Tests/WalIndexTests.cs` ## 8. Benchmark Additions Extend `tests/CBDD.Tests.Benchmark/` with: 1. `CompressionBenchmarks.cs` - insert/update/read workloads with compression on/off - codec and level comparison 2. `CompactionBenchmarks.cs` - offline compact runtime - reclaimable bytes vs elapsed 3. `MixedWorkloadBenchmarks.cs` - insert/update/delete-heavy cache workload - periodic compact impact 4. Update `DatabaseSizeBenchmark.cs` - pre/post compact shrink delta reporting - compression ratio reporting ## 9. Suggested Implementation Order (Execution Plan) ### Phase 1 (as requested): Compression config + read/write path for new writes only - Workstream A - Workstream B (insert/read only) - initial tests: insert/read + compatibility ### Phase 2 (as requested): Compression-aware updates/deletes + overflow handling - Workstream C - overflow-focused tests + corruption guards ### Phase 3 (as requested): Offline copy-and-swap compaction/shrink - Workstream D + shared internals from section 4 - crash-safe finalize + space accounting ### Phase 4 (as requested): Online compaction + automation hooks - Workstream E - concurrency and scheduling tests ## 10. Subagent Execution Safety + Completion Verification ### 10.1 Subagent ownership model Use explicit, non-overlapping ownership to avoid unsafe parallel edits: 1. Subagent A (Compression Core) - Owns `src/CBDD.Core/Compression/*` - Owns format/config plumbing in `src/CBDD.Core/Storage/StorageEngine.Format.cs`, `src/CBDD.Core/Storage/StorageEngine.cs`, `src/CBDD.Core/DocumentDbContext.cs` 2. Subagent B (CRUD + Overflow Compression Semantics) - Owns compression-aware CRUD changes in `src/CBDD.Core/Collections/DocumentCollection.cs` - Owns slot/payload metadata helpers in `src/CBDD.Core/Storage/SlottedPage.cs` (if needed) 3. Subagent C (Compaction/Vacuum Engine) - Owns `src/CBDD.Core/Storage/StorageEngine.Maintenance.cs` - Owns related `PageFile` extensions in `src/CBDD.Core/Storage/PageFile.cs` 4. Subagent D (Verification Assets) - Owns new/updated tests in `tests/CBDD.Tests/*Compression*`, `tests/CBDD.Tests/*Compaction*` - Owns benchmark additions in `tests/CBDD.Tests.Benchmark/*` Rule: one file has exactly one active subagent owner at a time. ### 10.2 Safe collaboration rules for subagents 1. Do not edit files outside assigned ownership scope. 2. Do not revert or reformat unrelated existing changes. 3. Do not change public contracts outside assigned workstream without explicit handoff. 4. If overlap is discovered mid-task, stop and reassign ownership before continuing. 5. Keep changes atomic and reviewable (small PR-sized batches per workstream phase). ### 10.3 Required handoff payload from each subagent Each completion handoff MUST include: 1. Summary of implemented requirements and non-implemented items. 2. Exact touched file list. 3. Risk list (behavioral, compatibility, recovery, perf). 4. Test commands executed and pass/fail results. 5. Mapping to plan acceptance criteria sections. ### 10.4 Mandatory verification on subagent completion Every subagent completion is verified before merge/mark-done: 1. Scope verification - Confirm touched files are in owned scope only. - Confirm required plan items for that phase are implemented. 2. Correctness verification - Run targeted tests for touched behavior. - Run related regression suites (CRUD, overflow, WAL/recovery, index consistency where applicable). 3. Safety verification - Validate corruption/safety guard behavior (checksum, size limits, crash-state handling where applicable). - Validate backward compatibility with old uncompressed files when relevant. 4. Performance verification - Run benchmark smoke checks for modified hot paths. - Verify no obvious regressions against baseline thresholds. 5. Integration verification - Rebuild solution and run full test pass before final phase closure. ### 10.5 Verification commands (minimum gate) Run these gates after each subagent completion (adjust filters to scope): 1. `dotnet build /Users/dohertj2/Desktop/CBDD/CBDD.slnx` 2. `dotnet test /Users/dohertj2/Desktop/CBDD/tests/CBDD.Tests/ZB.MOM.WW.CBDD.Tests.csproj` 3. Targeted suites for the changed area, for example: - `dotnet test /Users/dohertj2/Desktop/CBDD/tests/CBDD.Tests/ZB.MOM.WW.CBDD.Tests.csproj --filter \"FullyQualifiedName~Compression\"` - `dotnet test /Users/dohertj2/Desktop/CBDD/tests/CBDD.Tests/ZB.MOM.WW.CBDD.Tests.csproj --filter \"FullyQualifiedName~Compaction\"` 4. Benchmark smoke run when hot paths changed: - `dotnet run -c Release --project /Users/dohertj2/Desktop/CBDD/tests/CBDD.Tests.Benchmark/ZB.MOM.WW.CBDD.Tests.Benchmark.csproj` If any gate fails, the subagent task is not complete and must be returned for rework with failure notes. ## 11. Definition of Done (Release Gates) 1. Correctness - all new compression/compaction test suites green - no regressions in existing test suite 2. Compatibility - old DB files readable with no migration required - mixed-format operation validated 3. Safety - decompression guardrails enforced - corruption checks and deterministic failure behavior - crash recovery scenarios covered 4. Performance - documented benchmark deltas for write/read overhead and compaction throughput - no pathological GC spikes under compression-enabled workloads 5. Operability - telemetry counters exposed - admin diagnostics available for page usage/compression/fragmentation