18 KiB
Compression + Compaction Implementation Plan
1. Objectives
Implement two major capabilities in CBDD:
- Compression for stored document payloads (including overflow support, compatibility, safety, and telemetry).
- Compaction/Vacuum for reclaiming fragmented space and shrinking database files safely.
This plan is grounded in the current architecture:
- Storage pages and free-list in
src/CBDD.Core/Storage/PageFile.cs - Slot/page layout in
src/CBDD.Core/Storage/SlottedPage.cs - Document CRUD + overflow paths in
src/CBDD.Core/Collections/DocumentCollection.cs - WAL/transaction/recovery semantics in
src/CBDD.Core/Storage/StorageEngine*.csandsrc/CBDD.Core/Transactions/WriteAheadLog.cs - Collection metadata in
src/CBDD.Core/Storage/StorageEngine.Collections.cs
2. Key Design Decisions
2.1 Compression unit: Per-document before overflow
Chosen model: compress the full serialized BSON document first, then apply existing overflow chunking to the stored bytes.
Why this choice:
- Single compression decision per document (simple threshold logic).
- Overflow logic remains generic over byte blobs.
- Update path can compare stored compressed size vs existing slot size directly.
- Better ratio than per-chunk for many document shapes.
Consequences:
HasOverflowcontinues to represent storage chaining only.Compressedcontinues to represent payload encoding.- Read path always reconstructs full stored blob (from primary + overflow), then decompresses if flagged.
2.2 Codec strategy
Implement a codec abstraction with initial built-in codecs backed by .NET primitives:
NoneBrotliDeflate
Expose via config:
EnableCompressionCodecLevelMinSizeBytesMinSavingsPercent
Add safety knobs:
MaxDecompressedSizeBytes(guardrail)- optional
MaxCompressionInputBytes(defensive cap for write path)
2.3 Metadata strategy
Use layered metadata for compatibility and decoding:
- DB-level persistent metadata (Page 0 extension region):
- DB format version
- feature flags (compression enabled capability)
- default codec id
- Page-level format metadata:
- page format version marker (for mixed old/new page parsing)
- optional default codec hint (for diagnostics and future tuning)
- Slot payload metadata for compressed entries (fixed header prefix in stored payload):
- codec id
- original length
- compressed length
- checksum (CRC32 of compressed payload bytes)
This avoids breaking old uncompressed pages while still satisfying “readers know how to decode” and checksum requirements.
3. Workstreams
3.1 Workstream A: Compression Core + Config Surface
Deliverables
- New compression options model and codec abstraction.
- Persistent file/page format metadata support.
- Telemetry primitives for compression counters.
Changes
- Add
src/CBDD.Core/Compression/:CompressionOptions.csCompressionCodec.csICompressionCodec.csCompressionService.csCompressedPayloadHeader.csCompressionTelemetry.cs
- Extend context/engine construction:
src/CBDD.Core/DocumentDbContext.cssrc/CBDD.Core/Storage/StorageEngine.cs
- Add DB metadata read/write helpers:
src/CBDD.Core/Storage/PageFile.cs- new
src/CBDD.Core/Storage/StorageEngine.Format.cs
Implementation tasks
- Introduce
CompressionOptionswith defaults:- compression disabled by default
- conservative thresholds (
MinSizeBytes,MinSavingsPercent)
- Add codec registry/factory.
- Add DB format metadata block in page 0 extension with version + feature flags + default codec id.
- Add page format marker to slotted pages on allocation path (new pages only).
- Add telemetry counter container (thread-safe atomic counters).
Acceptance
- Existing DB files open unchanged.
- New DB files persist format metadata.
- Compression service can roundtrip payloads with selected codec.
3.2 Workstream B: Insert + Read Paths (new writes first)
Deliverables
- Compression on insert with threshold logic.
- Safe decompression on reads with checksum and size validation.
- Fallback to uncompressed write on compression failure.
Changes
src/CBDD.Core/Collections/DocumentCollection.cs:InsertDataCoreInsertIntoPageInsertWithOverflowFindByLocation
src/CBDD.Core/Storage/SlottedPage.cs(if slot/page metadata helpers are added)
Implementation tasks
- Insert path:
- Serialize BSON (existing behavior).
- If compression enabled and
docData.Length >= MinSizeBytes, try codec. - Compute savings and only set
SlotFlags.Compressedif threshold met. - Build compressed payload as:
[CompressedPayloadHeader][compressed bytes]. - On any compression exception, increment failure counter and store uncompressed.
- Overflow path:
- Apply overflow based on stored bytes length (compressed or uncompressed).
- No separate compression of chunks/pages.
- Read path:
- Existing overflow reassembly first.
- If
Compressedflag present:- parse payload header
- validate compressed length + original length bounds
- validate checksum before decompression
- decompress using codec id
- enforce
MaxDecompressedSizeBytes
- Corruption handling:
- throw deterministic
InvalidDataExceptionfor header/checksum/size violations.
- throw deterministic
Acceptance
- Inserts/reads succeed for uncompressed docs (regression-safe).
- Mixed compressed/uncompressed documents in same collection read correctly.
- Corrupted compressed payload is detected and rejected predictably.
3.3 Workstream C: Update/Delete + Overflow Consistency
Deliverables
- Compression-aware update decisions.
- Correct delete behavior for compressed+overflow combinations.
Changes
src/CBDD.Core/Collections/DocumentCollection.cs:UpdateDataCoreDeleteCoreFreeOverflowChain
Implementation tasks
- Update path:
- Recompute storage payload for new document using same compression decision logic as insert.
- In-place update only when:
- existing slot is non-overflow, and
- new stored payload length <= old slot length, and
- compression flag/metadata can be updated safely.
- Otherwise relocate (existing delete+insert strategy).
- Delete path:
- Keep logical semantics unchanged.
- Ensure overflow chain extraction still works when slot has both
CompressedandHasOverflow.
- Overflow consistency tests:
- compressed small -> compressed overflow transitions on update
- compressed overflow -> uncompressed small transitions
Acceptance
- Update behavior preserves correctness for all compression/overflow combinations.
- Delete frees overflow pages for compressed and uncompressed overflow docs.
3.4 Workstream D: Compaction / Shrink (Offline first)
Deliverables
- Public
Compact/Vacuummaintenance API. - Offline copy-and-swap compaction with crash-safe finalize.
- Exact pre/post space accounting.
API surface
- Add to
DocumentDbContext:Compact(CompactionOptions? options = null)CompactAsync(...)- alias
Vacuum(...)
- Engine-level operation in new file:
src/CBDD.Core/Storage/StorageEngine.Maintenance.cs
Offline mode algorithm (Phase 1)
- Acquire exclusive maintenance gate (block writes).
- Checkpoint WAL before start.
- Build a temporary database file (
*.compact.tmp) with same page config and compression config. - Copy logical contents collection-by-collection:
- preserve collection metadata/index definitions
- reinsert documents through collection APIs so locations are rewritten correctly
- rebuild/update index roots in metadata
- Checkpoint temp DB and fsync.
- Atomic finalize (copy-and-swap):
- write state marker (
*.compact.state) for resumability - rename original -> backup, temp -> original
- reset/remove WAL appropriately
- remove marker
- write state marker (
- Produce
CompactionStatswith exact pre/post bytes and deltas.
Crash safety
- Use explicit state-machine marker file with phases (
Started,Copied,Swapped,CleanupDone). - On startup, detect marker and resume/repair idempotently.
Acceptance
- File shrinks when free tail pages exist.
- No data/index loss after compaction.
- Crash during compaction is recoverable and deterministic.
3.5 Workstream E: Online Compaction + Scheduling
Deliverables
- Online mode with throttled relocation.
- Scheduling hooks (manual/startup/threshold-based trigger).
Online mode strategy (Phase 2)
- Background scanner identifies fragmented pages and relocation candidates.
- Move documents in bounded batches under short write exclusion windows.
- Update primary and secondary index locations transactionally.
- Periodic checkpoints to bound WAL growth.
- Tail truncation pass when contiguous free pages reach EOF.
Scheduling hooks
MaintenanceOptions:RunAtStartupMinFragmentationPercentMinReclaimableBytesMaxRunDurationOnlineThrottle(ops/sec or pages/batch)
Acceptance
- Writes continue during online mode except small lock windows.
- Recovery semantics remain valid with WAL and checkpoints.
4. Compaction Internals Required by Both Modes
4.1 Page defragmentation utilities
- Add slotted-page defrag helper:
- rewrites active slots/data densely
- recomputes
FreeSpaceStart/End
4.2 Free-list consolidation + tail truncation
- Extend
PageFilewith:- free-page enumeration
- free-list normalization
- safe truncation when free pages are contiguous at end-of-file
4.3 Metadata/index pointer correctness
- Ensure collection metadata root IDs and index root IDs are rewritten/verified after relocation/copy.
- Add validation pass that checks all primary index locations resolve to non-deleted slots.
4.4 WAL coordination
- Explicit checkpoint before and after compaction.
- Ensure compaction writes follow normal WAL durability semantics.
- Keep
Recover()behavior valid with compaction marker states.
5. Compatibility and Migration
5.1 Compatibility goals
- Read old uncompressed files unchanged.
- Support mixed pages/documents (compressed + uncompressed) in same DB.
- Preserve existing APIs unless explicitly extended.
5.2 Migration tool (optional one-time rewrite)
- Implement
MigrateCompression(...)as an offline rewrite command using the same copy-and-swap machinery. - Options:
- target codec/level
- per-collection include/exclude
- dry-run estimation mode
6. Telemetry + Admin Tooling
6.1 Compression telemetry counters
- compressed document count
- bytes before/after
- compression ratio aggregate
- compression CPU time
- decompression CPU time
- compression failure count
- checksum failure count
- safety-limit rejection count
Expose via:
StorageEngine.GetCompressionStats()- context-level forwarding method in
DocumentDbContext
6.2 Compaction telemetry/stats
- pre/post file size
- live bytes
- free bytes
- fragmentation percentage
- documents/pages relocated
- runtime and throughput
6.3 Admin inspection APIs
Add diagnostics APIs (engine/context):
- page usage by collection/page type
- compression ratio by collection
- fragmentation map and free-list summary
7. Tests
Add focused test suites in tests/CBDD.Tests/:
CompressionInsertReadTests.cs
- threshold on/off behavior
- mixed compressed/uncompressed reads
- fallback to uncompressed on forced codec error
CompressionOverflowTests.cs
- compressed docs that span overflow pages
- transitions across size thresholds
CompressionCorruptionTests.cs
- bad checksum
- bad original length
- oversized decompression guardrail
- invalid codec id
CompressionCompatibilityTests.cs
- open existing uncompressed DB files
- mixed-format pages after partial migration
CompactionOfflineTests.cs
- logical equivalence pre/post compact
- index correctness pre/post compact
- tail truncation actually reduces file size
CompactionCrashRecoveryTests.cs
- simulate crashes at each copy/swap state
- resume/finalize behavior
CompactionOnlineConcurrencyTests.cs
- concurrent writes + reads during online compact
- correctness and no deadlock
CompactionWalCoordinationTests.cs
- checkpoint before/after behavior
- recoverability with in-flight WAL entries
Also update/extend existing tests:
tests/CBDD.Tests/DocumentOverflowTests.cstests/CBDD.Tests/DbContextTests.cstests/CBDD.Tests/WalIndexTests.cs
8. Benchmark Additions
Extend tests/CBDD.Tests.Benchmark/ with:
CompressionBenchmarks.cs
- insert/update/read workloads with compression on/off
- codec and level comparison
CompactionBenchmarks.cs
- offline compact runtime
- reclaimable bytes vs elapsed
MixedWorkloadBenchmarks.cs
- insert/update/delete-heavy cache workload
- periodic compact impact
- Update
DatabaseSizeBenchmark.cs
- pre/post compact shrink delta reporting
- compression ratio reporting
9. Suggested Implementation Order (Execution Plan)
Phase 1 (as requested): Compression config + read/write path for new writes only
- Workstream A
- Workstream B (insert/read only)
- initial tests: insert/read + compatibility
Phase 2 (as requested): Compression-aware updates/deletes + overflow handling
- Workstream C
- overflow-focused tests + corruption guards
Phase 3 (as requested): Offline copy-and-swap compaction/shrink
- Workstream D + shared internals from section 4
- crash-safe finalize + space accounting
Phase 4 (as requested): Online compaction + automation hooks
- Workstream E
- concurrency and scheduling tests
10. Subagent Execution Safety + Completion Verification
10.1 Subagent ownership model
Use explicit, non-overlapping ownership to avoid unsafe parallel edits:
- Subagent A (Compression Core)
- Owns
src/CBDD.Core/Compression/* - Owns format/config plumbing in
src/CBDD.Core/Storage/StorageEngine.Format.cs,src/CBDD.Core/Storage/StorageEngine.cs,src/CBDD.Core/DocumentDbContext.cs
- Subagent B (CRUD + Overflow Compression Semantics)
- Owns compression-aware CRUD changes in
src/CBDD.Core/Collections/DocumentCollection.cs - Owns slot/payload metadata helpers in
src/CBDD.Core/Storage/SlottedPage.cs(if needed)
- Subagent C (Compaction/Vacuum Engine)
- Owns
src/CBDD.Core/Storage/StorageEngine.Maintenance.cs - Owns related
PageFileextensions insrc/CBDD.Core/Storage/PageFile.cs
- Subagent D (Verification Assets)
- Owns new/updated tests in
tests/CBDD.Tests/*Compression*,tests/CBDD.Tests/*Compaction* - Owns benchmark additions in
tests/CBDD.Tests.Benchmark/*
Rule: one file has exactly one active subagent owner at a time.
10.2 Safe collaboration rules for subagents
- Do not edit files outside assigned ownership scope.
- Do not revert or reformat unrelated existing changes.
- Do not change public contracts outside assigned workstream without explicit handoff.
- If overlap is discovered mid-task, stop and reassign ownership before continuing.
- Keep changes atomic and reviewable (small PR-sized batches per workstream phase).
10.3 Required handoff payload from each subagent
Each completion handoff MUST include:
- Summary of implemented requirements and non-implemented items.
- Exact touched file list.
- Risk list (behavioral, compatibility, recovery, perf).
- Test commands executed and pass/fail results.
- Mapping to plan acceptance criteria sections.
10.4 Mandatory verification on subagent completion
Every subagent completion is verified before merge/mark-done:
- Scope verification
- Confirm touched files are in owned scope only.
- Confirm required plan items for that phase are implemented.
- Correctness verification
- Run targeted tests for touched behavior.
- Run related regression suites (CRUD, overflow, WAL/recovery, index consistency where applicable).
- Safety verification
- Validate corruption/safety guard behavior (checksum, size limits, crash-state handling where applicable).
- Validate backward compatibility with old uncompressed files when relevant.
- Performance verification
- Run benchmark smoke checks for modified hot paths.
- Verify no obvious regressions against baseline thresholds.
- Integration verification
- Rebuild solution and run full test pass before final phase closure.
10.5 Verification commands (minimum gate)
Run these gates after each subagent completion (adjust filters to scope):
dotnet build /Users/dohertj2/Desktop/CBDD/CBDD.slnxdotnet test /Users/dohertj2/Desktop/CBDD/tests/CBDD.Tests/ZB.MOM.WW.CBDD.Tests.csproj- Targeted suites for the changed area, for example:
dotnet test /Users/dohertj2/Desktop/CBDD/tests/CBDD.Tests/ZB.MOM.WW.CBDD.Tests.csproj --filter \"FullyQualifiedName~Compression\"dotnet test /Users/dohertj2/Desktop/CBDD/tests/CBDD.Tests/ZB.MOM.WW.CBDD.Tests.csproj --filter \"FullyQualifiedName~Compaction\"
- Benchmark smoke run when hot paths changed:
dotnet run -c Release --project /Users/dohertj2/Desktop/CBDD/tests/CBDD.Tests.Benchmark/ZB.MOM.WW.CBDD.Tests.Benchmark.csproj
If any gate fails, the subagent task is not complete and must be returned for rework with failure notes.
11. Definition of Done (Release Gates)
- Correctness
- all new compression/compaction test suites green
- no regressions in existing test suite
- Compatibility
- old DB files readable with no migration required
- mixed-format operation validated
- Safety
- decompression guardrails enforced
- corruption checks and deterministic failure behavior
- crash recovery scenarios covered
- Performance
- documented benchmark deltas for write/read overhead and compaction throughput
- no pathological GC spikes under compression-enabled workloads
- Operability
- telemetry counters exposed
- admin diagnostics available for page usage/compression/fragmentation