docs: add MessagePack cache conversion design plan

Design for converting CACHED_DB_FILES from zstd-compressed JSON
to zstd-compressed MessagePack for faster deserialization and
smaller file sizes.
This commit is contained in:
Joseph Doherty
2026-01-06 14:03:47 -05:00
parent 7b3592df96
commit 01da261d6c
@@ -0,0 +1,142 @@
# MessagePack Cache Conversion Design
## Purpose
Convert the development cache files in `CACHED_DB_FILES/` from zstd-compressed JSON (`.json.zstd`) to zstd-compressed MessagePack (`.msgpack.zstd`) for faster deserialization and smaller file sizes.
## Goals
1. **Faster deserialization** - MessagePack is faster to parse than JSON
2. **Smaller file sizes** - MessagePack is more compact than JSON
## Current State
- 22 cache files in `CACHED_DB_FILES/` totaling ~3.6 GB (zstd-compressed JSON)
- `JsonZstdFileSource` reads files using ZstdSharp + Utf8JsonReader
- Each `*DevEtl.cs` defines a schema and creates a pipeline from JSON files
- Tests verify ETL loads data from cache files into SQL Server
## Design Decisions
| Decision | Choice | Rationale |
|----------|--------|-----------|
| Conversion approach | One-time manual | Cache files are static snapshots, not actively regenerated |
| Data structure | Map format (field names as keys) | Self-describing, maintainable, keys compress well with zstd |
| Compression | Keep zstd | Largest file is 878 MB; raw MessagePack would be 2-4x larger |
| Converter location | Standalone console app in `Tools/CacheConverter/` | Isolated utility, not part of main solution |
## File Format
**New extension:** `.msgpack.zstd`
**File naming:**
- `branch.json.zstd``branch.msgpack.zstd`
- `workordertime_curr.json.zstd``workordertime_curr.msgpack.zstd`
**Data structure:** Array of maps (same logical structure as JSON)
```
[
{ "Code": "ABC", "Description": "Branch ABC", "LastUpdateDT": <DateTime> },
{ "Code": "DEF", "Description": "Branch DEF", "LastUpdateDT": <DateTime> },
...
]
```
**Library:** MessagePack-CSharp (`MessagePack` NuGet package)
## Components
### 1. Converter Tool
**Location:** `/JdeScopingTool/Tools/CacheConverter/`
```
Tools/
└── CacheConverter/
├── CacheConverter.csproj
└── Program.cs
```
**Dependencies:**
- `ZstdSharp.Port` - read zstd JSON, write zstd MessagePack
- `MessagePack` - MessagePack serialization
**Behavior:**
1. Read each `.json.zstd` file from `CACHED_DB_FILES/`
2. Stream JSON → deserialize to `Dictionary<string, object?>[]`
3. Serialize to MessagePack (map format) → compress with zstd
4. Write to `.msgpack.zstd` alongside originals
5. Print before/after sizes for comparison
**Usage:**
```bash
cd Tools/CacheConverter
dotnet run -- ../../CACHED_DB_FILES
```
### 2. MessagePackZstdFileSource
**New file:** `NEW/src/JdeScoping.DataSync.Dev/Sources/MessagePackZstdFileSource.cs`
- Implements `IImportSource` (same interface as `JsonZstdFileSource`)
- Reads `.msgpack.zstd` files using streaming decompression
- Uses `MessagePackStreamReader` for efficient streaming deserialization
- Returns an `IDataReader` that yields rows one at a time
- Schema still needed for `IDataReader` field metadata (column names, types, ordinals)
**Package addition:** Add `MessagePack` to `JdeScoping.DataSync.Dev.csproj`
### 3. DevEtl Class Updates
**Changes to each `*DevEtl.cs` file (22 files):**
1. Update `CacheFileName` constant:
```csharp
// Before
public static readonly string CacheFileName = "branch.json.zstd";
// After
public static readonly string CacheFileName = "branch.msgpack.zstd";
```
2. Update `Create()` method:
```csharp
// Before
.WithSource(new JsonZstdFileSource(cacheFilePath, Schema))
// After
.WithSource(new MessagePackZstdFileSource(cacheFilePath, Schema))
```
**No changes to:**
- Schema definitions (same column names and types)
- Pipeline structure
- `DevEtlRegistry.cs`
### 4. Cleanup (After Verification)
Remove obsolete JSON readers:
- `JsonZstdFileSource.cs`
- `JsonStreamingDataReader.cs`
- `Utf8JsonStreamingDataReader.cs`
Remove old cache files:
- All `*.json.zstd` files in `CACHED_DB_FILES/`
## Test Strategy
1. Run converter tool, verify all 22 files convert without errors
2. Compare file sizes (expect 10-30% reduction)
3. Run existing `JdeScoping.DataSync.Dev.Tests` - all tests should pass unchanged
4. Verify data loaded matches previous JSON-based loads
## Files Changed
| File | Change |
|------|--------|
| `Tools/CacheConverter/` (new) | Standalone converter tool |
| `Sources/MessagePackZstdFileSource.cs` (new) | New MessagePack reader |
| `JdeScoping.DataSync.Dev.csproj` | Add MessagePack package |
| `*DevEtl.cs` (22 files) | Update file extension and source class |
| `Sources/JsonZstdFileSource.cs` | Delete after migration |
| `Sources/JsonStreamingDataReader.cs` | Delete after migration |
| `Sources/Utf8JsonStreamingDataReader.cs` | Delete after migration |
| `CACHED_DB_FILES/*.json.zstd` | Delete after verification |