Files
jdescopingtool/PLANS/2026-01-06-protobuf-cache-conversion-design.md
T
Joseph Doherty 8ce9a7dae1 docs: switch cache conversion design from MessagePack to protobuf-net-data
protobuf-net-data is purpose-built for IDataReader serialization and
returns IDataReader directly from Deserialize(), eliminating the need
for custom streaming reader implementations.
2026-01-06 14:15:19 -05:00

165 lines
5.6 KiB
Markdown

# Protobuf Cache Conversion Design
## Purpose
Convert the development cache files in `CACHED_DB_FILES/` from zstd-compressed JSON (`.json.zstd`) to zstd-compressed Protocol Buffers (`.pb.zstd`) using protobuf-net-data for faster deserialization and smaller file sizes.
## Goals
1. **Faster deserialization** - Protobuf is faster to parse than JSON
2. **Smaller file sizes** - Protobuf is more compact than JSON
3. **Simpler code** - protobuf-net-data returns `IDataReader` directly, no custom reader needed
## Current State
- 22 cache files in `CACHED_DB_FILES/` totaling ~3.6 GB (zstd-compressed JSON)
- `JsonZstdFileSource` reads files using ZstdSharp + Utf8JsonReader
- Custom `Utf8JsonStreamingDataReader` implements `IDataReader` for streaming
- Each `*DevEtl.cs` defines a schema and creates a pipeline from JSON files
## Design Decisions
| Decision | Choice | Rationale |
|----------|--------|-----------|
| Serialization library | protobuf-net-data | Purpose-built for IDataReader, returns IDataReader directly |
| Conversion approach | One-time manual | Cache files are static snapshots, not actively regenerated |
| Compression | zstd on whole file | Consistent with current approach, excellent compression |
| Converter location | Standalone console app in `Tools/CacheConverter/` | Isolated utility, not part of main solution |
## File Format
**New extension:** `.pb.zstd`
**File naming:**
- `branch.json.zstd``branch.pb.zstd`
- `workordertime_curr.json.zstd``workordertime_curr.pb.zstd`
**Data structure:** protobuf-net-data binary format
- Schema embedded in stream (column names, types, nullability)
- Rows serialized sequentially
- Native ADO.NET type support (DateTime, Guid, decimal, etc.)
**Libraries:**
- `protobuf-net-data` - IDataReader serialization/deserialization
- `ZstdSharp.Port` - compression
## Components
### 1. Converter Tool
**Location:** `/JdeScopingTool/Tools/CacheConverter/`
```
Tools/
└── CacheConverter/
├── CacheConverter.csproj
└── Program.cs
```
**Dependencies:**
- `ZstdSharp.Port` - read zstd JSON, write zstd protobuf
- `protobuf-net-data` - protobuf serialization
**Behavior:**
1. Read each `.json.zstd` file from `CACHED_DB_FILES/`
2. Decompress and parse JSON into an `IDataReader`
3. Use `DataSerializer.Serialize(stream, reader)` to write protobuf
4. Compress with zstd and write to `.pb.zstd`
5. Print before/after sizes for comparison
**Usage:**
```bash
cd Tools/CacheConverter
dotnet run -- ../../CACHED_DB_FILES
```
### 2. ProtobufZstdFileSource
**New file:** `NEW/src/JdeScoping.DataSync.Dev/Sources/ProtobufZstdFileSource.cs`
**Key simplification:** No custom `IDataReader` implementation needed!
```csharp
public sealed class ProtobufZstdFileSource : IImportSource
{
public async Task<IDataReader> ReadDataAsync(CancellationToken ct = default)
{
_fileStream = new FileStream(_filePath, FileMode.Open, ...);
_decompressionStream = new DecompressionStream(_fileStream);
// protobuf-net-data returns IDataReader directly!
return DataSerializer.Deserialize(_decompressionStream);
}
}
```
**Package additions to `JdeScoping.DataSync.Dev.csproj`:**
- `protobuf-net-data`
### 3. DevEtl Class Updates
**Changes to each `*DevEtl.cs` file (22 files):**
1. Update `CacheFileName` constant:
```csharp
// Before
public static readonly string CacheFileName = "branch.json.zstd";
// After
public static readonly string CacheFileName = "branch.pb.zstd";
```
2. Update `Create()` method:
```csharp
// Before
.WithSource(new JsonZstdFileSource(cacheFilePath, Schema))
// After
.WithSource(new ProtobufZstdFileSource(cacheFilePath))
```
3. **Remove schema definitions** - protobuf-net-data embeds schema in the file, so `JsonColumnSchema[]` arrays are no longer needed in DevEtl classes.
**No changes to:**
- Pipeline structure
- `DevEtlRegistry.cs`
### 4. Cleanup (After Verification)
**Remove obsolete files:**
- `Sources/JsonZstdFileSource.cs`
- `Sources/JsonStreamingDataReader.cs`
- `Sources/Utf8JsonStreamingDataReader.cs`
- `Models/JsonColumnSchema.cs`
**Remove old cache files:**
- All `*.json.zstd` files in `CACHED_DB_FILES/`
## Code Simplification Summary
| Before (JSON) | After (Protobuf) |
|---------------|------------------|
| `JsonZstdFileSource` | `ProtobufZstdFileSource` |
| `Utf8JsonStreamingDataReader` (custom) | `DataSerializer.Deserialize()` (library) |
| `JsonStreamingDataReader` (legacy) | Removed |
| `JsonColumnSchema[]` per table | Not needed (embedded in file) |
## Test Strategy
1. Run converter tool, verify all 22 files convert without errors
2. Compare file sizes (expect 10-30% reduction)
3. Run existing `JdeScoping.DataSync.Dev.Tests` - all tests should pass unchanged
4. Verify data loaded matches previous JSON-based loads
## Files Changed
| File | Change |
|------|--------|
| `Tools/CacheConverter/` (new) | Standalone converter tool |
| `Sources/ProtobufZstdFileSource.cs` (new) | New protobuf reader (much simpler) |
| `JdeScoping.DataSync.Dev.csproj` | Add protobuf-net-data package |
| `*DevEtl.cs` (22 files) | Update file extension, source class, remove schema |
| `Sources/JsonZstdFileSource.cs` | Delete after migration |
| `Sources/JsonStreamingDataReader.cs` | Delete after migration |
| `Sources/Utf8JsonStreamingDataReader.cs` | Delete after migration |
| `Models/JsonColumnSchema.cs` | Delete after migration |
| `CACHED_DB_FILES/*.json.zstd` | Delete after verification |