# Protobuf Cache Conversion Design ## Purpose Convert the development cache files in `CACHED_DB_FILES/` from zstd-compressed JSON (`.json.zstd`) to zstd-compressed Protocol Buffers (`.pb.zstd`) using protobuf-net-data for faster deserialization and smaller file sizes. ## Goals 1. **Faster deserialization** - Protobuf is faster to parse than JSON 2. **Smaller file sizes** - Protobuf is more compact than JSON 3. **Simpler code** - protobuf-net-data returns `IDataReader` directly, no custom reader needed ## Current State - 22 cache files in `CACHED_DB_FILES/` totaling ~3.6 GB (zstd-compressed JSON) - `JsonZstdFileSource` reads files using ZstdSharp + Utf8JsonReader - Custom `Utf8JsonStreamingDataReader` implements `IDataReader` for streaming - Each `*DevEtl.cs` defines a schema and creates a pipeline from JSON files ## Design Decisions | Decision | Choice | Rationale | |----------|--------|-----------| | Serialization library | protobuf-net-data | Purpose-built for IDataReader, returns IDataReader directly | | Conversion approach | One-time manual | Cache files are static snapshots, not actively regenerated | | Compression | zstd on whole file | Consistent with current approach, excellent compression | | Converter location | Standalone console app in `Tools/CacheConverter/` | Isolated utility, not part of main solution | ## File Format **New extension:** `.pb.zstd` **File naming:** - `branch.json.zstd` → `branch.pb.zstd` - `workordertime_curr.json.zstd` → `workordertime_curr.pb.zstd` **Data structure:** protobuf-net-data binary format - Schema embedded in stream (column names, types, nullability) - Rows serialized sequentially - Native ADO.NET type support (DateTime, Guid, decimal, etc.) **Libraries:** - `protobuf-net-data` - IDataReader serialization/deserialization - `ZstdSharp.Port` - compression ## Components ### 1. Converter Tool **Location:** `/JdeScopingTool/Tools/CacheConverter/` ``` Tools/ └── CacheConverter/ ├── CacheConverter.csproj └── Program.cs ``` **Dependencies:** - `ZstdSharp.Port` - read zstd JSON, write zstd protobuf - `protobuf-net-data` - protobuf serialization **Behavior:** 1. Read each `.json.zstd` file from `CACHED_DB_FILES/` 2. Decompress and parse JSON into an `IDataReader` 3. Use `DataSerializer.Serialize(stream, reader)` to write protobuf 4. Compress with zstd and write to `.pb.zstd` 5. Print before/after sizes for comparison **Usage:** ```bash cd Tools/CacheConverter dotnet run -- ../../CACHED_DB_FILES ``` ### 2. ProtobufZstdFileSource **New file:** `NEW/src/JdeScoping.DataSync.Dev/Sources/ProtobufZstdFileSource.cs` **Key simplification:** No custom `IDataReader` implementation needed! ```csharp public sealed class ProtobufZstdFileSource : IImportSource { public async Task ReadDataAsync(CancellationToken ct = default) { _fileStream = new FileStream(_filePath, FileMode.Open, ...); _decompressionStream = new DecompressionStream(_fileStream); // protobuf-net-data returns IDataReader directly! return DataSerializer.Deserialize(_decompressionStream); } } ``` **Package additions to `JdeScoping.DataSync.Dev.csproj`:** - `protobuf-net-data` ### 3. DevEtl Class Updates **Changes to each `*DevEtl.cs` file (22 files):** 1. Update `CacheFileName` constant: ```csharp // Before public static readonly string CacheFileName = "branch.json.zstd"; // After public static readonly string CacheFileName = "branch.pb.zstd"; ``` 2. Update `Create()` method: ```csharp // Before .WithSource(new JsonZstdFileSource(cacheFilePath, Schema)) // After .WithSource(new ProtobufZstdFileSource(cacheFilePath)) ``` 3. **Remove schema definitions** - protobuf-net-data embeds schema in the file, so `JsonColumnSchema[]` arrays are no longer needed in DevEtl classes. **No changes to:** - Pipeline structure - `DevEtlRegistry.cs` ### 4. Cleanup (After Verification) **Remove obsolete files:** - `Sources/JsonZstdFileSource.cs` - `Sources/JsonStreamingDataReader.cs` - `Sources/Utf8JsonStreamingDataReader.cs` - `Models/JsonColumnSchema.cs` **Remove old cache files:** - All `*.json.zstd` files in `CACHED_DB_FILES/` ## Code Simplification Summary | Before (JSON) | After (Protobuf) | |---------------|------------------| | `JsonZstdFileSource` | `ProtobufZstdFileSource` | | `Utf8JsonStreamingDataReader` (custom) | `DataSerializer.Deserialize()` (library) | | `JsonStreamingDataReader` (legacy) | Removed | | `JsonColumnSchema[]` per table | Not needed (embedded in file) | ## Test Strategy 1. Run converter tool, verify all 22 files convert without errors 2. Compare file sizes (expect 10-30% reduction) 3. Run existing `JdeScoping.DataSync.Dev.Tests` - all tests should pass unchanged 4. Verify data loaded matches previous JSON-based loads ## Files Changed | File | Change | |------|--------| | `Tools/CacheConverter/` (new) | Standalone converter tool | | `Sources/ProtobufZstdFileSource.cs` (new) | New protobuf reader (much simpler) | | `JdeScoping.DataSync.Dev.csproj` | Add protobuf-net-data package | | `*DevEtl.cs` (22 files) | Update file extension, source class, remove schema | | `Sources/JsonZstdFileSource.cs` | Delete after migration | | `Sources/JsonStreamingDataReader.cs` | Delete after migration | | `Sources/Utf8JsonStreamingDataReader.cs` | Delete after migration | | `Models/JsonColumnSchema.cs` | Delete after migration | | `CACHED_DB_FILES/*.json.zstd` | Delete after verification |