Files
jdescopingtool/PLANS/2026-01-06-protobuf-cache-conversion-design.md
T
Joseph Doherty 8ce9a7dae1 docs: switch cache conversion design from MessagePack to protobuf-net-data
protobuf-net-data is purpose-built for IDataReader serialization and
returns IDataReader directly from Deserialize(), eliminating the need
for custom streaming reader implementations.
2026-01-06 14:15:19 -05:00

5.6 KiB

Protobuf Cache Conversion Design

Purpose

Convert the development cache files in CACHED_DB_FILES/ from zstd-compressed JSON (.json.zstd) to zstd-compressed Protocol Buffers (.pb.zstd) using protobuf-net-data for faster deserialization and smaller file sizes.

Goals

  1. Faster deserialization - Protobuf is faster to parse than JSON
  2. Smaller file sizes - Protobuf is more compact than JSON
  3. Simpler code - protobuf-net-data returns IDataReader directly, no custom reader needed

Current State

  • 22 cache files in CACHED_DB_FILES/ totaling ~3.6 GB (zstd-compressed JSON)
  • JsonZstdFileSource reads files using ZstdSharp + Utf8JsonReader
  • Custom Utf8JsonStreamingDataReader implements IDataReader for streaming
  • Each *DevEtl.cs defines a schema and creates a pipeline from JSON files

Design Decisions

Decision Choice Rationale
Serialization library protobuf-net-data Purpose-built for IDataReader, returns IDataReader directly
Conversion approach One-time manual Cache files are static snapshots, not actively regenerated
Compression zstd on whole file Consistent with current approach, excellent compression
Converter location Standalone console app in Tools/CacheConverter/ Isolated utility, not part of main solution

File Format

New extension: .pb.zstd

File naming:

  • branch.json.zstdbranch.pb.zstd
  • workordertime_curr.json.zstdworkordertime_curr.pb.zstd

Data structure: protobuf-net-data binary format

  • Schema embedded in stream (column names, types, nullability)
  • Rows serialized sequentially
  • Native ADO.NET type support (DateTime, Guid, decimal, etc.)

Libraries:

  • protobuf-net-data - IDataReader serialization/deserialization
  • ZstdSharp.Port - compression

Components

1. Converter Tool

Location: /JdeScopingTool/Tools/CacheConverter/

Tools/
└── CacheConverter/
    ├── CacheConverter.csproj
    └── Program.cs

Dependencies:

  • ZstdSharp.Port - read zstd JSON, write zstd protobuf
  • protobuf-net-data - protobuf serialization

Behavior:

  1. Read each .json.zstd file from CACHED_DB_FILES/
  2. Decompress and parse JSON into an IDataReader
  3. Use DataSerializer.Serialize(stream, reader) to write protobuf
  4. Compress with zstd and write to .pb.zstd
  5. Print before/after sizes for comparison

Usage:

cd Tools/CacheConverter
dotnet run -- ../../CACHED_DB_FILES

2. ProtobufZstdFileSource

New file: NEW/src/JdeScoping.DataSync.Dev/Sources/ProtobufZstdFileSource.cs

Key simplification: No custom IDataReader implementation needed!

public sealed class ProtobufZstdFileSource : IImportSource
{
    public async Task<IDataReader> ReadDataAsync(CancellationToken ct = default)
    {
        _fileStream = new FileStream(_filePath, FileMode.Open, ...);
        _decompressionStream = new DecompressionStream(_fileStream);

        // protobuf-net-data returns IDataReader directly!
        return DataSerializer.Deserialize(_decompressionStream);
    }
}

Package additions to JdeScoping.DataSync.Dev.csproj:

  • protobuf-net-data

3. DevEtl Class Updates

Changes to each *DevEtl.cs file (22 files):

  1. Update CacheFileName constant:

    // Before
    public static readonly string CacheFileName = "branch.json.zstd";
    // After
    public static readonly string CacheFileName = "branch.pb.zstd";
    
  2. Update Create() method:

    // Before
    .WithSource(new JsonZstdFileSource(cacheFilePath, Schema))
    // After
    .WithSource(new ProtobufZstdFileSource(cacheFilePath))
    
  3. Remove schema definitions - protobuf-net-data embeds schema in the file, so JsonColumnSchema[] arrays are no longer needed in DevEtl classes.

No changes to:

  • Pipeline structure
  • DevEtlRegistry.cs

4. Cleanup (After Verification)

Remove obsolete files:

  • Sources/JsonZstdFileSource.cs
  • Sources/JsonStreamingDataReader.cs
  • Sources/Utf8JsonStreamingDataReader.cs
  • Models/JsonColumnSchema.cs

Remove old cache files:

  • All *.json.zstd files in CACHED_DB_FILES/

Code Simplification Summary

Before (JSON) After (Protobuf)
JsonZstdFileSource ProtobufZstdFileSource
Utf8JsonStreamingDataReader (custom) DataSerializer.Deserialize() (library)
JsonStreamingDataReader (legacy) Removed
JsonColumnSchema[] per table Not needed (embedded in file)

Test Strategy

  1. Run converter tool, verify all 22 files convert without errors
  2. Compare file sizes (expect 10-30% reduction)
  3. Run existing JdeScoping.DataSync.Dev.Tests - all tests should pass unchanged
  4. Verify data loaded matches previous JSON-based loads

Files Changed

File Change
Tools/CacheConverter/ (new) Standalone converter tool
Sources/ProtobufZstdFileSource.cs (new) New protobuf reader (much simpler)
JdeScoping.DataSync.Dev.csproj Add protobuf-net-data package
*DevEtl.cs (22 files) Update file extension, source class, remove schema
Sources/JsonZstdFileSource.cs Delete after migration
Sources/JsonStreamingDataReader.cs Delete after migration
Sources/Utf8JsonStreamingDataReader.cs Delete after migration
Models/JsonColumnSchema.cs Delete after migration
CACHED_DB_FILES/*.json.zstd Delete after verification