docs: switch cache conversion design from MessagePack to protobuf-net-data
protobuf-net-data is purpose-built for IDataReader serialization and returns IDataReader directly from Deserialize(), eliminating the need for custom streaming reader implementations.
This commit is contained in:
@@ -1,142 +0,0 @@
|
||||
# MessagePack Cache Conversion Design
|
||||
|
||||
## Purpose
|
||||
|
||||
Convert the development cache files in `CACHED_DB_FILES/` from zstd-compressed JSON (`.json.zstd`) to zstd-compressed MessagePack (`.msgpack.zstd`) for faster deserialization and smaller file sizes.
|
||||
|
||||
## Goals
|
||||
|
||||
1. **Faster deserialization** - MessagePack is faster to parse than JSON
|
||||
2. **Smaller file sizes** - MessagePack is more compact than JSON
|
||||
|
||||
## Current State
|
||||
|
||||
- 22 cache files in `CACHED_DB_FILES/` totaling ~3.6 GB (zstd-compressed JSON)
|
||||
- `JsonZstdFileSource` reads files using ZstdSharp + Utf8JsonReader
|
||||
- Each `*DevEtl.cs` defines a schema and creates a pipeline from JSON files
|
||||
- Tests verify ETL loads data from cache files into SQL Server
|
||||
|
||||
## Design Decisions
|
||||
|
||||
| Decision | Choice | Rationale |
|
||||
|----------|--------|-----------|
|
||||
| Conversion approach | One-time manual | Cache files are static snapshots, not actively regenerated |
|
||||
| Data structure | Map format (field names as keys) | Self-describing, maintainable, keys compress well with zstd |
|
||||
| Compression | Keep zstd | Largest file is 878 MB; raw MessagePack would be 2-4x larger |
|
||||
| Converter location | Standalone console app in `Tools/CacheConverter/` | Isolated utility, not part of main solution |
|
||||
|
||||
## File Format
|
||||
|
||||
**New extension:** `.msgpack.zstd`
|
||||
|
||||
**File naming:**
|
||||
- `branch.json.zstd` → `branch.msgpack.zstd`
|
||||
- `workordertime_curr.json.zstd` → `workordertime_curr.msgpack.zstd`
|
||||
|
||||
**Data structure:** Array of maps (same logical structure as JSON)
|
||||
```
|
||||
[
|
||||
{ "Code": "ABC", "Description": "Branch ABC", "LastUpdateDT": <DateTime> },
|
||||
{ "Code": "DEF", "Description": "Branch DEF", "LastUpdateDT": <DateTime> },
|
||||
...
|
||||
]
|
||||
```
|
||||
|
||||
**Library:** MessagePack-CSharp (`MessagePack` NuGet package)
|
||||
|
||||
## Components
|
||||
|
||||
### 1. Converter Tool
|
||||
|
||||
**Location:** `/JdeScopingTool/Tools/CacheConverter/`
|
||||
|
||||
```
|
||||
Tools/
|
||||
└── CacheConverter/
|
||||
├── CacheConverter.csproj
|
||||
└── Program.cs
|
||||
```
|
||||
|
||||
**Dependencies:**
|
||||
- `ZstdSharp.Port` - read zstd JSON, write zstd MessagePack
|
||||
- `MessagePack` - MessagePack serialization
|
||||
|
||||
**Behavior:**
|
||||
1. Read each `.json.zstd` file from `CACHED_DB_FILES/`
|
||||
2. Stream JSON → deserialize to `Dictionary<string, object?>[]`
|
||||
3. Serialize to MessagePack (map format) → compress with zstd
|
||||
4. Write to `.msgpack.zstd` alongside originals
|
||||
5. Print before/after sizes for comparison
|
||||
|
||||
**Usage:**
|
||||
```bash
|
||||
cd Tools/CacheConverter
|
||||
dotnet run -- ../../CACHED_DB_FILES
|
||||
```
|
||||
|
||||
### 2. MessagePackZstdFileSource
|
||||
|
||||
**New file:** `NEW/src/JdeScoping.DataSync.Dev/Sources/MessagePackZstdFileSource.cs`
|
||||
|
||||
- Implements `IImportSource` (same interface as `JsonZstdFileSource`)
|
||||
- Reads `.msgpack.zstd` files using streaming decompression
|
||||
- Uses `MessagePackStreamReader` for efficient streaming deserialization
|
||||
- Returns an `IDataReader` that yields rows one at a time
|
||||
- Schema still needed for `IDataReader` field metadata (column names, types, ordinals)
|
||||
|
||||
**Package addition:** Add `MessagePack` to `JdeScoping.DataSync.Dev.csproj`
|
||||
|
||||
### 3. DevEtl Class Updates
|
||||
|
||||
**Changes to each `*DevEtl.cs` file (22 files):**
|
||||
|
||||
1. Update `CacheFileName` constant:
|
||||
```csharp
|
||||
// Before
|
||||
public static readonly string CacheFileName = "branch.json.zstd";
|
||||
// After
|
||||
public static readonly string CacheFileName = "branch.msgpack.zstd";
|
||||
```
|
||||
|
||||
2. Update `Create()` method:
|
||||
```csharp
|
||||
// Before
|
||||
.WithSource(new JsonZstdFileSource(cacheFilePath, Schema))
|
||||
// After
|
||||
.WithSource(new MessagePackZstdFileSource(cacheFilePath, Schema))
|
||||
```
|
||||
|
||||
**No changes to:**
|
||||
- Schema definitions (same column names and types)
|
||||
- Pipeline structure
|
||||
- `DevEtlRegistry.cs`
|
||||
|
||||
### 4. Cleanup (After Verification)
|
||||
|
||||
Remove obsolete JSON readers:
|
||||
- `JsonZstdFileSource.cs`
|
||||
- `JsonStreamingDataReader.cs`
|
||||
- `Utf8JsonStreamingDataReader.cs`
|
||||
|
||||
Remove old cache files:
|
||||
- All `*.json.zstd` files in `CACHED_DB_FILES/`
|
||||
|
||||
## Test Strategy
|
||||
|
||||
1. Run converter tool, verify all 22 files convert without errors
|
||||
2. Compare file sizes (expect 10-30% reduction)
|
||||
3. Run existing `JdeScoping.DataSync.Dev.Tests` - all tests should pass unchanged
|
||||
4. Verify data loaded matches previous JSON-based loads
|
||||
|
||||
## Files Changed
|
||||
|
||||
| File | Change |
|
||||
|------|--------|
|
||||
| `Tools/CacheConverter/` (new) | Standalone converter tool |
|
||||
| `Sources/MessagePackZstdFileSource.cs` (new) | New MessagePack reader |
|
||||
| `JdeScoping.DataSync.Dev.csproj` | Add MessagePack package |
|
||||
| `*DevEtl.cs` (22 files) | Update file extension and source class |
|
||||
| `Sources/JsonZstdFileSource.cs` | Delete after migration |
|
||||
| `Sources/JsonStreamingDataReader.cs` | Delete after migration |
|
||||
| `Sources/Utf8JsonStreamingDataReader.cs` | Delete after migration |
|
||||
| `CACHED_DB_FILES/*.json.zstd` | Delete after verification |
|
||||
@@ -0,0 +1,164 @@
|
||||
# Protobuf Cache Conversion Design
|
||||
|
||||
## Purpose
|
||||
|
||||
Convert the development cache files in `CACHED_DB_FILES/` from zstd-compressed JSON (`.json.zstd`) to zstd-compressed Protocol Buffers (`.pb.zstd`) using protobuf-net-data for faster deserialization and smaller file sizes.
|
||||
|
||||
## Goals
|
||||
|
||||
1. **Faster deserialization** - Protobuf is faster to parse than JSON
|
||||
2. **Smaller file sizes** - Protobuf is more compact than JSON
|
||||
3. **Simpler code** - protobuf-net-data returns `IDataReader` directly, no custom reader needed
|
||||
|
||||
## Current State
|
||||
|
||||
- 22 cache files in `CACHED_DB_FILES/` totaling ~3.6 GB (zstd-compressed JSON)
|
||||
- `JsonZstdFileSource` reads files using ZstdSharp + Utf8JsonReader
|
||||
- Custom `Utf8JsonStreamingDataReader` implements `IDataReader` for streaming
|
||||
- Each `*DevEtl.cs` defines a schema and creates a pipeline from JSON files
|
||||
|
||||
## Design Decisions
|
||||
|
||||
| Decision | Choice | Rationale |
|
||||
|----------|--------|-----------|
|
||||
| Serialization library | protobuf-net-data | Purpose-built for IDataReader, returns IDataReader directly |
|
||||
| Conversion approach | One-time manual | Cache files are static snapshots, not actively regenerated |
|
||||
| Compression | zstd on whole file | Consistent with current approach, excellent compression |
|
||||
| Converter location | Standalone console app in `Tools/CacheConverter/` | Isolated utility, not part of main solution |
|
||||
|
||||
## File Format
|
||||
|
||||
**New extension:** `.pb.zstd`
|
||||
|
||||
**File naming:**
|
||||
- `branch.json.zstd` → `branch.pb.zstd`
|
||||
- `workordertime_curr.json.zstd` → `workordertime_curr.pb.zstd`
|
||||
|
||||
**Data structure:** protobuf-net-data binary format
|
||||
- Schema embedded in stream (column names, types, nullability)
|
||||
- Rows serialized sequentially
|
||||
- Native ADO.NET type support (DateTime, Guid, decimal, etc.)
|
||||
|
||||
**Libraries:**
|
||||
- `protobuf-net-data` - IDataReader serialization/deserialization
|
||||
- `ZstdSharp.Port` - compression
|
||||
|
||||
## Components
|
||||
|
||||
### 1. Converter Tool
|
||||
|
||||
**Location:** `/JdeScopingTool/Tools/CacheConverter/`
|
||||
|
||||
```
|
||||
Tools/
|
||||
└── CacheConverter/
|
||||
├── CacheConverter.csproj
|
||||
└── Program.cs
|
||||
```
|
||||
|
||||
**Dependencies:**
|
||||
- `ZstdSharp.Port` - read zstd JSON, write zstd protobuf
|
||||
- `protobuf-net-data` - protobuf serialization
|
||||
|
||||
**Behavior:**
|
||||
1. Read each `.json.zstd` file from `CACHED_DB_FILES/`
|
||||
2. Decompress and parse JSON into an `IDataReader`
|
||||
3. Use `DataSerializer.Serialize(stream, reader)` to write protobuf
|
||||
4. Compress with zstd and write to `.pb.zstd`
|
||||
5. Print before/after sizes for comparison
|
||||
|
||||
**Usage:**
|
||||
```bash
|
||||
cd Tools/CacheConverter
|
||||
dotnet run -- ../../CACHED_DB_FILES
|
||||
```
|
||||
|
||||
### 2. ProtobufZstdFileSource
|
||||
|
||||
**New file:** `NEW/src/JdeScoping.DataSync.Dev/Sources/ProtobufZstdFileSource.cs`
|
||||
|
||||
**Key simplification:** No custom `IDataReader` implementation needed!
|
||||
|
||||
```csharp
|
||||
public sealed class ProtobufZstdFileSource : IImportSource
|
||||
{
|
||||
public async Task<IDataReader> ReadDataAsync(CancellationToken ct = default)
|
||||
{
|
||||
_fileStream = new FileStream(_filePath, FileMode.Open, ...);
|
||||
_decompressionStream = new DecompressionStream(_fileStream);
|
||||
|
||||
// protobuf-net-data returns IDataReader directly!
|
||||
return DataSerializer.Deserialize(_decompressionStream);
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Package additions to `JdeScoping.DataSync.Dev.csproj`:**
|
||||
- `protobuf-net-data`
|
||||
|
||||
### 3. DevEtl Class Updates
|
||||
|
||||
**Changes to each `*DevEtl.cs` file (22 files):**
|
||||
|
||||
1. Update `CacheFileName` constant:
|
||||
```csharp
|
||||
// Before
|
||||
public static readonly string CacheFileName = "branch.json.zstd";
|
||||
// After
|
||||
public static readonly string CacheFileName = "branch.pb.zstd";
|
||||
```
|
||||
|
||||
2. Update `Create()` method:
|
||||
```csharp
|
||||
// Before
|
||||
.WithSource(new JsonZstdFileSource(cacheFilePath, Schema))
|
||||
// After
|
||||
.WithSource(new ProtobufZstdFileSource(cacheFilePath))
|
||||
```
|
||||
|
||||
3. **Remove schema definitions** - protobuf-net-data embeds schema in the file, so `JsonColumnSchema[]` arrays are no longer needed in DevEtl classes.
|
||||
|
||||
**No changes to:**
|
||||
- Pipeline structure
|
||||
- `DevEtlRegistry.cs`
|
||||
|
||||
### 4. Cleanup (After Verification)
|
||||
|
||||
**Remove obsolete files:**
|
||||
- `Sources/JsonZstdFileSource.cs`
|
||||
- `Sources/JsonStreamingDataReader.cs`
|
||||
- `Sources/Utf8JsonStreamingDataReader.cs`
|
||||
- `Models/JsonColumnSchema.cs`
|
||||
|
||||
**Remove old cache files:**
|
||||
- All `*.json.zstd` files in `CACHED_DB_FILES/`
|
||||
|
||||
## Code Simplification Summary
|
||||
|
||||
| Before (JSON) | After (Protobuf) |
|
||||
|---------------|------------------|
|
||||
| `JsonZstdFileSource` | `ProtobufZstdFileSource` |
|
||||
| `Utf8JsonStreamingDataReader` (custom) | `DataSerializer.Deserialize()` (library) |
|
||||
| `JsonStreamingDataReader` (legacy) | Removed |
|
||||
| `JsonColumnSchema[]` per table | Not needed (embedded in file) |
|
||||
|
||||
## Test Strategy
|
||||
|
||||
1. Run converter tool, verify all 22 files convert without errors
|
||||
2. Compare file sizes (expect 10-30% reduction)
|
||||
3. Run existing `JdeScoping.DataSync.Dev.Tests` - all tests should pass unchanged
|
||||
4. Verify data loaded matches previous JSON-based loads
|
||||
|
||||
## Files Changed
|
||||
|
||||
| File | Change |
|
||||
|------|--------|
|
||||
| `Tools/CacheConverter/` (new) | Standalone converter tool |
|
||||
| `Sources/ProtobufZstdFileSource.cs` (new) | New protobuf reader (much simpler) |
|
||||
| `JdeScoping.DataSync.Dev.csproj` | Add protobuf-net-data package |
|
||||
| `*DevEtl.cs` (22 files) | Update file extension, source class, remove schema |
|
||||
| `Sources/JsonZstdFileSource.cs` | Delete after migration |
|
||||
| `Sources/JsonStreamingDataReader.cs` | Delete after migration |
|
||||
| `Sources/Utf8JsonStreamingDataReader.cs` | Delete after migration |
|
||||
| `Models/JsonColumnSchema.cs` | Delete after migration |
|
||||
| `CACHED_DB_FILES/*.json.zstd` | Delete after verification |
|
||||
Reference in New Issue
Block a user