Move configuration options from Core/DataAccess/DataSync/ExcelIO to dedicated Options folders within each project for better organization. Update all references and tests accordingly.
11 KiB
ETL Pipeline Design for DataSync
Date: 2026-01-03 Status: Reviewed (Codex MCP) Purpose: Replace the existing strongly-typed fetcher + source-generated DataReader + BulkMergeHelper system with a flexible, configuration-driven ETL pipeline.
Problem Statement
The current DataSync system requires:
- A source-generated
IDataReaderimplementation per entity type - Strongly-typed
IDataFetcher<T>classes for each data source - Code changes to modify queries or add new sync operations
Primary drivers for change:
- Flexibility - Enable ad-hoc queries and schema changes without code modifications
- Observability - Better error handling and row count tracking across pipeline steps
Design Overview
┌─────────────┐ ┌──────────────┐ ┌─────────────┐
│ ImportSource│───▶│ Transformer │───▶│ Destination │
│ (IDataReader) │ (decorates) │ │ (bulk ops) │
└─────────────┘ └──────────────┘ └─────────────┘
▲ │
│ ▼
┌──────┴──────┐ ┌──────────────┐
│ Pre-Scripts │ │ Post-Scripts │
└─────────────┘ └──────────────┘
All steps report to PipelineResult with timing and row counts.
Core Interfaces
IImportSource
Reads data from a source system, returns IDataReader for streaming.
public interface IImportSource : IAsyncDisposable
{
Task<IDataReader> ReadDataAsync(CancellationToken cancellationToken = default);
string SourceName { get; }
}
Implementations:
DbQuerySource- Executes SQL against Oracle/Sybase/SQL Server viaIDbConnectionFactoryFileSource- Reads CSV files with configurable delimiter/header options
IDataTransformer
Wraps an IDataReader and applies transformations (column rename, drop, type conversion).
public interface IDataTransformer
{
IDataReader Transform(IDataReader source);
string TransformerName { get; }
}
Implementations:
JdeDateTransformer- Merges Julian date + time columns intoDateTimeColumnRenameTransformer- Renames columnsColumnDropTransformer- Removes columns from outputColumnAddTransformer- Adds computed columns
Transformers are registered by name in ITransformerRegistry for config-based hydration.
IImportDestination
Writes data to a target table using bulk operations.
public interface IImportDestination
{
Task<DestinationResult> WriteAsync(
IDataReader source,
CancellationToken cancellationToken = default);
string DestinationName { get; }
}
public record DestinationResult(
long RowsProcessed,
int BatchCount,
TimeSpan Elapsed);
Implementations:
DbBulkImportDestination (Full Refresh)
- Truncate destination table
- Bulk copy all data in batches
- Per-batch commits
DbBulkMergeDestination (Incremental)
- Create temp table from destination schema
- For each batch:
- Bulk copy to temp table
- Execute MERGE to destination
- Truncate temp table
- Drop temp table on completion
Configuration options:
tableName- Target tablematchColumns- PK columns for MERGE ON clauseupdateColumns- Columns to update (default: all non-match columns)batchSize- Rows per batch (default: 10000)
IScriptRunner
Executes SQL scripts before or after the main pipeline.
public interface IScriptRunner
{
Task ExecuteAsync(CancellationToken cancellationToken = default);
string ScriptName { get; }
}
Common scripts (factory methods):
CommonScripts.DisableIndexes(factory, tableName)CommonScripts.RebuildIndexes(factory, tableName)CommonScripts.UpdateStatistics(factory, tableName)CommonScripts.CustomSql(factory, sql, name)
Pipeline Results
public record PipelineResult(
bool Success,
long TotalRows,
TimeSpan Elapsed,
IReadOnlyList<StepResult> Steps,
Exception? Error = null);
public record StepResult(
string StepName,
string StepType, // "Source", "Transform", "Destination", "Script"
long RowsAffected,
TimeSpan Elapsed);
EtlPipeline Orchestration
public class EtlPipeline
{
public string PipelineName { get; }
public async Task<PipelineResult> ExecuteAsync(CancellationToken cancellationToken = default)
{
// 1. Run pre-scripts (fail-fast)
// 2. Open source, get IDataReader
// 3. Chain transformers (decorators)
// 4. Write to destination
// 5. Run post-scripts
// 6. Return PipelineResult with all step metrics
}
}
Error handling:
- Fail-fast on any error
- No cross-batch transactions (per-batch commits)
- Caller marks overall sync operation as failed
- Restart from beginning on failure (no resume)
Pipeline Construction
Fluent Builder
var pipeline = new EtlPipelineBuilder()
.WithName("WorkOrderSync")
.WithSource(new DbQuerySource(connFactory, "JDE", workOrderQuery))
.WithTransformer(new JdeDateTransformer("UPMJ", "TDAY", "UpdatedAt"))
.WithDestination(new DbBulkMergeDestination(connFactory, "WorkOrder", ["OrderNumber"]))
.WithPreScript(CommonScripts.DisableIndexes(connFactory, "WorkOrder"))
.WithPostScript(CommonScripts.RebuildIndexes(connFactory, "WorkOrder"))
.WithLogger(logger)
.Build();
var result = await pipeline.ExecuteAsync(cancellationToken);
Configuration-Based
{
"name": "WorkOrderSync",
"source": {
"type": "db",
"connectionName": "JDE",
"query": "SELECT * FROM F4801 WHERE UPMJ >= :minDate"
},
"transformers": [
{ "type": "jde-date", "options": { "dateColumn": "UPMJ", "timeColumn": "TDAY", "outputColumn": "UpdatedAt" } }
],
"destination": {
"type": "bulk-merge",
"tableName": "WorkOrder",
"matchColumns": ["OrderNumber"],
"batchSize": 10000
},
"preScripts": [
{ "type": "disable-indexes", "tableName": "WorkOrder" }
],
"postScripts": [
{ "type": "rebuild-indexes", "tableName": "WorkOrder" }
]
}
Hydrated via EtlPipelineFactory.CreateFromConfig(config).
Project Structure
NEW/src/JdeScoping.DataSync/
├── Contracts/
│ ├── IImportSource.cs
│ ├── IDataTransformer.cs
│ ├── IImportDestination.cs
│ ├── IScriptRunner.cs
│ └── ITransformerRegistry.cs
├── Etl/
│ ├── Pipeline/
│ │ ├── EtlPipeline.cs
│ │ ├── EtlPipelineBuilder.cs
│ │ └── EtlPipelineFactory.cs
│ ├── Sources/
│ │ ├── DbQuerySource.cs
│ │ └── FileSource.cs
│ ├── Transformers/
│ │ ├── DataTransformerBase.cs
│ │ ├── TransformingDataReader.cs
│ │ ├── JdeDateTransformer.cs
│ │ ├── ColumnRenameTransformer.cs
│ │ ├── ColumnDropTransformer.cs
│ │ └── TransformerRegistry.cs
│ ├── Destinations/
│ │ ├── DbBulkImportDestination.cs
│ │ └── DbBulkMergeDestination.cs
│ ├── Scripts/
│ │ ├── SqlScriptRunner.cs
│ │ └── CommonScripts.cs
│ └── Results/
│ ├── PipelineResult.cs
│ ├── StepResult.cs
│ └── DestinationResult.cs
└── Config/
└── PipelineConfig.cs
Integration Strategy
- Parallel existence - New ETL pipeline lives alongside existing
IDataFetcher<T>+IBulkMergeHelper - Gradual migration - Convert one sync operation at a time to new pipeline
- Shared infrastructure - Both use
IDbConnectionFactory, logging, etc. - Deprecation path - Once all syncs migrated, remove old
Fetchers/and source generator
DI Registration
public static class EtlServiceCollectionExtensions
{
public static IServiceCollection AddEtlPipeline(this IServiceCollection services)
{
services.AddSingleton<ITransformerRegistry, TransformerRegistry>();
services.AddTransient<EtlPipelineFactory>();
return services;
}
}
Design Decisions
Based on Codex MCP review, the following decisions were made:
Atomicity & Consistency
| Concern | Decision | Rationale |
|---|---|---|
| Full refresh partial visibility | Accept | Caller retries on failure; consumers tolerate brief inconsistency |
| Orphaned disabled indexes | Accept risk | Next successful run rebuilds; manual intervention rare |
| Temp table connection lifetime | Destination owns connection | DbBulkMergeDestination holds single connection for all batches |
Implementation Requirements
-
TransformingDataReader schema accuracy - Transformers MUST correctly implement:
GetSchemaTable()- Accurate after column changesGetName(ordinal)- Reflects renamesGetOrdinal(name)- Works with new namesGetFieldType(ordinal)- Correct after type transformsFieldCount- Accurate after drops/adds
-
Transform step row counts -
StepResult.RowsAffectedis0for transform steps (inline decorators). Only destination reports actual rows processed. -
Cancellation handling - Destinations use
DbDataReader.ReadAsync()internally, checking cancellation between rows. SourceIDataReaderinterface stays synchronous for ADO.NET compatibility. -
Schema validation - Optional
ValidateSchemastep available to compare source columns against expected schema before pipeline runs. Not mandatory. -
MERGE duplicate keys - Source data MUST have unique keys for match columns. Pipeline does not dedupe; caller's responsibility to ensure uniqueness.
Connection Lifetime Rules
DbQuerySource:
- Opens connection in ReadDataAsync()
- Holds connection until DisposeAsync()
- Connection closed when pipeline completes or fails
DbBulkImportDestination:
- Opens connection in WriteAsync()
- Holds for all batches
- Closes on completion or failure
DbBulkMergeDestination:
- Opens connection in WriteAsync()
- Creates #temp table
- Holds same connection for all batch cycles
- Drops temp table in finally block
- Closes connection on completion or failure
Open Questions
- Should
DbQuerySourcesupport parameterized queries with different parameter styles (:paramfor Oracle,@paramfor SQL Server)? - Should we support multiple destinations per pipeline (e.g., write to both current and history tables)?
- How should pipeline configurations be stored and loaded (appsettings.json, separate files, database)?