Files
jdescopingtool/PLANS/2026-01-03-etl-pipeline-design.md
T
Joseph Doherty ec4c8fab87 refactor: relocate options classes to dedicated Options folders
Move configuration options from Core/DataAccess/DataSync/ExcelIO to
dedicated Options folders within each project for better organization.
Update all references and tests accordingly.
2026-01-03 08:55:08 -05:00

11 KiB

ETL Pipeline Design for DataSync

Date: 2026-01-03 Status: Reviewed (Codex MCP) Purpose: Replace the existing strongly-typed fetcher + source-generated DataReader + BulkMergeHelper system with a flexible, configuration-driven ETL pipeline.

Problem Statement

The current DataSync system requires:

  1. A source-generated IDataReader implementation per entity type
  2. Strongly-typed IDataFetcher<T> classes for each data source
  3. Code changes to modify queries or add new sync operations

Primary drivers for change:

  • Flexibility - Enable ad-hoc queries and schema changes without code modifications
  • Observability - Better error handling and row count tracking across pipeline steps

Design Overview

┌─────────────┐    ┌──────────────┐    ┌─────────────┐
│ ImportSource│───▶│ Transformer  │───▶│ Destination │
│ (IDataReader)    │ (decorates)  │    │ (bulk ops)  │
└─────────────┘    └──────────────┘    └─────────────┘
       ▲                                      │
       │                                      ▼
┌──────┴──────┐                      ┌──────────────┐
│ Pre-Scripts │                      │ Post-Scripts │
└─────────────┘                      └──────────────┘

All steps report to PipelineResult with timing and row counts.

Core Interfaces

IImportSource

Reads data from a source system, returns IDataReader for streaming.

public interface IImportSource : IAsyncDisposable
{
    Task<IDataReader> ReadDataAsync(CancellationToken cancellationToken = default);
    string SourceName { get; }
}

Implementations:

  • DbQuerySource - Executes SQL against Oracle/Sybase/SQL Server via IDbConnectionFactory
  • FileSource - Reads CSV files with configurable delimiter/header options

IDataTransformer

Wraps an IDataReader and applies transformations (column rename, drop, type conversion).

public interface IDataTransformer
{
    IDataReader Transform(IDataReader source);
    string TransformerName { get; }
}

Implementations:

  • JdeDateTransformer - Merges Julian date + time columns into DateTime
  • ColumnRenameTransformer - Renames columns
  • ColumnDropTransformer - Removes columns from output
  • ColumnAddTransformer - Adds computed columns

Transformers are registered by name in ITransformerRegistry for config-based hydration.

IImportDestination

Writes data to a target table using bulk operations.

public interface IImportDestination
{
    Task<DestinationResult> WriteAsync(
        IDataReader source,
        CancellationToken cancellationToken = default);
    string DestinationName { get; }
}

public record DestinationResult(
    long RowsProcessed,
    int BatchCount,
    TimeSpan Elapsed);

Implementations:

DbBulkImportDestination (Full Refresh)

  1. Truncate destination table
  2. Bulk copy all data in batches
  3. Per-batch commits

DbBulkMergeDestination (Incremental)

  1. Create temp table from destination schema
  2. For each batch:
    • Bulk copy to temp table
    • Execute MERGE to destination
    • Truncate temp table
  3. Drop temp table on completion

Configuration options:

  • tableName - Target table
  • matchColumns - PK columns for MERGE ON clause
  • updateColumns - Columns to update (default: all non-match columns)
  • batchSize - Rows per batch (default: 10000)

IScriptRunner

Executes SQL scripts before or after the main pipeline.

public interface IScriptRunner
{
    Task ExecuteAsync(CancellationToken cancellationToken = default);
    string ScriptName { get; }
}

Common scripts (factory methods):

  • CommonScripts.DisableIndexes(factory, tableName)
  • CommonScripts.RebuildIndexes(factory, tableName)
  • CommonScripts.UpdateStatistics(factory, tableName)
  • CommonScripts.CustomSql(factory, sql, name)

Pipeline Results

public record PipelineResult(
    bool Success,
    long TotalRows,
    TimeSpan Elapsed,
    IReadOnlyList<StepResult> Steps,
    Exception? Error = null);

public record StepResult(
    string StepName,
    string StepType,  // "Source", "Transform", "Destination", "Script"
    long RowsAffected,
    TimeSpan Elapsed);

EtlPipeline Orchestration

public class EtlPipeline
{
    public string PipelineName { get; }

    public async Task<PipelineResult> ExecuteAsync(CancellationToken cancellationToken = default)
    {
        // 1. Run pre-scripts (fail-fast)
        // 2. Open source, get IDataReader
        // 3. Chain transformers (decorators)
        // 4. Write to destination
        // 5. Run post-scripts
        // 6. Return PipelineResult with all step metrics
    }
}

Error handling:

  • Fail-fast on any error
  • No cross-batch transactions (per-batch commits)
  • Caller marks overall sync operation as failed
  • Restart from beginning on failure (no resume)

Pipeline Construction

Fluent Builder

var pipeline = new EtlPipelineBuilder()
    .WithName("WorkOrderSync")
    .WithSource(new DbQuerySource(connFactory, "JDE", workOrderQuery))
    .WithTransformer(new JdeDateTransformer("UPMJ", "TDAY", "UpdatedAt"))
    .WithDestination(new DbBulkMergeDestination(connFactory, "WorkOrder", ["OrderNumber"]))
    .WithPreScript(CommonScripts.DisableIndexes(connFactory, "WorkOrder"))
    .WithPostScript(CommonScripts.RebuildIndexes(connFactory, "WorkOrder"))
    .WithLogger(logger)
    .Build();

var result = await pipeline.ExecuteAsync(cancellationToken);

Configuration-Based

{
  "name": "WorkOrderSync",
  "source": {
    "type": "db",
    "connectionName": "JDE",
    "query": "SELECT * FROM F4801 WHERE UPMJ >= :minDate"
  },
  "transformers": [
    { "type": "jde-date", "options": { "dateColumn": "UPMJ", "timeColumn": "TDAY", "outputColumn": "UpdatedAt" } }
  ],
  "destination": {
    "type": "bulk-merge",
    "tableName": "WorkOrder",
    "matchColumns": ["OrderNumber"],
    "batchSize": 10000
  },
  "preScripts": [
    { "type": "disable-indexes", "tableName": "WorkOrder" }
  ],
  "postScripts": [
    { "type": "rebuild-indexes", "tableName": "WorkOrder" }
  ]
}

Hydrated via EtlPipelineFactory.CreateFromConfig(config).

Project Structure

NEW/src/JdeScoping.DataSync/
├── Contracts/
│   ├── IImportSource.cs
│   ├── IDataTransformer.cs
│   ├── IImportDestination.cs
│   ├── IScriptRunner.cs
│   └── ITransformerRegistry.cs
├── Etl/
│   ├── Pipeline/
│   │   ├── EtlPipeline.cs
│   │   ├── EtlPipelineBuilder.cs
│   │   └── EtlPipelineFactory.cs
│   ├── Sources/
│   │   ├── DbQuerySource.cs
│   │   └── FileSource.cs
│   ├── Transformers/
│   │   ├── DataTransformerBase.cs
│   │   ├── TransformingDataReader.cs
│   │   ├── JdeDateTransformer.cs
│   │   ├── ColumnRenameTransformer.cs
│   │   ├── ColumnDropTransformer.cs
│   │   └── TransformerRegistry.cs
│   ├── Destinations/
│   │   ├── DbBulkImportDestination.cs
│   │   └── DbBulkMergeDestination.cs
│   ├── Scripts/
│   │   ├── SqlScriptRunner.cs
│   │   └── CommonScripts.cs
│   └── Results/
│       ├── PipelineResult.cs
│       ├── StepResult.cs
│       └── DestinationResult.cs
└── Config/
    └── PipelineConfig.cs

Integration Strategy

  1. Parallel existence - New ETL pipeline lives alongside existing IDataFetcher<T> + IBulkMergeHelper
  2. Gradual migration - Convert one sync operation at a time to new pipeline
  3. Shared infrastructure - Both use IDbConnectionFactory, logging, etc.
  4. Deprecation path - Once all syncs migrated, remove old Fetchers/ and source generator

DI Registration

public static class EtlServiceCollectionExtensions
{
    public static IServiceCollection AddEtlPipeline(this IServiceCollection services)
    {
        services.AddSingleton<ITransformerRegistry, TransformerRegistry>();
        services.AddTransient<EtlPipelineFactory>();
        return services;
    }
}

Design Decisions

Based on Codex MCP review, the following decisions were made:

Atomicity & Consistency

Concern Decision Rationale
Full refresh partial visibility Accept Caller retries on failure; consumers tolerate brief inconsistency
Orphaned disabled indexes Accept risk Next successful run rebuilds; manual intervention rare
Temp table connection lifetime Destination owns connection DbBulkMergeDestination holds single connection for all batches

Implementation Requirements

  1. TransformingDataReader schema accuracy - Transformers MUST correctly implement:

    • GetSchemaTable() - Accurate after column changes
    • GetName(ordinal) - Reflects renames
    • GetOrdinal(name) - Works with new names
    • GetFieldType(ordinal) - Correct after type transforms
    • FieldCount - Accurate after drops/adds
  2. Transform step row counts - StepResult.RowsAffected is 0 for transform steps (inline decorators). Only destination reports actual rows processed.

  3. Cancellation handling - Destinations use DbDataReader.ReadAsync() internally, checking cancellation between rows. Source IDataReader interface stays synchronous for ADO.NET compatibility.

  4. Schema validation - Optional ValidateSchema step available to compare source columns against expected schema before pipeline runs. Not mandatory.

  5. MERGE duplicate keys - Source data MUST have unique keys for match columns. Pipeline does not dedupe; caller's responsibility to ensure uniqueness.

Connection Lifetime Rules

DbQuerySource:
  - Opens connection in ReadDataAsync()
  - Holds connection until DisposeAsync()
  - Connection closed when pipeline completes or fails

DbBulkImportDestination:
  - Opens connection in WriteAsync()
  - Holds for all batches
  - Closes on completion or failure

DbBulkMergeDestination:
  - Opens connection in WriteAsync()
  - Creates #temp table
  - Holds same connection for all batch cycles
  - Drops temp table in finally block
  - Closes connection on completion or failure

Open Questions

  1. Should DbQuerySource support parameterized queries with different parameter styles (:param for Oracle, @param for SQL Server)?
  2. Should we support multiple destinations per pipeline (e.g., write to both current and history tables)?
  3. How should pipeline configurations be stored and loaded (appsettings.json, separate files, database)?