Files
jdescopingtool/PLANS/2026-01-03-etl-pipeline-design.md
T
Joseph Doherty ec4c8fab87 refactor: relocate options classes to dedicated Options folders
Move configuration options from Core/DataAccess/DataSync/ExcelIO to
dedicated Options folders within each project for better organization.
Update all references and tests accordingly.
2026-01-03 08:55:08 -05:00

335 lines
11 KiB
Markdown

# ETL Pipeline Design for DataSync
**Date:** 2026-01-03
**Status:** Reviewed (Codex MCP)
**Purpose:** Replace the existing strongly-typed fetcher + source-generated DataReader + BulkMergeHelper system with a flexible, configuration-driven ETL pipeline.
## Problem Statement
The current DataSync system requires:
1. A source-generated `IDataReader` implementation per entity type
2. Strongly-typed `IDataFetcher<T>` classes for each data source
3. Code changes to modify queries or add new sync operations
**Primary drivers for change:**
- **Flexibility** - Enable ad-hoc queries and schema changes without code modifications
- **Observability** - Better error handling and row count tracking across pipeline steps
## Design Overview
```
┌─────────────┐ ┌──────────────┐ ┌─────────────┐
│ ImportSource│───▶│ Transformer │───▶│ Destination │
│ (IDataReader) │ (decorates) │ │ (bulk ops) │
└─────────────┘ └──────────────┘ └─────────────┘
▲ │
│ ▼
┌──────┴──────┐ ┌──────────────┐
│ Pre-Scripts │ │ Post-Scripts │
└─────────────┘ └──────────────┘
```
All steps report to `PipelineResult` with timing and row counts.
## Core Interfaces
### IImportSource
Reads data from a source system, returns `IDataReader` for streaming.
```csharp
public interface IImportSource : IAsyncDisposable
{
Task<IDataReader> ReadDataAsync(CancellationToken cancellationToken = default);
string SourceName { get; }
}
```
**Implementations:**
- `DbQuerySource` - Executes SQL against Oracle/Sybase/SQL Server via `IDbConnectionFactory`
- `FileSource` - Reads CSV files with configurable delimiter/header options
### IDataTransformer
Wraps an `IDataReader` and applies transformations (column rename, drop, type conversion).
```csharp
public interface IDataTransformer
{
IDataReader Transform(IDataReader source);
string TransformerName { get; }
}
```
**Implementations:**
- `JdeDateTransformer` - Merges Julian date + time columns into `DateTime`
- `ColumnRenameTransformer` - Renames columns
- `ColumnDropTransformer` - Removes columns from output
- `ColumnAddTransformer` - Adds computed columns
Transformers are registered by name in `ITransformerRegistry` for config-based hydration.
### IImportDestination
Writes data to a target table using bulk operations.
```csharp
public interface IImportDestination
{
Task<DestinationResult> WriteAsync(
IDataReader source,
CancellationToken cancellationToken = default);
string DestinationName { get; }
}
public record DestinationResult(
long RowsProcessed,
int BatchCount,
TimeSpan Elapsed);
```
**Implementations:**
#### DbBulkImportDestination (Full Refresh)
1. Truncate destination table
2. Bulk copy all data in batches
3. Per-batch commits
#### DbBulkMergeDestination (Incremental)
1. Create temp table from destination schema
2. For each batch:
- Bulk copy to temp table
- Execute MERGE to destination
- Truncate temp table
3. Drop temp table on completion
**Configuration options:**
- `tableName` - Target table
- `matchColumns` - PK columns for MERGE ON clause
- `updateColumns` - Columns to update (default: all non-match columns)
- `batchSize` - Rows per batch (default: 10000)
### IScriptRunner
Executes SQL scripts before or after the main pipeline.
```csharp
public interface IScriptRunner
{
Task ExecuteAsync(CancellationToken cancellationToken = default);
string ScriptName { get; }
}
```
**Common scripts (factory methods):**
- `CommonScripts.DisableIndexes(factory, tableName)`
- `CommonScripts.RebuildIndexes(factory, tableName)`
- `CommonScripts.UpdateStatistics(factory, tableName)`
- `CommonScripts.CustomSql(factory, sql, name)`
### Pipeline Results
```csharp
public record PipelineResult(
bool Success,
long TotalRows,
TimeSpan Elapsed,
IReadOnlyList<StepResult> Steps,
Exception? Error = null);
public record StepResult(
string StepName,
string StepType, // "Source", "Transform", "Destination", "Script"
long RowsAffected,
TimeSpan Elapsed);
```
## EtlPipeline Orchestration
```csharp
public class EtlPipeline
{
public string PipelineName { get; }
public async Task<PipelineResult> ExecuteAsync(CancellationToken cancellationToken = default)
{
// 1. Run pre-scripts (fail-fast)
// 2. Open source, get IDataReader
// 3. Chain transformers (decorators)
// 4. Write to destination
// 5. Run post-scripts
// 6. Return PipelineResult with all step metrics
}
}
```
**Error handling:**
- Fail-fast on any error
- No cross-batch transactions (per-batch commits)
- Caller marks overall sync operation as failed
- Restart from beginning on failure (no resume)
## Pipeline Construction
### Fluent Builder
```csharp
var pipeline = new EtlPipelineBuilder()
.WithName("WorkOrderSync")
.WithSource(new DbQuerySource(connFactory, "JDE", workOrderQuery))
.WithTransformer(new JdeDateTransformer("UPMJ", "TDAY", "UpdatedAt"))
.WithDestination(new DbBulkMergeDestination(connFactory, "WorkOrder", ["OrderNumber"]))
.WithPreScript(CommonScripts.DisableIndexes(connFactory, "WorkOrder"))
.WithPostScript(CommonScripts.RebuildIndexes(connFactory, "WorkOrder"))
.WithLogger(logger)
.Build();
var result = await pipeline.ExecuteAsync(cancellationToken);
```
### Configuration-Based
```json
{
"name": "WorkOrderSync",
"source": {
"type": "db",
"connectionName": "JDE",
"query": "SELECT * FROM F4801 WHERE UPMJ >= :minDate"
},
"transformers": [
{ "type": "jde-date", "options": { "dateColumn": "UPMJ", "timeColumn": "TDAY", "outputColumn": "UpdatedAt" } }
],
"destination": {
"type": "bulk-merge",
"tableName": "WorkOrder",
"matchColumns": ["OrderNumber"],
"batchSize": 10000
},
"preScripts": [
{ "type": "disable-indexes", "tableName": "WorkOrder" }
],
"postScripts": [
{ "type": "rebuild-indexes", "tableName": "WorkOrder" }
]
}
```
Hydrated via `EtlPipelineFactory.CreateFromConfig(config)`.
## Project Structure
```
NEW/src/JdeScoping.DataSync/
├── Contracts/
│ ├── IImportSource.cs
│ ├── IDataTransformer.cs
│ ├── IImportDestination.cs
│ ├── IScriptRunner.cs
│ └── ITransformerRegistry.cs
├── Etl/
│ ├── Pipeline/
│ │ ├── EtlPipeline.cs
│ │ ├── EtlPipelineBuilder.cs
│ │ └── EtlPipelineFactory.cs
│ ├── Sources/
│ │ ├── DbQuerySource.cs
│ │ └── FileSource.cs
│ ├── Transformers/
│ │ ├── DataTransformerBase.cs
│ │ ├── TransformingDataReader.cs
│ │ ├── JdeDateTransformer.cs
│ │ ├── ColumnRenameTransformer.cs
│ │ ├── ColumnDropTransformer.cs
│ │ └── TransformerRegistry.cs
│ ├── Destinations/
│ │ ├── DbBulkImportDestination.cs
│ │ └── DbBulkMergeDestination.cs
│ ├── Scripts/
│ │ ├── SqlScriptRunner.cs
│ │ └── CommonScripts.cs
│ └── Results/
│ ├── PipelineResult.cs
│ ├── StepResult.cs
│ └── DestinationResult.cs
└── Config/
└── PipelineConfig.cs
```
## Integration Strategy
1. **Parallel existence** - New ETL pipeline lives alongside existing `IDataFetcher<T>` + `IBulkMergeHelper`
2. **Gradual migration** - Convert one sync operation at a time to new pipeline
3. **Shared infrastructure** - Both use `IDbConnectionFactory`, logging, etc.
4. **Deprecation path** - Once all syncs migrated, remove old `Fetchers/` and source generator
## DI Registration
```csharp
public static class EtlServiceCollectionExtensions
{
public static IServiceCollection AddEtlPipeline(this IServiceCollection services)
{
services.AddSingleton<ITransformerRegistry, TransformerRegistry>();
services.AddTransient<EtlPipelineFactory>();
return services;
}
}
```
## Design Decisions
Based on Codex MCP review, the following decisions were made:
### Atomicity & Consistency
| Concern | Decision | Rationale |
|---------|----------|-----------|
| Full refresh partial visibility | **Accept** | Caller retries on failure; consumers tolerate brief inconsistency |
| Orphaned disabled indexes | **Accept risk** | Next successful run rebuilds; manual intervention rare |
| Temp table connection lifetime | **Destination owns connection** | `DbBulkMergeDestination` holds single connection for all batches |
### Implementation Requirements
1. **TransformingDataReader schema accuracy** - Transformers MUST correctly implement:
- `GetSchemaTable()` - Accurate after column changes
- `GetName(ordinal)` - Reflects renames
- `GetOrdinal(name)` - Works with new names
- `GetFieldType(ordinal)` - Correct after type transforms
- `FieldCount` - Accurate after drops/adds
2. **Transform step row counts** - `StepResult.RowsAffected` is `0` for transform steps (inline decorators). Only destination reports actual rows processed.
3. **Cancellation handling** - Destinations use `DbDataReader.ReadAsync()` internally, checking cancellation between rows. Source `IDataReader` interface stays synchronous for ADO.NET compatibility.
4. **Schema validation** - Optional `ValidateSchema` step available to compare source columns against expected schema before pipeline runs. Not mandatory.
5. **MERGE duplicate keys** - Source data MUST have unique keys for match columns. Pipeline does not dedupe; caller's responsibility to ensure uniqueness.
### Connection Lifetime Rules
```
DbQuerySource:
- Opens connection in ReadDataAsync()
- Holds connection until DisposeAsync()
- Connection closed when pipeline completes or fails
DbBulkImportDestination:
- Opens connection in WriteAsync()
- Holds for all batches
- Closes on completion or failure
DbBulkMergeDestination:
- Opens connection in WriteAsync()
- Creates #temp table
- Holds same connection for all batch cycles
- Drops temp table in finally block
- Closes connection on completion or failure
```
## Open Questions
1. Should `DbQuerySource` support parameterized queries with different parameter styles (`:param` for Oracle, `@param` for SQL Server)?
2. Should we support multiple destinations per pipeline (e.g., write to both current and history tables)?
3. How should pipeline configurations be stored and loaded (appsettings.json, separate files, database)?