ec4c8fab87
Move configuration options from Core/DataAccess/DataSync/ExcelIO to dedicated Options folders within each project for better organization. Update all references and tests accordingly.
335 lines
11 KiB
Markdown
335 lines
11 KiB
Markdown
# ETL Pipeline Design for DataSync
|
|
|
|
**Date:** 2026-01-03
|
|
**Status:** Reviewed (Codex MCP)
|
|
**Purpose:** Replace the existing strongly-typed fetcher + source-generated DataReader + BulkMergeHelper system with a flexible, configuration-driven ETL pipeline.
|
|
|
|
## Problem Statement
|
|
|
|
The current DataSync system requires:
|
|
1. A source-generated `IDataReader` implementation per entity type
|
|
2. Strongly-typed `IDataFetcher<T>` classes for each data source
|
|
3. Code changes to modify queries or add new sync operations
|
|
|
|
**Primary drivers for change:**
|
|
- **Flexibility** - Enable ad-hoc queries and schema changes without code modifications
|
|
- **Observability** - Better error handling and row count tracking across pipeline steps
|
|
|
|
## Design Overview
|
|
|
|
```
|
|
┌─────────────┐ ┌──────────────┐ ┌─────────────┐
|
|
│ ImportSource│───▶│ Transformer │───▶│ Destination │
|
|
│ (IDataReader) │ (decorates) │ │ (bulk ops) │
|
|
└─────────────┘ └──────────────┘ └─────────────┘
|
|
▲ │
|
|
│ ▼
|
|
┌──────┴──────┐ ┌──────────────┐
|
|
│ Pre-Scripts │ │ Post-Scripts │
|
|
└─────────────┘ └──────────────┘
|
|
```
|
|
|
|
All steps report to `PipelineResult` with timing and row counts.
|
|
|
|
## Core Interfaces
|
|
|
|
### IImportSource
|
|
|
|
Reads data from a source system, returns `IDataReader` for streaming.
|
|
|
|
```csharp
|
|
public interface IImportSource : IAsyncDisposable
|
|
{
|
|
Task<IDataReader> ReadDataAsync(CancellationToken cancellationToken = default);
|
|
string SourceName { get; }
|
|
}
|
|
```
|
|
|
|
**Implementations:**
|
|
- `DbQuerySource` - Executes SQL against Oracle/Sybase/SQL Server via `IDbConnectionFactory`
|
|
- `FileSource` - Reads CSV files with configurable delimiter/header options
|
|
|
|
### IDataTransformer
|
|
|
|
Wraps an `IDataReader` and applies transformations (column rename, drop, type conversion).
|
|
|
|
```csharp
|
|
public interface IDataTransformer
|
|
{
|
|
IDataReader Transform(IDataReader source);
|
|
string TransformerName { get; }
|
|
}
|
|
```
|
|
|
|
**Implementations:**
|
|
- `JdeDateTransformer` - Merges Julian date + time columns into `DateTime`
|
|
- `ColumnRenameTransformer` - Renames columns
|
|
- `ColumnDropTransformer` - Removes columns from output
|
|
- `ColumnAddTransformer` - Adds computed columns
|
|
|
|
Transformers are registered by name in `ITransformerRegistry` for config-based hydration.
|
|
|
|
### IImportDestination
|
|
|
|
Writes data to a target table using bulk operations.
|
|
|
|
```csharp
|
|
public interface IImportDestination
|
|
{
|
|
Task<DestinationResult> WriteAsync(
|
|
IDataReader source,
|
|
CancellationToken cancellationToken = default);
|
|
string DestinationName { get; }
|
|
}
|
|
|
|
public record DestinationResult(
|
|
long RowsProcessed,
|
|
int BatchCount,
|
|
TimeSpan Elapsed);
|
|
```
|
|
|
|
**Implementations:**
|
|
|
|
#### DbBulkImportDestination (Full Refresh)
|
|
1. Truncate destination table
|
|
2. Bulk copy all data in batches
|
|
3. Per-batch commits
|
|
|
|
#### DbBulkMergeDestination (Incremental)
|
|
1. Create temp table from destination schema
|
|
2. For each batch:
|
|
- Bulk copy to temp table
|
|
- Execute MERGE to destination
|
|
- Truncate temp table
|
|
3. Drop temp table on completion
|
|
|
|
**Configuration options:**
|
|
- `tableName` - Target table
|
|
- `matchColumns` - PK columns for MERGE ON clause
|
|
- `updateColumns` - Columns to update (default: all non-match columns)
|
|
- `batchSize` - Rows per batch (default: 10000)
|
|
|
|
### IScriptRunner
|
|
|
|
Executes SQL scripts before or after the main pipeline.
|
|
|
|
```csharp
|
|
public interface IScriptRunner
|
|
{
|
|
Task ExecuteAsync(CancellationToken cancellationToken = default);
|
|
string ScriptName { get; }
|
|
}
|
|
```
|
|
|
|
**Common scripts (factory methods):**
|
|
- `CommonScripts.DisableIndexes(factory, tableName)`
|
|
- `CommonScripts.RebuildIndexes(factory, tableName)`
|
|
- `CommonScripts.UpdateStatistics(factory, tableName)`
|
|
- `CommonScripts.CustomSql(factory, sql, name)`
|
|
|
|
### Pipeline Results
|
|
|
|
```csharp
|
|
public record PipelineResult(
|
|
bool Success,
|
|
long TotalRows,
|
|
TimeSpan Elapsed,
|
|
IReadOnlyList<StepResult> Steps,
|
|
Exception? Error = null);
|
|
|
|
public record StepResult(
|
|
string StepName,
|
|
string StepType, // "Source", "Transform", "Destination", "Script"
|
|
long RowsAffected,
|
|
TimeSpan Elapsed);
|
|
```
|
|
|
|
## EtlPipeline Orchestration
|
|
|
|
```csharp
|
|
public class EtlPipeline
|
|
{
|
|
public string PipelineName { get; }
|
|
|
|
public async Task<PipelineResult> ExecuteAsync(CancellationToken cancellationToken = default)
|
|
{
|
|
// 1. Run pre-scripts (fail-fast)
|
|
// 2. Open source, get IDataReader
|
|
// 3. Chain transformers (decorators)
|
|
// 4. Write to destination
|
|
// 5. Run post-scripts
|
|
// 6. Return PipelineResult with all step metrics
|
|
}
|
|
}
|
|
```
|
|
|
|
**Error handling:**
|
|
- Fail-fast on any error
|
|
- No cross-batch transactions (per-batch commits)
|
|
- Caller marks overall sync operation as failed
|
|
- Restart from beginning on failure (no resume)
|
|
|
|
## Pipeline Construction
|
|
|
|
### Fluent Builder
|
|
|
|
```csharp
|
|
var pipeline = new EtlPipelineBuilder()
|
|
.WithName("WorkOrderSync")
|
|
.WithSource(new DbQuerySource(connFactory, "JDE", workOrderQuery))
|
|
.WithTransformer(new JdeDateTransformer("UPMJ", "TDAY", "UpdatedAt"))
|
|
.WithDestination(new DbBulkMergeDestination(connFactory, "WorkOrder", ["OrderNumber"]))
|
|
.WithPreScript(CommonScripts.DisableIndexes(connFactory, "WorkOrder"))
|
|
.WithPostScript(CommonScripts.RebuildIndexes(connFactory, "WorkOrder"))
|
|
.WithLogger(logger)
|
|
.Build();
|
|
|
|
var result = await pipeline.ExecuteAsync(cancellationToken);
|
|
```
|
|
|
|
### Configuration-Based
|
|
|
|
```json
|
|
{
|
|
"name": "WorkOrderSync",
|
|
"source": {
|
|
"type": "db",
|
|
"connectionName": "JDE",
|
|
"query": "SELECT * FROM F4801 WHERE UPMJ >= :minDate"
|
|
},
|
|
"transformers": [
|
|
{ "type": "jde-date", "options": { "dateColumn": "UPMJ", "timeColumn": "TDAY", "outputColumn": "UpdatedAt" } }
|
|
],
|
|
"destination": {
|
|
"type": "bulk-merge",
|
|
"tableName": "WorkOrder",
|
|
"matchColumns": ["OrderNumber"],
|
|
"batchSize": 10000
|
|
},
|
|
"preScripts": [
|
|
{ "type": "disable-indexes", "tableName": "WorkOrder" }
|
|
],
|
|
"postScripts": [
|
|
{ "type": "rebuild-indexes", "tableName": "WorkOrder" }
|
|
]
|
|
}
|
|
```
|
|
|
|
Hydrated via `EtlPipelineFactory.CreateFromConfig(config)`.
|
|
|
|
## Project Structure
|
|
|
|
```
|
|
NEW/src/JdeScoping.DataSync/
|
|
├── Contracts/
|
|
│ ├── IImportSource.cs
|
|
│ ├── IDataTransformer.cs
|
|
│ ├── IImportDestination.cs
|
|
│ ├── IScriptRunner.cs
|
|
│ └── ITransformerRegistry.cs
|
|
├── Etl/
|
|
│ ├── Pipeline/
|
|
│ │ ├── EtlPipeline.cs
|
|
│ │ ├── EtlPipelineBuilder.cs
|
|
│ │ └── EtlPipelineFactory.cs
|
|
│ ├── Sources/
|
|
│ │ ├── DbQuerySource.cs
|
|
│ │ └── FileSource.cs
|
|
│ ├── Transformers/
|
|
│ │ ├── DataTransformerBase.cs
|
|
│ │ ├── TransformingDataReader.cs
|
|
│ │ ├── JdeDateTransformer.cs
|
|
│ │ ├── ColumnRenameTransformer.cs
|
|
│ │ ├── ColumnDropTransformer.cs
|
|
│ │ └── TransformerRegistry.cs
|
|
│ ├── Destinations/
|
|
│ │ ├── DbBulkImportDestination.cs
|
|
│ │ └── DbBulkMergeDestination.cs
|
|
│ ├── Scripts/
|
|
│ │ ├── SqlScriptRunner.cs
|
|
│ │ └── CommonScripts.cs
|
|
│ └── Results/
|
|
│ ├── PipelineResult.cs
|
|
│ ├── StepResult.cs
|
|
│ └── DestinationResult.cs
|
|
└── Config/
|
|
└── PipelineConfig.cs
|
|
```
|
|
|
|
## Integration Strategy
|
|
|
|
1. **Parallel existence** - New ETL pipeline lives alongside existing `IDataFetcher<T>` + `IBulkMergeHelper`
|
|
2. **Gradual migration** - Convert one sync operation at a time to new pipeline
|
|
3. **Shared infrastructure** - Both use `IDbConnectionFactory`, logging, etc.
|
|
4. **Deprecation path** - Once all syncs migrated, remove old `Fetchers/` and source generator
|
|
|
|
## DI Registration
|
|
|
|
```csharp
|
|
public static class EtlServiceCollectionExtensions
|
|
{
|
|
public static IServiceCollection AddEtlPipeline(this IServiceCollection services)
|
|
{
|
|
services.AddSingleton<ITransformerRegistry, TransformerRegistry>();
|
|
services.AddTransient<EtlPipelineFactory>();
|
|
return services;
|
|
}
|
|
}
|
|
```
|
|
|
|
## Design Decisions
|
|
|
|
Based on Codex MCP review, the following decisions were made:
|
|
|
|
### Atomicity & Consistency
|
|
|
|
| Concern | Decision | Rationale |
|
|
|---------|----------|-----------|
|
|
| Full refresh partial visibility | **Accept** | Caller retries on failure; consumers tolerate brief inconsistency |
|
|
| Orphaned disabled indexes | **Accept risk** | Next successful run rebuilds; manual intervention rare |
|
|
| Temp table connection lifetime | **Destination owns connection** | `DbBulkMergeDestination` holds single connection for all batches |
|
|
|
|
### Implementation Requirements
|
|
|
|
1. **TransformingDataReader schema accuracy** - Transformers MUST correctly implement:
|
|
- `GetSchemaTable()` - Accurate after column changes
|
|
- `GetName(ordinal)` - Reflects renames
|
|
- `GetOrdinal(name)` - Works with new names
|
|
- `GetFieldType(ordinal)` - Correct after type transforms
|
|
- `FieldCount` - Accurate after drops/adds
|
|
|
|
2. **Transform step row counts** - `StepResult.RowsAffected` is `0` for transform steps (inline decorators). Only destination reports actual rows processed.
|
|
|
|
3. **Cancellation handling** - Destinations use `DbDataReader.ReadAsync()` internally, checking cancellation between rows. Source `IDataReader` interface stays synchronous for ADO.NET compatibility.
|
|
|
|
4. **Schema validation** - Optional `ValidateSchema` step available to compare source columns against expected schema before pipeline runs. Not mandatory.
|
|
|
|
5. **MERGE duplicate keys** - Source data MUST have unique keys for match columns. Pipeline does not dedupe; caller's responsibility to ensure uniqueness.
|
|
|
|
### Connection Lifetime Rules
|
|
|
|
```
|
|
DbQuerySource:
|
|
- Opens connection in ReadDataAsync()
|
|
- Holds connection until DisposeAsync()
|
|
- Connection closed when pipeline completes or fails
|
|
|
|
DbBulkImportDestination:
|
|
- Opens connection in WriteAsync()
|
|
- Holds for all batches
|
|
- Closes on completion or failure
|
|
|
|
DbBulkMergeDestination:
|
|
- Opens connection in WriteAsync()
|
|
- Creates #temp table
|
|
- Holds same connection for all batch cycles
|
|
- Drops temp table in finally block
|
|
- Closes connection on completion or failure
|
|
```
|
|
|
|
## Open Questions
|
|
|
|
1. Should `DbQuerySource` support parameterized queries with different parameter styles (`:param` for Oracle, `@param` for SQL Server)?
|
|
2. Should we support multiple destinations per pipeline (e.g., write to both current and history tables)?
|
|
3. How should pipeline configurations be stored and loaded (appsettings.json, separate files, database)?
|