docs: add ETL Pipeline Phase 2 design document

Design addresses 7 issues identified in Phase 1 review:
- Bulk copy column mapping (by name, intersect with destination)
- Schema-qualified table names (parse tableName, QUOTENAME)
- Configurable timeouts (single commandTimeoutSeconds)
- JDE date sentinel (1900-01-01 default for invalid dates)
- TransformingDataReader ordinal mapping (MapOrdinal on interface)
- Transformer lifecycle (document single-use, no auto-dispose)
- Column rename collision validation (fail-fast on duplicates)

Reviewed by Codex MCP with all issues addressed.
This commit is contained in:
Joseph Doherty
2026-01-03 10:21:02 -05:00
parent 27f84fa3c1
commit 7ae2cd4882
@@ -0,0 +1,467 @@
# ETL Pipeline Phase 2 Design
**Date:** 2026-01-03
**Status:** Reviewed (Codex MCP)
**Purpose:** Extend the ETL pipeline with feature improvements: column mapping, schema-qualified tables, configurable timeouts, date validation, ordinal mapping fixes, and collision detection.
## Problem Statement
Phase 1 implemented the core ETL pipeline. Phase 2 addresses limitations identified during review:
**Feature Gaps:**
1. Bulk merge relies on ordinal matching - source column order must match destination
2. Schema-qualified table names (`dbo.Table`) not fully supported in scripts
3. Bulk merge timeouts not configurable
4. JDE date parsing fails on invalid dates
**Correctness Issues:**
5. `TransformingDataReader` bypasses transformer mappings for GetBytes/GetChars/GetData
6. Column rename collisions not validated
## Design Decisions
| Issue | Approach |
|-------|----------|
| Column mapping for bulk ops | Explicit `SqlBulkCopy.ColumnMappings`, intersect with destination schema |
| Schema-qualified tables | Parse tableName for schema, default `"dbo"` |
| Configurable timeouts | Single `commandTimeoutSeconds` for all SQL operations |
| Invalid JDE dates | Return sentinel date (`1900-01-01` default) |
| TransformingDataReader bypasses | `MapOrdinal` on interface + base class overrides for computed columns |
| Column rename collisions | Throw on collision during init (including pre-existing duplicates) |
| Transformer disposal | Do not dispose transformers (single-use documented) |
## Detailed Design
### 1. Bulk Copy Column Mapping
**Problem:** `DbBulkMergeDestination` and `DbBulkImportDestination` rely on ordinal matching between source and destination. Transformers that rename or reorder columns break bulk copy. Extra columns in source cause errors.
**Solution:** Query destination schema and map only columns that exist in both source and destination.
```csharp
// Query destination columns once
private async Task<HashSet<string>> GetDestinationColumnsAsync(
SqlConnection connection, string tableName, string schemaName, CancellationToken ct)
{
var sql = @"SELECT COLUMN_NAME FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_NAME = @tableName AND TABLE_SCHEMA = @schemaName";
var columns = await connection.QueryAsync<string>(sql, new { tableName, schemaName });
return columns.ToHashSet(StringComparer.OrdinalIgnoreCase);
}
// In ProcessBatchAsync
var destColumns = await GetDestinationColumnsAsync(connection, _tableName, _schemaName, ct);
using var bulkCopy = new SqlBulkCopy(connection)
{
DestinationTableName = tempTableName,
BatchSize = batch.Rows.Count,
BulkCopyTimeout = _commandTimeoutSeconds
};
// Map only columns that exist in both source and destination
for (int i = 0; i < source.FieldCount; i++)
{
var columnName = source.GetName(i);
if (destColumns.Contains(columnName))
{
bulkCopy.ColumnMappings.Add(columnName, columnName);
}
}
```
**Benefits:**
1. Decouples ordinal position from column identity
2. Extra source columns are silently ignored (e.g., computed columns for logging)
3. Missing destination columns cause clear error from SqlBulkCopy
Both `DbBulkImportDestination` and `DbBulkMergeDestination` will use this pattern.
### 2. Schema-Qualified Table Names
**Problem:** `CommonScripts` methods incorrectly handle `dbo.WorkOrder` format - both the bracketing and the `sys.tables` lookup fail.
**Solution:** Parse tableName to extract schema if present, use `QUOTENAME()` for safe dynamic SQL.
```csharp
public static class CommonScripts
{
/// <summary>
/// Parses a table name, extracting schema if present.
/// Supports: "Table", "dbo.Table", "[dbo].[Table]"
/// </summary>
private static (string Schema, string Table) ParseTableName(string tableName)
{
// Remove brackets for parsing
var cleaned = tableName.Replace("[", "").Replace("]", "");
var parts = cleaned.Split('.', 2);
return parts.Length == 2
? (parts[0], parts[1])
: ("dbo", parts[0]);
}
public static IScriptRunner DisableIndexes(
IDbConnectionFactory factory,
string tableName,
int timeoutSeconds = 300)
{
var (schema, table) = ParseTableName(tableName);
// Use QUOTENAME() for safe dynamic SQL - prevents injection
var sql = @"
DECLARE @sql NVARCHAR(MAX) = '';
DECLARE @fullTableName NVARCHAR(256) = QUOTENAME(@schemaName) + '.' + QUOTENAME(@tableName);
SELECT @sql = @sql + 'ALTER INDEX ' + QUOTENAME(i.name) + ' ON ' + @fullTableName + ' DISABLE;' + CHAR(13)
FROM sys.indexes i
INNER JOIN sys.tables t ON i.object_id = t.object_id
INNER JOIN sys.schemas s ON t.schema_id = s.schema_id
WHERE t.name = @tableName
AND s.name = @schemaName
AND i.type = 2
AND i.is_disabled = 0;
IF LEN(@sql) > 0 EXEC sp_executesql @sql;";
return new SqlScriptRunner(factory, sql, $"DisableIndexes:{schema}.{table}",
parameters: new { tableName = table, schemaName = schema },
timeoutSeconds: timeoutSeconds);
}
// Same pattern for RebuildIndexes, UpdateStatistics
}
```
**Changes:**
- Parse tableName to extract schema (supports `"Table"`, `"dbo.Table"`, `"[dbo].[Table]"`)
- Use `QUOTENAME()` in dynamic SQL (prevents injection, handles special characters)
- Join `sys.schemas` to properly filter by schema
- `SqlScriptRunner` needs to accept parameters (new capability)
### 3. Configurable Timeouts
**Problem:** `SqlBulkCopy` uses 30-second default timeout, insufficient for large tables. MERGE and other SQL commands also need timeout configuration.
**Solution:** Single `commandTimeoutSeconds` parameter applied to all SQL operations (bulk copy, MERGE, TRUNCATE, etc.).
**Destination changes:**
```csharp
public class DbBulkMergeDestination : IImportDestination
{
private const int DefaultCommandTimeoutSeconds = 600; // 10 minutes
private readonly int _commandTimeoutSeconds;
public DbBulkMergeDestination(
IDbConnectionFactory connectionFactory,
string tableName,
string[] matchColumns,
string[]? updateColumns = null,
int batchSize = 0,
int commandTimeoutSeconds = 0) // 0 = use default (600)
{
_commandTimeoutSeconds = commandTimeoutSeconds > 0
? commandTimeoutSeconds
: DefaultCommandTimeoutSeconds;
}
}
// Usage in WriteAsync:
using var bulkCopy = new SqlBulkCopy(connection)
{
BulkCopyTimeout = _commandTimeoutSeconds
};
await using var cmd = connection.CreateCommand();
cmd.CommandTimeout = _commandTimeoutSeconds; // For MERGE, TRUNCATE, etc.
```
**Pipeline builder changes:**
```csharp
public class EtlPipelineBuilder
{
private int _defaultCommandTimeoutSeconds = 600;
public EtlPipelineBuilder WithCommandTimeout(TimeSpan timeout)
{
if (timeout < TimeSpan.Zero || timeout > TimeSpan.FromHours(24))
throw new ArgumentOutOfRangeException(nameof(timeout),
"Timeout must be between 0 and 24 hours.");
_defaultCommandTimeoutSeconds = (int)timeout.TotalSeconds;
return this;
}
}
```
**Key points:**
- Single timeout applies to all SQL operations (bulk copy, MERGE, TRUNCATE, CREATE TABLE)
- Constructor parameter overrides pipeline default
- Validation: throws `ArgumentOutOfRangeException` for negative or > 24 hours
- Default: 10 minutes (600 seconds)
### 4. JDE Date Sentinel Handling
**Problem:** `JdeDateTransformer.ParseJdeDateTime` assumes valid CYYDDD format. Invalid values produce incorrect dates or exceptions.
**Solution:** Validate date components and return configurable sentinel on failure. Default sentinel is `1900-01-01` (compatible with SQL `datetime` and `datetime2`).
```csharp
public class JdeDateTransformer : DataTransformerBase
{
/// <summary>
/// Default sentinel for invalid dates. 1900-01-01 is compatible with
/// both SQL datetime (min 1753-01-01) and datetime2.
/// </summary>
public static readonly DateTime DefaultInvalidDateSentinel = new(1900, 1, 1);
private readonly DateTime _invalidDateSentinel;
public JdeDateTransformer(
string dateColumn,
string timeColumn,
string outputColumn,
DateTime? invalidDateSentinel = null) // null = 1900-01-01
{
_invalidDateSentinel = invalidDateSentinel ?? DefaultInvalidDateSentinel;
}
public static DateTime ParseJdeDateTime(decimal julianDate, decimal time, DateTime sentinel)
{
var dateInt = (int)julianDate;
if (dateInt <= 0) return sentinel;
var century = dateInt / 100000;
var year = (dateInt / 1000) % 100;
var dayOfYear = dateInt % 1000;
// Validate components
if (century < 0 || century > 1) return sentinel;
if (year < 0 || year > 99) return sentinel;
if (dayOfYear < 1 || dayOfYear > 366) return sentinel;
var fullYear = (century == 0 ? 1900 : 2000) + year;
// Validate day of year for the specific year
var daysInYear = DateTime.IsLeapYear(fullYear) ? 366 : 365;
if (dayOfYear > daysInYear) return sentinel;
var date = new DateTime(fullYear, 1, 1).AddDays(dayOfYear - 1);
// Validate time components
var timeInt = (int)time;
var hours = timeInt / 10000;
var minutes = (timeInt / 100) % 100;
var seconds = timeInt % 100;
if (hours < 0 || hours > 23) return sentinel;
if (minutes < 0 || minutes > 59) return sentinel;
if (seconds < 0 || seconds > 59) return sentinel;
return date.AddHours(hours).AddMinutes(minutes).AddSeconds(seconds);
}
}
```
**Key points:**
- Default sentinel `1900-01-01` is compatible with SQL `datetime` (min 1753-01-01)
- Validates century (0-1), year (0-99), day of year (1-366), and time components
- Returns sentinel for any invalid component rather than throwing
### 5. TransformingDataReader Ordinal Mapping
**Problem:** `GetBytes`, `GetChars`, and `GetData` bypass transformer ordinal mapping. For computed columns (like JDE date output), these methods would return source data instead of transformed data.
**Solution:** Add `MapOrdinal` to `IDataTransformer` interface. Override `GetBytes`/`GetChars`/`GetData`/`GetDataTypeName` in `DataTransformerBase` to handle computed columns.
**IDataTransformer interface changes:**
```csharp
public interface IDataTransformer
{
IDataReader Transform(IDataReader source);
string TransformerName { get; }
/// <summary>
/// Maps a transformed ordinal to the source ordinal.
/// Returns -1 for computed columns that have no source ordinal.
/// </summary>
int MapOrdinal(int transformedOrdinal, IDataReader source);
}
```
**DataTransformerBase changes:**
```csharp
public abstract class DataTransformerBase : IDataTransformer
{
/// <summary>
/// Maps a transformed ordinal to the source ordinal.
/// Returns -1 for computed columns. Default returns ordinal unchanged.
/// </summary>
public virtual int MapOrdinal(int transformedOrdinal, IDataReader source)
=> transformedOrdinal;
/// <summary>
/// Gets bytes for the specified ordinal. Throws for computed columns.
/// </summary>
public virtual long GetBytes(int ordinal, long fieldOffset, byte[]? buffer,
int bufferOffset, int length, IDataReader source)
{
var sourceOrdinal = MapOrdinal(ordinal, source);
if (sourceOrdinal < 0)
throw new NotSupportedException(
$"GetBytes not supported for computed column at ordinal {ordinal}.");
return source.GetBytes(sourceOrdinal, fieldOffset, buffer, bufferOffset, length);
}
/// <summary>
/// Gets chars for the specified ordinal. Throws for computed columns.
/// </summary>
public virtual long GetChars(int ordinal, long fieldOffset, char[]? buffer,
int bufferOffset, int length, IDataReader source)
{
var sourceOrdinal = MapOrdinal(ordinal, source);
if (sourceOrdinal < 0)
throw new NotSupportedException(
$"GetChars not supported for computed column at ordinal {ordinal}.");
return source.GetChars(sourceOrdinal, fieldOffset, buffer, bufferOffset, length);
}
// Similar for GetData, GetDataTypeName
}
```
**JdeDateTransformer override:**
```csharp
public override int MapOrdinal(int transformedOrdinal, IDataReader source)
{
var sourceOrdinal = _ordinalMap![transformedOrdinal];
// The output DateTime column maps to the date source ordinal,
// but GetBytes/GetChars make no sense for it
return sourceOrdinal == _dateOrdinal ? -1 : sourceOrdinal;
}
public override string GetDataTypeName(int ordinal, IDataReader source)
{
var sourceOrdinal = _ordinalMap![ordinal];
return sourceOrdinal == _dateOrdinal ? "datetime" : source.GetDataTypeName(sourceOrdinal);
}
```
**TransformingDataReader delegates to transformer:**
```csharp
public long GetBytes(int i, long fieldOffset, byte[]? buffer, int bufferoffset, int length)
=> _transformer.GetBytes(i, fieldOffset, buffer, bufferoffset, length, _source);
public string GetDataTypeName(int i)
=> _transformer.GetDataTypeName(i, _source);
```
### 6. Transformer Lifecycle
**Problem:** Should transformers be disposed by the reader?
**Decision:** Do not dispose transformers. Document single-use pattern.
**Rationale:**
- Current transformers are stateless - no resources to dispose
- Disposing would break if transformer is reused across readers
- Adding `IDisposable` now would require all custom transformers to implement it
- YAGNI - we can add disposal later if a stateful transformer is needed
**Documentation requirement:**
```csharp
/// <summary>
/// Base class for data transformers.
/// </summary>
/// <remarks>
/// Transformers are designed for single-use per reader. If a transformer needs
/// to hold disposable resources, it should implement IDisposable directly and
/// be disposed by the caller after pipeline completion.
/// </remarks>
public abstract class DataTransformerBase : IDataTransformer
{
// ...
}
```
**TransformingDataReader** continues to only dispose the source reader:
```csharp
public void Dispose() => _source.Dispose();
```
### 7. Column Rename Collision Validation
**Problem:** `ColumnRenameTransformer` doesn't detect when renames create duplicate column names.
**Solution:** Validate in `OnInitialize()` and throw with clear error message.
```csharp
public class ColumnRenameTransformer : DataTransformerBase
{
protected override void OnInitialize(IDataReader source)
{
_outputNames = new string[source.FieldCount];
_nameToOrdinal = new Dictionary<string, int>(StringComparer.OrdinalIgnoreCase);
for (int i = 0; i < source.FieldCount; i++)
{
var originalName = source.GetName(i);
var outputName = _renames.TryGetValue(originalName, out var newName)
? newName
: originalName;
// Collision detection (catches both rename collisions and pre-existing duplicates)
if (_nameToOrdinal.ContainsKey(outputName))
{
var existingOrdinal = _nameToOrdinal[outputName];
var existingOriginal = source.GetName(existingOrdinal);
throw new InvalidOperationException(
$"Column name collision: '{originalName}' → '{outputName}' conflicts with " +
$"'{existingOriginal}' (already at ordinal {existingOrdinal}). " +
$"Each output column name must be unique.");
}
_outputNames[i] = outputName;
_nameToOrdinal[outputName] = i;
}
}
}
```
**Behavior:**
- Throws on rename collisions (e.g., `A→B` when `B` exists)
- Also throws on pre-existing duplicate column names in source (schema requirement)
- Error message includes: which columns collide, at which ordinals
- Fails fast during pipeline setup, before any data is processed
## Files to Modify
| File | Changes |
|------|---------|
| `IDataTransformer.cs` | Add `MapOrdinal()` method |
| `DataTransformerBase.cs` | Add `MapOrdinal()`, `GetBytes()`, `GetChars()`, `GetData()`, `GetDataTypeName()` |
| `TransformingDataReader.cs` | Delegate binary methods to transformer |
| `JdeDateTransformer.cs` | Add sentinel parameter, validation, `MapOrdinal()` override, `GetDataTypeName()` |
| `ColumnRenameTransformer.cs` | Add collision detection |
| `ColumnDropTransformer.cs` | Add `MapOrdinal()` override |
| `DbBulkMergeDestination.cs` | Add column mappings, schema parsing, timeout parameter |
| `DbBulkImportDestination.cs` | Add column mappings, timeout parameter |
| `CommonScripts.cs` | Add `ParseTableName()`, use `QUOTENAME()`, parameterized SQL |
| `SqlScriptRunner.cs` | Add parameters support |
| `EtlPipelineBuilder.cs` | Add `WithCommandTimeout()` with validation |
## Breaking Changes
1. **IDataTransformer interface** - New `MapOrdinal()` method required (custom transformers must implement)
2. **JdeDateTransformer constructor** - New `invalidDateSentinel` parameter (has default `1900-01-01`)
3. **DbBulkMergeDestination/DbBulkImportDestination constructors** - New `commandTimeoutSeconds` parameter (has default)
**Behavioral changes:**
- `ColumnRenameTransformer` now throws on pre-existing duplicate column names (previously silent)
- Bulk copy now only maps columns that exist in destination (extra source columns ignored)