jdescopingtool/PLANS/2026-01-03-etl-documentation-implementation.md

# ETL Pipeline Documentation Implementation Plan

> **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.

**Goal:** Create 6 documentation files for the ETL pipeline covering architecture, extension patterns, configuration, and troubleshooting.

**Architecture:** Each task creates one markdown file in `DOCUMENTATION/DataSync/`. Code snippets come directly from source files. Verification checks that links resolve and code matches source.

**Tech Stack:** Markdown, code from `NEW/src/JdeScoping.DataSync/Etl/`

---

### Task 1: Create DataSync folder and Overview.md

**Files:**
- Create: `DOCUMENTATION/DataSync/Overview.md`

**Step 1: Create the DataSync directory**

```bash
mkdir -p DOCUMENTATION/DataSync
```

**Step 2: Write Overview.md with architecture and core concepts**

Create `DOCUMENTATION/DataSync/Overview.md` with:

```markdown
# ETL Pipeline

The ETL pipeline streams data from enterprise sources (JDE, CMS) through transformations into SQL Server cache tables. It supports batched processing, pre/post scripts for index management, and detailed execution tracking.

## Architecture

```
┌─────────────┐    ┌──────────────────┐    ┌─────────────────┐
│ IImportSource│───▶│ IDataTransformer │───▶│IImportDestination│
└─────────────┘    │   (chain of N)   │    └─────────────────┘
                   └──────────────────┘
       ▲                                           │
       │         ┌──────────────┐                  │
       └─────────│ Pre-Scripts  │                  ▼
                 └──────────────┘           ┌──────────────┐
                                            │ Post-Scripts │
                                            └──────────────┘
```

**Execution flow:**
1. Run pre-scripts (e.g., disable indexes)
2. Open source and get `IDataReader`
3. Apply transformer chain (each wraps the previous reader)
4. Write to destination (bulk copy or merge)
5. Run post-scripts (e.g., rebuild indexes)

## Core Contracts

### IImportSource

Provides data to the pipeline. Returns an `IDataReader` that streams rows.

```csharp
public interface IImportSource : IAsyncDisposable
{
    Task<IDataReader> ReadDataAsync(CancellationToken cancellationToken = default);
    string SourceName { get; }
}
```

### IDataTransformer

Modifies data during transfer. Wraps the source reader in a decorator.

```csharp
public interface IDataTransformer
{
    IDataReader Transform(IDataReader source);
    string TransformerName { get; }
    int MapOrdinal(int transformedOrdinal, IDataReader source);
}
```

### IImportDestination

Consumes data and writes to storage. Returns statistics about the operation.

```csharp
public interface IImportDestination
{
    Task<DestinationResult> WriteAsync(IDataReader source, CancellationToken cancellationToken = default);
    string DestinationName { get; }
}
```

### IScriptRunner

Executes SQL scripts before or after data transfer.

```csharp
public interface IScriptRunner
{
    Task ExecuteAsync(CancellationToken cancellationToken = default);
    string ScriptName { get; }
}
```

## Pipeline Execution

The `EtlPipeline` class orchestrates execution and tracks timing for each step:

```csharp
public async Task<PipelineResult> ExecuteAsync(CancellationToken cancellationToken = default)
{
    // 1. Run pre-scripts
    foreach (var script in _preScripts)
    {
        var stepResult = await RunScriptAsync(script, cancellationToken);
        steps.Add(stepResult);
    }

    // 2. Open source
    await using (_source)
    {
        var reader = await _source.ReadDataAsync(cancellationToken);

        // 3. Apply transformers
        foreach (var transformer in _transformers)
        {
            reader = transformer.Transform(reader);
        }

        // 4. Write to destination
        var destResult = await _destination.WriteAsync(reader, cancellationToken);
    }

    // 5. Run post-scripts
    foreach (var script in _postScripts)
    {
        var stepResult = await RunScriptAsync(script, cancellationToken);
    }

    return PipelineResult.Succeeded(totalRows, totalStopwatch.Elapsed, steps);
}
```

## Result Model

### PipelineResult

```csharp
public record PipelineResult(
    bool Success,
    long TotalRows,
    TimeSpan Elapsed,
    IReadOnlyList<StepResult> Steps,
    Exception? Error = null);
```

### StepResult

```csharp
public record StepResult(
    string StepName,
    string StepType,
    long RowsAffected,
    TimeSpan Elapsed);
```

### DestinationResult

```csharp
public record DestinationResult(
    long RowsProcessed,
    int BatchCount,
    TimeSpan Elapsed);
```

## Related Documentation

- [Sources](./Sources.md) - Writing custom data sources
- [Transformers](./Transformers.md) - Writing custom transformers
- [Destinations](./Destinations.md) - Writing destinations and scripts
- [Configuration](./Configuration.md) - Pipeline builder and DI setup
- [Troubleshooting](./Troubleshooting.md) - Debugging and performance
```

**Step 3: Verify the file was created**

```bash
ls -la DOCUMENTATION/DataSync/
```
Expected: `Overview.md` exists

**Step 4: Commit**

```bash
git add DOCUMENTATION/DataSync/Overview.md
git commit -m "docs: add ETL pipeline overview documentation"
```

---

### Task 2: Create Sources.md

**Files:**
- Create: `DOCUMENTATION/DataSync/Sources.md`

**Step 1: Write Sources.md with interface and DbQuerySource walkthrough**

Create `DOCUMENTATION/DataSync/Sources.md` with:

```markdown
# Data Sources

Sources provide data to the ETL pipeline by implementing `IImportSource`. They return an `IDataReader` that streams rows to transformers and destinations.

## Interface Contract

```csharp
public interface IImportSource : IAsyncDisposable
{
    Task<IDataReader> ReadDataAsync(CancellationToken cancellationToken = default);
    string SourceName { get; }
}
```

**Key requirements:**
- Implement `IAsyncDisposable` for connection cleanup
- Return a live `IDataReader` (not buffered) for memory efficiency
- `SourceName` is used in logging and `StepResult` tracking

## DbQuerySource Implementation

`DbQuerySource` executes a SQL query against the local cache database:

```csharp
public class DbQuerySource : IImportSource
{
    private readonly IDbConnectionFactory _connectionFactory;
    private readonly string _sql;
    private readonly object? _parameters;
    private readonly int _commandTimeout;
    private SqlConnection? _connection;
    private SqlCommand? _command;

    public string SourceName { get; }

    public DbQuerySource(
        IDbConnectionFactory connectionFactory,
        string sql,
        string? name = null,
        object? parameters = null,
        int commandTimeout = 3600)
    {
        _connectionFactory = connectionFactory;
        _sql = sql;
        _parameters = parameters;
        _commandTimeout = commandTimeout;
        SourceName = $"DbQuery:{name ?? "Query"}";
    }
```

### Reading data

The connection opens in `ReadDataAsync` and stays open until disposal:

```csharp
    public async Task<IDataReader> ReadDataAsync(CancellationToken cancellationToken = default)
    {
        _connection = await _connectionFactory.CreateLotFinderConnectionAsync(cancellationToken);
        _command = _connection.CreateCommand();
        _command.CommandText = _sql;
        _command.CommandTimeout = _commandTimeout;
        AddParameters(_command, _parameters);
        return await _command.ExecuteReaderAsync(cancellationToken);
    }
```

### Parameter handling

Parameters are added from an anonymous object using reflection:

```csharp
    private static void AddParameters(SqlCommand command, object? parameters)
    {
        if (parameters == null) return;

        var properties = parameters.GetType().GetProperties();
        foreach (var prop in properties)
        {
            var value = prop.GetValue(parameters) ?? DBNull.Value;
            command.Parameters.AddWithValue($"@{prop.Name}", value);
        }
    }
```

### Resource cleanup

Both the command and connection are disposed asynchronously:

```csharp
    public async ValueTask DisposeAsync()
    {
        if (_command != null)
        {
            await _command.DisposeAsync();
            _command = null;
        }
        if (_connection != null)
        {
            await _connection.DisposeAsync();
            _connection = null;
        }
    }
}
```

## Key Patterns

### Keep sources stateless until ReadDataAsync

Don't open connections or execute queries in the constructor. The source should be configurable without side effects until `ReadDataAsync` is called.

### Streaming, not buffering

Return a live `IDataReader` rather than loading all data into memory. This allows processing millions of rows without memory pressure.

### Use SourceName for diagnostics

Format: `"DbQuery:{table}"` or `"File:{filename}"`. This appears in logs and `StepResult.StepName`.

## Future source types

The interface supports additional source types not yet implemented:

- **File-based sources** - CSV, Excel files
- **API sources** - REST endpoints returning paged data
- **Oracle/Sybase sources** - Direct queries against JDE or CMS

Each would implement the same interface with different connection and reader implementations.

## Related Documentation

- [Overview](./Overview.md) - Pipeline architecture
- [Transformers](./Transformers.md) - Processing source data
- [Configuration](./Configuration.md) - Connection factory setup
```

**Step 2: Verify the file was created**

```bash
ls -la DOCUMENTATION/DataSync/Sources.md
```
Expected: File exists

**Step 3: Commit**

```bash
git add DOCUMENTATION/DataSync/Sources.md
git commit -m "docs: add ETL sources documentation"
```

---

### Task 3: Create Transformers.md

**Files:**
- Create: `DOCUMENTATION/DataSync/Transformers.md`

**Step 1: Write Transformers.md with interface, base class, and three transformer walkthroughs**

Create `DOCUMENTATION/DataSync/Transformers.md` with:

```markdown
# Data Transformers

Transformers modify data as it flows through the pipeline. They wrap the source `IDataReader` in a decorator, allowing column renaming, dropping, type conversion, and computed columns.

## Interface Contract

```csharp
public interface IDataTransformer
{
    IDataReader Transform(IDataReader source);
    string TransformerName { get; }
    int MapOrdinal(int transformedOrdinal, IDataReader source);
}
```

**Key methods:**
- `Transform()` - Wraps the source reader, returns a new reader with modifications
- `TransformerName` - Used in logging and `StepResult` tracking
- `MapOrdinal()` - Maps transformed ordinals to source ordinals. Returns `-1` for computed columns.

## DataTransformerBase

The base class provides default implementations and handles the decorator pattern:

```csharp
public abstract class DataTransformerBase : IDataTransformer
{
    public abstract string TransformerName { get; }

    public IDataReader Transform(IDataReader source)
    {
        ArgumentNullException.ThrowIfNull(source);
        OnInitialize(source);
        return new TransformingDataReader(source, this);
    }

    protected virtual void OnInitialize(IDataReader source) { }
```

### Default pass-through methods

Override only what you need to change:

```csharp
    public virtual int GetFieldCount(IDataReader source) => source.FieldCount;
    public virtual string GetName(int ordinal, IDataReader source) => source.GetName(ordinal);
    public virtual Type GetFieldType(int ordinal, IDataReader source) => source.GetFieldType(ordinal);
    public virtual object GetValue(int ordinal, IDataReader source) => source.GetValue(ordinal);
    public virtual int GetOrdinal(string name, IDataReader source) => source.GetOrdinal(name);
    public virtual bool IsDBNull(int ordinal, IDataReader source) => source.IsDBNull(ordinal);
    public virtual int MapOrdinal(int transformedOrdinal, IDataReader source) => transformedOrdinal;
```

### Binary method handling

Computed columns (where `MapOrdinal` returns `-1`) throw `NotSupportedException`:

```csharp
    public virtual long GetBytes(int ordinal, long fieldOffset, byte[]? buffer,
        int bufferOffset, int length, IDataReader source)
    {
        var sourceOrdinal = MapOrdinal(ordinal, source);
        if (sourceOrdinal < 0)
            throw new NotSupportedException(
                $"GetBytes not supported for computed column at ordinal {ordinal}.");
        return source.GetBytes(sourceOrdinal, fieldOffset, buffer, bufferOffset, length);
    }
```

## ColumnRenameTransformer

Renames columns without changing values or order:

```csharp
public class ColumnRenameTransformer : DataTransformerBase
{
    private readonly Dictionary<string, string> _renames;
    private string[]? _outputNames;
    private Dictionary<string, int>? _nameToOrdinal;

    public override string TransformerName => $"RenameColumns:{_renames.Count}";

    public ColumnRenameTransformer(params (string OldName, string NewName)[] renames)
    {
        _renames = renames.ToDictionary(
            r => r.OldName, r => r.NewName, StringComparer.OrdinalIgnoreCase);
    }
```

### Collision detection

The transformer validates that renames don't create duplicate column names:

```csharp
    protected override void OnInitialize(IDataReader source)
    {
        _outputNames = new string[source.FieldCount];
        _nameToOrdinal = new Dictionary<string, int>(StringComparer.OrdinalIgnoreCase);

        for (int i = 0; i < source.FieldCount; i++)
        {
            var originalName = source.GetName(i);
            var outputName = _renames.TryGetValue(originalName, out var newName)
                ? newName : originalName;

            if (_nameToOrdinal.TryGetValue(outputName, out var existingOrdinal))
            {
                throw new InvalidOperationException(
                    $"Column name collision: '{originalName}' → '{outputName}' conflicts with " +
                    $"'{source.GetName(existingOrdinal)}' (already at ordinal {existingOrdinal}).");
            }

            _outputNames[i] = outputName;
            _nameToOrdinal[outputName] = i;
        }
    }
```

## ColumnDropTransformer

Removes specified columns from the output:

```csharp
public class ColumnDropTransformer : DataTransformerBase
{
    private readonly HashSet<string> _columnsToDrop;
    private int[]? _ordinalMap;
    private Dictionary<string, int>? _nameToOrdinal;

    public override string TransformerName => $"DropColumns:{string.Join(",", _columnsToDrop)}";

    public ColumnDropTransformer(params string[] columnsToDrop)
    {
        _columnsToDrop = new HashSet<string>(columnsToDrop, StringComparer.OrdinalIgnoreCase);
    }
```

### Ordinal mapping

Builds a map from output ordinals to source ordinals, excluding dropped columns:

```csharp
    protected override void OnInitialize(IDataReader source)
    {
        var ordinalList = new List<int>();
        _nameToOrdinal = new Dictionary<string, int>(StringComparer.OrdinalIgnoreCase);

        for (int i = 0; i < source.FieldCount; i++)
        {
            var name = source.GetName(i);
            if (!_columnsToDrop.Contains(name))
            {
                _nameToOrdinal[name] = ordinalList.Count;
                ordinalList.Add(i);
            }
        }
        _ordinalMap = ordinalList.ToArray();
    }

    public override int MapOrdinal(int transformedOrdinal, IDataReader source)
        => _ordinalMap![transformedOrdinal];
```

## JdeDateTransformer

Combines JDE Julian date (CYYDDD) and time (HHMMSS) columns into a single `DateTime`:

```csharp
public class JdeDateTransformer : DataTransformerBase
{
    public static readonly DateTime DefaultInvalidDateSentinel = new(1900, 1, 1);

    private readonly string _dateColumn;
    private readonly string _timeColumn;
    private readonly string _outputColumn;
    private readonly DateTime _invalidDateSentinel;
```

### Computed column handling

The output `DateTime` column has no direct source ordinal, so `MapOrdinal` returns `-1`:

```csharp
    public override int MapOrdinal(int transformedOrdinal, IDataReader source)
    {
        var sourceOrdinal = _ordinalMap![transformedOrdinal];
        return sourceOrdinal == _dateOrdinal ? -1 : sourceOrdinal;
    }

    public override string GetDataTypeName(int ordinal, IDataReader source)
    {
        var sourceOrdinal = _ordinalMap![ordinal];
        return sourceOrdinal == _dateOrdinal ? "datetime" : source.GetDataTypeName(sourceOrdinal);
    }
```

### Date parsing with validation

Invalid dates return a configurable sentinel value (default: 1900-01-01):

```csharp
    public static DateTime ParseJdeDateTime(decimal julianDate, decimal time, DateTime sentinel)
    {
        var dateInt = (int)julianDate;
        if (dateInt <= 0) return sentinel;

        var century = dateInt / 100000;
        var year = (dateInt / 1000) % 100;
        var dayOfYear = dateInt % 1000;

        if (century < 0 || century > 1) return sentinel;
        if (year < 0 || year > 99) return sentinel;
        if (dayOfYear < 1 || dayOfYear > 366) return sentinel;

        var fullYear = (century == 0 ? 1900 : 2000) + year;
        var daysInYear = DateTime.IsLeapYear(fullYear) ? 366 : 365;
        if (dayOfYear > daysInYear) return sentinel;

        var date = new DateTime(fullYear, 1, 1).AddDays(dayOfYear - 1);

        // Parse time (HHMMSS format)
        var timeInt = (int)time;
        var hours = timeInt / 10000;
        var minutes = (timeInt / 100) % 100;
        var seconds = timeInt % 100;

        if (hours < 0 || hours > 23) return sentinel;
        if (minutes < 0 || minutes > 59) return sentinel;
        if (seconds < 0 || seconds > 59) return sentinel;

        return date.AddHours(hours).AddMinutes(minutes).AddSeconds(seconds);
    }
```

## Transformer Chaining

Transformers compose by wrapping each other. The pipeline applies them in order:

```csharp
foreach (var transformer in _transformers)
{
    reader = transformer.Transform(reader);
}
```

Each transformer sees the output of the previous one. Ordinal mappings accumulate through the chain.

## Validation in OnInitialize

Perform all validation in `OnInitialize()` to fail fast before processing data:

- Check that required columns exist
- Validate rename mappings don't create collisions
- Build ordinal maps for efficient lookup during row processing

## Related Documentation

- [Overview](./Overview.md) - Pipeline architecture
- [Sources](./Sources.md) - Data sources that feed transformers
- [Destinations](./Destinations.md) - Where transformed data goes
```

**Step 2: Verify the file was created**

```bash
ls -la DOCUMENTATION/DataSync/Transformers.md
```
Expected: File exists

**Step 3: Commit**

```bash
git add DOCUMENTATION/DataSync/Transformers.md
git commit -m "docs: add ETL transformers documentation"
```

---

### Task 4: Create Destinations.md

**Files:**
- Create: `DOCUMENTATION/DataSync/Destinations.md`

**Step 1: Write Destinations.md with interface, both destinations, and script patterns**

Create `DOCUMENTATION/DataSync/Destinations.md` with:

```markdown
# Destinations and Scripts

Destinations consume data from the pipeline and write it to storage. Scripts run SQL operations before or after data transfer, commonly for index management.

## IImportDestination Contract

```csharp
public interface IImportDestination
{
    Task<DestinationResult> WriteAsync(IDataReader source, CancellationToken cancellationToken = default);
    string DestinationName { get; }
}
```

**Key requirements:**
- Consume the entire `IDataReader` in `WriteAsync`
- Return `DestinationResult` with row count, batch count, and elapsed time
- `DestinationName` is used in logging and `StepResult` tracking

## DbBulkImportDestination

Full table refresh using TRUNCATE + bulk copy:

```csharp
public class DbBulkImportDestination : IImportDestination
{
    private const int DefaultBatchSize = 10000;
    private const int DefaultCommandTimeoutSeconds = 600;

    public DbBulkImportDestination(
        IDbConnectionFactory connectionFactory,
        string tableName,
        int batchSize = 0,
        int commandTimeoutSeconds = 0)
    {
        _batchSize = batchSize > 0 ? batchSize : DefaultBatchSize;
        _commandTimeoutSeconds = commandTimeoutSeconds > 0
            ? commandTimeoutSeconds : DefaultCommandTimeoutSeconds;
    }
```

### Column mapping

Queries destination schema and maps only matching columns. Extra source columns are ignored:

```csharp
    var destColumns = await GetDestinationColumnsAsync(connection, cancellationToken);

    using var bulkCopy = new SqlBulkCopy(connection)
    {
        DestinationTableName = qualifiedName,
        BatchSize = _batchSize,
        BulkCopyTimeout = _commandTimeoutSeconds,
        EnableStreaming = true
    };

    for (int i = 0; i < source.FieldCount; i++)
    {
        var columnName = source.GetName(i);
        if (destColumns.Contains(columnName))
        {
            bulkCopy.ColumnMappings.Add(columnName, columnName);
        }
    }

    if (bulkCopy.ColumnMappings.Count == 0)
        throw new InvalidOperationException(
            $"No columns from source exist in destination table '{_tableName}'.");
```

### Destination column discovery

Uses `INFORMATION_SCHEMA.COLUMNS` with schema support:

```csharp
    private async Task<HashSet<string>> GetDestinationColumnsAsync(
        SqlConnection connection, CancellationToken ct)
    {
        var (schema, table) = CommonScripts.ParseTableName(_tableName);
        var sql = @"SELECT COLUMN_NAME FROM INFORMATION_SCHEMA.COLUMNS
                    WHERE TABLE_NAME = @tableName AND TABLE_SCHEMA = @schemaName";
        var columns = await connection.QueryAsync<string>(
            new CommandDefinition(sql, new { tableName = table, schemaName = schema },
                commandTimeout: _commandTimeoutSeconds, cancellationToken: ct));
        return columns.ToHashSet(StringComparer.OrdinalIgnoreCase);
    }
```

## DbBulkMergeDestination

Incremental updates using bulk copy to temp table + MERGE:

```csharp
public class DbBulkMergeDestination : IImportDestination
{
    public DbBulkMergeDestination(
        IDbConnectionFactory connectionFactory,
        string tableName,
        string[] matchColumns,
        string[]? updateColumns = null,
        int batchSize = 0,
        int commandTimeoutSeconds = 0)
    {
        if (matchColumns.Length == 0)
            throw new ArgumentException("At least one match column is required.");
    }
```

### Batch processing

Creates a temp table, bulk copies in batches, then merges each batch:

```csharp
    var tempTableName = $"#ETL_{_tableName.Replace(".", "_").Replace("[", "").Replace("]", "")}";
    await CreateTempTableAsync(connection, tempTableName, cancellationToken);

    while (source.Read())
    {
        // Buffer rows into DataTable
        if (batch.Rows.Count >= _batchSize)
        {
            await ProcessBatchAsync(connection, batch, tempTableName, mergeSql, destColumns, ct);
            totalRows += batch.Rows.Count;
            batch.Clear();
        }
    }
```

### MERGE SQL generation

Generates MERGE statement with configurable match and update columns:

```csharp
    private string BuildMergeSql(string tempTableName,
        IReadOnlyList<string> allColumns, IReadOnlyList<string> updateColumns)
    {
        var qualifiedName = CommonScripts.FormatQualifiedTableName(_tableName);
        var sb = new StringBuilder();
        sb.AppendLine($"MERGE INTO {qualifiedName} AS target");
        sb.AppendLine($"USING {tempTableName} AS source");
        sb.Append("ON ");
        sb.AppendLine(string.Join(" AND ",
            _matchColumns.Select(c => $"target.[{c}] = source.[{c}]")));

        if (updateColumns.Count > 0)
        {
            sb.AppendLine("WHEN MATCHED THEN UPDATE SET");
            sb.AppendLine(string.Join(", ",
                updateColumns.Select(c => $"target.[{c}] = source.[{c}]")));
        }

        sb.AppendLine("WHEN NOT MATCHED THEN INSERT");
        sb.AppendLine($"({string.Join(", ", allColumns.Select(c => $"[{c}]"))})");
        sb.AppendLine($"VALUES ({string.Join(", ", allColumns.Select(c => $"source.[{c}]"))});");

        return sb.ToString();
    }
```

## Schema-Qualified Table Names

Both destinations support schema-qualified names via `CommonScripts`:

```csharp
public static (string Schema, string Table) ParseTableName(string tableName)
{
    var cleaned = tableName.Replace("[", "").Replace("]", "");
    var parts = cleaned.Split('.', 2);
    return parts.Length == 2 ? (parts[0], parts[1]) : ("dbo", parts[0]);
}

public static string FormatQualifiedTableName(string tableName)
{
    var (schema, table) = ParseTableName(tableName);
    return $"[{schema}].[{table}]";
}
```

Supported formats: `"Table"`, `"dbo.Table"`, `"[dbo].[Table]"`

## Script Patterns

### IScriptRunner Contract

```csharp
public interface IScriptRunner
{
    Task ExecuteAsync(CancellationToken cancellationToken = default);
    string ScriptName { get; }
}
```

### SqlScriptRunner Implementation

```csharp
public class SqlScriptRunner : IScriptRunner
{
    public SqlScriptRunner(
        IDbConnectionFactory connectionFactory,
        string sql,
        string? name = null,
        object? parameters = null,
        int timeoutSeconds = 3600)
    {
        ScriptName = name ?? "SqlScript";
    }

    public async Task ExecuteAsync(CancellationToken cancellationToken = default)
    {
        await using var connection = await _connectionFactory
            .CreateLotFinderConnectionAsync(cancellationToken);
        await connection.ExecuteAsync(
            new CommandDefinition(_sql, _parameters,
                commandTimeout: _timeoutSeconds, cancellationToken: cancellationToken));
    }
}
```

### Common Scripts

`CommonScripts` provides factory methods for index management:

```csharp
public static IScriptRunner DisableIndexes(
    IDbConnectionFactory factory, string tableName, int timeoutSeconds = 300)
{
    var (schema, table) = ParseTableName(tableName);
    var sql = @"
DECLARE @sql NVARCHAR(MAX) = '';
DECLARE @fullTableName NVARCHAR(256) = QUOTENAME(@schemaName) + '.' + QUOTENAME(@tableName);

SELECT @sql = @sql + 'ALTER INDEX ' + QUOTENAME(i.name) + ' ON ' + @fullTableName + ' DISABLE;'
FROM sys.indexes i
INNER JOIN sys.tables t ON i.object_id = t.object_id
INNER JOIN sys.schemas s ON t.schema_id = s.schema_id
WHERE t.name = @tableName AND s.name = @schemaName
  AND i.type = 2 AND i.is_disabled = 0;

IF LEN(@sql) > 0 EXEC sp_executesql @sql;";

    return new SqlScriptRunner(factory, sql, $"DisableIndexes:{schema}.{table}",
        parameters: new { tableName = table, schemaName = schema },
        timeoutSeconds: timeoutSeconds);
}

public static IScriptRunner RebuildIndexes(
    IDbConnectionFactory factory, string tableName, int timeoutSeconds = 3600)
{
    // Similar pattern with ALTER INDEX ALL ... REBUILD
}

public static IScriptRunner UpdateStatistics(
    IDbConnectionFactory factory, string tableName, int timeoutSeconds = 600)
{
    // UPDATE STATISTICS with QUOTENAME protection
}
```

### SQL injection protection

All dynamic SQL uses `QUOTENAME()` for identifier escaping:

```csharp
DECLARE @fullTableName NVARCHAR(256) = QUOTENAME(@schemaName) + '.' + QUOTENAME(@tableName);
```

### When to use scripts

Use pre/post scripts for large bulk loads where index overhead matters:

```csharp
var pipeline = new EtlPipelineBuilder()
    .WithName("LargeTableSync")
    .WithSource(source)
    .WithPreScript(CommonScripts.DisableIndexes(factory, "WorkOrder"))
    .WithDestination(new DbBulkImportDestination(factory, "WorkOrder"))
    .WithPostScript(CommonScripts.RebuildIndexes(factory, "WorkOrder"))
    .WithPostScript(CommonScripts.UpdateStatistics(factory, "WorkOrder"))
    .Build();
```

## Related Documentation

- [Overview](./Overview.md) - Pipeline architecture
- [Transformers](./Transformers.md) - Data transformation
- [Configuration](./Configuration.md) - Timeout and batch size options
- [Troubleshooting](./Troubleshooting.md) - Performance tuning
```

**Step 2: Verify the file was created**

```bash
ls -la DOCUMENTATION/DataSync/Destinations.md
```
Expected: File exists

**Step 3: Commit**

```bash
git add DOCUMENTATION/DataSync/Destinations.md
git commit -m "docs: add ETL destinations and scripts documentation"
```

---

### Task 5: Create Configuration.md

**Files:**
- Create: `DOCUMENTATION/DataSync/Configuration.md`

**Step 1: Write Configuration.md with builder API, connection setup, and DI registration**

Create `DOCUMENTATION/DataSync/Configuration.md` with:

```markdown
# Configuration

This document covers pipeline builder configuration, connection factory setup, and dependency injection registration.

## Pipeline Builder API

`EtlPipelineBuilder` uses a fluent API to construct pipelines:

```csharp
var pipeline = new EtlPipelineBuilder()
    .WithName("WorkOrderSync")
    .WithSource(new DbQuerySource(factory, "SELECT * FROM Source.WorkOrders", "WorkOrders"))
    .WithTransformer(new JdeDateTransformer("STRDJ", "TRDJ", "StartDate"))
    .WithTransformer(new ColumnDropTransformer("STRDJ", "TRDJ"))
    .WithPreScript(CommonScripts.DisableIndexes(factory, "WorkOrder"))
    .WithDestination(new DbBulkMergeDestination(factory, "WorkOrder", new[] { "OrderNumber" }))
    .WithPostScript(CommonScripts.RebuildIndexes(factory, "WorkOrder"))
    .WithLogger(logger)
    .Build();
```

### Builder Methods

| Method | Required | Description |
|--------|----------|-------------|
| `WithName(string)` | No | Pipeline name for logging. Default: "Unnamed" |
| `WithSource(IImportSource)` | **Yes** | Data source. Throws if not set before `Build()` |
| `WithTransformer(IDataTransformer)` | No | Add transformer. Can be called multiple times (chained) |
| `WithDestination(IImportDestination)` | **Yes** | Data destination. Throws if not set before `Build()` |
| `WithPreScript(IScriptRunner)` | No | Script to run before data transfer. Can be called multiple times |
| `WithPostScript(IScriptRunner)` | No | Script to run after data transfer. Can be called multiple times |
| `WithCommandTimeout(TimeSpan)` | No | Default timeout. Range: 0-24 hours. Default: 600s |
| `WithLogger(ILogger<EtlPipeline>)` | No | Logger for pipeline events. Default: NullLogger |

### WithCommandTimeout Validation

```csharp
public EtlPipelineBuilder WithCommandTimeout(TimeSpan timeout)
{
    if (timeout < TimeSpan.Zero || timeout > TimeSpan.FromHours(24))
        throw new ArgumentOutOfRangeException(nameof(timeout),
            "Timeout must be between 0 and 24 hours.");
    _defaultCommandTimeoutSeconds = (int)timeout.TotalSeconds;
    return this;
}
```

### Build Validation

```csharp
public EtlPipeline Build()
{
    if (_source == null)
        throw new InvalidOperationException(
            "Source is required. Call WithSource() before Build().");
    if (_destination == null)
        throw new InvalidOperationException(
            "Destination is required. Call WithDestination() before Build().");

    return new EtlPipeline(_name, _source, _transformers, _destination,
        _preScripts, _postScripts, _logger ?? NullLogger<EtlPipeline>.Instance);
}
```

## Component Configuration

### DbQuerySource Options

| Parameter | Default | Description |
|-----------|---------|-------------|
| `connectionFactory` | Required | Factory for database connections |
| `sql` | Required | SQL query to execute |
| `name` | `"Query"` | Name for logging (appears as `DbQuery:{name}`) |
| `parameters` | `null` | Anonymous object for query parameters |
| `commandTimeout` | `3600` | Query timeout in seconds |

### DbBulkImportDestination Options

| Parameter | Default | Description |
|-----------|---------|-------------|
| `connectionFactory` | Required | Factory for database connections |
| `tableName` | Required | Destination table (supports schema: `dbo.Table`) |
| `batchSize` | `10000` | Rows per batch for progress tracking |
| `commandTimeoutSeconds` | `600` | Timeout for TRUNCATE and bulk copy |

### DbBulkMergeDestination Options

| Parameter | Default | Description |
|-----------|---------|-------------|
| `connectionFactory` | Required | Factory for database connections |
| `tableName` | Required | Destination table (supports schema: `dbo.Table`) |
| `matchColumns` | Required | Key columns for MERGE matching |
| `updateColumns` | All non-match | Columns to update on match |
| `batchSize` | `10000` | Rows per batch |
| `commandTimeoutSeconds` | `600` | Timeout for bulk copy and MERGE |

### Script Timeout Defaults

| Script | Default Timeout |
|--------|-----------------|
| `DisableIndexes` | 300s (5 min) |
| `RebuildIndexes` | 3600s (1 hour) |
| `UpdateStatistics` | 600s (10 min) |
| `SqlScriptRunner` | 3600s (1 hour) |

## Connection Factory Setup

The pipeline uses `IDbConnectionFactory` for database connections. Register it with your connection strings:

```csharp
services.AddSingleton<IDbConnectionFactory>(sp =>
{
    var configuration = sp.GetRequiredService<IConfiguration>();
    return new DbConnectionFactory(
        configuration.GetConnectionString("LotFinder"),
        configuration.GetConnectionString("JDE"),
        configuration.GetConnectionString("CMS"));
});
```

### Connection string examples

```json
{
  "ConnectionStrings": {
    "LotFinder": "Server=localhost,1434;Database=LotFinder;User Id=scopingapp;Password=...;TrustServerCertificate=true",
    "JDE": "Data Source=jde-oracle;User Id=...;Password=...",
    "CMS": "Data Source=cms-sybase;User Id=...;Password=..."
  }
}
```

## Dependency Injection Registration

### Basic registration

```csharp
services.AddEtlPipeline();
```

This registers `EtlPipelineBuilder` as transient so each request gets a fresh builder.

### Extension method implementation

```csharp
public static class EtlServiceCollectionExtensions
{
    public static IServiceCollection AddEtlPipeline(this IServiceCollection services)
    {
        services.AddTransient<EtlPipelineBuilder>();
        return services;
    }
}
```

### Full registration example

```csharp
public static IServiceCollection AddDataSync(this IServiceCollection services)
{
    // Connection factory (singleton - manages connection pooling)
    services.AddSingleton<IDbConnectionFactory, DbConnectionFactory>();

    // ETL pipeline builder (transient - fresh instance per use)
    services.AddEtlPipeline();

    // Background service for scheduled syncs
    services.AddHostedService<DataSyncService>();

    return services;
}
```

### Using the builder in a service

```csharp
public class DataSyncService : BackgroundService
{
    private readonly EtlPipelineBuilder _pipelineBuilder;
    private readonly IDbConnectionFactory _connectionFactory;
    private readonly ILogger<EtlPipeline> _pipelineLogger;

    public DataSyncService(
        EtlPipelineBuilder pipelineBuilder,
        IDbConnectionFactory connectionFactory,
        ILogger<EtlPipeline> pipelineLogger)
    {
        _pipelineBuilder = pipelineBuilder;
        _connectionFactory = connectionFactory;
        _pipelineLogger = pipelineLogger;
    }

    protected override async Task ExecuteAsync(CancellationToken stoppingToken)
    {
        var pipeline = _pipelineBuilder
            .WithName("WorkOrderSync")
            .WithSource(new DbQuerySource(_connectionFactory, "SELECT * FROM JDE.WorkOrders"))
            .WithDestination(new DbBulkImportDestination(_connectionFactory, "WorkOrder"))
            .WithLogger(_pipelineLogger)
            .Build();

        var result = await pipeline.ExecuteAsync(stoppingToken);
    }
}
```

## Configuration Summary

| Component | Option | Default | Valid Range |
|-----------|--------|---------|-------------|
| `EtlPipelineBuilder` | `WithCommandTimeout` | 600s | 0-24 hours |
| `DbQuerySource` | `commandTimeout` | 3600s | > 0 |
| `DbBulkImportDestination` | `batchSize` | 10000 | > 0 |
| `DbBulkImportDestination` | `commandTimeoutSeconds` | 600s | > 0 |
| `DbBulkMergeDestination` | `batchSize` | 10000 | > 0 |
| `DbBulkMergeDestination` | `commandTimeoutSeconds` | 600s | > 0 |

## Related Documentation

- [Overview](./Overview.md) - Pipeline architecture
- [Destinations](./Destinations.md) - Destination-specific options
- [Troubleshooting](./Troubleshooting.md) - Timeout and batch size tuning
```

**Step 2: Verify the file was created**

```bash
ls -la DOCUMENTATION/DataSync/Configuration.md
```
Expected: File exists

**Step 3: Commit**

```bash
git add DOCUMENTATION/DataSync/Configuration.md
git commit -m "docs: add ETL configuration documentation"
```

---

### Task 6: Create Troubleshooting.md

**Files:**
- Create: `DOCUMENTATION/DataSync/Troubleshooting.md`

**Step 1: Write Troubleshooting.md with error catalog, debugging patterns, and performance tuning**

Create `DOCUMENTATION/DataSync/Troubleshooting.md` with:

```markdown
# Troubleshooting

This document covers common errors, debugging patterns, and performance tuning for the ETL pipeline.

## Common Errors

### Column mapping errors

| Error | Cause | Resolution |
|-------|-------|------------|
| "No columns from source exist in destination table" | Source column names don't match destination | Check source query column aliases match destination table columns exactly (case-insensitive) |
| "Column name collision" | Transformer creates duplicate column names | Review rename mappings; ensure no two columns map to the same output name |
| "Column '{name}' not found or was dropped" | Accessing a column that was dropped | Check transformer chain order; don't access dropped columns in later transformers |

### Computed column errors

| Error | Cause | Resolution |
|-------|-------|------------|
| "GetBytes not supported for computed column at ordinal N" | Binary access on transformed column | Use `GetValue()` instead; computed columns (like JDE dates) don't support binary access |
| "GetChars not supported for computed column at ordinal N" | Same as above | Use `GetValue()` or `GetString()` |
| "GetData not supported for computed column at ordinal N" | Same as above | Computed columns can't return nested readers |

### Timeout errors

| Error | Cause | Resolution |
|-------|-------|------------|
| `SqlException: Timeout expired` during bulk copy | Large dataset, slow network | Increase `commandTimeoutSeconds` on destination |
| `SqlException: Timeout expired` during MERGE | Many rows to match | Increase timeout; consider smaller batches |
| `SqlException: Timeout expired` during script | Index rebuild on large table | Increase script `timeoutSeconds` (default 3600s for rebuild) |

### Validation errors

| Error | Cause | Resolution |
|-------|-------|------------|
| "Source is required. Call WithSource() before Build()" | Missing source in pipeline | Add `.WithSource()` to builder chain |
| "Destination is required. Call WithDestination() before Build()" | Missing destination in pipeline | Add `.WithDestination()` to builder chain |
| "At least one match column is required" | Empty matchColumns array | Provide key columns for MERGE matching |
| "Timeout must be between 0 and 24 hours" | Invalid timeout value | Use `TimeSpan` between 0 and 24 hours |

## Debugging Patterns

### Inspecting pipeline results

Check `PipelineResult` after execution to understand what happened:

```csharp
var result = await pipeline.ExecuteAsync(cancellationToken);

if (!result.Success)
{
    logger.LogError(result.Error, "Pipeline failed after {Rows} rows in {Elapsed}",
        result.TotalRows, result.Elapsed);

    // Find which step failed
    var lastStep = result.Steps.LastOrDefault();
    if (lastStep != null)
    {
        logger.LogError("Failed at step: {Step} ({Type})",
            lastStep.StepName, lastStep.StepType);
    }
}
```

### Tracking step-by-step progress

Each step records timing and row counts:

```csharp
foreach (var step in result.Steps)
{
    logger.LogInformation("Step {Name} ({Type}): {Rows} rows in {Elapsed}ms",
        step.StepName,
        step.StepType,
        step.RowsAffected,
        step.Elapsed.TotalMilliseconds);
}
```

### Enabling detailed logging

Inject a logger into the pipeline for execution-level logging:

```csharp
var pipeline = new EtlPipelineBuilder()
    .WithName("DebugPipeline")
    .WithSource(source)
    .WithDestination(destination)
    .WithLogger(loggerFactory.CreateLogger<EtlPipeline>())
    .Build();
```

Pipeline logs include:
- `Information`: Pipeline start/complete with row counts
- `Debug`: Individual script execution
- `Error`: Failure with exception and last step

### Identifying the failure point

When a pipeline fails, `PipelineResult.Steps` contains all completed steps:

```csharp
if (!result.Success)
{
    // Steps completed before failure
    var completedSteps = result.Steps.Select(s => s.StepName);
    logger.LogError("Completed steps: {Steps}", string.Join(" → ", completedSteps));

    // The exception contains root cause
    logger.LogError(result.Error, "Root cause");
}
```

## Performance Tuning

### Batch size optimization

Default batch size is 10,000 rows. Adjust based on row width:

| Row Size | Recommended Batch Size |
|----------|------------------------|
| Narrow (< 20 columns) | 10,000 - 50,000 |
| Medium (20-50 columns) | 5,000 - 10,000 |
| Wide (> 50 columns) | 1,000 - 5,000 |

```csharp
// Large batch for narrow rows
new DbBulkImportDestination(factory, "LookupTable", batchSize: 50000)

// Small batch for wide rows
new DbBulkMergeDestination(factory, "DetailTable", matchColumns, batchSize: 2000)
```

### Index management for bulk loads

Disable indexes before large imports, rebuild after:

```csharp
var pipeline = new EtlPipelineBuilder()
    .WithName("FullTableRefresh")
    .WithPreScript(CommonScripts.DisableIndexes(factory, "LargeTable"))
    .WithSource(source)
    .WithDestination(new DbBulkImportDestination(factory, "LargeTable"))
    .WithPostScript(CommonScripts.RebuildIndexes(factory, "LargeTable"))
    .WithPostScript(CommonScripts.UpdateStatistics(factory, "LargeTable"))
    .Build();
```

**When to use:**
- Full table refreshes (TRUNCATE + import)
- Tables with 3+ non-clustered indexes
- Import of 100,000+ rows

**When to skip:**
- Incremental merges with few rows
- Tables with only a clustered index
- Frequent small updates

### Timeout sizing guidelines

| Operation | Rows | Suggested Timeout |
|-----------|------|-------------------|
| Bulk import | < 100K | 600s (default) |
| Bulk import | 100K - 1M | 1800s (30 min) |
| Bulk import | > 1M | 3600s (1 hour) |
| Bulk merge | < 50K | 600s (default) |
| Bulk merge | 50K - 500K | 1800s (30 min) |
| Index rebuild | Any | 3600s (default) |

```csharp
// Large table with extended timeout
new DbBulkMergeDestination(factory, "HistoricalData",
    matchColumns: new[] { "RecordId" },
    commandTimeoutSeconds: 1800)
```

### Reducing network and memory usage

Select only needed columns in the source query:

```csharp
// Good - select only needed columns
var source = new DbQuerySource(factory,
    "SELECT OrderNumber, Status, StartDate FROM JDE.WorkOrders");

// Avoid - selecting all columns wastes bandwidth
var source = new DbQuerySource(factory,
    "SELECT * FROM JDE.WorkOrders");
```

Extra columns in the source are ignored by the destination column mapping, but they still consume network bandwidth and memory.

### Monitoring baseline performance

Track `PipelineResult.Elapsed` over time to detect degradation:

```csharp
var result = await pipeline.ExecuteAsync(ct);

metrics.RecordPipeline(
    pipelineName: pipeline.PipelineName,
    success: result.Success,
    rows: result.TotalRows,
    durationMs: result.Elapsed.TotalMilliseconds);

// Alert if duration exceeds baseline by 50%
if (result.Elapsed > baselineDuration * 1.5)
{
    logger.LogWarning("Pipeline {Name} took {Elapsed} (baseline: {Baseline})",
        pipeline.PipelineName, result.Elapsed, baselineDuration);
}
```

### Step-level performance analysis

Identify slow steps using `StepResult.Elapsed`:

```csharp
var slowSteps = result.Steps
    .Where(s => s.Elapsed > TimeSpan.FromSeconds(30))
    .OrderByDescending(s => s.Elapsed);

foreach (var step in slowSteps)
{
    logger.LogWarning("Slow step: {Name} took {Elapsed}",
        step.StepName, step.Elapsed);
}
```

## Related Documentation

- [Overview](./Overview.md) - Pipeline architecture
- [Configuration](./Configuration.md) - Timeout and batch size options
- [Destinations](./Destinations.md) - Script patterns for index management
```

**Step 2: Verify the file was created**

```bash
ls -la DOCUMENTATION/DataSync/Troubleshooting.md
```
Expected: File exists

**Step 3: Commit**

```bash
git add DOCUMENTATION/DataSync/Troubleshooting.md
git commit -m "docs: add ETL troubleshooting documentation"
```

---

### Task 7: Update ComponentMap.md

**Files:**
- Modify: `DOCUMENTATION/Instructions/ComponentMap.md`

**Step 1: Read the current ComponentMap.md**

Read `DOCUMENTATION/Instructions/ComponentMap.md` to find where to add the ETL source mapping.

**Step 2: Add ETL source paths to the DataSync section**

Add the new ETL source paths to the DataSync section:

```markdown
### DataSync/

Documents data synchronization from enterprise systems.

**Source paths (Legacy):**
- `OLD/DataModel/Process/JDE*.cs` - JDE Oracle queries
- `OLD/DataModel/Process/CMS*.cs` - CMS Sybase queries
- `OLD/WorkerService/Process/UpdateProcessor.cs` - Sync orchestration
- `OLD/WorkerService/dsconfig/*.json` - Data source configs

**Source paths (New):**
- `NEW/src/JdeScoping.DataSync/Etl/` - ETL pipeline framework
- `NEW/src/JdeScoping.DataSync/Etl/Contracts/` - Core interfaces
- `NEW/src/JdeScoping.DataSync/Etl/Pipeline/` - Pipeline and builder
- `NEW/src/JdeScoping.DataSync/Etl/Sources/` - Data sources
- `NEW/src/JdeScoping.DataSync/Etl/Transformers/` - Data transformers
- `NEW/src/JdeScoping.DataSync/Etl/Destinations/` - Bulk copy/merge destinations
- `NEW/src/JdeScoping.DataSync/Etl/Scripts/` - SQL script runners
- `NEW/src/JdeScoping.DataSync/Etl/Results/` - Execution result types

**Typical files:**
- `Overview.md` - ETL pipeline architecture
- `Sources.md` - Writing custom data sources
- `Transformers.md` - Writing custom transformers
- `Destinations.md` - Bulk destinations and scripts
- `Configuration.md` - Pipeline builder and DI setup
- `Troubleshooting.md` - Debugging and performance
- `JDE.md` - JD Edwards (Oracle) integration
- `CMS.md` - CMS (Sybase) integration
- `Scheduling.md` - Mass/daily/hourly sync schedules
```

**Step 3: Verify the edit is correct**

Read the modified section to confirm the changes.

**Step 4: Commit**

```bash
git add DOCUMENTATION/Instructions/ComponentMap.md
git commit -m "docs: add ETL source paths to ComponentMap"
```

---

### Task 8: Final verification

**Step 1: List all created documentation files**

```bash
ls -la DOCUMENTATION/DataSync/
```

Expected output:
```
Configuration.md
Destinations.md
Overview.md
Sources.md
Transformers.md
Troubleshooting.md
```

**Step 2: Verify all internal links resolve**

Check that all cross-references between docs point to existing files:

```bash
grep -h "\[.*\](\./" DOCUMENTATION/DataSync/*.md | sort -u
```

All referenced files should exist.

**Step 3: Check git log for all commits**

```bash
git log --oneline -8
```

Expected: 7 commits for the documentation (6 docs + 1 ComponentMap update)

**Step 4: Final status check**

```bash
git status
```

Expected: Clean working tree