# ETL Pipeline Documentation Design **Date:** 2026-01-03 **Status:** Approved **Purpose:** Design for comprehensive ETL pipeline documentation targeting developers extending the pipeline and operations/support teams. ## Audience - **Developers extending the pipeline** - Need patterns and API reference for adding new sources, transformers, and destinations - **Operations/support** - Need configuration options, monitoring guidance, and troubleshooting ## Document Structure ``` DOCUMENTATION/DataSync/ ├── Overview.md # Architecture, data flow, key concepts ├── Sources.md # Writing custom IImportSource implementations ├── Transformers.md # Writing custom IDataTransformer implementations ├── Destinations.md # Writing destinations + script patterns ├── Configuration.md # Builder API, connections, DI registration └── Troubleshooting.md # Errors, debugging, performance tuning ``` ## Document Specifications ### Overview.md **Content:** 1. Purpose statement - What the ETL pipeline does: streams data from enterprise sources (JDE, CMS) through transformations into SQL Server cache tables 2. Architecture diagram (text-based) - Shows flow: Source → Transformer chain → Destination, with optional pre/post scripts 3. Core contracts - Brief description of the 4 interfaces (`IImportSource`, `IDataTransformer`, `IImportDestination`, `IScriptRunner`) with responsibilities 4. Pipeline execution flow - Walk through `EtlPipeline.ExecuteAsync`: pre-scripts → open source → apply transformers → write to destination → post-scripts 5. Result model - Explain `PipelineResult`, `StepResult`, `DestinationResult` for tracking execution **Approach:** No detailed code examples - interface signatures only. Component docs have implementations. --- ### Sources.md **Content:** 1. Interface contract - `IImportSource` with `ReadDataAsync()` and `SourceName`. Explain `IAsyncDisposable` requirement. 2. Annotated walkthrough of `DbQuerySource`: - Constructor pattern: `IDbConnectionFactory` + SQL query + parameters - Connection lifecycle: open in `ReadDataAsync`, dispose via `IAsyncDisposable` - Streaming `IDataReader` (not buffered) 3. Key patterns: - Keep sources stateless until `ReadDataAsync` - Returned `IDataReader` must remain valid until source disposed - `SourceName` format for logging 4. Future source types - Brief notes on file-based, API sources **Estimated length:** ~150-200 lines including code from `DbQuerySource.cs` --- ### Transformers.md **Content:** 1. Interface contract - `IDataTransformer` with `Transform()`, `TransformerName`, `MapOrdinal()` 2. Base class pattern - `DataTransformerBase`: - Default `IDataReader` method implementations - Lazy initialization via `OnInitialize()` - Ordinal mapping for binary methods - Computed columns return `-1` from `MapOrdinal` 3. Annotated walkthrough of three transformers: - `ColumnRenameTransformer` - Simple: remaps names, validates collisions - `ColumnDropTransformer` - Removes columns, overrides `MapOrdinal` - `JdeDateTransformer` - Complex: combines columns, sentinel handling, `GetDataTypeName` override 4. Chaining behavior - Transformers compose, ordinal mappings accumulate 5. Validation patterns - Configuration validation in `OnInitialize()` **Estimated length:** ~250-300 lines with annotated code from all three transformers --- ### Destinations.md **Content:** 1. Interface contract - `IImportDestination` with `WriteAsync()` returning `DestinationResult` 2. Annotated walkthrough of `DbBulkImportDestination`: - Truncates target, bulk copies all rows - Column mapping via `INFORMATION_SCHEMA.COLUMNS` - Batch processing with `DataTable` buffer 3. Annotated walkthrough of `DbBulkMergeDestination`: - Temp table creation from destination schema - Bulk copy to temp, MERGE SQL execution - Match columns for upsert, configurable update columns - Schema-qualified names via `CommonScripts.ParseTableName()` 4. Script patterns (folded in): - `IScriptRunner` and `SqlScriptRunner` - `CommonScripts.DisableIndexes()` / `RebuildIndexes()` / `UpdateStatistics()` - When to use pre/post scripts - QUOTENAME for SQL injection protection 5. Timeout and batch size configuration **Estimated length:** ~300-350 lines including MERGE SQL logic --- ### Configuration.md **Content:** 1. Pipeline builder API - Full `EtlPipelineBuilder` reference: - `WithName()`, `WithSource()`, `WithDestination()` - `WithTransformer()` (chainable) - `WithPreScript()`, `WithPostScript()` - `WithCommandTimeout()` - validation (0-24 hours), default 600s - `WithLogger()`, `Build()` 2. Connection factory setup: - Connection strings for SQL Server, JDE Oracle, CMS Sybase - Pooling considerations 3. DI registration - `EtlServiceCollectionExtensions`: - Registering connection factories - Pipeline builders as transient - Scoped vs singleton 4. Configuration table - Quick reference with defaults and valid ranges **Estimated length:** ~200-250 lines --- ### Troubleshooting.md **Content:** 1. Common errors and fixes (table format): - "No columns from source exist in destination" → Column name mismatch - "Column name collision" → Duplicate names from transformer - "GetBytes not supported for computed column" → Binary access on transformed column - Timeout exceptions → Increase timeout, reduce batch size 2. Debugging patterns: - Inspecting `PipelineResult.Steps` - Checking `StepResult.RowsAffected` - Debug logging via `ILogger` - `PipelineResult.Exception` for root cause 3. Performance tuning: - Batch size guidelines (start 10000, adjust for row width) - Index management (disable before, rebuild after) - Timeout sizing (1 min per 100K rows rule of thumb) - Column filtering to reduce network/memory 4. Monitoring - Using elapsed times for baselines **Estimated length:** ~200-250 lines --- ## Source Files to Reference | Document | Primary Source Files | |----------|---------------------| | Overview.md | `EtlPipeline.cs`, all contracts | | Sources.md | `IImportSource.cs`, `DbQuerySource.cs` | | Transformers.md | `IDataTransformer.cs`, `DataTransformerBase.cs`, `JdeDateTransformer.cs`, `ColumnRenameTransformer.cs`, `ColumnDropTransformer.cs` | | Destinations.md | `IImportDestination.cs`, `DbBulkImportDestination.cs`, `DbBulkMergeDestination.cs`, `IScriptRunner.cs`, `SqlScriptRunner.cs`, `CommonScripts.cs` | | Configuration.md | `EtlPipelineBuilder.cs`, `EtlServiceCollectionExtensions.cs` | | Troubleshooting.md | `PipelineResult.cs`, `StepResult.cs`, `DestinationResult.cs` | ## Implementation Notes - All code snippets must come from actual source files per StyleGuide.md - Follow ComponentMap.md location: `DOCUMENTATION/DataSync/` - Update ComponentMap.md to include new ETL source paths - Cross-reference between docs using relative links