diff --git a/PLANS/2026-01-03-etl-documentation-design.md b/PLANS/2026-01-03-etl-documentation-design.md new file mode 100644 index 0000000..243e090 --- /dev/null +++ b/PLANS/2026-01-03-etl-documentation-design.md @@ -0,0 +1,163 @@ +# ETL Pipeline Documentation Design + +**Date:** 2026-01-03 +**Status:** Approved +**Purpose:** Design for comprehensive ETL pipeline documentation targeting developers extending the pipeline and operations/support teams. + +## Audience + +- **Developers extending the pipeline** - Need patterns and API reference for adding new sources, transformers, and destinations +- **Operations/support** - Need configuration options, monitoring guidance, and troubleshooting + +## Document Structure + +``` +DOCUMENTATION/DataSync/ +├── Overview.md # Architecture, data flow, key concepts +├── Sources.md # Writing custom IImportSource implementations +├── Transformers.md # Writing custom IDataTransformer implementations +├── Destinations.md # Writing destinations + script patterns +├── Configuration.md # Builder API, connections, DI registration +└── Troubleshooting.md # Errors, debugging, performance tuning +``` + +## Document Specifications + +### Overview.md + +**Content:** +1. Purpose statement - What the ETL pipeline does: streams data from enterprise sources (JDE, CMS) through transformations into SQL Server cache tables +2. Architecture diagram (text-based) - Shows flow: Source → Transformer chain → Destination, with optional pre/post scripts +3. Core contracts - Brief description of the 4 interfaces (`IImportSource`, `IDataTransformer`, `IImportDestination`, `IScriptRunner`) with responsibilities +4. Pipeline execution flow - Walk through `EtlPipeline.ExecuteAsync`: pre-scripts → open source → apply transformers → write to destination → post-scripts +5. Result model - Explain `PipelineResult`, `StepResult`, `DestinationResult` for tracking execution + +**Approach:** No detailed code examples - interface signatures only. Component docs have implementations. + +--- + +### Sources.md + +**Content:** +1. Interface contract - `IImportSource` with `ReadDataAsync()` and `SourceName`. Explain `IAsyncDisposable` requirement. +2. Annotated walkthrough of `DbQuerySource`: + - Constructor pattern: `IDbConnectionFactory` + SQL query + parameters + - Connection lifecycle: open in `ReadDataAsync`, dispose via `IAsyncDisposable` + - Streaming `IDataReader` (not buffered) +3. Key patterns: + - Keep sources stateless until `ReadDataAsync` + - Returned `IDataReader` must remain valid until source disposed + - `SourceName` format for logging +4. Future source types - Brief notes on file-based, API sources + +**Estimated length:** ~150-200 lines including code from `DbQuerySource.cs` + +--- + +### Transformers.md + +**Content:** +1. Interface contract - `IDataTransformer` with `Transform()`, `TransformerName`, `MapOrdinal()` +2. Base class pattern - `DataTransformerBase`: + - Default `IDataReader` method implementations + - Lazy initialization via `OnInitialize()` + - Ordinal mapping for binary methods + - Computed columns return `-1` from `MapOrdinal` +3. Annotated walkthrough of three transformers: + - `ColumnRenameTransformer` - Simple: remaps names, validates collisions + - `ColumnDropTransformer` - Removes columns, overrides `MapOrdinal` + - `JdeDateTransformer` - Complex: combines columns, sentinel handling, `GetDataTypeName` override +4. Chaining behavior - Transformers compose, ordinal mappings accumulate +5. Validation patterns - Configuration validation in `OnInitialize()` + +**Estimated length:** ~250-300 lines with annotated code from all three transformers + +--- + +### Destinations.md + +**Content:** +1. Interface contract - `IImportDestination` with `WriteAsync()` returning `DestinationResult` +2. Annotated walkthrough of `DbBulkImportDestination`: + - Truncates target, bulk copies all rows + - Column mapping via `INFORMATION_SCHEMA.COLUMNS` + - Batch processing with `DataTable` buffer +3. Annotated walkthrough of `DbBulkMergeDestination`: + - Temp table creation from destination schema + - Bulk copy to temp, MERGE SQL execution + - Match columns for upsert, configurable update columns + - Schema-qualified names via `CommonScripts.ParseTableName()` +4. Script patterns (folded in): + - `IScriptRunner` and `SqlScriptRunner` + - `CommonScripts.DisableIndexes()` / `RebuildIndexes()` / `UpdateStatistics()` + - When to use pre/post scripts + - QUOTENAME for SQL injection protection +5. Timeout and batch size configuration + +**Estimated length:** ~300-350 lines including MERGE SQL logic + +--- + +### Configuration.md + +**Content:** +1. Pipeline builder API - Full `EtlPipelineBuilder` reference: + - `WithName()`, `WithSource()`, `WithDestination()` + - `WithTransformer()` (chainable) + - `WithPreScript()`, `WithPostScript()` + - `WithCommandTimeout()` - validation (0-24 hours), default 600s + - `WithLogger()`, `Build()` +2. Connection factory setup: + - Connection strings for SQL Server, JDE Oracle, CMS Sybase + - Pooling considerations +3. DI registration - `EtlServiceCollectionExtensions`: + - Registering connection factories + - Pipeline builders as transient + - Scoped vs singleton +4. Configuration table - Quick reference with defaults and valid ranges + +**Estimated length:** ~200-250 lines + +--- + +### Troubleshooting.md + +**Content:** +1. Common errors and fixes (table format): + - "No columns from source exist in destination" → Column name mismatch + - "Column name collision" → Duplicate names from transformer + - "GetBytes not supported for computed column" → Binary access on transformed column + - Timeout exceptions → Increase timeout, reduce batch size +2. Debugging patterns: + - Inspecting `PipelineResult.Steps` + - Checking `StepResult.RowsAffected` + - Debug logging via `ILogger` + - `PipelineResult.Exception` for root cause +3. Performance tuning: + - Batch size guidelines (start 10000, adjust for row width) + - Index management (disable before, rebuild after) + - Timeout sizing (1 min per 100K rows rule of thumb) + - Column filtering to reduce network/memory +4. Monitoring - Using elapsed times for baselines + +**Estimated length:** ~200-250 lines + +--- + +## Source Files to Reference + +| Document | Primary Source Files | +|----------|---------------------| +| Overview.md | `EtlPipeline.cs`, all contracts | +| Sources.md | `IImportSource.cs`, `DbQuerySource.cs` | +| Transformers.md | `IDataTransformer.cs`, `DataTransformerBase.cs`, `JdeDateTransformer.cs`, `ColumnRenameTransformer.cs`, `ColumnDropTransformer.cs` | +| Destinations.md | `IImportDestination.cs`, `DbBulkImportDestination.cs`, `DbBulkMergeDestination.cs`, `IScriptRunner.cs`, `SqlScriptRunner.cs`, `CommonScripts.cs` | +| Configuration.md | `EtlPipelineBuilder.cs`, `EtlServiceCollectionExtensions.cs` | +| Troubleshooting.md | `PipelineResult.cs`, `StepResult.cs`, `DestinationResult.cs` | + +## Implementation Notes + +- All code snippets must come from actual source files per StyleGuide.md +- Follow ComponentMap.md location: `DOCUMENTATION/DataSync/` +- Update ComponentMap.md to include new ETL source paths +- Cross-reference between docs using relative links