docs: add ETL pipeline documentation design

Design for 6 documentation files covering the DataSync ETL pipeline:
- Overview, Sources, Transformers, Destinations, Configuration, Troubleshooting

Target audience: developers extending the pipeline + operations/support.
This commit is contained in:
Joseph Doherty
2026-01-03 15:22:57 -05:00
parent 7dcbacd5ca
commit 9103626ad4
@@ -0,0 +1,163 @@
# ETL Pipeline Documentation Design
**Date:** 2026-01-03
**Status:** Approved
**Purpose:** Design for comprehensive ETL pipeline documentation targeting developers extending the pipeline and operations/support teams.
## Audience
- **Developers extending the pipeline** - Need patterns and API reference for adding new sources, transformers, and destinations
- **Operations/support** - Need configuration options, monitoring guidance, and troubleshooting
## Document Structure
```
DOCUMENTATION/DataSync/
├── Overview.md # Architecture, data flow, key concepts
├── Sources.md # Writing custom IImportSource implementations
├── Transformers.md # Writing custom IDataTransformer implementations
├── Destinations.md # Writing destinations + script patterns
├── Configuration.md # Builder API, connections, DI registration
└── Troubleshooting.md # Errors, debugging, performance tuning
```
## Document Specifications
### Overview.md
**Content:**
1. Purpose statement - What the ETL pipeline does: streams data from enterprise sources (JDE, CMS) through transformations into SQL Server cache tables
2. Architecture diagram (text-based) - Shows flow: Source → Transformer chain → Destination, with optional pre/post scripts
3. Core contracts - Brief description of the 4 interfaces (`IImportSource`, `IDataTransformer`, `IImportDestination`, `IScriptRunner`) with responsibilities
4. Pipeline execution flow - Walk through `EtlPipeline.ExecuteAsync`: pre-scripts → open source → apply transformers → write to destination → post-scripts
5. Result model - Explain `PipelineResult`, `StepResult`, `DestinationResult` for tracking execution
**Approach:** No detailed code examples - interface signatures only. Component docs have implementations.
---
### Sources.md
**Content:**
1. Interface contract - `IImportSource` with `ReadDataAsync()` and `SourceName`. Explain `IAsyncDisposable` requirement.
2. Annotated walkthrough of `DbQuerySource`:
- Constructor pattern: `IDbConnectionFactory` + SQL query + parameters
- Connection lifecycle: open in `ReadDataAsync`, dispose via `IAsyncDisposable`
- Streaming `IDataReader` (not buffered)
3. Key patterns:
- Keep sources stateless until `ReadDataAsync`
- Returned `IDataReader` must remain valid until source disposed
- `SourceName` format for logging
4. Future source types - Brief notes on file-based, API sources
**Estimated length:** ~150-200 lines including code from `DbQuerySource.cs`
---
### Transformers.md
**Content:**
1. Interface contract - `IDataTransformer` with `Transform()`, `TransformerName`, `MapOrdinal()`
2. Base class pattern - `DataTransformerBase`:
- Default `IDataReader` method implementations
- Lazy initialization via `OnInitialize()`
- Ordinal mapping for binary methods
- Computed columns return `-1` from `MapOrdinal`
3. Annotated walkthrough of three transformers:
- `ColumnRenameTransformer` - Simple: remaps names, validates collisions
- `ColumnDropTransformer` - Removes columns, overrides `MapOrdinal`
- `JdeDateTransformer` - Complex: combines columns, sentinel handling, `GetDataTypeName` override
4. Chaining behavior - Transformers compose, ordinal mappings accumulate
5. Validation patterns - Configuration validation in `OnInitialize()`
**Estimated length:** ~250-300 lines with annotated code from all three transformers
---
### Destinations.md
**Content:**
1. Interface contract - `IImportDestination` with `WriteAsync()` returning `DestinationResult`
2. Annotated walkthrough of `DbBulkImportDestination`:
- Truncates target, bulk copies all rows
- Column mapping via `INFORMATION_SCHEMA.COLUMNS`
- Batch processing with `DataTable` buffer
3. Annotated walkthrough of `DbBulkMergeDestination`:
- Temp table creation from destination schema
- Bulk copy to temp, MERGE SQL execution
- Match columns for upsert, configurable update columns
- Schema-qualified names via `CommonScripts.ParseTableName()`
4. Script patterns (folded in):
- `IScriptRunner` and `SqlScriptRunner`
- `CommonScripts.DisableIndexes()` / `RebuildIndexes()` / `UpdateStatistics()`
- When to use pre/post scripts
- QUOTENAME for SQL injection protection
5. Timeout and batch size configuration
**Estimated length:** ~300-350 lines including MERGE SQL logic
---
### Configuration.md
**Content:**
1. Pipeline builder API - Full `EtlPipelineBuilder` reference:
- `WithName()`, `WithSource()`, `WithDestination()`
- `WithTransformer()` (chainable)
- `WithPreScript()`, `WithPostScript()`
- `WithCommandTimeout()` - validation (0-24 hours), default 600s
- `WithLogger()`, `Build()`
2. Connection factory setup:
- Connection strings for SQL Server, JDE Oracle, CMS Sybase
- Pooling considerations
3. DI registration - `EtlServiceCollectionExtensions`:
- Registering connection factories
- Pipeline builders as transient
- Scoped vs singleton
4. Configuration table - Quick reference with defaults and valid ranges
**Estimated length:** ~200-250 lines
---
### Troubleshooting.md
**Content:**
1. Common errors and fixes (table format):
- "No columns from source exist in destination" → Column name mismatch
- "Column name collision" → Duplicate names from transformer
- "GetBytes not supported for computed column" → Binary access on transformed column
- Timeout exceptions → Increase timeout, reduce batch size
2. Debugging patterns:
- Inspecting `PipelineResult.Steps`
- Checking `StepResult.RowsAffected`
- Debug logging via `ILogger<EtlPipeline>`
- `PipelineResult.Exception` for root cause
3. Performance tuning:
- Batch size guidelines (start 10000, adjust for row width)
- Index management (disable before, rebuild after)
- Timeout sizing (1 min per 100K rows rule of thumb)
- Column filtering to reduce network/memory
4. Monitoring - Using elapsed times for baselines
**Estimated length:** ~200-250 lines
---
## Source Files to Reference
| Document | Primary Source Files |
|----------|---------------------|
| Overview.md | `EtlPipeline.cs`, all contracts |
| Sources.md | `IImportSource.cs`, `DbQuerySource.cs` |
| Transformers.md | `IDataTransformer.cs`, `DataTransformerBase.cs`, `JdeDateTransformer.cs`, `ColumnRenameTransformer.cs`, `ColumnDropTransformer.cs` |
| Destinations.md | `IImportDestination.cs`, `DbBulkImportDestination.cs`, `DbBulkMergeDestination.cs`, `IScriptRunner.cs`, `SqlScriptRunner.cs`, `CommonScripts.cs` |
| Configuration.md | `EtlPipelineBuilder.cs`, `EtlServiceCollectionExtensions.cs` |
| Troubleshooting.md | `PipelineResult.cs`, `StepResult.cs`, `DestinationResult.cs` |
## Implementation Notes
- All code snippets must come from actual source files per StyleGuide.md
- Follow ComponentMap.md location: `DOCUMENTATION/DataSync/`
- Update ComponentMap.md to include new ETL source paths
- Cross-reference between docs using relative links