docs: add ETL pipeline documentation design
Design for 6 documentation files covering the DataSync ETL pipeline: - Overview, Sources, Transformers, Destinations, Configuration, Troubleshooting Target audience: developers extending the pipeline + operations/support.
This commit is contained in:
@@ -0,0 +1,163 @@
|
||||
# ETL Pipeline Documentation Design
|
||||
|
||||
**Date:** 2026-01-03
|
||||
**Status:** Approved
|
||||
**Purpose:** Design for comprehensive ETL pipeline documentation targeting developers extending the pipeline and operations/support teams.
|
||||
|
||||
## Audience
|
||||
|
||||
- **Developers extending the pipeline** - Need patterns and API reference for adding new sources, transformers, and destinations
|
||||
- **Operations/support** - Need configuration options, monitoring guidance, and troubleshooting
|
||||
|
||||
## Document Structure
|
||||
|
||||
```
|
||||
DOCUMENTATION/DataSync/
|
||||
├── Overview.md # Architecture, data flow, key concepts
|
||||
├── Sources.md # Writing custom IImportSource implementations
|
||||
├── Transformers.md # Writing custom IDataTransformer implementations
|
||||
├── Destinations.md # Writing destinations + script patterns
|
||||
├── Configuration.md # Builder API, connections, DI registration
|
||||
└── Troubleshooting.md # Errors, debugging, performance tuning
|
||||
```
|
||||
|
||||
## Document Specifications
|
||||
|
||||
### Overview.md
|
||||
|
||||
**Content:**
|
||||
1. Purpose statement - What the ETL pipeline does: streams data from enterprise sources (JDE, CMS) through transformations into SQL Server cache tables
|
||||
2. Architecture diagram (text-based) - Shows flow: Source → Transformer chain → Destination, with optional pre/post scripts
|
||||
3. Core contracts - Brief description of the 4 interfaces (`IImportSource`, `IDataTransformer`, `IImportDestination`, `IScriptRunner`) with responsibilities
|
||||
4. Pipeline execution flow - Walk through `EtlPipeline.ExecuteAsync`: pre-scripts → open source → apply transformers → write to destination → post-scripts
|
||||
5. Result model - Explain `PipelineResult`, `StepResult`, `DestinationResult` for tracking execution
|
||||
|
||||
**Approach:** No detailed code examples - interface signatures only. Component docs have implementations.
|
||||
|
||||
---
|
||||
|
||||
### Sources.md
|
||||
|
||||
**Content:**
|
||||
1. Interface contract - `IImportSource` with `ReadDataAsync()` and `SourceName`. Explain `IAsyncDisposable` requirement.
|
||||
2. Annotated walkthrough of `DbQuerySource`:
|
||||
- Constructor pattern: `IDbConnectionFactory` + SQL query + parameters
|
||||
- Connection lifecycle: open in `ReadDataAsync`, dispose via `IAsyncDisposable`
|
||||
- Streaming `IDataReader` (not buffered)
|
||||
3. Key patterns:
|
||||
- Keep sources stateless until `ReadDataAsync`
|
||||
- Returned `IDataReader` must remain valid until source disposed
|
||||
- `SourceName` format for logging
|
||||
4. Future source types - Brief notes on file-based, API sources
|
||||
|
||||
**Estimated length:** ~150-200 lines including code from `DbQuerySource.cs`
|
||||
|
||||
---
|
||||
|
||||
### Transformers.md
|
||||
|
||||
**Content:**
|
||||
1. Interface contract - `IDataTransformer` with `Transform()`, `TransformerName`, `MapOrdinal()`
|
||||
2. Base class pattern - `DataTransformerBase`:
|
||||
- Default `IDataReader` method implementations
|
||||
- Lazy initialization via `OnInitialize()`
|
||||
- Ordinal mapping for binary methods
|
||||
- Computed columns return `-1` from `MapOrdinal`
|
||||
3. Annotated walkthrough of three transformers:
|
||||
- `ColumnRenameTransformer` - Simple: remaps names, validates collisions
|
||||
- `ColumnDropTransformer` - Removes columns, overrides `MapOrdinal`
|
||||
- `JdeDateTransformer` - Complex: combines columns, sentinel handling, `GetDataTypeName` override
|
||||
4. Chaining behavior - Transformers compose, ordinal mappings accumulate
|
||||
5. Validation patterns - Configuration validation in `OnInitialize()`
|
||||
|
||||
**Estimated length:** ~250-300 lines with annotated code from all three transformers
|
||||
|
||||
---
|
||||
|
||||
### Destinations.md
|
||||
|
||||
**Content:**
|
||||
1. Interface contract - `IImportDestination` with `WriteAsync()` returning `DestinationResult`
|
||||
2. Annotated walkthrough of `DbBulkImportDestination`:
|
||||
- Truncates target, bulk copies all rows
|
||||
- Column mapping via `INFORMATION_SCHEMA.COLUMNS`
|
||||
- Batch processing with `DataTable` buffer
|
||||
3. Annotated walkthrough of `DbBulkMergeDestination`:
|
||||
- Temp table creation from destination schema
|
||||
- Bulk copy to temp, MERGE SQL execution
|
||||
- Match columns for upsert, configurable update columns
|
||||
- Schema-qualified names via `CommonScripts.ParseTableName()`
|
||||
4. Script patterns (folded in):
|
||||
- `IScriptRunner` and `SqlScriptRunner`
|
||||
- `CommonScripts.DisableIndexes()` / `RebuildIndexes()` / `UpdateStatistics()`
|
||||
- When to use pre/post scripts
|
||||
- QUOTENAME for SQL injection protection
|
||||
5. Timeout and batch size configuration
|
||||
|
||||
**Estimated length:** ~300-350 lines including MERGE SQL logic
|
||||
|
||||
---
|
||||
|
||||
### Configuration.md
|
||||
|
||||
**Content:**
|
||||
1. Pipeline builder API - Full `EtlPipelineBuilder` reference:
|
||||
- `WithName()`, `WithSource()`, `WithDestination()`
|
||||
- `WithTransformer()` (chainable)
|
||||
- `WithPreScript()`, `WithPostScript()`
|
||||
- `WithCommandTimeout()` - validation (0-24 hours), default 600s
|
||||
- `WithLogger()`, `Build()`
|
||||
2. Connection factory setup:
|
||||
- Connection strings for SQL Server, JDE Oracle, CMS Sybase
|
||||
- Pooling considerations
|
||||
3. DI registration - `EtlServiceCollectionExtensions`:
|
||||
- Registering connection factories
|
||||
- Pipeline builders as transient
|
||||
- Scoped vs singleton
|
||||
4. Configuration table - Quick reference with defaults and valid ranges
|
||||
|
||||
**Estimated length:** ~200-250 lines
|
||||
|
||||
---
|
||||
|
||||
### Troubleshooting.md
|
||||
|
||||
**Content:**
|
||||
1. Common errors and fixes (table format):
|
||||
- "No columns from source exist in destination" → Column name mismatch
|
||||
- "Column name collision" → Duplicate names from transformer
|
||||
- "GetBytes not supported for computed column" → Binary access on transformed column
|
||||
- Timeout exceptions → Increase timeout, reduce batch size
|
||||
2. Debugging patterns:
|
||||
- Inspecting `PipelineResult.Steps`
|
||||
- Checking `StepResult.RowsAffected`
|
||||
- Debug logging via `ILogger<EtlPipeline>`
|
||||
- `PipelineResult.Exception` for root cause
|
||||
3. Performance tuning:
|
||||
- Batch size guidelines (start 10000, adjust for row width)
|
||||
- Index management (disable before, rebuild after)
|
||||
- Timeout sizing (1 min per 100K rows rule of thumb)
|
||||
- Column filtering to reduce network/memory
|
||||
4. Monitoring - Using elapsed times for baselines
|
||||
|
||||
**Estimated length:** ~200-250 lines
|
||||
|
||||
---
|
||||
|
||||
## Source Files to Reference
|
||||
|
||||
| Document | Primary Source Files |
|
||||
|----------|---------------------|
|
||||
| Overview.md | `EtlPipeline.cs`, all contracts |
|
||||
| Sources.md | `IImportSource.cs`, `DbQuerySource.cs` |
|
||||
| Transformers.md | `IDataTransformer.cs`, `DataTransformerBase.cs`, `JdeDateTransformer.cs`, `ColumnRenameTransformer.cs`, `ColumnDropTransformer.cs` |
|
||||
| Destinations.md | `IImportDestination.cs`, `DbBulkImportDestination.cs`, `DbBulkMergeDestination.cs`, `IScriptRunner.cs`, `SqlScriptRunner.cs`, `CommonScripts.cs` |
|
||||
| Configuration.md | `EtlPipelineBuilder.cs`, `EtlServiceCollectionExtensions.cs` |
|
||||
| Troubleshooting.md | `PipelineResult.cs`, `StepResult.cs`, `DestinationResult.cs` |
|
||||
|
||||
## Implementation Notes
|
||||
|
||||
- All code snippets must come from actual source files per StyleGuide.md
|
||||
- Follow ComponentMap.md location: `DOCUMENTATION/DataSync/`
|
||||
- Update ComponentMap.md to include new ETL source paths
|
||||
- Cross-reference between docs using relative links
|
||||
Reference in New Issue
Block a user