9103626ad4
Design for 6 documentation files covering the DataSync ETL pipeline: - Overview, Sources, Transformers, Destinations, Configuration, Troubleshooting Target audience: developers extending the pipeline + operations/support.
164 lines
7.0 KiB
Markdown
164 lines
7.0 KiB
Markdown
# ETL Pipeline Documentation Design
|
|
|
|
**Date:** 2026-01-03
|
|
**Status:** Approved
|
|
**Purpose:** Design for comprehensive ETL pipeline documentation targeting developers extending the pipeline and operations/support teams.
|
|
|
|
## Audience
|
|
|
|
- **Developers extending the pipeline** - Need patterns and API reference for adding new sources, transformers, and destinations
|
|
- **Operations/support** - Need configuration options, monitoring guidance, and troubleshooting
|
|
|
|
## Document Structure
|
|
|
|
```
|
|
DOCUMENTATION/DataSync/
|
|
├── Overview.md # Architecture, data flow, key concepts
|
|
├── Sources.md # Writing custom IImportSource implementations
|
|
├── Transformers.md # Writing custom IDataTransformer implementations
|
|
├── Destinations.md # Writing destinations + script patterns
|
|
├── Configuration.md # Builder API, connections, DI registration
|
|
└── Troubleshooting.md # Errors, debugging, performance tuning
|
|
```
|
|
|
|
## Document Specifications
|
|
|
|
### Overview.md
|
|
|
|
**Content:**
|
|
1. Purpose statement - What the ETL pipeline does: streams data from enterprise sources (JDE, CMS) through transformations into SQL Server cache tables
|
|
2. Architecture diagram (text-based) - Shows flow: Source → Transformer chain → Destination, with optional pre/post scripts
|
|
3. Core contracts - Brief description of the 4 interfaces (`IImportSource`, `IDataTransformer`, `IImportDestination`, `IScriptRunner`) with responsibilities
|
|
4. Pipeline execution flow - Walk through `EtlPipeline.ExecuteAsync`: pre-scripts → open source → apply transformers → write to destination → post-scripts
|
|
5. Result model - Explain `PipelineResult`, `StepResult`, `DestinationResult` for tracking execution
|
|
|
|
**Approach:** No detailed code examples - interface signatures only. Component docs have implementations.
|
|
|
|
---
|
|
|
|
### Sources.md
|
|
|
|
**Content:**
|
|
1. Interface contract - `IImportSource` with `ReadDataAsync()` and `SourceName`. Explain `IAsyncDisposable` requirement.
|
|
2. Annotated walkthrough of `DbQuerySource`:
|
|
- Constructor pattern: `IDbConnectionFactory` + SQL query + parameters
|
|
- Connection lifecycle: open in `ReadDataAsync`, dispose via `IAsyncDisposable`
|
|
- Streaming `IDataReader` (not buffered)
|
|
3. Key patterns:
|
|
- Keep sources stateless until `ReadDataAsync`
|
|
- Returned `IDataReader` must remain valid until source disposed
|
|
- `SourceName` format for logging
|
|
4. Future source types - Brief notes on file-based, API sources
|
|
|
|
**Estimated length:** ~150-200 lines including code from `DbQuerySource.cs`
|
|
|
|
---
|
|
|
|
### Transformers.md
|
|
|
|
**Content:**
|
|
1. Interface contract - `IDataTransformer` with `Transform()`, `TransformerName`, `MapOrdinal()`
|
|
2. Base class pattern - `DataTransformerBase`:
|
|
- Default `IDataReader` method implementations
|
|
- Lazy initialization via `OnInitialize()`
|
|
- Ordinal mapping for binary methods
|
|
- Computed columns return `-1` from `MapOrdinal`
|
|
3. Annotated walkthrough of three transformers:
|
|
- `ColumnRenameTransformer` - Simple: remaps names, validates collisions
|
|
- `ColumnDropTransformer` - Removes columns, overrides `MapOrdinal`
|
|
- `JdeDateTransformer` - Complex: combines columns, sentinel handling, `GetDataTypeName` override
|
|
4. Chaining behavior - Transformers compose, ordinal mappings accumulate
|
|
5. Validation patterns - Configuration validation in `OnInitialize()`
|
|
|
|
**Estimated length:** ~250-300 lines with annotated code from all three transformers
|
|
|
|
---
|
|
|
|
### Destinations.md
|
|
|
|
**Content:**
|
|
1. Interface contract - `IImportDestination` with `WriteAsync()` returning `DestinationResult`
|
|
2. Annotated walkthrough of `DbBulkImportDestination`:
|
|
- Truncates target, bulk copies all rows
|
|
- Column mapping via `INFORMATION_SCHEMA.COLUMNS`
|
|
- Batch processing with `DataTable` buffer
|
|
3. Annotated walkthrough of `DbBulkMergeDestination`:
|
|
- Temp table creation from destination schema
|
|
- Bulk copy to temp, MERGE SQL execution
|
|
- Match columns for upsert, configurable update columns
|
|
- Schema-qualified names via `CommonScripts.ParseTableName()`
|
|
4. Script patterns (folded in):
|
|
- `IScriptRunner` and `SqlScriptRunner`
|
|
- `CommonScripts.DisableIndexes()` / `RebuildIndexes()` / `UpdateStatistics()`
|
|
- When to use pre/post scripts
|
|
- QUOTENAME for SQL injection protection
|
|
5. Timeout and batch size configuration
|
|
|
|
**Estimated length:** ~300-350 lines including MERGE SQL logic
|
|
|
|
---
|
|
|
|
### Configuration.md
|
|
|
|
**Content:**
|
|
1. Pipeline builder API - Full `EtlPipelineBuilder` reference:
|
|
- `WithName()`, `WithSource()`, `WithDestination()`
|
|
- `WithTransformer()` (chainable)
|
|
- `WithPreScript()`, `WithPostScript()`
|
|
- `WithCommandTimeout()` - validation (0-24 hours), default 600s
|
|
- `WithLogger()`, `Build()`
|
|
2. Connection factory setup:
|
|
- Connection strings for SQL Server, JDE Oracle, CMS Sybase
|
|
- Pooling considerations
|
|
3. DI registration - `EtlServiceCollectionExtensions`:
|
|
- Registering connection factories
|
|
- Pipeline builders as transient
|
|
- Scoped vs singleton
|
|
4. Configuration table - Quick reference with defaults and valid ranges
|
|
|
|
**Estimated length:** ~200-250 lines
|
|
|
|
---
|
|
|
|
### Troubleshooting.md
|
|
|
|
**Content:**
|
|
1. Common errors and fixes (table format):
|
|
- "No columns from source exist in destination" → Column name mismatch
|
|
- "Column name collision" → Duplicate names from transformer
|
|
- "GetBytes not supported for computed column" → Binary access on transformed column
|
|
- Timeout exceptions → Increase timeout, reduce batch size
|
|
2. Debugging patterns:
|
|
- Inspecting `PipelineResult.Steps`
|
|
- Checking `StepResult.RowsAffected`
|
|
- Debug logging via `ILogger<EtlPipeline>`
|
|
- `PipelineResult.Exception` for root cause
|
|
3. Performance tuning:
|
|
- Batch size guidelines (start 10000, adjust for row width)
|
|
- Index management (disable before, rebuild after)
|
|
- Timeout sizing (1 min per 100K rows rule of thumb)
|
|
- Column filtering to reduce network/memory
|
|
4. Monitoring - Using elapsed times for baselines
|
|
|
|
**Estimated length:** ~200-250 lines
|
|
|
|
---
|
|
|
|
## Source Files to Reference
|
|
|
|
| Document | Primary Source Files |
|
|
|----------|---------------------|
|
|
| Overview.md | `EtlPipeline.cs`, all contracts |
|
|
| Sources.md | `IImportSource.cs`, `DbQuerySource.cs` |
|
|
| Transformers.md | `IDataTransformer.cs`, `DataTransformerBase.cs`, `JdeDateTransformer.cs`, `ColumnRenameTransformer.cs`, `ColumnDropTransformer.cs` |
|
|
| Destinations.md | `IImportDestination.cs`, `DbBulkImportDestination.cs`, `DbBulkMergeDestination.cs`, `IScriptRunner.cs`, `SqlScriptRunner.cs`, `CommonScripts.cs` |
|
|
| Configuration.md | `EtlPipelineBuilder.cs`, `EtlServiceCollectionExtensions.cs` |
|
|
| Troubleshooting.md | `PipelineResult.cs`, `StepResult.cs`, `DestinationResult.cs` |
|
|
|
|
## Implementation Notes
|
|
|
|
- All code snippets must come from actual source files per StyleGuide.md
|
|
- Follow ComponentMap.md location: `DOCUMENTATION/DataSync/`
|
|
- Update ComponentMap.md to include new ETL source paths
|
|
- Cross-reference between docs using relative links
|