9103626ad4
Design for 6 documentation files covering the DataSync ETL pipeline: - Overview, Sources, Transformers, Destinations, Configuration, Troubleshooting Target audience: developers extending the pipeline + operations/support.
7.0 KiB
7.0 KiB
ETL Pipeline Documentation Design
Date: 2026-01-03 Status: Approved Purpose: Design for comprehensive ETL pipeline documentation targeting developers extending the pipeline and operations/support teams.
Audience
- Developers extending the pipeline - Need patterns and API reference for adding new sources, transformers, and destinations
- Operations/support - Need configuration options, monitoring guidance, and troubleshooting
Document Structure
DOCUMENTATION/DataSync/
├── Overview.md # Architecture, data flow, key concepts
├── Sources.md # Writing custom IImportSource implementations
├── Transformers.md # Writing custom IDataTransformer implementations
├── Destinations.md # Writing destinations + script patterns
├── Configuration.md # Builder API, connections, DI registration
└── Troubleshooting.md # Errors, debugging, performance tuning
Document Specifications
Overview.md
Content:
- Purpose statement - What the ETL pipeline does: streams data from enterprise sources (JDE, CMS) through transformations into SQL Server cache tables
- Architecture diagram (text-based) - Shows flow: Source → Transformer chain → Destination, with optional pre/post scripts
- Core contracts - Brief description of the 4 interfaces (
IImportSource,IDataTransformer,IImportDestination,IScriptRunner) with responsibilities - Pipeline execution flow - Walk through
EtlPipeline.ExecuteAsync: pre-scripts → open source → apply transformers → write to destination → post-scripts - Result model - Explain
PipelineResult,StepResult,DestinationResultfor tracking execution
Approach: No detailed code examples - interface signatures only. Component docs have implementations.
Sources.md
Content:
- Interface contract -
IImportSourcewithReadDataAsync()andSourceName. ExplainIAsyncDisposablerequirement. - Annotated walkthrough of
DbQuerySource:- Constructor pattern:
IDbConnectionFactory+ SQL query + parameters - Connection lifecycle: open in
ReadDataAsync, dispose viaIAsyncDisposable - Streaming
IDataReader(not buffered)
- Constructor pattern:
- Key patterns:
- Keep sources stateless until
ReadDataAsync - Returned
IDataReadermust remain valid until source disposed SourceNameformat for logging
- Keep sources stateless until
- Future source types - Brief notes on file-based, API sources
Estimated length: ~150-200 lines including code from DbQuerySource.cs
Transformers.md
Content:
- Interface contract -
IDataTransformerwithTransform(),TransformerName,MapOrdinal() - Base class pattern -
DataTransformerBase:- Default
IDataReadermethod implementations - Lazy initialization via
OnInitialize() - Ordinal mapping for binary methods
- Computed columns return
-1fromMapOrdinal
- Default
- Annotated walkthrough of three transformers:
ColumnRenameTransformer- Simple: remaps names, validates collisionsColumnDropTransformer- Removes columns, overridesMapOrdinalJdeDateTransformer- Complex: combines columns, sentinel handling,GetDataTypeNameoverride
- Chaining behavior - Transformers compose, ordinal mappings accumulate
- Validation patterns - Configuration validation in
OnInitialize()
Estimated length: ~250-300 lines with annotated code from all three transformers
Destinations.md
Content:
- Interface contract -
IImportDestinationwithWriteAsync()returningDestinationResult - Annotated walkthrough of
DbBulkImportDestination:- Truncates target, bulk copies all rows
- Column mapping via
INFORMATION_SCHEMA.COLUMNS - Batch processing with
DataTablebuffer
- Annotated walkthrough of
DbBulkMergeDestination:- Temp table creation from destination schema
- Bulk copy to temp, MERGE SQL execution
- Match columns for upsert, configurable update columns
- Schema-qualified names via
CommonScripts.ParseTableName()
- Script patterns (folded in):
IScriptRunnerandSqlScriptRunnerCommonScripts.DisableIndexes()/RebuildIndexes()/UpdateStatistics()- When to use pre/post scripts
- QUOTENAME for SQL injection protection
- Timeout and batch size configuration
Estimated length: ~300-350 lines including MERGE SQL logic
Configuration.md
Content:
- Pipeline builder API - Full
EtlPipelineBuilderreference:WithName(),WithSource(),WithDestination()WithTransformer()(chainable)WithPreScript(),WithPostScript()WithCommandTimeout()- validation (0-24 hours), default 600sWithLogger(),Build()
- Connection factory setup:
- Connection strings for SQL Server, JDE Oracle, CMS Sybase
- Pooling considerations
- DI registration -
EtlServiceCollectionExtensions:- Registering connection factories
- Pipeline builders as transient
- Scoped vs singleton
- Configuration table - Quick reference with defaults and valid ranges
Estimated length: ~200-250 lines
Troubleshooting.md
Content:
- Common errors and fixes (table format):
- "No columns from source exist in destination" → Column name mismatch
- "Column name collision" → Duplicate names from transformer
- "GetBytes not supported for computed column" → Binary access on transformed column
- Timeout exceptions → Increase timeout, reduce batch size
- Debugging patterns:
- Inspecting
PipelineResult.Steps - Checking
StepResult.RowsAffected - Debug logging via
ILogger<EtlPipeline> PipelineResult.Exceptionfor root cause
- Inspecting
- Performance tuning:
- Batch size guidelines (start 10000, adjust for row width)
- Index management (disable before, rebuild after)
- Timeout sizing (1 min per 100K rows rule of thumb)
- Column filtering to reduce network/memory
- Monitoring - Using elapsed times for baselines
Estimated length: ~200-250 lines
Source Files to Reference
| Document | Primary Source Files |
|---|---|
| Overview.md | EtlPipeline.cs, all contracts |
| Sources.md | IImportSource.cs, DbQuerySource.cs |
| Transformers.md | IDataTransformer.cs, DataTransformerBase.cs, JdeDateTransformer.cs, ColumnRenameTransformer.cs, ColumnDropTransformer.cs |
| Destinations.md | IImportDestination.cs, DbBulkImportDestination.cs, DbBulkMergeDestination.cs, IScriptRunner.cs, SqlScriptRunner.cs, CommonScripts.cs |
| Configuration.md | EtlPipelineBuilder.cs, EtlServiceCollectionExtensions.cs |
| Troubleshooting.md | PipelineResult.cs, StepResult.cs, DestinationResult.cs |
Implementation Notes
- All code snippets must come from actual source files per StyleGuide.md
- Follow ComponentMap.md location:
DOCUMENTATION/DataSync/ - Update ComponentMap.md to include new ETL source paths
- Cross-reference between docs using relative links