Files
jdescopingtool/PLANS/2026-01-03-etl-documentation-design.md
T
Joseph Doherty 9103626ad4 docs: add ETL pipeline documentation design
Design for 6 documentation files covering the DataSync ETL pipeline:
- Overview, Sources, Transformers, Destinations, Configuration, Troubleshooting

Target audience: developers extending the pipeline + operations/support.
2026-01-03 15:22:57 -05:00

7.0 KiB

ETL Pipeline Documentation Design

Date: 2026-01-03 Status: Approved Purpose: Design for comprehensive ETL pipeline documentation targeting developers extending the pipeline and operations/support teams.

Audience

  • Developers extending the pipeline - Need patterns and API reference for adding new sources, transformers, and destinations
  • Operations/support - Need configuration options, monitoring guidance, and troubleshooting

Document Structure

DOCUMENTATION/DataSync/
├── Overview.md          # Architecture, data flow, key concepts
├── Sources.md           # Writing custom IImportSource implementations
├── Transformers.md      # Writing custom IDataTransformer implementations
├── Destinations.md      # Writing destinations + script patterns
├── Configuration.md     # Builder API, connections, DI registration
└── Troubleshooting.md   # Errors, debugging, performance tuning

Document Specifications

Overview.md

Content:

  1. Purpose statement - What the ETL pipeline does: streams data from enterprise sources (JDE, CMS) through transformations into SQL Server cache tables
  2. Architecture diagram (text-based) - Shows flow: Source → Transformer chain → Destination, with optional pre/post scripts
  3. Core contracts - Brief description of the 4 interfaces (IImportSource, IDataTransformer, IImportDestination, IScriptRunner) with responsibilities
  4. Pipeline execution flow - Walk through EtlPipeline.ExecuteAsync: pre-scripts → open source → apply transformers → write to destination → post-scripts
  5. Result model - Explain PipelineResult, StepResult, DestinationResult for tracking execution

Approach: No detailed code examples - interface signatures only. Component docs have implementations.


Sources.md

Content:

  1. Interface contract - IImportSource with ReadDataAsync() and SourceName. Explain IAsyncDisposable requirement.
  2. Annotated walkthrough of DbQuerySource:
    • Constructor pattern: IDbConnectionFactory + SQL query + parameters
    • Connection lifecycle: open in ReadDataAsync, dispose via IAsyncDisposable
    • Streaming IDataReader (not buffered)
  3. Key patterns:
    • Keep sources stateless until ReadDataAsync
    • Returned IDataReader must remain valid until source disposed
    • SourceName format for logging
  4. Future source types - Brief notes on file-based, API sources

Estimated length: ~150-200 lines including code from DbQuerySource.cs


Transformers.md

Content:

  1. Interface contract - IDataTransformer with Transform(), TransformerName, MapOrdinal()
  2. Base class pattern - DataTransformerBase:
    • Default IDataReader method implementations
    • Lazy initialization via OnInitialize()
    • Ordinal mapping for binary methods
    • Computed columns return -1 from MapOrdinal
  3. Annotated walkthrough of three transformers:
    • ColumnRenameTransformer - Simple: remaps names, validates collisions
    • ColumnDropTransformer - Removes columns, overrides MapOrdinal
    • JdeDateTransformer - Complex: combines columns, sentinel handling, GetDataTypeName override
  4. Chaining behavior - Transformers compose, ordinal mappings accumulate
  5. Validation patterns - Configuration validation in OnInitialize()

Estimated length: ~250-300 lines with annotated code from all three transformers


Destinations.md

Content:

  1. Interface contract - IImportDestination with WriteAsync() returning DestinationResult
  2. Annotated walkthrough of DbBulkImportDestination:
    • Truncates target, bulk copies all rows
    • Column mapping via INFORMATION_SCHEMA.COLUMNS
    • Batch processing with DataTable buffer
  3. Annotated walkthrough of DbBulkMergeDestination:
    • Temp table creation from destination schema
    • Bulk copy to temp, MERGE SQL execution
    • Match columns for upsert, configurable update columns
    • Schema-qualified names via CommonScripts.ParseTableName()
  4. Script patterns (folded in):
    • IScriptRunner and SqlScriptRunner
    • CommonScripts.DisableIndexes() / RebuildIndexes() / UpdateStatistics()
    • When to use pre/post scripts
    • QUOTENAME for SQL injection protection
  5. Timeout and batch size configuration

Estimated length: ~300-350 lines including MERGE SQL logic


Configuration.md

Content:

  1. Pipeline builder API - Full EtlPipelineBuilder reference:
    • WithName(), WithSource(), WithDestination()
    • WithTransformer() (chainable)
    • WithPreScript(), WithPostScript()
    • WithCommandTimeout() - validation (0-24 hours), default 600s
    • WithLogger(), Build()
  2. Connection factory setup:
    • Connection strings for SQL Server, JDE Oracle, CMS Sybase
    • Pooling considerations
  3. DI registration - EtlServiceCollectionExtensions:
    • Registering connection factories
    • Pipeline builders as transient
    • Scoped vs singleton
  4. Configuration table - Quick reference with defaults and valid ranges

Estimated length: ~200-250 lines


Troubleshooting.md

Content:

  1. Common errors and fixes (table format):
    • "No columns from source exist in destination" → Column name mismatch
    • "Column name collision" → Duplicate names from transformer
    • "GetBytes not supported for computed column" → Binary access on transformed column
    • Timeout exceptions → Increase timeout, reduce batch size
  2. Debugging patterns:
    • Inspecting PipelineResult.Steps
    • Checking StepResult.RowsAffected
    • Debug logging via ILogger<EtlPipeline>
    • PipelineResult.Exception for root cause
  3. Performance tuning:
    • Batch size guidelines (start 10000, adjust for row width)
    • Index management (disable before, rebuild after)
    • Timeout sizing (1 min per 100K rows rule of thumb)
    • Column filtering to reduce network/memory
  4. Monitoring - Using elapsed times for baselines

Estimated length: ~200-250 lines


Source Files to Reference

Document Primary Source Files
Overview.md EtlPipeline.cs, all contracts
Sources.md IImportSource.cs, DbQuerySource.cs
Transformers.md IDataTransformer.cs, DataTransformerBase.cs, JdeDateTransformer.cs, ColumnRenameTransformer.cs, ColumnDropTransformer.cs
Destinations.md IImportDestination.cs, DbBulkImportDestination.cs, DbBulkMergeDestination.cs, IScriptRunner.cs, SqlScriptRunner.cs, CommonScripts.cs
Configuration.md EtlPipelineBuilder.cs, EtlServiceCollectionExtensions.cs
Troubleshooting.md PipelineResult.cs, StepResult.cs, DestinationResult.cs

Implementation Notes

  • All code snippets must come from actual source files per StyleGuide.md
  • Follow ComponentMap.md location: DOCUMENTATION/DataSync/
  • Update ComponentMap.md to include new ETL source paths
  • Cross-reference between docs using relative links