B1 resolved: ACL model designed and committed (decisions #129-132). 6-level scope hierarchy, NodePermissions bitmask, generation-versioned NodeAcl table, Phase 1 ships before any driver phase. Updated goal-state and roadmap. B2 partially resolved: schemas repo seed exists at schemas/ (temporary). FANUC CNC pilot class, JSON Schema format definitions, UNS subtree example, docs. Still needs: owner team, dedicated repo, format ratification, CI gate, consumer integration plumbing. C5 resolved: consumer cutover OUT of OtOpcUa v2 scope (decision #136). Integration/operations team owns cutover, not yet named. Plan updated to explicitly assign ownership outside OtOpcUa. CLAUDE.md updated with schemas/ in the file index.
18 KiB
Migration Plan / Roadmap
How we get from current-state.md to goal-state.md over the 3-year plan.
Status: in progress. The structure below is a scaffold. Cells are intentionally thin — fill in as decisions land, don't over-commit before the work is real.
Purpose
This document answers how the 3-year plan is actually executed — what gets built, migrated, or retired, in what order, and with what dependencies. It does not re-state goal-state.md (the destination) or current-state.md (the starting point); it refers to them.
Read this together with:
goal-state.md— the destination, success criteria, and the three in-scope pillars.current-state.md— today's system of record.current-state/legacy-integrations.md— the authoritative inventory for pillar 3 retirement.
Guiding principles
Carried forward from goal-state.md → Vision; load-bearing for any sequencing decision:
- Stable, single point of integration. Every step in the roadmap should move the estate toward the single ScadaBridge-central IT↔OT bridge, never away from it. No step should create a parallel or bespoke integration path, even as a "temporary" measure.
- Three pillars are binary at end of plan. Unification at 100% of sites, analytics/AI enablement with at least one "not possible before" use case in production, and zero remaining legacy IT↔OT paths. Intermediate progress is tracked per-site / per-tag / per-integration here — not by softening the end-state criteria.
- Data locality is preserved throughout. ScadaBridge remains the local data path at each site; centralizations (Redpanda, SnowBridge, Snowflake) sit above that, not around it.
- Dual-run before retire. No legacy integration is switched off until its replacement has been running in production with the same consumers for a period defined in the integration's retirement criteria. Roadmap steps reflect this with explicit dual-run phases.
- Support staffing, licensing, orchestrator selection, VM-level DR, and physical network segmentation are out of scope for this plan. Sequencing should not gate on decisions that belong to other teams.
Organizing axis: workstreams × years
The roadmap is laid out as a 2D grid — workstreams (rows) crossed with years (columns). Each workstream owns a component or capability, and each year's cell describes what happens in that workstream during that year.
Workstreams
- OtOpcUa — evolve the existing in-house
LmxOpcUainto a unified clustered OPC UA server (OtOpcUa) with two namespaces: the existing System Platform namespace plus a new equipment namespace that holds the single session to each piece of equipment. Ship it to every site and execute the tiered cutover of downstream consumers (seegoal-state.md→ OtOpcUa — the unified site-level OPC UA layer (absorbs LmxOpcUa)). Prioritized first because it is foundational for the rest of the OT plan. - Redpanda EventHub — stand up and operate the central Kafka-compatible backbone (see
goal-state.md→ Async Event Backbone). - SnowBridge — custom-build the dedicated service that owns all machine-data flows into Snowflake (see
goal-state.md→ SnowBridge). - Snowflake dbt Transform Layer — build and evolve the dbt curated layers that Snowflake consumers read from (see
goal-state.md→ Aveva Historian → Snowflake → Snowflake-side transform tooling). - ScadaBridge Extensions — add and tune the capabilities ScadaBridge needs to serve the new architecture (deadband publishing, EventHub producer configuration, auth alignment).
- Site Onboarding — bring the currently unintegrated smaller sites onto the standardized stack, and keep the already-integrated sites aligned with the evolving pattern.
- Legacy Retirement — discover, sequence, migrate, dual-run, and retire every legacy IT↔OT path tracked in
current-state/legacy-integrations.md.
Workstream → pillar mapping
| Workstream | Primary pillar(s) |
|---|---|
| OtOpcUa | Pillars 1, 2 — foundational (unblocks consistent equipment access for both unification and analytics paths) |
| Redpanda EventHub | Pillar 2 (analytics/AI enablement) — foundational |
| SnowBridge | Pillar 2 |
| Snowflake dbt Transform Layer | Pillar 2 |
| ScadaBridge Extensions | Pillars 1, 2, 3 — touches all three |
| Site Onboarding | Pillar 1 (unification) |
| Legacy Retirement | Pillar 3 (legacy retirement) |
Cross-workstream dependencies
- OtOpcUa is foundational and its deployment (software installed and ready at every site) is a Year 1 prerequisite for everything else. Its cutover (consumers redirected to it) follows the tiered order and extends across all three years, but the software must be present at every site before other workstreams take hard dependencies on equipment-data consistency. LmxOpcUa is already deployed per-node; Year 1 grows it into OtOpcUa in place, which keeps the rollout a low-risk evolution rather than a parallel install.
- Redpanda must be in place before the SnowBridge can consume Redpanda-backed flows, and before ScadaBridge Extensions can test the EventHub producer path end-to-end.
- The SnowBridge must be in place before dbt curated layers can be built on real machine-data landing tables.
- The Legacy inventory (in
current-state/legacy-integrations.md) must be populated before Legacy Retirement can be sequenced; inventory discovery is a Year 1 prerequisite. - ScadaBridge tier-1 cutover (ScadaBridge reading from OtOpcUa instead of equipment directly) must be completed at a site before ScadaBridge Extensions at that site can rely on consistent equipment-data semantics for downstream Redpanda publishing.
- Site Onboarding for the smaller sites depends on having the standardized stack (OtOpcUa + ScadaBridge + Redpanda + SnowBridge) reasonably proven at the large sites — so heavy onboarding is Year 2+, not Year 1.
- Dual-run for any retired legacy path requires the replacement path to be live — so Legacy Retirement's execution lags the workstream that delivers the replacement (most often ScadaBridge Extensions or the SnowBridge).
The grid
| Workstream | Year 1 — Foundation | Year 2 — Scale | Year 3 — Completion |
|---|---|---|---|
| OtOpcUa | Evolve LmxOpcUa into OtOpcUa — extend the existing in-house OPC UA server to add (a) a new equipment namespace with single session per equipment via native protocols translated to OPC UA (committed core drivers: OPC UA Client, Modbus TCP, AB CIP, AB Legacy, S7, TwinCAT, FOCAS, plus Galaxy carried forward), and (b) clustering (non-transparent redundancy, 2-node per site) on top of the existing per-node deployment. Driver stability tiers: Tier A in-process (Modbus, OPC UA Client), Tier B in-process with guards (S7, AB CIP, AB Legacy, TwinCAT), Tier C out-of-process (Galaxy — bitness constraint, FOCAS — uncatchable AVE). Core driver list confirmed by v2 implementation team (protocol survey no longer needed for driver scoping). UNS hierarchy snapshot walk — per-site equipment-instance discovery (site/area/line/equipment + UUID assignment) to feed the initial schemas-repo hierarchy definition and canonical model; target done Q1–Q2. ACL model designed and committed (decisions #129–132): 6-level scope hierarchy, NodePermissions bitmask, generation-versioned NodeAcl table, Admin UI + permission simulator. Phase 1 ships before any driver phase. Deploy OtOpcUa to every site as fast as practical. Begin tier 1 cutover (ScadaBridge) at large sites. Prerequisite: certificate-distribution to consumer trust stores before each cutover. Aveva System Platform IO pattern validation — Year 1 or early Year 2 research to confirm Aveva supports upstream OPC UA data sources, well ahead of Year 3 tier 3. TBD — first-cutover site selection; cutover plan owner (not OtOpcUa — a separate integration/operations team, per decision #136, not yet named); enterprise shortname for UNS hierarchy root; schemas-repo owner team and dedicated repo creation. |
Complete tier 1 (ScadaBridge) across all sites. Begin tier 2 (Ignition) — Ignition consumers redirected from direct-equipment OPC UA to each site's OtOpcUa, collapsing WAN session counts from N per equipment to one per site. Build long-tail drivers on demand as sites require them. Resolve Warsaw per-building multi-cluster consumer-addressing pattern (consumer-side stitching vs site-aggregator OtOpcUa instance). TBD — per-site tier-2 rollout sequence. | Complete tier 2 (Ignition) across all sites. Execute tier 3 (Aveva System Platform IO) with compliance stakeholder validation — the hardest cutover because System Platform IO feeds validated data collection. Reach steady state: every equipment session is held by OtOpcUa, every downstream consumer reads OT data through it. TBD — per-equipment-class criteria for System Platform IO re-validation. |
| Redpanda EventHub | Stand up central Redpanda cluster in South Bend (single-cluster HA). Stand up bundled Schema Registry. Wire SASL/OAUTHBEARER to enterprise IdP. Create initial topic set (prefix-based ACLs). Hook up observability minimum signal set. Define the three retention tiers (operational/analytics/compliance). Stand up the central schemas repo with buf CI, CODEOWNERS, and the NuGet publishing pipeline. Publish the canonical equipment/production/event model v1 — including the canonical machine state vocabulary (Running / Idle / Faulted / Starved / Blocked + any agreed additions) as a Protobuf enum, the equipment.state.transitioned event schema, and initial equipment-class definitions for pilot equipment. This is the foundation for Digital Twin Use Cases 1 and 3 (see goal-state.md → Strategic Considerations → Digital twin) and is load-bearing for pillar 2. Pilot equipment class for canonical definition: FANUC CNC (pre-defined FOCAS2 hierarchy already exists in OtOpcUa v2 driver design). Land the FANUC CNC class template in the schemas repo before Tier 1 cutover begins. TBD — sizing decisions, initial topic list, canonical vocabulary ownership (domain SME group). |
Expand topic coverage as additional domains onboard. Enforce tiered retention and ACLs at scale. Prove backlog replay after a WAN-outage drill (also exercises the Digital Twin Use Case 2 simulation-lite replay path). Exercise long-outage planning (ScadaBridge queue capacity vs. outage duration). Iterate the canonical model as additional equipment classes and domains onboard. TBD — concrete drill cadence. | Steady-state operation. Harden alerting and runbooks against the observed failure modes from Years 1–2. Canonical model is mature and covers every in-scope equipment class; schema changes are routine rather than foundational. |
| SnowBridge | Design and begin custom build in .NET. Filtered, governed upload to Snowflake is the Year 1 purpose — the service is the component that decides which topics/tags flow to Snowflake, applies the governed selection model, and writes into Snowflake. Ship an initial version with one working source adapter — starting with Aveva Historian (SQL interface) because it's central-only, exists today, and lets the workstream progress in parallel with Redpanda rather than waiting on it. First end-to-end filtered flow to Snowflake landing tables on a handful of priority tags. Selection model in place even if the operator UI isn't yet (config-driven is acceptable for Year 1). TBD — team, credential management, datastore for selection state. | Add the ScadaBridge/Redpanda source adapter alongside Historian. Build and ship the operator web UI + API on top of the Year 1 selection model, including the blast-radius-based approval workflow, audit trail, RBAC, and exportable state. Onboard priority tags per domain under the UI-driven governance path. TBD — UI framework. | All planned source adapters live behind the unified interface. Approval workflow tuned based on Year 2 operational experience. Feature freeze; focus on hardening. |
| Snowflake dbt Transform Layer | Scaffold a dbt project in git, wired to the self-hosted orchestrator (per goal-state.md; specific orchestrator chosen outside this plan). Build first landing → curated model for priority tags. Align curated views with the canonical model v1 published in the schemas repo — equipment, production, and event entities in the curated layer use the canonical state vocabulary and the same event-type enum values, so downstream consumers (Power BI, ad-hoc analysts, future AI/ML) see the same shape of data Redpanda publishes. This is the dbt-side delivery for Digital Twin Use Cases 1 and 3. Establish dbt test discipline from day one — including tests that catch divergence between curated views and the canonical enums. TBD — project layout (single vs per-domain); reconciliation rule if derived state in curated views disagrees with the layer-3 derivation (should not happen, but the rule needs to exist). |
Build curated layers for all in-scope domains. Ship a canonical-state-based OEE model as a strong candidate for the pillar-2 "not possible before" use case — accurate cross-equipment, cross-site OEE computed once in dbt from the canonical state stream, rather than re-derived in every reporting surface. Source-freshness SLAs tied to the ≤15-minute analytics budget. Begin development of the first "not possible before" AI/analytics use case (pillar 2). | The "not possible before" use case is in production, consuming the curated layer, meeting its own SLO. Pillar 2 check passes. |
| ScadaBridge Extensions | Implement deadband / exception-based publishing with the global-default model (+ override mechanism). Add EventHub producer capability with per-call store-and-forward to Redpanda. Verify co-located footprint doesn't degrade System Platform. TBD — global deadband value, override mechanism location. | Roll deadband + EventHub producer to all currently-integrated sites. Tune deadband and overrides based on observed Snowflake cost. Support early legacy-retirement work with outbound Web API / DB write patterns as needed. | Steady state. Any remaining Extensions work is residual cleanup or support for the tail end of Site Onboarding / Legacy Retirement. |
| Site Onboarding | No new site onboardings in Year 1. Use the year to define and document the lightweight onboarding pattern for smaller sites — equipment types, network requirements, standard ScadaBridge template set, standard topic/tag set. Keep the existing integrated sites stable. | Pilot the onboarding pattern on one smaller site end-to-end (Berlin, Winterthur, or Jacksonville — choice TBD). Use learnings to refine the pattern, then begin scaling onboarding to additional smaller sites. TBD — pilot site selection criteria, per-site effort estimate. | Complete onboarding of all remaining smaller sites. Every site on the authoritative list is on the standardized stack. Pillar 1 check passes. |
| Legacy Retirement | Populate the legacy inventory (current-state/legacy-integrations.md) — this is the prerequisite for sequencing. Identify early-retirement candidates where the replacement path already exists (e.g., LEG-002 Camstar, since ScadaBridge already has a native Camstar path). Retire at least one integration end-to-end as a pattern-proving exercise (including dual-run + decommission). TBD — inventory ownership, discovery approach. |
Bulk migration. Execute retirements in sequence against the inventory, prioritized by a mix of risk and ease. Each retirement follows: plan → build replacement (often in ScadaBridge Extensions) → dual-run → cutover → decommission. Inventory burn-down tracked quarterly. TBD — prioritization rubric, dual-run duration per integration class. | Drive inventory to zero. Any remaining integrations are in dual-run or decommission phase at start of year; the inventory reaches zero by end of year. Pillar 3 check passes. |
Cell reading guide
- Each cell describes what happens in that workstream during that year — not every task, just the shape.
_TBD_markers inside cells are open items that don't need to be resolved today but will before execution of that cell.- Cells deliberately omit dates beyond "Year N" until individual workstreams firm up their delivery plans. This file should not become a Gantt chart; it's the strategic shape, not the project plan.
End-of-plan pillar checks
At the end of Year 3, the three pillar criteria from goal-state.md → Success Criteria are each binary. The cells above are structured so that the relevant workstream ends Year 3 having either satisfied its share of the check or not.
- Pillar 1 — Unification: Site Onboarding ends Year 3 with all sites on the standardized stack.
- Pillar 2 — Analytics/AI Enablement: dbt + SnowBridge + Redpanda end Year 3 with the "not possible before" use case in production against the ≤15-minute analytics SLO.
- Pillar 3 — Legacy Retirement: Legacy Retirement ends Year 3 with the inventory at zero.
If a workstream appears to be falling behind its Year 3 cell, the response is never to soften the end-state criterion. It is either to accelerate the workstream, reallocate from a lower-risk workstream, or formally accept slippage and adjust the plan — but the success criteria are not moved.
Open questions
- Starting source adapter for the SnowBridge. Year 1 commits to Aveva Historian (SQL interface) as the first source adapter — central-only Historian, exists today, lets the workstream progress in parallel with Redpanda. The Redpanda/ScadaBridge source adapter follows in Year 2 once that workstream has matured. Validate with the build team only if Historian SQL read proves painful at scale.
Deadband global default value.Resolved. Starting value is approximately 1% of span for analogs, change-only for booleans/state, every-increment for counters — captured ingoal-state.mdunder the deadband model. The build team may adjust during implementation; the mechanism is the load-bearing commitment, not the number.- Pilot smaller-site selection. Year 2 Site Onboarding needs a pilot site chosen early in Year 2 (or late Year 1).
- Quarterly milestones. This grid is year-level only. Quarterly milestones that roll up into the three pillar checks are not yet defined — if leadership reporting needs them, they belong in a companion document, not in this grid.