Finalize digital-twin scope as two ACL-based patterns

Plan's digital-twin scope is now exactly (1) environment-lifecycle
promotion via ACL flip on write authority and (2) safe read-only
consumption for KPI / monitoring systems — both delivered by
already-committed architecture. Removes the meeting-prep brief and the
management-delivered use-cases source document; canonical model and
state vocabulary stand as pillar-2 work on their own.
This commit is contained in:
Joseph Doherty
2026-04-24 14:13:10 -04:00
parent c8d7bf37de
commit 22a86974f6
11 changed files with 59 additions and 303 deletions

View File

@@ -64,9 +64,9 @@ The roadmap is laid out as a 2D grid — **workstreams** (rows) crossed with **y
| Workstream | **Year 1 — Foundation** | **Year 2 — Scale** | **Year 3 — Completion** |
|---|---|---|---|
| **OtOpcUa** | **Evolve LmxOpcUa into OtOpcUa** — extend the existing in-house OPC UA server to add (a) a new equipment namespace with single session per equipment via native protocols translated to OPC UA (committed core drivers: OPC UA Client, Modbus TCP, AB CIP, AB Legacy, S7, TwinCAT, FOCAS, plus Galaxy carried forward), and (b) clustering (non-transparent redundancy, 2-node per site) on top of the existing per-node deployment. **Driver stability tiers:** Tier A in-process (Modbus, OPC UA Client), Tier B in-process with guards (S7, AB CIP, AB Legacy, TwinCAT), Tier C out-of-process (Galaxy — bitness constraint, FOCAS — uncatchable AVE). Core driver list confirmed by v2 implementation team (protocol survey no longer needed for driver scoping). **UNS hierarchy snapshot walk** — per-site equipment-instance discovery (site/area/line/equipment + UUID assignment) to feed the initial schemas-repo hierarchy definition and canonical model; target done Q1Q2. **ACL model designed and committed** (decisions #129132): 6-level scope hierarchy, `NodePermissions` bitmask, generation-versioned `NodeAcl` table, Admin UI + permission simulator. Phase 1 ships before any driver phase. **Deploy OtOpcUa to every site** as fast as practical. **Begin tier 1 cutover (ScadaBridge)** at large sites. **Prerequisite: certificate-distribution** to consumer trust stores before each cutover. **Aveva System Platform IO pattern validation** — Year 1 or early Year 2 research to confirm Aveva supports upstream OPC UA data sources, well ahead of Year 3 tier 3. _TBD — first-cutover site selection; **cutover plan owner** (not OtOpcUa — a separate integration/operations team, per decision #136, not yet named); enterprise shortname for UNS hierarchy root; schemas-repo owner team and dedicated repo creation._ | **Complete tier 1 (ScadaBridge)** across all sites. **Begin tier 2 (Ignition)** — Ignition consumers redirected from direct-equipment OPC UA to each site's OtOpcUa, collapsing WAN session counts from *N per equipment* to *one per site*. **Build long-tail drivers** on demand as sites require them. Resolve Warsaw per-building multi-cluster consumer-addressing pattern (consumer-side stitching vs site-aggregator OtOpcUa instance). _TBD — per-site tier-2 rollout sequence._ | **Complete tier 2 (Ignition)** across all sites. **Execute tier 3 (Aveva System Platform IO)** with compliance stakeholder validation — the hardest cutover because System Platform IO feeds validated data collection. Reach steady state: every equipment session is held by OtOpcUa, every downstream consumer reads OT data through it. _TBD — per-equipment-class criteria for System Platform IO re-validation._ |
| **Redpanda EventHub** | Stand up central Redpanda cluster in South Bend (single-cluster HA). Stand up bundled Schema Registry. Wire SASL/OAUTHBEARER to enterprise IdP. Create initial topic set (prefix-based ACLs). Hook up observability minimum signal set. Define the three retention tiers (`operational`/`analytics`/`compliance`). **Stand up the central `schemas` repo** with `buf` CI, CODEOWNERS, and the NuGet publishing pipeline. **Publish the canonical equipment/production/event model v1** — including the canonical machine state vocabulary (`Running / Idle / Faulted / Starved / Blocked` + any agreed additions) as a Protobuf enum, the `equipment.state.transitioned` event schema, and initial equipment-class definitions for pilot equipment. This is the foundation for Digital Twin Use Cases 1 and 3 (see `goal-state.md` → Strategic Considerations → Digital twin) and is load-bearing for pillar 2. **Pilot equipment class for canonical definition: FANUC CNC** (pre-defined FOCAS2 hierarchy already exists in OtOpcUa v2 driver design). Land the FANUC CNC class template in the schemas repo before Tier 1 cutover begins. **Universal `_base` equipment-class template** seeded by the OtOpcUa team — every other class extends it via the `extends` field on the equipment-class JSON Schema. `_base` aligns to **OPC UA Companion Spec OPC 40010 (Machinery)** for the Identification component (Manufacturer, Model, ProductInstanceUri, SerialNumber, HardwareRevision, SoftwareRevision, YearOfConstruction, ManufacturerUri, DeviceManual, AssetLocation) and MachineryOperationMode enum, **OPC UA Part 9** for alarm-summary fields, and **ISO 22400** for lifetime counters that feed Availability + Performance KPIs. Avoids per-class drift in identity / state / alarm field naming and ensures every machine in the estate exposes the same baseline metadata regardless of vendor. _TBD — sizing decisions, initial topic list, canonical vocabulary ownership (domain SME group)._ | Expand topic coverage as additional domains onboard. Enforce tiered retention and ACLs at scale. Prove backlog replay after a WAN-outage drill (also exercises the Digital Twin Use Case 2 simulation-lite replay path). Exercise long-outage planning (ScadaBridge queue capacity vs. outage duration). Iterate the canonical model as additional equipment classes and domains onboard. _TBD — concrete drill cadence._ | Steady-state operation. Harden alerting and runbooks against the observed failure modes from Years 12. Canonical model is mature and covers every in-scope equipment class; schema changes are routine rather than foundational. |
| **Redpanda EventHub** | Stand up central Redpanda cluster in South Bend (single-cluster HA). Stand up bundled Schema Registry. Wire SASL/OAUTHBEARER to enterprise IdP. Create initial topic set (prefix-based ACLs). Hook up observability minimum signal set. Define the three retention tiers (`operational`/`analytics`/`compliance`). **Stand up the central `schemas` repo** with `buf` CI, CODEOWNERS, and the NuGet publishing pipeline. **Publish the canonical equipment/production/event model v1** — including the canonical machine state vocabulary (`Running / Idle / Faulted / Starved / Blocked` + any agreed additions) as a Protobuf enum, the `equipment.state.transitioned` event schema, and initial equipment-class definitions for pilot equipment. This is load-bearing for pillar 2 (canonical model is what makes cross-domain "not possible before" analytics possible at all). **Pilot equipment class for canonical definition: FANUC CNC** (pre-defined FOCAS2 hierarchy already exists in OtOpcUa v2 driver design). Land the FANUC CNC class template in the schemas repo before Tier 1 cutover begins. **Universal `_base` equipment-class template** seeded by the OtOpcUa team — every other class extends it via the `extends` field on the equipment-class JSON Schema. `_base` aligns to **OPC UA Companion Spec OPC 40010 (Machinery)** for the Identification component (Manufacturer, Model, ProductInstanceUri, SerialNumber, HardwareRevision, SoftwareRevision, YearOfConstruction, ManufacturerUri, DeviceManual, AssetLocation) and MachineryOperationMode enum, **OPC UA Part 9** for alarm-summary fields, and **ISO 22400** for lifetime counters that feed Availability + Performance KPIs. Avoids per-class drift in identity / state / alarm field naming and ensures every machine in the estate exposes the same baseline metadata regardless of vendor. _TBD — sizing decisions, initial topic list, canonical vocabulary ownership (domain SME group)._ | Expand topic coverage as additional domains onboard. Enforce tiered retention and ACLs at scale. Prove backlog replay after a WAN-outage drill (the replay surface is also the foundation for any future funded physics-simulation / FAT initiative, should one materialize). Exercise long-outage planning (ScadaBridge queue capacity vs. outage duration). Iterate the canonical model as additional equipment classes and domains onboard. _TBD — concrete drill cadence._ | Steady-state operation. Harden alerting and runbooks against the observed failure modes from Years 12. Canonical model is mature and covers every in-scope equipment class; schema changes are routine rather than foundational. |
| **SnowBridge** | Design and begin custom build in .NET. **Filtered, governed upload to Snowflake is the Year 1 purpose** — the service is the component that decides which topics/tags flow to Snowflake, applies the governed selection model, and writes into Snowflake. Ship an initial version with **one working source adapter** — starting with **Aveva Historian (SQL interface)** because it's central-only, exists today, and lets the workstream progress in parallel with Redpanda rather than waiting on it. First end-to-end **filtered** flow to Snowflake landing tables on a handful of priority tags. Selection model in place even if the operator UI isn't yet (config-driven is acceptable for Year 1). _TBD — team, credential management, datastore for selection state._ | Add the **ScadaBridge/Redpanda source adapter** alongside Historian. Build and ship the operator **web UI + API** on top of the Year 1 selection model, including the blast-radius-based approval workflow, audit trail, RBAC, and exportable state. Onboard priority tags per domain under the UI-driven governance path. _TBD — UI framework._ | All planned source adapters live behind the unified interface. Approval workflow tuned based on Year 2 operational experience. Feature freeze; focus on hardening. |
| **Snowflake dbt Transform Layer** | Scaffold a dbt project in git, wired to the self-hosted orchestrator (per `goal-state.md`; specific orchestrator chosen outside this plan). Build first **landing → curated** model for priority tags. **Align curated views with the canonical model v1** published in the `schemas` repo — equipment, production, and event entities in the curated layer use the canonical state vocabulary and the same event-type enum values, so downstream consumers (Power BI, ad-hoc analysts, future AI/ML) see the same shape of data Redpanda publishes. This is the dbt-side delivery for Digital Twin Use Cases 1 and 3. Establish `dbt test` discipline from day one — including tests that catch divergence between curated views and the canonical enums. _TBD — project layout (single vs per-domain); reconciliation rule if derived state in curated views disagrees with the layer-3 derivation (should not happen, but the rule needs to exist)._ | Build curated layers for all in-scope domains. **Ship a canonical-state-based OEE model** as a strong candidate for the pillar-2 "not possible before" use case — accurate cross-equipment, cross-site OEE computed once in dbt from the canonical state stream, rather than re-derived in every reporting surface. Source-freshness SLAs tied to the **≤15-minute analytics** budget. Begin development of the first **"not possible before" AI/analytics use case** (pillar 2). | The "not possible before" use case is **in production**, consuming the curated layer, meeting its own SLO. Pillar 2 check passes. |
| **Snowflake dbt Transform Layer** | Scaffold a dbt project in git, wired to the self-hosted orchestrator (per `goal-state.md`; specific orchestrator chosen outside this plan). Build first **landing → curated** model for priority tags. **Align curated views with the canonical model v1** published in the `schemas` repo — equipment, production, and event entities in the curated layer use the canonical state vocabulary and the same event-type enum values, so downstream consumers (Power BI, ad-hoc analysts, future AI/ML) see the same shape of data Redpanda publishes. This is the dbt-side delivery of the canonical model (load-bearing for pillar 2). Establish `dbt test` discipline from day one — including tests that catch divergence between curated views and the canonical enums. _TBD — project layout (single vs per-domain); reconciliation rule if derived state in curated views disagrees with the layer-3 derivation (should not happen, but the rule needs to exist)._ | Build curated layers for all in-scope domains. **Ship a canonical-state-based OEE model** as a strong candidate for the pillar-2 "not possible before" use case — accurate cross-equipment, cross-site OEE computed once in dbt from the canonical state stream, rather than re-derived in every reporting surface. Source-freshness SLAs tied to the **≤15-minute analytics** budget. Begin development of the first **"not possible before" AI/analytics use case** (pillar 2). | The "not possible before" use case is **in production**, consuming the curated layer, meeting its own SLO. Pillar 2 check passes. |
| **ScadaBridge Extensions** | Implement **deadband / exception-based publishing** with the global-default model (+ override mechanism). Add **EventHub producer** capability with per-call **store-and-forward** to Redpanda. Verify co-located footprint doesn't degrade System Platform. _TBD — global deadband value, override mechanism location._ | Roll deadband + EventHub producer to **all currently-integrated sites**. Tune deadband and overrides based on observed Snowflake cost. Support early legacy-retirement work with outbound Web API / DB write patterns as needed. | Steady state. Any remaining Extensions work is residual cleanup or support for the tail end of Site Onboarding / Legacy Retirement. |
| **Site Onboarding** | **No new site onboardings in Year 1.** Use the year to define and document the **lightweight onboarding pattern** for smaller sites — equipment types, network requirements, standard ScadaBridge template set, standard topic/tag set. Keep the existing integrated sites stable. | **Pilot the onboarding pattern** on one smaller site end-to-end (Berlin, Winterthur, or Jacksonville — choice TBD). Use learnings to refine the pattern, then **begin scaling** onboarding to additional smaller sites. _TBD — pilot site selection criteria, per-site effort estimate._ | **Complete onboarding of all remaining smaller sites.** Every site on the authoritative list is on the standardized stack. Pillar 1 check passes. |
| **Legacy Retirement** | **Populate the legacy inventory** (`current-state/legacy-integrations.md`) — this is the prerequisite for sequencing. Identify **early-retirement candidates** where the replacement path already exists (e.g., **LEG-002 Camstar**, since ScadaBridge already has a native Camstar path). Retire at least one integration end-to-end as a pattern-proving exercise (including dual-run + decommission). _TBD — inventory ownership, discovery approach._ | **Bulk migration.** Execute retirements in sequence against the inventory, prioritized by a mix of risk and ease. Each retirement follows: plan → build replacement (often in ScadaBridge Extensions) → dual-run → cutover → decommission. Inventory burn-down tracked quarterly. _TBD — prioritization rubric, dual-run duration per integration class._ | **Drive inventory to zero.** Any remaining integrations are in dual-run or decommission phase at start of year; the inventory reaches zero by end of year. Pillar 3 check passes. |