Integrate OtOpcUa v2 implementation corrections into plan

19 corrections from handoffs/otopcua-corrections-2026-04-17.md:

Inaccuracies fixed:
- A1: OPC UA-native equipment requires OpcUaClient gateway driver (~hours
  config), not "no driver build"
- A2: "single endpoint" is per-node (non-transparent redundancy), not
  per-cluster; no VIP planned

Missing constraints added:
- B1: ACL surface (EquipmentAcl table, Admin UI, NodeManager enforcement)
  as Year 1 deliverable before Tier 1 cutover
- B2: schemas-repo creation on OtOpcUa critical path with FANUC CNC pilot
- B3: Certificate-distribution as pre-cutover step (per-node ApplicationUri
  trust-pinning)

Architectural decisions incorporated:
- C1: 8 committed core drivers (added TwinCAT/Beckhoff, split AB Legacy)
- C2: Three-tier driver stability model (A/B/C with out-of-process for
  Galaxy and FOCAS)
- C3: Polly v8+ resilience with default-no-retry on writes
- C4: Multi-identifier equipment model (5 IDs: UUID, EquipmentId,
  MachineCode, ZTag, SAPID)
- C5: Consumer cutover plan needs an owner (flagged)
- C6: Per-building cluster implications at Warsaw clarified

TBDs resolved:
- D1: Pilot equipment class = FANUC CNC
- D2: Schemas repo format = JSON Schema (.json), Protobuf derived
- D3: ACL definitions in central config DB alongside driver/topology
- D4: Enterprise shortname still unresolved (flagged as pre-cutover blocker)

New TBDs added:
- E1: UUID generation authority (OtOpcUa vs external system)
- E2: Aveva System Platform IO pattern validation (Year 1/2 research)
- E3: Site-wide vs per-cluster consumer addressing at Warsaw
- E4: Cluster endpoint wording (resolved via A2)
This commit is contained in:
Joseph Doherty
2026-04-17 10:05:07 -04:00
parent 9b2acfe699
commit 68dbc014da
3 changed files with 90 additions and 30 deletions

View File

@@ -63,8 +63,8 @@ The roadmap is laid out as a 2D grid — **workstreams** (rows) crossed with **y
| Workstream | **Year 1 — Foundation** | **Year 2 — Scale** | **Year 3 — Completion** |
|---|---|---|---|
| **OtOpcUa** | **Evolve LmxOpcUa into OtOpcUa** — extend the existing in-house OPC UA server to add (a) a new equipment namespace that holds the single session to each piece of equipment via native device protocols translated to OPC UA, and (b) clustering on top of the existing per-node deployment. The System Platform namespace carries forward from LmxOpcUa; consumers that already use LmxOpcUa keep working. **Proactive protocol survey** across the estate — template, schema, rollup, and classification rule (core vs long-tail) live in [`current-state/equipment-protocol-survey.md`](current-state/equipment-protocol-survey.md); survey is a **Year 1 prerequisite** for core library scope, target steps 13 (System Platform IO / Ignition / ScadaBridge walks) done inside Q1 so the core driver build can start Q2. **Build the core driver library** for the protocols that meet the classification rule. **Deploy OtOpcUa to every site** (in place on existing System Platform nodes where LmxOpcUa already runs) as fast as practical — deployment ≠ cutover; an idle equipment namespace is cheap. **Begin tier 1 cutover (ScadaBridge)** at the large sites where we own both ends of the connection. _TBD — survey owner; first-cutover site selection._ | **Complete tier 1 (ScadaBridge)** across all sites. **Begin tier 2 (Ignition)** — Ignition consumers redirected from direct-equipment OPC UA to each site's OtOpcUa, collapsing WAN session counts from *N per equipment* to *one per site*. **Build long-tail drivers** on demand as sites require them. _TBD — per-site tier-2 rollout sequence._ | **Complete tier 2 (Ignition)** across all sites. **Execute tier 3 (Aveva System Platform IO)** with compliance stakeholder validation — the hardest cutover because System Platform IO feeds validated data collection. Reach steady state: every equipment session is held by OtOpcUa, every downstream consumer reads OT data through it. _TBD — per-equipment-class criteria for System Platform IO re-validation._ |
| **Redpanda EventHub** | Stand up central Redpanda cluster in South Bend (single-cluster HA). Stand up bundled Schema Registry. Wire SASL/OAUTHBEARER to enterprise IdP. Create initial topic set (prefix-based ACLs). Hook up observability minimum signal set. Define the three retention tiers (`operational`/`analytics`/`compliance`). **Stand up the central `schemas` repo** with `buf` CI, CODEOWNERS, and the NuGet publishing pipeline. **Publish the canonical equipment/production/event model v1** — including the canonical machine state vocabulary (`Running / Idle / Faulted / Starved / Blocked` + any agreed additions) as a Protobuf enum, the `equipment.state.transitioned` event schema, and initial equipment-class definitions for pilot equipment. This is the foundation for Digital Twin Use Cases 1 and 3 (see `goal-state.md` → Strategic Considerations → Digital twin) and is load-bearing for pillar 2. _TBD — sizing decisions, initial topic list, pilot equipment classes for the first canonical definition, canonical vocabulary ownership (domain SME group)._ | Expand topic coverage as additional domains onboard. Enforce tiered retention and ACLs at scale. Prove backlog replay after a WAN-outage drill (also exercises the Digital Twin Use Case 2 simulation-lite replay path). Exercise long-outage planning (ScadaBridge queue capacity vs. outage duration). Iterate the canonical model as additional equipment classes and domains onboard. _TBD — concrete drill cadence._ | Steady-state operation. Harden alerting and runbooks against the observed failure modes from Years 12. Canonical model is mature and covers every in-scope equipment class; schema changes are routine rather than foundational. |
| **OtOpcUa** | **Evolve LmxOpcUa into OtOpcUa** — extend the existing in-house OPC UA server to add (a) a new equipment namespace with single session per equipment via native protocols translated to OPC UA (committed core drivers: OPC UA Client, Modbus TCP, AB CIP, AB Legacy, S7, TwinCAT, FOCAS, plus Galaxy carried forward), and (b) clustering (non-transparent redundancy, 2-node per site) on top of the existing per-node deployment. **Driver stability tiers:** Tier A in-process (Modbus, OPC UA Client), Tier B in-process with guards (S7, AB CIP, AB Legacy, TwinCAT), Tier C out-of-process (Galaxy — bitness constraint, FOCAS — uncatchable AVE). **Protocol survey** across the estate — template in [`current-state/equipment-protocol-survey.md`](current-state/equipment-protocol-survey.md); target steps 13 done Q1 to validate committed driver list and feed initial UNS hierarchy snapshot. **Build ACL surface** (per-cluster `EquipmentAcl` table, Admin UI, OPC UA NodeManager enforcement) — required before tier-1 cutover. **Deploy OtOpcUa to every site** as fast as practical. **Begin tier 1 cutover (ScadaBridge)** at large sites. **Prerequisite: certificate-distribution** to consumer trust stores before each cutover. **Aveva System Platform IO pattern validation** — Year 1 or early Year 2 research to confirm Aveva supports upstream OPC UA data sources, well ahead of Year 3 tier 3. _TBD — survey owner; first-cutover site selection; cutover plan owner (OtOpcUa team or integration team); enterprise shortname for UNS hierarchy root._ | **Complete tier 1 (ScadaBridge)** across all sites. **Begin tier 2 (Ignition)** — Ignition consumers redirected from direct-equipment OPC UA to each site's OtOpcUa, collapsing WAN session counts from *N per equipment* to *one per site*. **Build long-tail drivers** on demand as sites require them. Resolve Warsaw per-building multi-cluster consumer-addressing pattern (consumer-side stitching vs site-aggregator OtOpcUa instance). _TBD — per-site tier-2 rollout sequence._ | **Complete tier 2 (Ignition)** across all sites. **Execute tier 3 (Aveva System Platform IO)** with compliance stakeholder validation — the hardest cutover because System Platform IO feeds validated data collection. Reach steady state: every equipment session is held by OtOpcUa, every downstream consumer reads OT data through it. _TBD — per-equipment-class criteria for System Platform IO re-validation._ |
| **Redpanda EventHub** | Stand up central Redpanda cluster in South Bend (single-cluster HA). Stand up bundled Schema Registry. Wire SASL/OAUTHBEARER to enterprise IdP. Create initial topic set (prefix-based ACLs). Hook up observability minimum signal set. Define the three retention tiers (`operational`/`analytics`/`compliance`). **Stand up the central `schemas` repo** with `buf` CI, CODEOWNERS, and the NuGet publishing pipeline. **Publish the canonical equipment/production/event model v1** — including the canonical machine state vocabulary (`Running / Idle / Faulted / Starved / Blocked` + any agreed additions) as a Protobuf enum, the `equipment.state.transitioned` event schema, and initial equipment-class definitions for pilot equipment. This is the foundation for Digital Twin Use Cases 1 and 3 (see `goal-state.md` → Strategic Considerations → Digital twin) and is load-bearing for pillar 2. **Pilot equipment class for canonical definition: FANUC CNC** (pre-defined FOCAS2 hierarchy already exists in OtOpcUa v2 driver design). Land the FANUC CNC class template in the schemas repo before Tier 1 cutover begins. _TBD — sizing decisions, initial topic list, canonical vocabulary ownership (domain SME group)._ | Expand topic coverage as additional domains onboard. Enforce tiered retention and ACLs at scale. Prove backlog replay after a WAN-outage drill (also exercises the Digital Twin Use Case 2 simulation-lite replay path). Exercise long-outage planning (ScadaBridge queue capacity vs. outage duration). Iterate the canonical model as additional equipment classes and domains onboard. _TBD — concrete drill cadence._ | Steady-state operation. Harden alerting and runbooks against the observed failure modes from Years 12. Canonical model is mature and covers every in-scope equipment class; schema changes are routine rather than foundational. |
| **SnowBridge** | Design and begin custom build in .NET. **Filtered, governed upload to Snowflake is the Year 1 purpose** — the service is the component that decides which topics/tags flow to Snowflake, applies the governed selection model, and writes into Snowflake. Ship an initial version with **one working source adapter** — starting with **Aveva Historian (SQL interface)** because it's central-only, exists today, and lets the workstream progress in parallel with Redpanda rather than waiting on it. First end-to-end **filtered** flow to Snowflake landing tables on a handful of priority tags. Selection model in place even if the operator UI isn't yet (config-driven is acceptable for Year 1). _TBD — team, credential management, datastore for selection state._ | Add the **ScadaBridge/Redpanda source adapter** alongside Historian. Build and ship the operator **web UI + API** on top of the Year 1 selection model, including the blast-radius-based approval workflow, audit trail, RBAC, and exportable state. Onboard priority tags per domain under the UI-driven governance path. _TBD — UI framework._ | All planned source adapters live behind the unified interface. Approval workflow tuned based on Year 2 operational experience. Feature freeze; focus on hardening. |
| **Snowflake dbt Transform Layer** | Scaffold a dbt project in git, wired to the self-hosted orchestrator (per `goal-state.md`; specific orchestrator chosen outside this plan). Build first **landing → curated** model for priority tags. **Align curated views with the canonical model v1** published in the `schemas` repo — equipment, production, and event entities in the curated layer use the canonical state vocabulary and the same event-type enum values, so downstream consumers (Power BI, ad-hoc analysts, future AI/ML) see the same shape of data Redpanda publishes. This is the dbt-side delivery for Digital Twin Use Cases 1 and 3. Establish `dbt test` discipline from day one — including tests that catch divergence between curated views and the canonical enums. _TBD — project layout (single vs per-domain); reconciliation rule if derived state in curated views disagrees with the layer-3 derivation (should not happen, but the rule needs to exist)._ | Build curated layers for all in-scope domains. **Ship a canonical-state-based OEE model** as a strong candidate for the pillar-2 "not possible before" use case — accurate cross-equipment, cross-site OEE computed once in dbt from the canonical state stream, rather than re-derived in every reporting surface. Source-freshness SLAs tied to the **≤15-minute analytics** budget. Begin development of the first **"not possible before" AI/analytics use case** (pillar 2). | The "not possible before" use case is **in production**, consuming the curated layer, meeting its own SLO. Pillar 2 check passes. |
| **ScadaBridge Extensions** | Implement **deadband / exception-based publishing** with the global-default model (+ override mechanism). Add **EventHub producer** capability with per-call **store-and-forward** to Redpanda. Verify co-located footprint doesn't degrade System Platform. _TBD — global deadband value, override mechanism location._ | Roll deadband + EventHub producer to **all currently-integrated sites**. Tune deadband and overrides based on observed Snowflake cost. Support early legacy-retirement work with outbound Web API / DB write patterns as needed. | Steady state. Any remaining Extensions work is residual cleanup or support for the tail end of Site Onboarding / Legacy Retirement. |