Integrate OtOpcUa v2 implementation corrections into plan
19 corrections from handoffs/otopcua-corrections-2026-04-17.md: Inaccuracies fixed: - A1: OPC UA-native equipment requires OpcUaClient gateway driver (~hours config), not "no driver build" - A2: "single endpoint" is per-node (non-transparent redundancy), not per-cluster; no VIP planned Missing constraints added: - B1: ACL surface (EquipmentAcl table, Admin UI, NodeManager enforcement) as Year 1 deliverable before Tier 1 cutover - B2: schemas-repo creation on OtOpcUa critical path with FANUC CNC pilot - B3: Certificate-distribution as pre-cutover step (per-node ApplicationUri trust-pinning) Architectural decisions incorporated: - C1: 8 committed core drivers (added TwinCAT/Beckhoff, split AB Legacy) - C2: Three-tier driver stability model (A/B/C with out-of-process for Galaxy and FOCAS) - C3: Polly v8+ resilience with default-no-retry on writes - C4: Multi-identifier equipment model (5 IDs: UUID, EquipmentId, MachineCode, ZTag, SAPID) - C5: Consumer cutover plan needs an owner (flagged) - C6: Per-building cluster implications at Warsaw clarified TBDs resolved: - D1: Pilot equipment class = FANUC CNC - D2: Schemas repo format = JSON Schema (.json), Protobuf derived - D3: ACL definitions in central config DB alongside driver/topology - D4: Enterprise shortname still unresolved (flagged as pre-cutover blocker) New TBDs added: - E1: UUID generation authority (OtOpcUa vs external system) - E2: Aveva System Platform IO pattern validation (Year 1/2 research) - E3: Site-wide vs per-cluster consumer addressing at Warsaw - E4: Cluster endpoint wording (resolved via A2)
This commit is contained in:
@@ -85,10 +85,10 @@ A protocol is **long-tail** by default if none of the above apply. Long-tail dri
|
|||||||
| **Sites** | _TBD_ |
|
| **Sites** | _TBD_ |
|
||||||
| **Approx. Instance Count** | _TBD_ |
|
| **Approx. Instance Count** | _TBD_ |
|
||||||
| **Current Access Path** | Mixed — direct OPC UA sessions from Aveva System Platform, Ignition, and/or ScadaBridge depending on the equipment. See [`../current-state.md`](../current-state.md) → Equipment OPC UA. |
|
| **Current Access Path** | Mixed — direct OPC UA sessions from Aveva System Platform, Ignition, and/or ScadaBridge depending on the equipment. See [`../current-state.md`](../current-state.md) → Equipment OPC UA. |
|
||||||
| **OtOpcUa Driver Needed?** | **No — already OPC UA.** OtOpcUa acts as an OPC UA client to these devices; no driver build, but connection configuration and auth setup still required. |
|
| **OtOpcUa Driver Needed?** | **Uses the `OpcUaClient` gateway driver** from the core library — no new protocol-specific driver project required, but per-equipment configuration (endpoint URL, browse strategy, namespace remapping to UNS, certificate trust, security policy) is still real work. **Onboarding effort per OPC UA-native equipment is ~hours of config, not zero.** |
|
||||||
| **Driver Complexity (Estimate)** | N/A — connection work only. |
|
| **Driver Complexity (Estimate)** | Low — config work, not protocol work. But non-trivial in aggregate if the OPC UA-native fleet is large. |
|
||||||
| **Priority Site(s)** | N/A |
|
| **Priority Site(s)** | N/A |
|
||||||
| **Notes** | Will be the **easiest** equipment class to bring onto OtOpcUa once OtOpcUa ships — no driver work, just redirect the client-side connection. Expected to be a meaningful fraction of the estate given the "OPC UA-first" posture of most equipment vendors over the last decade. Survey should **still** capture this category because the count informs how much of the tier-1 ScadaBridge cutover is "redirect an existing OPC UA client" vs "bridge through a new driver." |
|
| **Notes** | Will be the **lowest per-unit effort** equipment class to bring onto OtOpcUa — the `OpcUaClient` gateway driver handles them, but each equipment still needs endpoint config, browse strategy, namespace remapping, and certificate trust. Expected to be a meaningful fraction of the estate given the "OPC UA-first" posture of most equipment vendors over the last decade. Survey should **still** capture this category because the count informs (a) how much of the tier-1 ScadaBridge cutover is "configure an existing driver" vs "build a new one" and (b) aggregate config effort. |
|
||||||
|
|
||||||
### EQP-002 — Siemens PLC family (S7)
|
### EQP-002 — Siemens PLC family (S7)
|
||||||
|
|
||||||
@@ -124,6 +124,40 @@ A protocol is **long-tail** by default if none of the above apply. Long-tail dri
|
|||||||
| **Priority Site(s)** | _TBD_ |
|
| **Priority Site(s)** | _TBD_ |
|
||||||
| **Notes** | Paired with EQP-002 (Siemens) as the two most likely dominant PLC protocol families. Confirm scope during discovery. |
|
| **Notes** | Paired with EQP-002 (Siemens) as the two most likely dominant PLC protocol families. Confirm scope during discovery. |
|
||||||
|
|
||||||
|
### EQP-003a — Allen-Bradley Legacy (SLC 500 / MicroLogix — PCCC protocol)
|
||||||
|
|
||||||
|
| Field | Value |
|
||||||
|
|---|---|
|
||||||
|
| **ID** | EQP-003a |
|
||||||
|
| **Equipment Class** | Allen-Bradley / Rockwell SLC 500, MicroLogix families (legacy — separate from ControlLogix/CompactLogix CIP) |
|
||||||
|
| **Vendor(s)** | Rockwell Automation |
|
||||||
|
| **Native Protocol** | PCCC (Programmable Controller Communication Commands) — different protocol stack from EtherNet/IP CIP, despite same vendor |
|
||||||
|
| **Protocol Variant / Notes** | Data table addressing (integer, float, binary, timer, counter files), not tag-based like CIP. Shared library (libplctag) with EQP-003, but separate driver project due to different protocol semantics. |
|
||||||
|
| **Sites** | _TBD_ |
|
||||||
|
| **Approx. Instance Count** | _TBD_ |
|
||||||
|
| **Current Access Path** | _TBD — likely System Platform PCCC/DH+ IO driver_ |
|
||||||
|
| **OtOpcUa Driver Needed?** | **Core candidate (pre-committed in v2 design).** The v2 OtOpcUa design commits AB Legacy as a separate driver from AB CIP. Marginal cost is low (shared libplctag library) but it has its own stability tier (B), test coverage, and Admin UI surface. |
|
||||||
|
| **Driver Complexity (Estimate)** | Low-to-Medium — shares libplctag with EQP-003; the added work is the data-table addressing model and validation against legacy hardware. |
|
||||||
|
| **Priority Site(s)** | _TBD_ |
|
||||||
|
| **Notes** | Split from EQP-003 based on v2 OtOpcUa implementation findings — CIP and PCCC are different enough to warrant separate driver instances. Confirm via survey whether SLC/MicroLogix equipment still exists in meaningful numbers or has been replaced by ControlLogix. |
|
||||||
|
|
||||||
|
### EQP-003b — Beckhoff TwinCAT (ADS protocol)
|
||||||
|
|
||||||
|
| Field | Value |
|
||||||
|
|---|---|
|
||||||
|
| **ID** | EQP-003b |
|
||||||
|
| **Equipment Class** | Beckhoff TwinCAT PLCs and motion controllers |
|
||||||
|
| **Vendor(s)** | Beckhoff Automation |
|
||||||
|
| **Native Protocol** | TwinCAT ADS (Automation Device Specification) — proprietary Beckhoff protocol with native subscription support |
|
||||||
|
| **Protocol Variant / Notes** | ADS supports native subscriptions (no polling needed), which makes it more efficient than poll-based protocols for high-frequency data. ADS runs over TCP. |
|
||||||
|
| **Sites** | _TBD — known Beckhoff installations at specific sites per OtOpcUa team's internal knowledge_ |
|
||||||
|
| **Approx. Instance Count** | _TBD_ |
|
||||||
|
| **Current Access Path** | _TBD — likely direct ADS or via a Beckhoff OPC UA server_ |
|
||||||
|
| **OtOpcUa Driver Needed?** | **Core candidate (pre-committed in v2 design).** Not in the original plan's expected categories — added based on known Beckhoff installations at specific sites, confirmed by the OtOpcUa team ahead of the formal protocol survey. |
|
||||||
|
| **Driver Complexity (Estimate)** | Medium — ADS protocol is well-documented by Beckhoff; native subscriptions simplify the subscription model. Tier B stability (wrapped native, mature). |
|
||||||
|
| **Priority Site(s)** | _TBD — the sites with known Beckhoff installations should be confirmed during the protocol survey_ |
|
||||||
|
| **Notes** | **Pre-survey addition.** This category was not in the original expected list (EQP-001 through EQP-006). Added based on v2 OtOpcUa implementation design findings confirming known Beckhoff installations. The formal protocol survey should validate whether the committed core driver is justified by the install base. |
|
||||||
|
|
||||||
### EQP-004 — Generic Modbus devices
|
### EQP-004 — Generic Modbus devices
|
||||||
|
|
||||||
| Field | Value |
|
| Field | Value |
|
||||||
@@ -185,11 +219,13 @@ These views are **derived** from the row-level data above. Regenerate as rows ar
|
|||||||
|
|
||||||
| Native Protocol | Row IDs | Total Approx. Instances | Sites | Core / Long-tail / Already OPC UA |
|
| Native Protocol | Row IDs | Total Approx. Instances | Sites | Core / Long-tail / Already OPC UA |
|
||||||
|---|---|---|---|---|
|
|---|---|---|---|---|
|
||||||
| OPC UA | EQP-001 | _TBD_ | _TBD_ | Already OPC UA — no driver needed |
|
| OPC UA | EQP-001 | _TBD_ | _TBD_ | Uses `OpcUaClient` gateway driver (core library) — config work per equipment, not protocol work |
|
||||||
| Siemens S7 | EQP-002 | _TBD_ | _TBD_ | _TBD — depends on S7-1500 fraction_ |
|
| Siemens S7 | EQP-002 | _TBD_ | _TBD_ | **Pre-committed core** (v2 design) — validate with survey |
|
||||||
| EtherNet/IP | EQP-003 | _TBD_ | _TBD_ | _TBD — likely core_ |
|
| EtherNet/IP (CIP) | EQP-003 | _TBD_ | _TBD_ | **Pre-committed core** (v2 design) |
|
||||||
| Modbus TCP/RTU | EQP-004 | _TBD_ | _TBD_ | _TBD — likely core_ |
|
| PCCC (AB Legacy) | EQP-003a | _TBD_ | _TBD_ | **Pre-committed core** (v2 design) — validate SLC/MicroLogix fleet still exists |
|
||||||
| Fanuc FOCAS | EQP-005 | _TBD_ | _TBD_ | _TBD — depends on CNC count_ |
|
| TwinCAT ADS (Beckhoff) | EQP-003b | _TBD_ | _TBD_ | **Pre-committed core** (v2 design) — based on known installations, validate with survey |
|
||||||
|
| Modbus TCP/RTU | EQP-004 | _TBD_ | _TBD_ | **Pre-committed core** (v2 design) |
|
||||||
|
| Fanuc FOCAS | EQP-005 | _TBD_ | _TBD_ | **Pre-committed core** (v2 design) — **pilot equipment class for canonical model** |
|
||||||
| Long-tail mix | EQP-006+ | _TBD_ | _TBD_ | Long-tail — on-demand |
|
| Long-tail mix | EQP-006+ | _TBD_ | _TBD_ | Long-tail — on-demand |
|
||||||
|
|
||||||
**Decision output of this table:** the **core driver library scope** for Year 1 OtOpcUa. A protocol row tagged `Core` becomes a Year 1 build commitment; `Long-tail` becomes a Year 2+ on-demand build budget; `Already OPC UA` becomes connection configuration work only.
|
**Decision output of this table:** the **core driver library scope** for Year 1 OtOpcUa. A protocol row tagged `Core` becomes a Year 1 build commitment; `Long-tail` becomes a Year 2+ on-demand build budget; `Already OPC UA` becomes connection configuration work only.
|
||||||
|
|||||||
@@ -153,19 +153,28 @@ Identical conventions to the existing Redpanda topic naming — one vocabulary,
|
|||||||
|
|
||||||
The path is the **navigation identifier**: it tells you where the equipment lives today and how to browse to it. But paths **can change** — a machine moves from one line to another, a building gets renumbered, a campus reorganizes. If the path were the only identifier, every rename would break historical queries, genealogy traces, and audit trails.
|
The path is the **navigation identifier**: it tells you where the equipment lives today and how to browse to it. But paths **can change** — a machine moves from one line to another, a building gets renumbered, a campus reorganizes. If the path were the only identifier, every rename would break historical queries, genealogy traces, and audit trails.
|
||||||
|
|
||||||
The plan commits to a **stable equipment UUID** that lives alongside the path:
|
The plan commits to a **multi-identifier model** — the v2 implementation design surfaced that production usage requires more than UUID + path. Every equipment instance carries **five identifiers**, all exposed as OPC UA properties on the equipment node so external systems can resolve by their preferred identifier without a sidecar service:
|
||||||
|
|
||||||
- **UUID is assigned once, never reused, never changes** over the equipment's lifetime in the estate.
|
| Identifier | Required? | Uniqueness | Mutable? | Purpose |
|
||||||
- **Path can change** if the equipment moves, the area is renamed, or the line is restructured. Each change is an event in the `schemas` repo history.
|
|---|---|---|---|---|
|
||||||
- **Canonical events carry both** the UUID (authoritative, for lineage and joins) and the current path (convenient, for filtering and human readability at the time the event was produced).
|
| **EquipmentUuid** | Yes | Fleet-wide | **Immutable** | Downstream events, canonical model joins, cross-system lineage. RFC 4122 UUIDv4 (random). Not derived from the path. |
|
||||||
- **Historical messages retain the path that was current at the time.** When a machine moves, historical events still carry the old path; new events carry the new path; joining through the UUID gives a continuous view across the rename.
|
| **EquipmentId** | Yes | Within cluster | **Immutable after publish** | Internal logical key for cross-generation config diffs. Not user-facing. |
|
||||||
- **UUID format:** RFC 4122 UUIDv4 (random). Not derived from the path — derivation would couple UUID stability to path stability, defeating the purpose.
|
| **MachineCode** | Yes | Within cluster | Mutable (rename tracked) | **Operator-facing colloquial name** (e.g., `machine_001`). This is what operators say on the radio and write in runbooks — UUID and path are not mnemonic enough for daily operations. Surfaced prominently in Admin UI alongside ZTag. |
|
||||||
|
| **ZTag** | Optional | Fleet-wide | Mutable (re-assigned by ERP) | **ERP equipment identifier.** Primary identifier for browsing in Admin UI per operational request. |
|
||||||
|
| **SAPID** | Optional | Fleet-wide | Mutable (re-assigned by SAP PM) | **SAP PM equipment identifier.** Required for maintenance system join. |
|
||||||
|
|
||||||
**Consumer rule of thumb:**
|
**Path** (the 5-level UNS hierarchy address) is a **sixth** identifier but is **not** stored on the equipment node as a flat property — it is the node's **browse path** by construction. Path can change (equipment moves, area renamed); UUID and EquipmentId cannot.
|
||||||
- Use the **UUID** for joins, lineage, genealogy, audit, and long-running historical queries.
|
|
||||||
|
**Consumer rules of thumb:**
|
||||||
|
- Use **EquipmentUuid** for joins, lineage, genealogy, audit, and long-running historical queries.
|
||||||
|
- Use **MachineCode** for operator-facing UIs, runbooks, on-the-floor communications.
|
||||||
|
- Use **ZTag** when correlating with ERP/finance systems.
|
||||||
|
- Use **SAPID** when correlating with maintenance/PM systems.
|
||||||
- Use the **path** for dashboards, filters, browsing, human search, and anywhere a reader needs to know *where* the equipment is right now.
|
- Use the **path** for dashboards, filters, browsing, human search, and anywhere a reader needs to know *where* the equipment is right now.
|
||||||
|
|
||||||
A dbt dimension table (`dim_equipment`) carries both and is the authoritative join point for analytical consumers. OtOpcUa's equipment namespace exposes the UUID as a property on each equipment node so OPC UA consumers can read it without leaving the OPC UA surface. Redpanda canonical events carry both as fields on the message.
|
Canonical events on Redpanda carry `equipment_uuid` (stable) and `equipment_path` (current at event time) as fields. A dbt dimension table (`dim_equipment`) carries all five identifiers plus current and historical paths, and is the authoritative join point for analytical consumers. OtOpcUa's equipment namespace exposes all five as properties on each equipment node.
|
||||||
|
|
||||||
|
_TBD — **UUID generation authority** (E1 from v2 corrections): OtOpcUa Admin UI currently auto-generates UUIDv4 on equipment creation. If ERP or SAP PM systems take authoritative ownership of equipment registration in the future, the UUID-generation policy should be configurable per cluster (generate locally vs. look up from external system). For now, OtOpcUa generates by default._
|
||||||
|
|
||||||
##### Where the authoritative hierarchy lives
|
##### Where the authoritative hierarchy lives
|
||||||
|
|
||||||
@@ -173,7 +182,7 @@ A dbt dimension table (`dim_equipment`) carries both and is the authoritative jo
|
|||||||
|
|
||||||
| Surface | What the surface carries | Authority relationship |
|
| Surface | What the surface carries | Authority relationship |
|
||||||
|---|---|---|
|
|---|---|---|
|
||||||
| `schemas` repo | Canonical hierarchy definition — the full tree with UUIDs, current paths, equipment-class assignments, and evolution history. Likely stored as a structured file (YAML or Protobuf message) alongside the Protobuf schemas that reference it. | **Authoritative.** Changes go through the same CODEOWNERS + `buf`-CI governance as other schema changes. |
|
| `schemas` repo | Canonical hierarchy definition — the full tree with UUIDs, current paths, equipment-class assignments, and evolution history. Stored as **JSON Schema (.json files)** — idiomatic for .NET (System.Text.Json), CI-friendly (any runner can validate with `jq`), human-authorable, and merge-friendly in git. Protobuf is derived (code-generated) from the JSON Schema for wire serialization where needed (Redpanda events). One-way derivation: JSON Schema → Protobuf, not bidirectional. | **Authoritative.** Changes go through the same CODEOWNERS + `buf`-CI governance as other schema changes. |
|
||||||
| OtOpcUa equipment namespace | Browse-path structure matching the hierarchy; equipment nodes carry the UUID as a property. Built per-site from the relevant subtree (each site's OtOpcUa only exposes equipment at that site). | **Consumer.** Generated from the `schemas` repo definition at deploy/config time. Drift between OtOpcUa and `schemas` repo is a defect. |
|
| OtOpcUa equipment namespace | Browse-path structure matching the hierarchy; equipment nodes carry the UUID as a property. Built per-site from the relevant subtree (each site's OtOpcUa only exposes equipment at that site). | **Consumer.** Generated from the `schemas` repo definition at deploy/config time. Drift between OtOpcUa and `schemas` repo is a defect. |
|
||||||
| Redpanda canonical event payloads | Every event payload carries `equipment_uuid` (stable) and `equipment_path` (current at event time) as fields. Enables filtering without topic explosion. | **Consumer.** Protobuf schemas reference the hierarchy definition in the same `schemas` repo. |
|
| Redpanda canonical event payloads | Every event payload carries `equipment_uuid` (stable) and `equipment_path` (current at event time) as fields. Enables filtering without topic explosion. | **Consumer.** Protobuf schemas reference the hierarchy definition in the same `schemas` repo. |
|
||||||
| dbt curated layer in Snowflake | `dim_equipment` dimension table with UUID, current path, historical path versions, equipment class, site, area, line. Used as the join key by every analytical consumer. | **Consumer.** Populated by a dbt model that reads from an upstream reference table synced from the `schemas` repo — not hand-maintained in Snowflake. |
|
| dbt curated layer in Snowflake | `dim_equipment` dimension table with UUID, current path, historical path versions, equipment class, site, area, line. Used as the join key by every analytical consumer. | **Consumer.** Populated by a dbt model that reads from an upstream reference table synced from the `schemas` repo — not hand-maintained in Snowflake. |
|
||||||
@@ -388,7 +397,7 @@ _TBD — service name (working title only); hosting (South Bend, alongside Redpa
|
|||||||
|
|
||||||
### OtOpcUa — the unified site-level OPC UA layer (absorbs LmxOpcUa)
|
### OtOpcUa — the unified site-level OPC UA layer (absorbs LmxOpcUa)
|
||||||
|
|
||||||
**OtOpcUa** is a per-site **clustered OPC UA server** that is the **single sanctioned OPC UA access point for all OT data at each site**. It owns the one connection to each piece of equipment and exposes a unified OPC UA surface — containing **two logical namespaces** — to every downstream consumer (System Platform, Ignition, ScadaBridge, future consumers).
|
**OtOpcUa** is a per-site **clustered OPC UA server** that is the **single sanctioned OPC UA access point for all OT data at each site**. It owns the one connection to each piece of equipment and exposes a unified OPC UA surface — containing **two logical namespaces** — to every downstream consumer (System Platform, Ignition, ScadaBridge, future consumers). In a 2-node cluster, **each node has its own endpoint** (unique `ApplicationUri` per OPC UA spec); both nodes expose identical address spaces. Consumers see both endpoints in `ServerUriArray` and select by `ServiceLevel` — this is **non-transparent redundancy** (the v1 LmxOpcUa pattern, inherited by v2). Transparent single-endpoint redundancy would require a VIP / load balancer per cluster and is not planned.
|
||||||
|
|
||||||
**The two namespaces served by OtOpcUa:**
|
**The two namespaces served by OtOpcUa:**
|
||||||
1. **Equipment namespace (raw data).** Live values read from equipment via native OPC UA or native device protocols (Modbus, Ethernet/IP, Siemens S7, etc.) translated to OPC UA. This is the new capability the plan introduces — what the "layer 2 — raw data" role in the layered architecture describes.
|
1. **Equipment namespace (raw data).** Live values read from equipment via native OPC UA or native device protocols (Modbus, Ethernet/IP, Siemens S7, etc.) translated to OPC UA. This is the new capability the plan introduces — what the "layer 2 — raw data" role in the layered architecture describes.
|
||||||
@@ -401,7 +410,7 @@ _TBD — service name (working title only); hosting (South Bend, alongside Redpa
|
|||||||
**Responsibilities.**
|
**Responsibilities.**
|
||||||
- **Single connection per equipment.** OtOpcUa is the **only** OPC UA client that talks to equipment directly. Equipment holds one session — to OtOpcUa — regardless of how many downstream consumers need its data. This eliminates the multiple-direct-sessions problem documented in `current-state.md` → Equipment OPC UA.
|
- **Single connection per equipment.** OtOpcUa is the **only** OPC UA client that talks to equipment directly. Equipment holds one session — to OtOpcUa — regardless of how many downstream consumers need its data. This eliminates the multiple-direct-sessions problem documented in `current-state.md` → Equipment OPC UA.
|
||||||
- **Site-local aggregation.** Downstream consumers (System Platform IO, Ignition, ScadaBridge, and any future consumers such as a prospective digital twin layer) connect to OtOpcUa rather than to equipment directly. A consumer reading the same tag gets the same value regardless of who else is subscribed.
|
- **Site-local aggregation.** Downstream consumers (System Platform IO, Ignition, ScadaBridge, and any future consumers such as a prospective digital twin layer) connect to OtOpcUa rather than to equipment directly. A consumer reading the same tag gets the same value regardless of who else is subscribed.
|
||||||
- **Unified OPC UA endpoint for OT data.** Clients that need both raw equipment data and processed System Platform data read from **one OPC UA endpoint** with two namespaces instead of connecting to two separate OPC UA servers (as they would have in the previous "LmxOpcUa + new cluster" design). One fewer connection, one fewer credential, one fewer monitoring target per client.
|
- **Unified OPC UA endpoint for OT data (per node).** Each cluster node exposes both raw equipment data and processed System Platform data through a **single OPC UA endpoint with two namespaces**, instead of consumers connecting to two separate OPC UA servers. In a 2-node cluster, consumers connect to one of the two node endpoints (selected by `ServiceLevel`); each node serves identical namespaces.
|
||||||
- **Access control / authorization chokepoint.** Authentication, authorization, rate limiting, and audit of OT OPC UA reads/writes are enforced at OtOpcUa, not at each consumer. This is the site-level analogue of the "single sanctioned crossing point" theme the plan applies between IT and OT.
|
- **Access control / authorization chokepoint.** Authentication, authorization, rate limiting, and audit of OT OPC UA reads/writes are enforced at OtOpcUa, not at each consumer. This is the site-level analogue of the "single sanctioned crossing point" theme the plan applies between IT and OT.
|
||||||
- **Clustered for HA.** Runs as a cluster (multi-node), not a single server, so a node loss does not drop equipment or System Platform visibility.
|
- **Clustered for HA.** Runs as a cluster (multi-node), not a single server, so a node loss does not drop equipment or System Platform visibility.
|
||||||
|
|
||||||
@@ -422,24 +431,34 @@ _TBD — service name (working title only); hosting (South Bend, alongside Redpa
|
|||||||
**Build-vs-buy: custom build, in-house.** The site-level OPC UA server cluster is built in-house rather than adopting Kepware, Matrikon, Aveva Communication Drivers, or any other off-the-shelf OPC UA aggregator.
|
**Build-vs-buy: custom build, in-house.** The site-level OPC UA server cluster is built in-house rather than adopting Kepware, Matrikon, Aveva Communication Drivers, or any other off-the-shelf OPC UA aggregator.
|
||||||
- **Rationale:** matches the existing in-house .NET pattern for ScadaBridge and SnowBridge (and continues the in-house .NET pattern LmxOpcUa already established — which OtOpcUa is the evolution of); full control over clustering semantics, access model, and integration with ScadaBridge's operational surface; no per-site commercial license; no vendor roadmap risk for a component this central to the OT plan.
|
- **Rationale:** matches the existing in-house .NET pattern for ScadaBridge and SnowBridge (and continues the in-house .NET pattern LmxOpcUa already established — which OtOpcUa is the evolution of); full control over clustering semantics, access model, and integration with ScadaBridge's operational surface; no per-site commercial license; no vendor roadmap risk for a component this central to the OT plan.
|
||||||
- **Primary cost driver acknowledged upfront: equipment driver coverage.** Unlike ScadaBridge (which reads OPC UA and doesn't have to speak native device protocols) or the SnowBridge (which reads from a small set of well-defined sources), this component has to **speak the actual protocols of every piece of equipment it fronts**. Commercial aggregators like Kepware justify their license cost largely through their driver library, and picking custom build means that library has to be built or sourced internally over the lifetime of the plan. This is the real cost of option (a), and it is accepted as the trade-off for control.
|
- **Primary cost driver acknowledged upfront: equipment driver coverage.** Unlike ScadaBridge (which reads OPC UA and doesn't have to speak native device protocols) or the SnowBridge (which reads from a small set of well-defined sources), this component has to **speak the actual protocols of every piece of equipment it fronts**. Commercial aggregators like Kepware justify their license cost largely through their driver library, and picking custom build means that library has to be built or sourced internally over the lifetime of the plan. This is the real cost of option (a), and it is accepted as the trade-off for control.
|
||||||
- **Mitigation:** where equipment already speaks native OPC UA, no driver work is required — the cluster simply proxies the OPC UA session. The driver-build effort is scoped to equipment that exposes non-OPC-UA protocols (Modbus, Ethernet/IP, Siemens S7, proprietary serial, etc.).
|
- **Mitigation:** equipment that already speaks native OPC UA does not require a *new protocol-specific driver project* — the `OpcUaClient` gateway driver in the core library handles all of them. However, per-equipment configuration (endpoint URL, browse strategy, namespace remapping to UNS, certificate trust, security policy) is still required. **Onboarding effort per OPC UA-native equipment is ~hours of config, not zero** — the "no driver build" framing from earlier versions of this plan understated the work.
|
||||||
- **Driver strategy: hybrid — proactive core library plus on-demand long-tail.** A **core driver library** covering the top equipment protocols for the estate is built **proactively** (Year 1 into Year 2), so that most site onboardings can draw from existing drivers rather than blocking on driver work. Protocols beyond the core library — long-tail equipment specific to one site or one equipment class — are built **on-demand** as each site onboards.
|
- **Driver strategy: hybrid — proactive core library plus on-demand long-tail.** A **core driver library** covering the top equipment protocols for the estate is built **proactively** (Year 1 into Year 2), so that most site onboardings can draw from existing drivers rather than blocking on driver work. Protocols beyond the core library — long-tail equipment specific to one site or one equipment class — are built **on-demand** as each site onboards.
|
||||||
- **Why hybrid:** purely lazy (on-demand only) makes site onboarding unpredictable and bumpy; purely proactive risks building drivers for protocols nobody actually uses. The hybrid matches the reality of a mixed equipment estate over a 3-year horizon.
|
- **Why hybrid:** purely lazy (on-demand only) makes site onboarding unpredictable and bumpy; purely proactive risks building drivers for protocols nobody actually uses. The hybrid matches the reality of a mixed equipment estate over a 3-year horizon.
|
||||||
- **Core library scope** is driven by the equipment-protocol inventory, not by guessing — the top protocols are whichever ones are actually most common in the estate once surveyed.
|
- **Core library scope** is driven by the equipment-protocol inventory, not by guessing — the top protocols are whichever ones are actually most common in the estate once surveyed.
|
||||||
- **Implementation approach (not committed, one possible tactic):** embedded open-source protocol stacks (e.g., NModbus for Modbus, Sharp7 for Siemens S7, libplctag for Ethernet/IP) wrapped in the cluster's driver framework, rather than from-scratch protocol implementations. This reduces driver work to "write the OPC UA ↔ protocol adapter" rather than "implement the protocol." The build team may pick this or a cleaner-room approach per driver; this plan does not commit to a specific library choice.
|
- **Committed core driver list (from v2 implementation design, confirmed by team's internal knowledge of the estate ahead of the formal protocol survey):** (1) **OPC UA Client** — gateway driver for OPC UA-native equipment; (2) **Modbus TCP** (also covers DL205 via octal address translation); (3) **AB CIP** (ControlLogix / CompactLogix); (4) **AB Legacy** (SLC 500 / MicroLogix, PCCC — separate driver from CIP due to different protocol stack); (5) **Siemens S7** (S7-300/400/1200/1500); (6) **TwinCAT** (Beckhoff ADS, native subscriptions — known Beckhoff installations at specific sites); (7) **FOCAS** (FANUC CNC). Plus the existing **Galaxy** driver (System Platform namespace, carried forward from LmxOpcUa v1). Total: 8 drivers. The formal protocol survey (`current-state/equipment-protocol-survey.md`) will validate this list and may add or deprioritize entries.
|
||||||
- _TBD — inventory of the actual non-OPC-UA equipment protocols in the estate, which determines the core library scope; how long-tail driver requests are prioritized vs site-onboarding deadlines._
|
- **Implementation approach:** embedded open-source protocol stacks (NModbus for Modbus, Sharp7 for Siemens S7, libplctag for Ethernet/IP and AB Legacy) wrapped in OtOpcUa's driver framework. Reduces driver work to "write the OPC UA ↔ protocol adapter" rather than "implement the protocol."
|
||||||
|
- _TBD — formal protocol survey to validate the committed list; how long-tail driver requests are prioritized vs site-onboarding deadlines._
|
||||||
|
- **Driver stability tiers (v2 implementation design decision — process isolation model).** Not all drivers have equal stability profiles. The v2 design introduces a three-tier model that determines whether a driver runs in-process or out-of-process:
|
||||||
|
- **Tier A (pure managed .NET):** Modbus TCP, OPC UA Client. Run **in-process** in the OtOpcUa server. Low risk — no native code, no COM interop.
|
||||||
|
- **Tier B (wrapped native, mature libraries):** S7, AB CIP, AB Legacy, TwinCAT. Run **in-process** with additional guards — SafeHandle wrappers, memory watchdog, bounded queues. The native libraries are mature and well-tested, so process isolation is not required, but guardrails contain resource leaks.
|
||||||
|
- **Tier C (heavy native / COM / black-box vendor DLL):** Galaxy, FOCAS. Run **out-of-process** as separate Windows services with named-pipe IPC. An `AccessViolationException` from native code (e.g., FANUC's `Fwlib64.dll`) is uncatchable in modern .NET and would tear down the entire OtOpcUa server — process isolation contains the blast radius. Galaxy additionally requires out-of-process because MXAccess COM is .NET Framework 4.8 x86 (bitness constraint forces a separate process).
|
||||||
|
- **Operational footprint impact:** a site with Tier C drivers runs **1 to 3 Windows services per cluster node** (OtOpcUa main + Galaxy host + FOCAS host). With 2-node clusters, that's up to 6 services per cluster. Deployment guides and operational runbooks must cover the multi-service install/upgrade/recycle workflow.
|
||||||
|
- **Reusable pattern:** Tier C drivers follow a generalized `Proxy/Host/Shared` three-project layout reusable for any future driver that needs process isolation.
|
||||||
|
- **Per-device resilience: Polly v8+ (`Microsoft.Extensions.Resilience`).** Every driver instance and every device within a driver uses composable **retry, circuit-breaker, and timeout pipelines** via Polly v8+. Circuit-breaker state is surfaced in the status dashboard. **Write safety: default-no-retry on writes.** Timeouts on equipment writes can fire after the device has already accepted the command; blind retry of non-idempotent operations (pulses, alarm acks, recipe steps) would cause duplicate field actions. Per-tag `WriteIdempotent` flag with explicit opt-in required for write retry. This is a substantive safety decision that affects operator training and tag onboarding runbooks.
|
||||||
- **Not used:** Kepware, Matrikon, Aveva Communication Drivers, HiveMQ Edge, and other off-the-shelf options. Reference products may still be useful for comparison on specific capabilities (clustering patterns, security features, driver implementations) even though they are not the target implementation.
|
- **Not used:** Kepware, Matrikon, Aveva Communication Drivers, HiveMQ Edge, and other off-the-shelf options. Reference products may still be useful for comparison on specific capabilities (clustering patterns, security features, driver implementations) even though they are not the target implementation.
|
||||||
|
|
||||||
**Deployment footprint per site: co-located on existing Aveva System Platform nodes.** Same pattern as ScadaBridge today — the site-level OPC UA server cluster runs on the **same physical/virtual nodes** that host Aveva System Platform and ScadaBridge, not on dedicated hardware.
|
**Deployment footprint per site: co-located on existing Aveva System Platform nodes.** Same pattern as ScadaBridge today — the site-level OPC UA server cluster runs on the **same physical/virtual nodes** that host Aveva System Platform and ScadaBridge, not on dedicated hardware.
|
||||||
- **Rationale:** zero new hardware footprint, consistent operational model across the in-house .NET OT components (ScadaBridge and OtOpcUa — with OtOpcUa running on the same nodes LmxOpcUa already runs on today), and the driver workload at a typical site is modest compared to ScadaBridge's 225k/sec OPC UA ingestion ceiling that these nodes already handle. Co-location keeps site infrastructure simple as smaller sites onboard.
|
- **Rationale:** zero new hardware footprint, consistent operational model across the in-house .NET OT components (ScadaBridge and OtOpcUa — with OtOpcUa running on the same nodes LmxOpcUa already runs on today), and the driver workload at a typical site is modest compared to ScadaBridge's 225k/sec OPC UA ingestion ceiling that these nodes already handle. Co-location keeps site infrastructure simple as smaller sites onboard.
|
||||||
- **Cluster size:** same as ScadaBridge — **2-node clusters at most sites**, with the understanding that the **largest sites** (Warsaw West, Warsaw North) run **one cluster per production building** matching ScadaBridge's and System Platform's existing per-building cluster pattern. No special hardware or quorum model is required beyond what ScadaBridge already uses.
|
- **Cluster size:** same as ScadaBridge — **2-node clusters at most sites**, with the understanding that the **largest sites** (Warsaw West, Warsaw North) run **one cluster per production building** matching ScadaBridge's and System Platform's existing per-building cluster pattern. No special hardware or quorum model is required beyond what ScadaBridge already uses.
|
||||||
|
- **Per-building cluster implication for consumers needing site-wide visibility.** At Warsaw campuses, a consumer (e.g., a ScadaBridge instance or a reporting tool) that needs to see equipment across multiple buildings must connect to **N clusters** (one per building) and stitch the data. OtOpcUa's "site-local aggregation" responsibility is satisfied per-cluster, not site-wide. Two mitigations exist: (a) configure consumer-side templates to enumerate per-building clusters — current expected pattern, adds consumer-side complexity; (b) deploy a **site-aggregator OtOpcUa instance** that consumes from per-building clusters via the OPC UA Client gateway driver and re-exposes a unified site namespace — doable with existing toolset (the OpcUaClient driver was designed for gateway-of-gateways), but adds operational complexity. _TBD — whether the per-building cluster decision is a constraint to optimize for or whether a single Warsaw-West cluster is feasible if hardware allows; whether ZTag fleet-wide uniqueness extends across per-building clusters at the same site (assumed yes, confirm with ERP team)._
|
||||||
- **Trade-off accepted:** System Platform, ScadaBridge, and OtOpcUa all share the same nodes' CPU, memory, and network. Resource contention is a risk — mitigated by (1) the modest driver workload relative to ScadaBridge's proven ingestion ceiling, (2) monitoring via the observability minimum signal set, and (3) the option to move off-node if contention is observed during rollout. Note: the OtOpcUa workload largely replaces what LmxOpcUa already runs on these nodes, so the *incremental* resource draw is just the new equipment-driver and clustering work, not a full new service. _TBD — measured impact of adding this workload to already-shared nodes; headroom numbers at the largest sites; whether any specific site needs to escalate to dedicated hardware._
|
- **Trade-off accepted:** System Platform, ScadaBridge, and OtOpcUa all share the same nodes' CPU, memory, and network. Resource contention is a risk — mitigated by (1) the modest driver workload relative to ScadaBridge's proven ingestion ceiling, (2) monitoring via the observability minimum signal set, and (3) the option to move off-node if contention is observed during rollout. Note: the OtOpcUa workload largely replaces what LmxOpcUa already runs on these nodes, so the *incremental* resource draw is just the new equipment-driver and clustering work, not a full new service. _TBD — measured impact of adding this workload to already-shared nodes; headroom numbers at the largest sites; whether any specific site needs to escalate to dedicated hardware._
|
||||||
|
|
||||||
**Authorization model: OPC UA-native — user tokens for authentication + namespace-level ACLs for authorization.** Every downstream consumer (System Platform IO, Ignition, ScadaBridge, future consumers) authenticates to the cluster using **standard OPC UA user tokens** (UserName tokens and/or X.509 client certs, per site/consumer policy), and authorization is enforced via **namespace-level ACLs** inside the cluster — each authenticated identity is scoped to the equipment/namespaces it is permitted to read/write.
|
**Authorization model: OPC UA-native — user tokens for authentication + namespace-level ACLs for authorization.** Every downstream consumer (System Platform IO, Ignition, ScadaBridge, future consumers) authenticates to the cluster using **standard OPC UA user tokens** (UserName tokens and/or X.509 client certs, per site/consumer policy), and authorization is enforced via **namespace-level ACLs** inside the cluster — each authenticated identity is scoped to the equipment/namespaces it is permitted to read/write.
|
||||||
- **Rationale:** OPC UA is the protocol we're fronting, so the auth model stays in OPC UA's own terms. No SASL/OAUTHBEARER bridging, no custom token-exchange glue — OtOpcUa is self-contained and operable with standard OPC UA client tooling. **Inherits the LmxOpcUa auth pattern** — UserName tokens with standard OPC UA security modes/profiles — so the consumer-side experience does not change for clients that used LmxOpcUa previously, and the fold-in is an evolution rather than a rewrite.
|
- **Rationale:** OPC UA is the protocol we're fronting, so the auth model stays in OPC UA's own terms. No SASL/OAUTHBEARER bridging, no custom token-exchange glue — OtOpcUa is self-contained and operable with standard OPC UA client tooling. **Inherits the LmxOpcUa auth pattern** — UserName tokens with standard OPC UA security modes/profiles — so the consumer-side experience does not change for clients that used LmxOpcUa previously, and the fold-in is an evolution rather than a rewrite.
|
||||||
- **Explicitly not federated with the enterprise IdP.** Unlike Redpanda (which uses SASL/OAUTHBEARER against the enterprise IdP) and SnowBridge (which uses the same IdP for RBAC), OtOpcUa does **not** pull enterprise IdP identity into the OT data access path. OT data access is a pure OT concern, and the plan's IT/OT boundary stays at ScadaBridge central — not here.
|
- **Explicitly not federated with the enterprise IdP.** Unlike Redpanda (which uses SASL/OAUTHBEARER against the enterprise IdP) and SnowBridge (which uses the same IdP for RBAC), OtOpcUa does **not** pull enterprise IdP identity into the OT data access path. OT data access is a pure OT concern, and the plan's IT/OT boundary stays at ScadaBridge central — not here.
|
||||||
- **Trade-off accepted:** identity lifecycle (user token/cert provisioning, rotation, revocation) is managed locally in the OT estate rather than inherited from the enterprise IdP. Two identity stores to operate (enterprise IdP for IT-facing components, OPC UA-native identities for OtOpcUa) is the cost of keeping the OPC UA layer clean and self-contained.
|
- **Trade-off accepted:** identity lifecycle (user token/cert provisioning, rotation, revocation) is managed locally in the OT estate rather than inherited from the enterprise IdP. Two identity stores to operate (enterprise IdP for IT-facing components, OPC UA-native identities for OtOpcUa) is the cost of keeping the OPC UA layer clean and self-contained.
|
||||||
- _TBD — specific OPC UA security mode + profile combinations required vs allowed; where UserName credentials/certs are sourced from (local site directory, a per-site credential vault, AD/LDAP); rotation cadence; audit trail of authz decisions; whether the namespace ACL definitions live alongside driver/topology config or in their own governance surface._
|
- **ACL implementation (Year 1 deliverable — required before Tier 1 cutover).** The v2 implementation design surfaced that namespace-level ACLs are not yet modeled. The plan commits to: a per-cluster `EquipmentAcl` table (or equivalent) in the **central configuration database** mapping LDAP-group → permitted Namespace + UnsArea / UnsLine / Equipment subtree + permission level (Read / Write / AlarmAck). ACLs support four granularity levels with inheritance: Namespace → UnsArea → UnsLine → Equipment (grant at UnsArea cascades to all children unless overridden). ACLs are edited through the **Admin UI**, go through the same draft → diff → publish flow as driver/topology config, and are **generation-versioned** for auditability and rollback. The OPC UA NodeManager checks the ACL on every browse / read / write / subscribe against the connected user's LDAP group claims. **This is a substantial missing surface area that must be built before Tier 1 ScadaBridge cutover**, since the "access control / authorization chokepoint" responsibility is the plan's core promise at this layer.
|
||||||
|
- _TBD — specific OPC UA security mode + profile combinations required vs allowed; where UserName credentials/certs are sourced from (local site directory, a per-site credential vault, AD/LDAP); rotation cadence; audit trail of authz decisions._
|
||||||
|
|
||||||
**Open questions (TBD).**
|
**Open questions (TBD).**
|
||||||
- **Driver coverage.** Which equipment protocols need to be bridged to OPC UA beyond native OPC UA equipment — this is where product-driven decisions matter most.
|
- **Driver coverage.** Which equipment protocols need to be bridged to OPC UA beyond native OPC UA equipment — this is where product-driven decisions matter most.
|
||||||
@@ -448,11 +467,12 @@ _TBD — service name (working title only); hosting (South Bend, alongside Redpa
|
|||||||
1. **ScadaBridge first.** We own both the server and the client, so redirecting ScadaBridge to consume from the new cluster is the lowest-risk cutover. It also validates the cluster under real ingestion load before Ignition or System Platform are affected.
|
1. **ScadaBridge first.** We own both the server and the client, so redirecting ScadaBridge to consume from the new cluster is the lowest-risk cutover. It also validates the cluster under real ingestion load before Ignition or System Platform are affected.
|
||||||
2. **Ignition second.** Moving Ignition off direct equipment OPC UA to the site-level cluster collapses its WAN session count per equipment from *N* to *one per site cluster*. Medium risk — Ignition is the KPI SCADA and a cutover mistake is user-visible, but Ignition has no validated-data obligations.
|
2. **Ignition second.** Moving Ignition off direct equipment OPC UA to the site-level cluster collapses its WAN session count per equipment from *N* to *one per site cluster*. Medium risk — Ignition is the KPI SCADA and a cutover mistake is user-visible, but Ignition has no validated-data obligations.
|
||||||
3. **Aveva System Platform IO last.** System Platform IO is the hardest cutover because its IO path feeds validated data collection. Moving it through the new cluster needs validation/re-qualification with compliance stakeholders, and System Platform is the most opinionated consumer about how its IO is sourced. Doing it last lets us accumulate operational confidence on the cluster from the ScadaBridge and Ignition cutovers first.
|
3. **Aveva System Platform IO last.** System Platform IO is the hardest cutover because its IO path feeds validated data collection. Moving it through the new cluster needs validation/re-qualification with compliance stakeholders, and System Platform is the most opinionated consumer about how its IO is sourced. Doing it last lets us accumulate operational confidence on the cluster from the ScadaBridge and Ignition cutovers first.
|
||||||
|
- **Certificate-distribution pre-cutover step (B3 from v2 corrections).** Before any consumer is cut over at a site, that consumer's OPC UA certificate trust store must be populated with the target OtOpcUa cluster's **per-node certificates and ApplicationUris** (2 per cluster; at Warsaw campuses with per-building clusters, multiply by building count if the consumer needs cross-building visibility). Consumers without pre-loaded trust will fail to connect. **Once a consumer has trusted a node's `ApplicationUri`, changing that `ApplicationUri` requires the consumer to re-establish trust** — this is an OPC UA spec constraint, not an implementation choice. OtOpcUa's Admin UI auto-suggests `urn:{Host}:OtOpcUa` on node creation but warns if `Host` changes later.
|
||||||
- **Acceptable double-connection windows.** During each consumer's cutover, a short window of **both old direct connection and new cluster connection** existing at the same time for the same equipment is **tolerated** — it temporarily aggravates the session-load problem the cluster is meant to solve, but keeping the window short (minutes to hours, not days) bounds the exposure. Longer parallel windows are only acceptable for the System Platform cutover where compliance validation may require extended dual-run.
|
- **Acceptable double-connection windows.** During each consumer's cutover, a short window of **both old direct connection and new cluster connection** existing at the same time for the same equipment is **tolerated** — it temporarily aggravates the session-load problem the cluster is meant to solve, but keeping the window short (minutes to hours, not days) bounds the exposure. Longer parallel windows are only acceptable for the System Platform cutover where compliance validation may require extended dual-run.
|
||||||
- **Rollback posture.** Each consumer's cutover is reversible — if the cluster misbehaves during or immediately after a cutover, the consumer falls back to direct equipment OPC UA, and the cutover is retried after the issue is understood. The old direct-connection capability is **not removed** from consumers until all three cutover tiers are complete and stable at a site.
|
- **Rollback posture.** Each consumer's cutover is reversible — if the cluster misbehaves during or immediately after a cutover, the consumer falls back to direct equipment OPC UA, and the cutover is retried after the issue is understood. The old direct-connection capability is **not removed** from consumers until all three cutover tiers are complete and stable at a site.
|
||||||
- _TBD — per-site cutover sequencing across the three tiers (all sites reach tier 1 before any reaches tier 2, or one site completes all three tiers before the next site starts), and per-equipment-class criteria for when a System Platform IO cutover requires compliance re-validation._
|
- **Consumer cutover plan needs an owner.** The v2 OtOpcUa implementation design covers building the server, drivers, config, and Admin UI (Phases 0–5) but does **not** address consumer cutover planning. The following are unaddressed and need ownership: per-site cutover sequencing, per-equipment validation methodology (proving consumers see equivalent data through OtOpcUa), rollback procedures, coordination with Aveva for System Platform IO cutover, operational runbooks for consumer connection failures. **Either** the OtOpcUa team adds cutover phases (6/7/8) to the v2 design, **or** an integration / operations team owns the cutover plan separately — in which case this section should name them and link the doc.
|
||||||
- **Validated-data implication.** System Platform's validated data collection currently uses its own IO path; moving that through the new cluster may require validation/re-qualification depending on the regulated context. Confirm with compliance stakeholders before committing System Platform IO to this path.
|
- _TBD — per-site cutover sequencing across the three tiers (all sites reach tier 1 before any reaches tier 2, or one site completes all three tiers before the next site starts), and per-equipment-class criteria for when a System Platform IO cutover requires compliance re-validation; cutover plan owner assignment._
|
||||||
- **Authorization model inside the cluster.** How consumers authenticate to the cluster (certificates, UserName tokens, tied to the enterprise IdP or a local site directory), and how per-consumer read/write scopes are expressed.
|
- **Validated-data implication (E2 — Aveva pattern validation needed Year 1 or early Year 2).** System Platform's validated data collection currently uses its own IO path; moving that through OtOpcUa may require validation/re-qualification depending on the regulated context. **Year 1 or early Year 2 research deliverable:** validate with Aveva that System Platform IO drivers support upstream OPC UA-server data sources (OtOpcUa), including any restrictions on security mode, namespace structure, or session model. If Aveva's pattern requires something OtOpcUa doesn't expose, that's a long-lead-time discovery that must surface well before Year 3's Tier 3 cutover.
|
||||||
- **Relationship to ScadaBridge's 225k/sec ingestion ceiling** (per `current-state.md`): the cluster's aggregate throughput must be able to feed ScadaBridge at its capacity without becoming a bottleneck — sizing needs to reflect this.
|
- **Relationship to ScadaBridge's 225k/sec ingestion ceiling** (per `current-state.md`): the cluster's aggregate throughput must be able to feed ScadaBridge at its capacity without becoming a bottleneck — sizing needs to reflect this.
|
||||||
|
|
||||||
### Async Event Backbone
|
### Async Event Backbone
|
||||||
@@ -538,6 +558,8 @@ The plan already delivers the infrastructure for a cross-system canonical model
|
|||||||
|
|
||||||
This subsection makes that declaration. It is the plan's answer to **Digital Twin Use Cases 1 and 3** (see **Strategic Considerations → Digital twin**) and — independent of digital twin framing — is load-bearing for pillar 2 (analytics/AI enablement) because a canonical model is what makes "not possible before" cross-domain analytics possible at all.
|
This subsection makes that declaration. It is the plan's answer to **Digital Twin Use Cases 1 and 3** (see **Strategic Considerations → Digital twin**) and — independent of digital twin framing — is load-bearing for pillar 2 (analytics/AI enablement) because a canonical model is what makes "not possible before" cross-domain analytics possible at all.
|
||||||
|
|
||||||
|
> **Schemas-repo dependency is on the OtOpcUa critical path (B2 from v2 corrections).** The `schemas` repo does not exist yet. Until it does, OtOpcUa equipment configurations are hand-curated per-equipment with no class templates, no auto-generated tag lists, no cross-cluster consistency checks, and no signal-validation contract for Layer 3 state derivation. The plan commits to **schemas-repo creation as a Year 1 deliverable** (its own scope, distinct from the OtOpcUa workstream) with a **pilot equipment class (FANUC CNC)** landed in the repo before Tier 1 cutover begins. The Equipment Protocol Survey output feeds both: (a) OtOpcUa core driver scope, and (b) the initial schemas-repo equipment-class list.
|
||||||
|
|
||||||
> **Unified Namespace framing:** this canonical model is also the plan's **Unified Namespace** (UNS) — see **Target IT/OT Integration → Unified Namespace (UNS) posture**. The UNS posture is a higher-level framing of the same mechanics described here: this section specifies the canonical model mechanically; the UNS posture explains what stakeholders asking about UNS should understand about how the plan delivers the UNS value proposition without an MQTT/Sparkplug broker.
|
> **Unified Namespace framing:** this canonical model is also the plan's **Unified Namespace** (UNS) — see **Target IT/OT Integration → Unified Namespace (UNS) posture**. The UNS posture is a higher-level framing of the same mechanics described here: this section specifies the canonical model mechanically; the UNS posture explains what stakeholders asking about UNS should understand about how the plan delivers the UNS value proposition without an MQTT/Sparkplug broker.
|
||||||
|
|
||||||
##### The three surfaces
|
##### The three surfaces
|
||||||
@@ -598,7 +620,9 @@ The canonical state vocabulary directly enables accurate OEE computation in the
|
|||||||
- **Real-time state event frequency guarantees.** Latency and delivery semantics for `equipment.state.transitioned` events are governed by the general Redpanda latency profile and the `analytics`-tier retention (30 days); this subsection does not add a per-event SLO beyond what pillar 2's `≤15-minute analytics` commitment already provides.
|
- **Real-time state event frequency guarantees.** Latency and delivery semantics for `equipment.state.transitioned` events are governed by the general Redpanda latency profile and the `analytics`-tier retention (30 days); this subsection does not add a per-event SLO beyond what pillar 2's `≤15-minute analytics` commitment already provides.
|
||||||
- **Predictive state prediction.** Forecasting whether an equipment instance is "about to fault" is a pillar-2 AI/ML use case on top of the canonical stream, not a canonical-model deliverable. The canonical model just has to be clean enough to train on.
|
- **Predictive state prediction.** Forecasting whether an equipment instance is "about to fault" is a pillar-2 AI/ML use case on top of the canonical stream, not a canonical-model deliverable. The canonical model just has to be clean enough to train on.
|
||||||
|
|
||||||
_TBD — exact `schemas` repo layout for equipment-class definitions (Protobuf vs YAML vs both); ownership of the canonical state vocabulary (likely a domain SME group rather than the ScadaBridge team); pilot equipment class for the first canonical definition; reconciliation process if System Platform and Ignition derivations of the same equipment's state diverge (they should not, but the canonical surface needs a tiebreak rule)._
|
**Resolved:** equipment-class definitions in the `schemas` repo use **JSON Schema (.json files)** as the authoring format, with Protobuf code-generated for wire serialization (see UNS naming hierarchy → Where the authoritative hierarchy lives). **Pilot equipment class: FANUC CNC** — the v2 OtOpcUa FOCAS driver design already specifies a fixed pre-defined node hierarchy (Identity, Status, Axes, Spindle, Program, Tool, Alarms, PMC, Macro) populated by specific FOCAS2 API calls, which is essentially a class template already; the schemas repo formalizes it. FANUC CNC is the natural pilot because: single vendor with well-known API surface, finite equipment count per site, clean failure-mode boundary (Tier C out-of-process driver), and no greenfield template invention required. The pilot should land in the schemas repo before Tier 1 cutover begins, to validate the template-consumer contract end-to-end.
|
||||||
|
|
||||||
|
_TBD — ownership of the canonical state vocabulary (likely a domain SME group rather than the ScadaBridge team); reconciliation process if System Platform and Ignition derivations of the same equipment's state diverge (they should not, but the canonical surface needs a tiebreak rule)._
|
||||||
|
|
||||||
|
|
||||||
## Observability
|
## Observability
|
||||||
|
|||||||
@@ -63,8 +63,8 @@ The roadmap is laid out as a 2D grid — **workstreams** (rows) crossed with **y
|
|||||||
|
|
||||||
| Workstream | **Year 1 — Foundation** | **Year 2 — Scale** | **Year 3 — Completion** |
|
| Workstream | **Year 1 — Foundation** | **Year 2 — Scale** | **Year 3 — Completion** |
|
||||||
|---|---|---|---|
|
|---|---|---|---|
|
||||||
| **OtOpcUa** | **Evolve LmxOpcUa into OtOpcUa** — extend the existing in-house OPC UA server to add (a) a new equipment namespace that holds the single session to each piece of equipment via native device protocols translated to OPC UA, and (b) clustering on top of the existing per-node deployment. The System Platform namespace carries forward from LmxOpcUa; consumers that already use LmxOpcUa keep working. **Proactive protocol survey** across the estate — template, schema, rollup, and classification rule (core vs long-tail) live in [`current-state/equipment-protocol-survey.md`](current-state/equipment-protocol-survey.md); survey is a **Year 1 prerequisite** for core library scope, target steps 1–3 (System Platform IO / Ignition / ScadaBridge walks) done inside Q1 so the core driver build can start Q2. **Build the core driver library** for the protocols that meet the classification rule. **Deploy OtOpcUa to every site** (in place on existing System Platform nodes where LmxOpcUa already runs) as fast as practical — deployment ≠ cutover; an idle equipment namespace is cheap. **Begin tier 1 cutover (ScadaBridge)** at the large sites where we own both ends of the connection. _TBD — survey owner; first-cutover site selection._ | **Complete tier 1 (ScadaBridge)** across all sites. **Begin tier 2 (Ignition)** — Ignition consumers redirected from direct-equipment OPC UA to each site's OtOpcUa, collapsing WAN session counts from *N per equipment* to *one per site*. **Build long-tail drivers** on demand as sites require them. _TBD — per-site tier-2 rollout sequence._ | **Complete tier 2 (Ignition)** across all sites. **Execute tier 3 (Aveva System Platform IO)** with compliance stakeholder validation — the hardest cutover because System Platform IO feeds validated data collection. Reach steady state: every equipment session is held by OtOpcUa, every downstream consumer reads OT data through it. _TBD — per-equipment-class criteria for System Platform IO re-validation._ |
|
| **OtOpcUa** | **Evolve LmxOpcUa into OtOpcUa** — extend the existing in-house OPC UA server to add (a) a new equipment namespace with single session per equipment via native protocols translated to OPC UA (committed core drivers: OPC UA Client, Modbus TCP, AB CIP, AB Legacy, S7, TwinCAT, FOCAS, plus Galaxy carried forward), and (b) clustering (non-transparent redundancy, 2-node per site) on top of the existing per-node deployment. **Driver stability tiers:** Tier A in-process (Modbus, OPC UA Client), Tier B in-process with guards (S7, AB CIP, AB Legacy, TwinCAT), Tier C out-of-process (Galaxy — bitness constraint, FOCAS — uncatchable AVE). **Protocol survey** across the estate — template in [`current-state/equipment-protocol-survey.md`](current-state/equipment-protocol-survey.md); target steps 1–3 done Q1 to validate committed driver list and feed initial UNS hierarchy snapshot. **Build ACL surface** (per-cluster `EquipmentAcl` table, Admin UI, OPC UA NodeManager enforcement) — required before tier-1 cutover. **Deploy OtOpcUa to every site** as fast as practical. **Begin tier 1 cutover (ScadaBridge)** at large sites. **Prerequisite: certificate-distribution** to consumer trust stores before each cutover. **Aveva System Platform IO pattern validation** — Year 1 or early Year 2 research to confirm Aveva supports upstream OPC UA data sources, well ahead of Year 3 tier 3. _TBD — survey owner; first-cutover site selection; cutover plan owner (OtOpcUa team or integration team); enterprise shortname for UNS hierarchy root._ | **Complete tier 1 (ScadaBridge)** across all sites. **Begin tier 2 (Ignition)** — Ignition consumers redirected from direct-equipment OPC UA to each site's OtOpcUa, collapsing WAN session counts from *N per equipment* to *one per site*. **Build long-tail drivers** on demand as sites require them. Resolve Warsaw per-building multi-cluster consumer-addressing pattern (consumer-side stitching vs site-aggregator OtOpcUa instance). _TBD — per-site tier-2 rollout sequence._ | **Complete tier 2 (Ignition)** across all sites. **Execute tier 3 (Aveva System Platform IO)** with compliance stakeholder validation — the hardest cutover because System Platform IO feeds validated data collection. Reach steady state: every equipment session is held by OtOpcUa, every downstream consumer reads OT data through it. _TBD — per-equipment-class criteria for System Platform IO re-validation._ |
|
||||||
| **Redpanda EventHub** | Stand up central Redpanda cluster in South Bend (single-cluster HA). Stand up bundled Schema Registry. Wire SASL/OAUTHBEARER to enterprise IdP. Create initial topic set (prefix-based ACLs). Hook up observability minimum signal set. Define the three retention tiers (`operational`/`analytics`/`compliance`). **Stand up the central `schemas` repo** with `buf` CI, CODEOWNERS, and the NuGet publishing pipeline. **Publish the canonical equipment/production/event model v1** — including the canonical machine state vocabulary (`Running / Idle / Faulted / Starved / Blocked` + any agreed additions) as a Protobuf enum, the `equipment.state.transitioned` event schema, and initial equipment-class definitions for pilot equipment. This is the foundation for Digital Twin Use Cases 1 and 3 (see `goal-state.md` → Strategic Considerations → Digital twin) and is load-bearing for pillar 2. _TBD — sizing decisions, initial topic list, pilot equipment classes for the first canonical definition, canonical vocabulary ownership (domain SME group)._ | Expand topic coverage as additional domains onboard. Enforce tiered retention and ACLs at scale. Prove backlog replay after a WAN-outage drill (also exercises the Digital Twin Use Case 2 simulation-lite replay path). Exercise long-outage planning (ScadaBridge queue capacity vs. outage duration). Iterate the canonical model as additional equipment classes and domains onboard. _TBD — concrete drill cadence._ | Steady-state operation. Harden alerting and runbooks against the observed failure modes from Years 1–2. Canonical model is mature and covers every in-scope equipment class; schema changes are routine rather than foundational. |
|
| **Redpanda EventHub** | Stand up central Redpanda cluster in South Bend (single-cluster HA). Stand up bundled Schema Registry. Wire SASL/OAUTHBEARER to enterprise IdP. Create initial topic set (prefix-based ACLs). Hook up observability minimum signal set. Define the three retention tiers (`operational`/`analytics`/`compliance`). **Stand up the central `schemas` repo** with `buf` CI, CODEOWNERS, and the NuGet publishing pipeline. **Publish the canonical equipment/production/event model v1** — including the canonical machine state vocabulary (`Running / Idle / Faulted / Starved / Blocked` + any agreed additions) as a Protobuf enum, the `equipment.state.transitioned` event schema, and initial equipment-class definitions for pilot equipment. This is the foundation for Digital Twin Use Cases 1 and 3 (see `goal-state.md` → Strategic Considerations → Digital twin) and is load-bearing for pillar 2. **Pilot equipment class for canonical definition: FANUC CNC** (pre-defined FOCAS2 hierarchy already exists in OtOpcUa v2 driver design). Land the FANUC CNC class template in the schemas repo before Tier 1 cutover begins. _TBD — sizing decisions, initial topic list, canonical vocabulary ownership (domain SME group)._ | Expand topic coverage as additional domains onboard. Enforce tiered retention and ACLs at scale. Prove backlog replay after a WAN-outage drill (also exercises the Digital Twin Use Case 2 simulation-lite replay path). Exercise long-outage planning (ScadaBridge queue capacity vs. outage duration). Iterate the canonical model as additional equipment classes and domains onboard. _TBD — concrete drill cadence._ | Steady-state operation. Harden alerting and runbooks against the observed failure modes from Years 1–2. Canonical model is mature and covers every in-scope equipment class; schema changes are routine rather than foundational. |
|
||||||
| **SnowBridge** | Design and begin custom build in .NET. **Filtered, governed upload to Snowflake is the Year 1 purpose** — the service is the component that decides which topics/tags flow to Snowflake, applies the governed selection model, and writes into Snowflake. Ship an initial version with **one working source adapter** — starting with **Aveva Historian (SQL interface)** because it's central-only, exists today, and lets the workstream progress in parallel with Redpanda rather than waiting on it. First end-to-end **filtered** flow to Snowflake landing tables on a handful of priority tags. Selection model in place even if the operator UI isn't yet (config-driven is acceptable for Year 1). _TBD — team, credential management, datastore for selection state._ | Add the **ScadaBridge/Redpanda source adapter** alongside Historian. Build and ship the operator **web UI + API** on top of the Year 1 selection model, including the blast-radius-based approval workflow, audit trail, RBAC, and exportable state. Onboard priority tags per domain under the UI-driven governance path. _TBD — UI framework._ | All planned source adapters live behind the unified interface. Approval workflow tuned based on Year 2 operational experience. Feature freeze; focus on hardening. |
|
| **SnowBridge** | Design and begin custom build in .NET. **Filtered, governed upload to Snowflake is the Year 1 purpose** — the service is the component that decides which topics/tags flow to Snowflake, applies the governed selection model, and writes into Snowflake. Ship an initial version with **one working source adapter** — starting with **Aveva Historian (SQL interface)** because it's central-only, exists today, and lets the workstream progress in parallel with Redpanda rather than waiting on it. First end-to-end **filtered** flow to Snowflake landing tables on a handful of priority tags. Selection model in place even if the operator UI isn't yet (config-driven is acceptable for Year 1). _TBD — team, credential management, datastore for selection state._ | Add the **ScadaBridge/Redpanda source adapter** alongside Historian. Build and ship the operator **web UI + API** on top of the Year 1 selection model, including the blast-radius-based approval workflow, audit trail, RBAC, and exportable state. Onboard priority tags per domain under the UI-driven governance path. _TBD — UI framework._ | All planned source adapters live behind the unified interface. Approval workflow tuned based on Year 2 operational experience. Feature freeze; focus on hardening. |
|
||||||
| **Snowflake dbt Transform Layer** | Scaffold a dbt project in git, wired to the self-hosted orchestrator (per `goal-state.md`; specific orchestrator chosen outside this plan). Build first **landing → curated** model for priority tags. **Align curated views with the canonical model v1** published in the `schemas` repo — equipment, production, and event entities in the curated layer use the canonical state vocabulary and the same event-type enum values, so downstream consumers (Power BI, ad-hoc analysts, future AI/ML) see the same shape of data Redpanda publishes. This is the dbt-side delivery for Digital Twin Use Cases 1 and 3. Establish `dbt test` discipline from day one — including tests that catch divergence between curated views and the canonical enums. _TBD — project layout (single vs per-domain); reconciliation rule if derived state in curated views disagrees with the layer-3 derivation (should not happen, but the rule needs to exist)._ | Build curated layers for all in-scope domains. **Ship a canonical-state-based OEE model** as a strong candidate for the pillar-2 "not possible before" use case — accurate cross-equipment, cross-site OEE computed once in dbt from the canonical state stream, rather than re-derived in every reporting surface. Source-freshness SLAs tied to the **≤15-minute analytics** budget. Begin development of the first **"not possible before" AI/analytics use case** (pillar 2). | The "not possible before" use case is **in production**, consuming the curated layer, meeting its own SLO. Pillar 2 check passes. |
|
| **Snowflake dbt Transform Layer** | Scaffold a dbt project in git, wired to the self-hosted orchestrator (per `goal-state.md`; specific orchestrator chosen outside this plan). Build first **landing → curated** model for priority tags. **Align curated views with the canonical model v1** published in the `schemas` repo — equipment, production, and event entities in the curated layer use the canonical state vocabulary and the same event-type enum values, so downstream consumers (Power BI, ad-hoc analysts, future AI/ML) see the same shape of data Redpanda publishes. This is the dbt-side delivery for Digital Twin Use Cases 1 and 3. Establish `dbt test` discipline from day one — including tests that catch divergence between curated views and the canonical enums. _TBD — project layout (single vs per-domain); reconciliation rule if derived state in curated views disagrees with the layer-3 derivation (should not happen, but the rule needs to exist)._ | Build curated layers for all in-scope domains. **Ship a canonical-state-based OEE model** as a strong candidate for the pillar-2 "not possible before" use case — accurate cross-equipment, cross-site OEE computed once in dbt from the canonical state stream, rather than re-derived in every reporting surface. Source-freshness SLAs tied to the **≤15-minute analytics** budget. Begin development of the first **"not possible before" AI/analytics use case** (pillar 2). | The "not possible before" use case is **in production**, consuming the curated layer, meeting its own SLO. Pillar 2 check passes. |
|
||||||
| **ScadaBridge Extensions** | Implement **deadband / exception-based publishing** with the global-default model (+ override mechanism). Add **EventHub producer** capability with per-call **store-and-forward** to Redpanda. Verify co-located footprint doesn't degrade System Platform. _TBD — global deadband value, override mechanism location._ | Roll deadband + EventHub producer to **all currently-integrated sites**. Tune deadband and overrides based on observed Snowflake cost. Support early legacy-retirement work with outbound Web API / DB write patterns as needed. | Steady state. Any remaining Extensions work is residual cleanup or support for the tail end of Site Onboarding / Legacy Retirement. |
|
| **ScadaBridge Extensions** | Implement **deadband / exception-based publishing** with the global-default model (+ override mechanism). Add **EventHub producer** capability with per-call **store-and-forward** to Redpanda. Verify co-located footprint doesn't degrade System Platform. _TBD — global deadband value, override mechanism location._ | Roll deadband + EventHub producer to **all currently-integrated sites**. Tune deadband and overrides based on observed Snowflake cost. Support early legacy-retirement work with outbound Web API / DB write patterns as needed. | Steady state. Any remaining Extensions work is residual cleanup or support for the tail end of Site Onboarding / Legacy Retirement. |
|
||||||
|
|||||||
Reference in New Issue
Block a user