diff --git a/handoffs/otopcua-handoff.md b/handoffs/otopcua-handoff.md new file mode 100644 index 0000000..e073798 --- /dev/null +++ b/handoffs/otopcua-handoff.md @@ -0,0 +1,375 @@ +# OtOpcUa — Implementation Handoff + +**Extracted:** 2026-04-17 +**Source plan:** [`../goal-state.md`](../goal-state.md), [`../current-state.md`](../current-state.md), [`../roadmap.md`](../roadmap.md), [`../current-state/equipment-protocol-survey.md`](../current-state/equipment-protocol-survey.md) +**Repo for existing codebase:** [lmxopcua](https://gitea.dohertylan.com/dohertj2/lmxopcua) (see [`../links.md`](../links.md)) + +> **This is a point-in-time extract, not a living document.** The authoritative plan content lives in the source files above. If anything here conflicts with the source files, the source files win. +> +> **Corrections from the implementation agent are expected and welcome.** If the implementation work surfaces inaccuracies, missing constraints, or architectural decisions that need revisiting, send corrections back for integration into the plan. Format: describe what's wrong, what you found, and what the plan should say instead. Corrections will be reviewed and folded into the authoritative plan files — they do not get applied to this handoff document (which is a snapshot, not the source of truth). + +--- + +## What OtOpcUa Is + +**OtOpcUa** is a per-site **clustered OPC UA server** that is the **single sanctioned OPC UA access point for all OT data at each site**. It owns the one connection to each piece of equipment and exposes a unified OPC UA surface to every downstream consumer (Aveva System Platform, Ignition, ScadaBridge, future consumers). + +It is **not** a new component from scratch — it is the **evolution of the existing LmxOpcUa** codebase. LmxOpcUa is absorbed into OtOpcUa, not replaced by a separate component. + +### Where it sits in the architecture + +``` +Layer 1 Equipment (PLCs, controllers, instruments) + ↕ +Layer 2 OtOpcUa ← THIS COMPONENT + ↕ +Layer 3 SCADA (Aveva System Platform + Ignition) + ↕ +Layer 4 ScadaBridge (sole IT↔OT crossing point) + ─── IT/OT Boundary ─── + Enterprise IT +``` + +OtOpcUa lives entirely on the **OT side**. It does not change where the IT↔OT crossing sits (that's ScadaBridge central). It is OT-data-facing, site-local, and fronts OT consumers. + +--- + +## What Exists Today (LmxOpcUa — the starting point) + +**Repo:** [lmxopcua](https://gitea.dohertylan.com/dohertj2/lmxopcua) + +- **What:** in-house OPC UA server with tight integration to Aveva System Platform. +- **Role:** exposes System Platform data/objects via OPC UA, enabling OPC UA clients (including ScadaBridge and third parties) to consume System Platform data natively. +- **Deployment:** built and deployed to **every Aveva System Platform node** — primary cluster in South Bend and every site-level application server cluster. Each System Platform node runs its own local instance. +- **Namespace source:** each instance interfaces with its **local Application Platform's LMX API**. The OPC UA address space reflects the System Platform objects reachable through that node's LMX API — namespace is per-node and scoped to whatever the local App Server surfaces. +- **Security model:** standard OPC UA security — `None` / `Sign` / `SignAndEncrypt` modes, `Basic256Sha256` and related profiles, **UserName token** authentication for clients. No bespoke auth scheme. +- **Technology:** .NET (in-house pattern shared with ScadaBridge). + +### Current equipment access problem that OtOpcUa solves + +Today, equipment is connected to by **multiple systems directly**, concurrently: +- **Aveva System Platform** (for validated data collection via IO drivers) +- **Ignition SCADA** (for KPI data, central from South Bend over WAN) +- **ScadaBridge** (for bridge/integration workloads via Akka.NET OPC UA client) + +Consequences: +- Multiple OPC UA sessions per equipment — strains devices with limited concurrent-session support +- No single access-control point — authorization is per-consumer, no site-level chokepoint +- Inconsistent data — same tag read by three consumers can produce three subtly different values (different sampling intervals, deadbands, session buffers) + +**OtOpcUa eliminates all three problems** by collapsing to one session per equipment. + +--- + +## Two Namespaces + +OtOpcUa serves **two logical namespaces** through a single endpoint: + +### 1. Equipment namespace (raw data) — NEW + +Live values read from equipment via native OPC UA or native device protocols (Modbus, Ethernet/IP, Siemens S7, etc.) translated to OPC UA. This is the new capability — what the "Layer 2 — raw data" role describes. + +Raw equipment data at this layer is exactly that — **raw** — no deadbanding, no aggregation, no business meaning. Business meaning is added at Layer 3 (System Platform / Ignition). + +### 2. System Platform namespace (processed data tap) — EXISTING (from LmxOpcUa) + +The former LmxOpcUa functionality, folded in. Exposes Aveva System Platform objects (via the local App Server's LMX API) as OPC UA so that OPC UA-native consumers can read processed data through the same endpoint they use for raw equipment data. + +### Extensible namespace model + +The two-namespace design is not a hard cap. A future **`simulated` namespace** could expose synthetic or replayed equipment data to consumers, letting tier-1/tier-2 consumers (ScadaBridge, Ignition, System Platform IO) be exercised against real-shaped-but-offline data streams without physical equipment. **Architecturally supported, not committed for build** in the 3-year scope. Design the namespace system so adding a third namespace is a configuration change, not a structural refactor. + +--- + +## Responsibilities + +- **Single connection per equipment.** OtOpcUa is the **only** OPC UA client that talks to equipment directly. Equipment holds one session — to OtOpcUa — regardless of how many downstream consumers need its data. +- **Site-local aggregation.** Downstream consumers connect to OtOpcUa rather than to equipment directly. A consumer reading the same tag gets the same value regardless of who else is subscribed. +- **Unified OPC UA endpoint for OT data.** Clients read both raw equipment data and processed System Platform data from **one OPC UA endpoint** with two namespaces. +- **Access control / authorization chokepoint.** Authentication, authorization, rate limiting, and audit of OT OPC UA reads/writes are enforced at OtOpcUa, not at each consumer. +- **Clustered for HA.** Multi-node cluster — node loss does not drop equipment or System Platform visibility. + +--- + +## Build vs Buy + +**Decision: custom build, in-house.** Not Kepware, Matrikon, Aveva Communication Drivers, HiveMQ Edge, or any off-the-shelf OPC UA aggregator. + +**Rationale:** +- Matches the existing in-house .NET pattern (ScadaBridge, SnowBridge, and LmxOpcUa itself) +- Full control over clustering semantics, access model, and integration with ScadaBridge's operational surface +- No per-site commercial license +- No vendor roadmap risk for a component this central + +**Primary cost acknowledged:** equipment driver coverage. Commercial aggregators like Kepware justify their license cost through their driver library. Picking custom build means that library has to be built in-house. See Driver Strategy below. + +**Reference products** (Kepware, Matrikon, etc.) may still be useful for comparison on specific capabilities even though they're not the target. + +--- + +## Driver Strategy: Hybrid — Proactive Core Library + On-Demand Long-Tail + +### Core driver library (proactive, Year 1 → Year 2) + +A core library covering the **top equipment protocols** for the estate, built proactively so that most site onboardings can draw from existing drivers rather than blocking on driver work. + +**Core library scope is driven by the equipment protocol survey** — see below and [`../current-state/equipment-protocol-survey.md`](../current-state/equipment-protocol-survey.md). A protocol becomes "core" if it meets any of: +1. Present at 3+ sites +2. Combined instance count above ~25 +3. Needed to onboard a Year 1 or Year 2 site +4. Strategic vendor whose equipment is expected to grow (judgment call) + +### Long-tail drivers (on-demand, as sites onboard) + +Protocols beyond the core library are built on-demand when the first site that needs the protocol reaches onboarding. + +### Implementation approach (not committed, one possible tactic) + +Embedded open-source protocol stacks wrapped in OtOpcUa's driver framework: +- **NModbus** for Modbus TCP/RTU +- **Sharp7** for Siemens S7 +- **libplctag** for EtherNet/IP (Allen-Bradley) +- Other libraries as needed + +This reduces driver work to "write the OPC UA ↔ protocol adapter" rather than "implement the protocol from scratch." The build team may pick this or a different approach per driver. + +### Equipment where no driver is needed + +Equipment that already speaks **native OPC UA** requires no driver build — OtOpcUa simply proxies the OPC UA session. The driver-build effort is scoped only to equipment exposing non-OPC-UA protocols. + +--- + +## Equipment Protocol Survey (Year 1 prerequisite — not yet run) + +The protocol survey determines the core driver library scope. **It has not been run yet.** + +Template, schema, classification rule, rollup views, and a 6-step discovery approach are documented in [`../current-state/equipment-protocol-survey.md`](../current-state/equipment-protocol-survey.md). + +**Pre-seeded expected categories** (placeholders, not confirmed): + +| ID | Equipment class | Native protocol | Core candidate? | +|---|---|---|---| +| EQP-001 | OPC UA-native equipment | OPC UA | No driver needed | +| EQP-002 | Siemens S7 PLCs (S7-300/400/1200/1500) | Siemens S7 / OPC UA on newer models | Unknown — depends on S7-1500 vs older ratio | +| EQP-003 | Allen-Bradley / Rockwell PLCs | EtherNet/IP (CIP) | Likely core | +| EQP-004 | Generic Modbus devices | Modbus TCP / RTU | Likely core | +| EQP-005 | Fanuc CNC controllers | FOCAS (proprietary library) | Depends on CNC count | +| EQP-006 | Long-tail (everything else) | Various | On-demand | + +**Dual mandate:** the same discovery walk also produces the initial **UNS naming hierarchy snapshot** at equipment-instance granularity (see UNS section below). Two outputs, one walk. + +--- + +## Deployment Footprint + +**Co-located on existing Aveva System Platform nodes.** Same pattern as ScadaBridge — no dedicated hardware. + +- **Cluster size:** 2-node clusters at most sites. Largest sites (Warsaw West, Warsaw North) run one cluster per production building, matching ScadaBridge's and System Platform's existing per-building cluster pattern. +- **Rationale:** zero new hardware footprint; OtOpcUa largely replaces what LmxOpcUa already runs on these nodes, so the incremental resource draw is just the new equipment-driver and clustering work. +- **Trade-off accepted:** System Platform, ScadaBridge, and OtOpcUa all share nodes. Resource contention mitigated by (1) modest driver workload relative to ScadaBridge's proven 225k/sec OPC UA ingestion ceiling, (2) monitoring via observability signals, (3) option to move off-node if contention is observed. + +_TBD — measured impact of adding this workload; headroom numbers at largest sites; whether any site needs dedicated hardware._ + +--- + +## Authorization Model + +**OPC UA-native — user tokens for authentication + namespace-level ACLs for authorization.** + +- Every downstream consumer authenticates with **standard OPC UA user tokens** (UserName tokens and/or X.509 client certs, per site/consumer policy) +- Authorization enforced via **namespace-level ACLs** — each identity scoped to permitted equipment/namespaces +- **Inherits the LmxOpcUa auth pattern** — consumer-side experience does not change for clients that used LmxOpcUa previously + +**Explicitly not federated with the enterprise IdP.** OT data access is a pure OT concern. The plan's IT/OT boundary stays at ScadaBridge central, not at OtOpcUa. Two identity stores to operate (enterprise IdP for IT-facing components, OPC UA-native identities for OtOpcUa) is the accepted trade-off. + +_TBD — specific security mode + profile combinations required; credential source (local directory, per-site vault, AD/LDAP); rotation cadence; audit trail of authz decisions._ + +--- + +## Rollout Posture + +### Deploy everywhere fast + +The cluster software (server + core driver library) is built and rolled out to **every site's System Platform nodes as fast as practical** — deployment to all sites is treated as a **prerequisite**, not a gradual effort. + +"Deployment" = installing and configuring so the node is ready to front equipment. It does **not** mean immediately migrating consumers. A deployed but inactive cluster is cheap. + +### Tiered consumer cutover (sequenced by risk) + +Existing direct equipment connections are moved to OtOpcUa **one consumer at a time**, in risk order: + +| Tier | Consumer | Why this order | Timeline | +|---|---|---|---| +| 1 | **ScadaBridge** | We own both ends; lowest-risk cutover; validates cluster under real load | Year 1 (begin at large sites) → Year 2 (complete all sites) | +| 2 | **Ignition** | Reduces WAN OPC UA sessions from *N per equipment* to *one per site*; medium risk | Year 2 (begin) → Year 3 (complete) | +| 3 | **Aveva System Platform IO** | Hardest cutover — System Platform IO feeds validated data collection; needs compliance validation | Year 3 | + +**Steady state at end of Year 3:** every equipment session is held by OtOpcUa; every downstream consumer reads OT data through it. + +--- + +## UNS Naming Hierarchy (must implement in equipment namespace) + +OtOpcUa's equipment namespace browse paths must implement the plan's **5-level UNS naming hierarchy**: + +### Five levels, always present + +| Level | Name | Example | +|---|---|---| +| 1 | Enterprise | `ent` *(placeholder — real shortname TBD)* | +| 2 | Site | `warsaw-west`, `shannon`, `south-bend` | +| 3 | Area | `bldg-3`, `_default` (placeholder at single-cluster sites) | +| 4 | Line | `line-2`, `assembly-a` | +| 5 | Equipment | `cnc-mill-05`, `injection-molder-02` | + +**OPC UA browse path form:** `ent/warsaw-west/bldg-3/line-2/cnc-mill-05` +**Text form (for messages, dbt keys):** `ent.warsaw-west.bldg-3.line-2.cnc-mill-05` + +Signals / tags are **children of equipment nodes** (level 6), not a separate path level. + +### Naming rules + +- `[a-z0-9-]` only. Lowercase enforced. +- Hyphens within a segment (`warsaw-west`), slashes between segments in OPC UA browse paths. +- Max 32 chars per segment, max 200 chars total path. +- `_default` is the only reserved segment name (placeholder for levels that don't apply). + +### Stable equipment UUID + +Every equipment node must expose a **stable UUIDv4** as a property: +- UUID is assigned once, never changes, never reused. +- Path can change (equipment moves, area renamed); UUID cannot. +- Canonical events downstream carry both UUID (for joins/lineage) and path (for dashboards/filtering). + +### Authority + +The hierarchy definition lives in the **central `schemas` repo** (not yet created). OtOpcUa is a **consumer** of the authoritative definition — it builds its per-site browse tree from the relevant subtree at deploy/config time. **Drift between OtOpcUa's browse paths and the `schemas` repo is a defect.** + +--- + +## Canonical Model Integration + +OtOpcUa's equipment namespace is one of **three surfaces** that expose the plan's canonical equipment / production / event model: + +| Surface | Role | +|---|---| +| **OtOpcUa equipment namespace** | Canonical per-equipment OPC UA node structure. Equipment-class templates from `schemas` repo define the node layout. | +| **Redpanda topics + Protobuf schemas** | Canonical event shape on the wire. Source of truth for the model lives in the `schemas` repo. | +| **dbt curated layer in Snowflake** | Canonical analytics model — same vocabulary, different access pattern. | + +### Canonical machine state vocabulary + +The plan commits to a canonical set of machine state values. OtOpcUa does **not derive these states** (that's a Layer 3 responsibility — System Platform / Ignition), but OtOpcUa's equipment namespace should expose the raw signals that feed the derivation, and the System Platform namespace will expose the derived state values using this vocabulary: + +| State | Semantics | +|---|---| +| `Running` | Actively producing at or near theoretical cycle time | +| `Idle` | Powered and available but not producing | +| `Faulted` | Fault raised, requires intervention | +| `Starved` | Ready but blocked by missing upstream input | +| `Blocked` | Ready but blocked by downstream constraint | + +**Under consideration (TBD):** `Changeover`, `Maintenance`, `Setup` / `WarmingUp`. + +State derivation lives at Layer 3 and is published as `equipment.state.transitioned` events on Redpanda. OtOpcUa's role is to deliver the raw signals cleanly so derivation can be accurate. + +--- + +## Digital Twin Touchpoints + +### Use case 1 — Standardized equipment state model +OtOpcUa delivers the raw signals that feed the canonical state derivation at Layer 3. Equipment-class templates in the `schemas` repo define which raw signals each equipment class exposes, standardized across the estate. + +### Use case 2 — Virtual testing / simulation +OtOpcUa's namespace architecture can accommodate a future `simulated` namespace — replaying historical equipment data to exercise tier-1/tier-2 consumers without physical equipment. **Not committed for build**, but the namespace system should be designed so adding it is a configuration change. + +### Use case 3 — Cross-system canonical model +OtOpcUa's equipment namespace IS the OT-side surface of the canonical model. Every consumer reading equipment data through OtOpcUa sees the same node structure, same naming, same data types, same units — regardless of the underlying equipment's native protocol. + +--- + +## Downstream Consumer Impact + +When OtOpcUa is deployed and consumers are cut over: + +- **ScadaBridge** reads equipment data from OtOpcUa's equipment namespace and System Platform data from OtOpcUa's System Platform namespace — all from the same OPC UA endpoint. Data locality preserved. +- **Ignition** consumes from each site's OtOpcUa instead of direct WAN OPC UA sessions. WAN session collapse from *N per equipment* to *one per site*. +- **Aveva System Platform IO** consumes equipment data from OtOpcUa's equipment namespace rather than direct equipment sessions. This is a meaningful shift in System Platform's IO layer and **needs validation against Aveva's supported patterns** — System Platform is the most opinionated consumer. +- **LmxOpcUa consumers** continue working — the System Platform namespace carries forward unchanged; the previous auth pattern (credentials, security modes) carries forward. + +--- + +## Sites + +### Primary data center +- **South Bend** — primary cluster + +### Largest sites (one cluster per production building) +- **Warsaw West** +- **Warsaw North** + +### Other integrated sites (single cluster per site) +- **Shannon** +- **Galway** +- **TMT** +- **Ponce** + +### Not yet integrated (Year 2+ onboarding) +- **Berlin** +- **Winterthur** +- **Jacksonville** +- Others — list is expected to change + +--- + +## Roadmap Summary + +| Year | What happens | +|---|---| +| **Year 1 — Foundation** | Evolve LmxOpcUa into OtOpcUa (equipment namespace + clustering). Run protocol survey (Q1). Build core driver library (Q2+). Deploy to every site. Begin tier-1 cutover (ScadaBridge) at large sites. | +| **Year 2 — Scale** | Complete tier 1 (ScadaBridge) all sites. Begin tier 2 (Ignition). Build long-tail drivers on demand. | +| **Year 3 — Completion** | Complete tier 2 (Ignition). Execute tier 3 (System Platform IO) with compliance validation. Reach steady state. | + +--- + +## Open Questions / TBDs + +Collected from across the plan files — these are items the implementation work will need to resolve: + +- Equipment-protocol inventory (drives core library scope) — survey not yet run +- First-cutover site selection for tier-1 (ScadaBridge) +- Per-site tier-2 rollout sequence (Ignition) +- Per-equipment-class criteria for System Platform IO re-validation (tier 3) +- Measured resource impact of co-location with System Platform and ScadaBridge +- Headroom numbers at largest sites (Warsaw campuses) +- Whether any site needs dedicated hardware +- Specific OPC UA security mode + profile combinations required vs allowed +- Where UserName credentials/certs are sourced from (local directory, per-site vault, AD/LDAP) +- Credential rotation cadence +- Audit trail of authz decisions +- Whether namespace ACL definitions live alongside driver/topology config or in their own governance surface +- Exact OPC UA namespace shape for the equipment namespace (how equipment-class templates map to browse tree structure) +- How ScadaBridge templates address equipment across multiple per-node OtOpcUa instances +- Enterprise shortname for UNS hierarchy root (currently `ent` placeholder) +- Storage format for the hierarchy in the `schemas` repo (YAML vs Protobuf vs both) +- Reconciliation rule if System Platform and Ignition derivations of the same equipment's state diverge +- Pilot equipment class for the first canonical definition + +--- + +## Sending Corrections Back + +If implementation work surfaces any of the following, send corrections back for integration into the 3-year plan: + +- **Inaccuracies** — something stated here or in the plan doesn't match what the codebase or equipment actually does. +- **Missing constraints** — a real-world constraint (Aveva limitation, OPC UA spec requirement, equipment behavior) that the plan doesn't account for. +- **Architectural decisions that need revisiting** — a plan decision that turns out to be impractical, with evidence for why and a proposed alternative. +- **Resolved TBDs** — answers to any of the open questions above, discovered during implementation. +- **New TBDs** — questions the plan didn't think to ask but should have. + +**Format for corrections:** +1. What the plan currently says (quote or cite file + section) +2. What you found (evidence — code, equipment behavior, Aveva docs, etc.) +3. What the plan should say instead (proposed change) + +Corrections will be reviewed and folded into the authoritative plan files (`goal-state.md`, `current-state.md`, `roadmap.md`, etc.). This handoff document is a snapshot and will **not** be updated — the plan files are the living source of truth.