Add OtOpcUa implementation handoff document

Self-contained extract of all OtOpcUa design material from the plan:
architecture context, LmxOpcUa starting point, two namespaces, driver
strategy, deployment, auth, rollout tiers, UNS hierarchy, canonical
model integration, digital twin touchpoints, sites, roadmap, and all
open TBDs. Includes correction-submission protocol for the implementing
agent.
This commit is contained in:
Joseph Doherty
2026-04-17 09:21:25 -04:00
parent d89c23a659
commit fc3e19fde1

375
handoffs/otopcua-handoff.md Normal file
View File

@@ -0,0 +1,375 @@
# OtOpcUa — Implementation Handoff
**Extracted:** 2026-04-17
**Source plan:** [`../goal-state.md`](../goal-state.md), [`../current-state.md`](../current-state.md), [`../roadmap.md`](../roadmap.md), [`../current-state/equipment-protocol-survey.md`](../current-state/equipment-protocol-survey.md)
**Repo for existing codebase:** [lmxopcua](https://gitea.dohertylan.com/dohertj2/lmxopcua) (see [`../links.md`](../links.md))
> **This is a point-in-time extract, not a living document.** The authoritative plan content lives in the source files above. If anything here conflicts with the source files, the source files win.
>
> **Corrections from the implementation agent are expected and welcome.** If the implementation work surfaces inaccuracies, missing constraints, or architectural decisions that need revisiting, send corrections back for integration into the plan. Format: describe what's wrong, what you found, and what the plan should say instead. Corrections will be reviewed and folded into the authoritative plan files — they do not get applied to this handoff document (which is a snapshot, not the source of truth).
---
## What OtOpcUa Is
**OtOpcUa** is a per-site **clustered OPC UA server** that is the **single sanctioned OPC UA access point for all OT data at each site**. It owns the one connection to each piece of equipment and exposes a unified OPC UA surface to every downstream consumer (Aveva System Platform, Ignition, ScadaBridge, future consumers).
It is **not** a new component from scratch — it is the **evolution of the existing LmxOpcUa** codebase. LmxOpcUa is absorbed into OtOpcUa, not replaced by a separate component.
### Where it sits in the architecture
```
Layer 1 Equipment (PLCs, controllers, instruments)
Layer 2 OtOpcUa ← THIS COMPONENT
Layer 3 SCADA (Aveva System Platform + Ignition)
Layer 4 ScadaBridge (sole IT↔OT crossing point)
─── IT/OT Boundary ───
Enterprise IT
```
OtOpcUa lives entirely on the **OT side**. It does not change where the IT↔OT crossing sits (that's ScadaBridge central). It is OT-data-facing, site-local, and fronts OT consumers.
---
## What Exists Today (LmxOpcUa — the starting point)
**Repo:** [lmxopcua](https://gitea.dohertylan.com/dohertj2/lmxopcua)
- **What:** in-house OPC UA server with tight integration to Aveva System Platform.
- **Role:** exposes System Platform data/objects via OPC UA, enabling OPC UA clients (including ScadaBridge and third parties) to consume System Platform data natively.
- **Deployment:** built and deployed to **every Aveva System Platform node** — primary cluster in South Bend and every site-level application server cluster. Each System Platform node runs its own local instance.
- **Namespace source:** each instance interfaces with its **local Application Platform's LMX API**. The OPC UA address space reflects the System Platform objects reachable through that node's LMX API — namespace is per-node and scoped to whatever the local App Server surfaces.
- **Security model:** standard OPC UA security — `None` / `Sign` / `SignAndEncrypt` modes, `Basic256Sha256` and related profiles, **UserName token** authentication for clients. No bespoke auth scheme.
- **Technology:** .NET (in-house pattern shared with ScadaBridge).
### Current equipment access problem that OtOpcUa solves
Today, equipment is connected to by **multiple systems directly**, concurrently:
- **Aveva System Platform** (for validated data collection via IO drivers)
- **Ignition SCADA** (for KPI data, central from South Bend over WAN)
- **ScadaBridge** (for bridge/integration workloads via Akka.NET OPC UA client)
Consequences:
- Multiple OPC UA sessions per equipment — strains devices with limited concurrent-session support
- No single access-control point — authorization is per-consumer, no site-level chokepoint
- Inconsistent data — same tag read by three consumers can produce three subtly different values (different sampling intervals, deadbands, session buffers)
**OtOpcUa eliminates all three problems** by collapsing to one session per equipment.
---
## Two Namespaces
OtOpcUa serves **two logical namespaces** through a single endpoint:
### 1. Equipment namespace (raw data) — NEW
Live values read from equipment via native OPC UA or native device protocols (Modbus, Ethernet/IP, Siemens S7, etc.) translated to OPC UA. This is the new capability — what the "Layer 2 — raw data" role describes.
Raw equipment data at this layer is exactly that — **raw** — no deadbanding, no aggregation, no business meaning. Business meaning is added at Layer 3 (System Platform / Ignition).
### 2. System Platform namespace (processed data tap) — EXISTING (from LmxOpcUa)
The former LmxOpcUa functionality, folded in. Exposes Aveva System Platform objects (via the local App Server's LMX API) as OPC UA so that OPC UA-native consumers can read processed data through the same endpoint they use for raw equipment data.
### Extensible namespace model
The two-namespace design is not a hard cap. A future **`simulated` namespace** could expose synthetic or replayed equipment data to consumers, letting tier-1/tier-2 consumers (ScadaBridge, Ignition, System Platform IO) be exercised against real-shaped-but-offline data streams without physical equipment. **Architecturally supported, not committed for build** in the 3-year scope. Design the namespace system so adding a third namespace is a configuration change, not a structural refactor.
---
## Responsibilities
- **Single connection per equipment.** OtOpcUa is the **only** OPC UA client that talks to equipment directly. Equipment holds one session — to OtOpcUa — regardless of how many downstream consumers need its data.
- **Site-local aggregation.** Downstream consumers connect to OtOpcUa rather than to equipment directly. A consumer reading the same tag gets the same value regardless of who else is subscribed.
- **Unified OPC UA endpoint for OT data.** Clients read both raw equipment data and processed System Platform data from **one OPC UA endpoint** with two namespaces.
- **Access control / authorization chokepoint.** Authentication, authorization, rate limiting, and audit of OT OPC UA reads/writes are enforced at OtOpcUa, not at each consumer.
- **Clustered for HA.** Multi-node cluster — node loss does not drop equipment or System Platform visibility.
---
## Build vs Buy
**Decision: custom build, in-house.** Not Kepware, Matrikon, Aveva Communication Drivers, HiveMQ Edge, or any off-the-shelf OPC UA aggregator.
**Rationale:**
- Matches the existing in-house .NET pattern (ScadaBridge, SnowBridge, and LmxOpcUa itself)
- Full control over clustering semantics, access model, and integration with ScadaBridge's operational surface
- No per-site commercial license
- No vendor roadmap risk for a component this central
**Primary cost acknowledged:** equipment driver coverage. Commercial aggregators like Kepware justify their license cost through their driver library. Picking custom build means that library has to be built in-house. See Driver Strategy below.
**Reference products** (Kepware, Matrikon, etc.) may still be useful for comparison on specific capabilities even though they're not the target.
---
## Driver Strategy: Hybrid — Proactive Core Library + On-Demand Long-Tail
### Core driver library (proactive, Year 1 → Year 2)
A core library covering the **top equipment protocols** for the estate, built proactively so that most site onboardings can draw from existing drivers rather than blocking on driver work.
**Core library scope is driven by the equipment protocol survey** — see below and [`../current-state/equipment-protocol-survey.md`](../current-state/equipment-protocol-survey.md). A protocol becomes "core" if it meets any of:
1. Present at 3+ sites
2. Combined instance count above ~25
3. Needed to onboard a Year 1 or Year 2 site
4. Strategic vendor whose equipment is expected to grow (judgment call)
### Long-tail drivers (on-demand, as sites onboard)
Protocols beyond the core library are built on-demand when the first site that needs the protocol reaches onboarding.
### Implementation approach (not committed, one possible tactic)
Embedded open-source protocol stacks wrapped in OtOpcUa's driver framework:
- **NModbus** for Modbus TCP/RTU
- **Sharp7** for Siemens S7
- **libplctag** for EtherNet/IP (Allen-Bradley)
- Other libraries as needed
This reduces driver work to "write the OPC UA ↔ protocol adapter" rather than "implement the protocol from scratch." The build team may pick this or a different approach per driver.
### Equipment where no driver is needed
Equipment that already speaks **native OPC UA** requires no driver build — OtOpcUa simply proxies the OPC UA session. The driver-build effort is scoped only to equipment exposing non-OPC-UA protocols.
---
## Equipment Protocol Survey (Year 1 prerequisite — not yet run)
The protocol survey determines the core driver library scope. **It has not been run yet.**
Template, schema, classification rule, rollup views, and a 6-step discovery approach are documented in [`../current-state/equipment-protocol-survey.md`](../current-state/equipment-protocol-survey.md).
**Pre-seeded expected categories** (placeholders, not confirmed):
| ID | Equipment class | Native protocol | Core candidate? |
|---|---|---|---|
| EQP-001 | OPC UA-native equipment | OPC UA | No driver needed |
| EQP-002 | Siemens S7 PLCs (S7-300/400/1200/1500) | Siemens S7 / OPC UA on newer models | Unknown — depends on S7-1500 vs older ratio |
| EQP-003 | Allen-Bradley / Rockwell PLCs | EtherNet/IP (CIP) | Likely core |
| EQP-004 | Generic Modbus devices | Modbus TCP / RTU | Likely core |
| EQP-005 | Fanuc CNC controllers | FOCAS (proprietary library) | Depends on CNC count |
| EQP-006 | Long-tail (everything else) | Various | On-demand |
**Dual mandate:** the same discovery walk also produces the initial **UNS naming hierarchy snapshot** at equipment-instance granularity (see UNS section below). Two outputs, one walk.
---
## Deployment Footprint
**Co-located on existing Aveva System Platform nodes.** Same pattern as ScadaBridge — no dedicated hardware.
- **Cluster size:** 2-node clusters at most sites. Largest sites (Warsaw West, Warsaw North) run one cluster per production building, matching ScadaBridge's and System Platform's existing per-building cluster pattern.
- **Rationale:** zero new hardware footprint; OtOpcUa largely replaces what LmxOpcUa already runs on these nodes, so the incremental resource draw is just the new equipment-driver and clustering work.
- **Trade-off accepted:** System Platform, ScadaBridge, and OtOpcUa all share nodes. Resource contention mitigated by (1) modest driver workload relative to ScadaBridge's proven 225k/sec OPC UA ingestion ceiling, (2) monitoring via observability signals, (3) option to move off-node if contention is observed.
_TBD — measured impact of adding this workload; headroom numbers at largest sites; whether any site needs dedicated hardware._
---
## Authorization Model
**OPC UA-native — user tokens for authentication + namespace-level ACLs for authorization.**
- Every downstream consumer authenticates with **standard OPC UA user tokens** (UserName tokens and/or X.509 client certs, per site/consumer policy)
- Authorization enforced via **namespace-level ACLs** — each identity scoped to permitted equipment/namespaces
- **Inherits the LmxOpcUa auth pattern** — consumer-side experience does not change for clients that used LmxOpcUa previously
**Explicitly not federated with the enterprise IdP.** OT data access is a pure OT concern. The plan's IT/OT boundary stays at ScadaBridge central, not at OtOpcUa. Two identity stores to operate (enterprise IdP for IT-facing components, OPC UA-native identities for OtOpcUa) is the accepted trade-off.
_TBD — specific security mode + profile combinations required; credential source (local directory, per-site vault, AD/LDAP); rotation cadence; audit trail of authz decisions._
---
## Rollout Posture
### Deploy everywhere fast
The cluster software (server + core driver library) is built and rolled out to **every site's System Platform nodes as fast as practical** — deployment to all sites is treated as a **prerequisite**, not a gradual effort.
"Deployment" = installing and configuring so the node is ready to front equipment. It does **not** mean immediately migrating consumers. A deployed but inactive cluster is cheap.
### Tiered consumer cutover (sequenced by risk)
Existing direct equipment connections are moved to OtOpcUa **one consumer at a time**, in risk order:
| Tier | Consumer | Why this order | Timeline |
|---|---|---|---|
| 1 | **ScadaBridge** | We own both ends; lowest-risk cutover; validates cluster under real load | Year 1 (begin at large sites) → Year 2 (complete all sites) |
| 2 | **Ignition** | Reduces WAN OPC UA sessions from *N per equipment* to *one per site*; medium risk | Year 2 (begin) → Year 3 (complete) |
| 3 | **Aveva System Platform IO** | Hardest cutover — System Platform IO feeds validated data collection; needs compliance validation | Year 3 |
**Steady state at end of Year 3:** every equipment session is held by OtOpcUa; every downstream consumer reads OT data through it.
---
## UNS Naming Hierarchy (must implement in equipment namespace)
OtOpcUa's equipment namespace browse paths must implement the plan's **5-level UNS naming hierarchy**:
### Five levels, always present
| Level | Name | Example |
|---|---|---|
| 1 | Enterprise | `ent` *(placeholder — real shortname TBD)* |
| 2 | Site | `warsaw-west`, `shannon`, `south-bend` |
| 3 | Area | `bldg-3`, `_default` (placeholder at single-cluster sites) |
| 4 | Line | `line-2`, `assembly-a` |
| 5 | Equipment | `cnc-mill-05`, `injection-molder-02` |
**OPC UA browse path form:** `ent/warsaw-west/bldg-3/line-2/cnc-mill-05`
**Text form (for messages, dbt keys):** `ent.warsaw-west.bldg-3.line-2.cnc-mill-05`
Signals / tags are **children of equipment nodes** (level 6), not a separate path level.
### Naming rules
- `[a-z0-9-]` only. Lowercase enforced.
- Hyphens within a segment (`warsaw-west`), slashes between segments in OPC UA browse paths.
- Max 32 chars per segment, max 200 chars total path.
- `_default` is the only reserved segment name (placeholder for levels that don't apply).
### Stable equipment UUID
Every equipment node must expose a **stable UUIDv4** as a property:
- UUID is assigned once, never changes, never reused.
- Path can change (equipment moves, area renamed); UUID cannot.
- Canonical events downstream carry both UUID (for joins/lineage) and path (for dashboards/filtering).
### Authority
The hierarchy definition lives in the **central `schemas` repo** (not yet created). OtOpcUa is a **consumer** of the authoritative definition — it builds its per-site browse tree from the relevant subtree at deploy/config time. **Drift between OtOpcUa's browse paths and the `schemas` repo is a defect.**
---
## Canonical Model Integration
OtOpcUa's equipment namespace is one of **three surfaces** that expose the plan's canonical equipment / production / event model:
| Surface | Role |
|---|---|
| **OtOpcUa equipment namespace** | Canonical per-equipment OPC UA node structure. Equipment-class templates from `schemas` repo define the node layout. |
| **Redpanda topics + Protobuf schemas** | Canonical event shape on the wire. Source of truth for the model lives in the `schemas` repo. |
| **dbt curated layer in Snowflake** | Canonical analytics model — same vocabulary, different access pattern. |
### Canonical machine state vocabulary
The plan commits to a canonical set of machine state values. OtOpcUa does **not derive these states** (that's a Layer 3 responsibility — System Platform / Ignition), but OtOpcUa's equipment namespace should expose the raw signals that feed the derivation, and the System Platform namespace will expose the derived state values using this vocabulary:
| State | Semantics |
|---|---|
| `Running` | Actively producing at or near theoretical cycle time |
| `Idle` | Powered and available but not producing |
| `Faulted` | Fault raised, requires intervention |
| `Starved` | Ready but blocked by missing upstream input |
| `Blocked` | Ready but blocked by downstream constraint |
**Under consideration (TBD):** `Changeover`, `Maintenance`, `Setup` / `WarmingUp`.
State derivation lives at Layer 3 and is published as `equipment.state.transitioned` events on Redpanda. OtOpcUa's role is to deliver the raw signals cleanly so derivation can be accurate.
---
## Digital Twin Touchpoints
### Use case 1 — Standardized equipment state model
OtOpcUa delivers the raw signals that feed the canonical state derivation at Layer 3. Equipment-class templates in the `schemas` repo define which raw signals each equipment class exposes, standardized across the estate.
### Use case 2 — Virtual testing / simulation
OtOpcUa's namespace architecture can accommodate a future `simulated` namespace — replaying historical equipment data to exercise tier-1/tier-2 consumers without physical equipment. **Not committed for build**, but the namespace system should be designed so adding it is a configuration change.
### Use case 3 — Cross-system canonical model
OtOpcUa's equipment namespace IS the OT-side surface of the canonical model. Every consumer reading equipment data through OtOpcUa sees the same node structure, same naming, same data types, same units — regardless of the underlying equipment's native protocol.
---
## Downstream Consumer Impact
When OtOpcUa is deployed and consumers are cut over:
- **ScadaBridge** reads equipment data from OtOpcUa's equipment namespace and System Platform data from OtOpcUa's System Platform namespace — all from the same OPC UA endpoint. Data locality preserved.
- **Ignition** consumes from each site's OtOpcUa instead of direct WAN OPC UA sessions. WAN session collapse from *N per equipment* to *one per site*.
- **Aveva System Platform IO** consumes equipment data from OtOpcUa's equipment namespace rather than direct equipment sessions. This is a meaningful shift in System Platform's IO layer and **needs validation against Aveva's supported patterns** — System Platform is the most opinionated consumer.
- **LmxOpcUa consumers** continue working — the System Platform namespace carries forward unchanged; the previous auth pattern (credentials, security modes) carries forward.
---
## Sites
### Primary data center
- **South Bend** — primary cluster
### Largest sites (one cluster per production building)
- **Warsaw West**
- **Warsaw North**
### Other integrated sites (single cluster per site)
- **Shannon**
- **Galway**
- **TMT**
- **Ponce**
### Not yet integrated (Year 2+ onboarding)
- **Berlin**
- **Winterthur**
- **Jacksonville**
- Others — list is expected to change
---
## Roadmap Summary
| Year | What happens |
|---|---|
| **Year 1 — Foundation** | Evolve LmxOpcUa into OtOpcUa (equipment namespace + clustering). Run protocol survey (Q1). Build core driver library (Q2+). Deploy to every site. Begin tier-1 cutover (ScadaBridge) at large sites. |
| **Year 2 — Scale** | Complete tier 1 (ScadaBridge) all sites. Begin tier 2 (Ignition). Build long-tail drivers on demand. |
| **Year 3 — Completion** | Complete tier 2 (Ignition). Execute tier 3 (System Platform IO) with compliance validation. Reach steady state. |
---
## Open Questions / TBDs
Collected from across the plan files — these are items the implementation work will need to resolve:
- Equipment-protocol inventory (drives core library scope) — survey not yet run
- First-cutover site selection for tier-1 (ScadaBridge)
- Per-site tier-2 rollout sequence (Ignition)
- Per-equipment-class criteria for System Platform IO re-validation (tier 3)
- Measured resource impact of co-location with System Platform and ScadaBridge
- Headroom numbers at largest sites (Warsaw campuses)
- Whether any site needs dedicated hardware
- Specific OPC UA security mode + profile combinations required vs allowed
- Where UserName credentials/certs are sourced from (local directory, per-site vault, AD/LDAP)
- Credential rotation cadence
- Audit trail of authz decisions
- Whether namespace ACL definitions live alongside driver/topology config or in their own governance surface
- Exact OPC UA namespace shape for the equipment namespace (how equipment-class templates map to browse tree structure)
- How ScadaBridge templates address equipment across multiple per-node OtOpcUa instances
- Enterprise shortname for UNS hierarchy root (currently `ent` placeholder)
- Storage format for the hierarchy in the `schemas` repo (YAML vs Protobuf vs both)
- Reconciliation rule if System Platform and Ignition derivations of the same equipment's state diverge
- Pilot equipment class for the first canonical definition
---
## Sending Corrections Back
If implementation work surfaces any of the following, send corrections back for integration into the 3-year plan:
- **Inaccuracies** — something stated here or in the plan doesn't match what the codebase or equipment actually does.
- **Missing constraints** — a real-world constraint (Aveva limitation, OPC UA spec requirement, equipment behavior) that the plan doesn't account for.
- **Architectural decisions that need revisiting** — a plan decision that turns out to be impractical, with evidence for why and a proposed alternative.
- **Resolved TBDs** — answers to any of the open questions above, discovered during implementation.
- **New TBDs** — questions the plan didn't think to ask but should have.
**Format for corrections:**
1. What the plan currently says (quote or cite file + section)
2. What you found (evidence — code, equipment behavior, Aveva docs, etc.)
3. What the plan should say instead (proposed change)
Corrections will be reviewed and folded into the authoritative plan files (`goal-state.md`, `current-state.md`, `roadmap.md`, etc.). This handoff document is a snapshot and will **not** be updated — the plan files are the living source of truth.