Add OtOpcUa implementation handoff document

Self-contained extract of all OtOpcUa design material from the plan: architecture context, LmxOpcUa starting point, two namespaces, driver strategy, deployment, auth, rollout tiers, UNS hierarchy, canonical model integration, digital twin touchpoints, sites, roadmap, and all open TBDs. Includes correction-submission protocol for the implementing agent.
2026-04-17 09:21:25 -04:00
parent d89c23a659
commit fc3e19fde1
1 changed files with 375 additions and 0 deletions
@@ -0,0 +1,375 @@
+# OtOpcUa — Implementation Handoff
+
+**Extracted:** 2026-04-17
+**Source plan:** [`../goal-state.md`](../goal-state.md), [`../current-state.md`](../current-state.md), [`../roadmap.md`](../roadmap.md), [`../current-state/equipment-protocol-survey.md`](../current-state/equipment-protocol-survey.md)
+**Repo for existing codebase:** [lmxopcua](https://gitea.dohertylan.com/dohertj2/lmxopcua) (see [`../links.md`](../links.md))
+
+> **This is a point-in-time extract, not a living document.** The authoritative plan content lives in the source files above. If anything here conflicts with the source files, the source files win.
+>
+> **Corrections from the implementation agent are expected and welcome.** If the implementation work surfaces inaccuracies, missing constraints, or architectural decisions that need revisiting, send corrections back for integration into the plan. Format: describe what's wrong, what you found, and what the plan should say instead. Corrections will be reviewed and folded into the authoritative plan files — they do not get applied to this handoff document (which is a snapshot, not the source of truth).
+
+---
+
+## What OtOpcUa Is
+
+**OtOpcUa** is a per-site **clustered OPC UA server** that is the **single sanctioned OPC UA access point for all OT data at each site**. It owns the one connection to each piece of equipment and exposes a unified OPC UA surface to every downstream consumer (Aveva System Platform, Ignition, ScadaBridge, future consumers).
+
+It is **not** a new component from scratch — it is the **evolution of the existing LmxOpcUa** codebase. LmxOpcUa is absorbed into OtOpcUa, not replaced by a separate component.
+
+### Where it sits in the architecture
+
+```
+Layer 1  Equipment (PLCs, controllers, instruments)
+            ↕
+Layer 2  OtOpcUa  ← THIS COMPONENT
+            ↕
+Layer 3  SCADA (Aveva System Platform + Ignition)
+            ↕
+Layer 4  ScadaBridge (sole IT↔OT crossing point)
+         ─── IT/OT Boundary ───
+         Enterprise IT
+```
+
+OtOpcUa lives entirely on the **OT side**. It does not change where the IT↔OT crossing sits (that's ScadaBridge central). It is OT-data-facing, site-local, and fronts OT consumers.
+
+---
+
+## What Exists Today (LmxOpcUa — the starting point)
+
+**Repo:** [lmxopcua](https://gitea.dohertylan.com/dohertj2/lmxopcua)
+
+- **What:** in-house OPC UA server with tight integration to Aveva System Platform.
+- **Role:** exposes System Platform data/objects via OPC UA, enabling OPC UA clients (including ScadaBridge and third parties) to consume System Platform data natively.
+- **Deployment:** built and deployed to **every Aveva System Platform node** — primary cluster in South Bend and every site-level application server cluster. Each System Platform node runs its own local instance.
+- **Namespace source:** each instance interfaces with its **local Application Platform's LMX API**. The OPC UA address space reflects the System Platform objects reachable through that node's LMX API — namespace is per-node and scoped to whatever the local App Server surfaces.
+- **Security model:** standard OPC UA security — `None` / `Sign` / `SignAndEncrypt` modes, `Basic256Sha256` and related profiles, **UserName token** authentication for clients. No bespoke auth scheme.
+- **Technology:** .NET (in-house pattern shared with ScadaBridge).
+
+### Current equipment access problem that OtOpcUa solves
+
+Today, equipment is connected to by **multiple systems directly**, concurrently:
+- **Aveva System Platform** (for validated data collection via IO drivers)
+- **Ignition SCADA** (for KPI data, central from South Bend over WAN)
+- **ScadaBridge** (for bridge/integration workloads via Akka.NET OPC UA client)
+
+Consequences:
+- Multiple OPC UA sessions per equipment — strains devices with limited concurrent-session support
+- No single access-control point — authorization is per-consumer, no site-level chokepoint
+- Inconsistent data — same tag read by three consumers can produce three subtly different values (different sampling intervals, deadbands, session buffers)
+
+**OtOpcUa eliminates all three problems** by collapsing to one session per equipment.
+
+---
+
+## Two Namespaces
+
+OtOpcUa serves **two logical namespaces** through a single endpoint:
+
+### 1. Equipment namespace (raw data) — NEW
+
+Live values read from equipment via native OPC UA or native device protocols (Modbus, Ethernet/IP, Siemens S7, etc.) translated to OPC UA. This is the new capability — what the "Layer 2 — raw data" role describes.
+
+Raw equipment data at this layer is exactly that — **raw** — no deadbanding, no aggregation, no business meaning. Business meaning is added at Layer 3 (System Platform / Ignition).
+
+### 2. System Platform namespace (processed data tap) — EXISTING (from LmxOpcUa)
+
+The former LmxOpcUa functionality, folded in. Exposes Aveva System Platform objects (via the local App Server's LMX API) as OPC UA so that OPC UA-native consumers can read processed data through the same endpoint they use for raw equipment data.
+
+### Extensible namespace model
+
+The two-namespace design is not a hard cap. A future **`simulated` namespace** could expose synthetic or replayed equipment data to consumers, letting tier-1/tier-2 consumers (ScadaBridge, Ignition, System Platform IO) be exercised against real-shaped-but-offline data streams without physical equipment. **Architecturally supported, not committed for build** in the 3-year scope. Design the namespace system so adding a third namespace is a configuration change, not a structural refactor.
+
+---
+
+## Responsibilities
+
+- **Single connection per equipment.** OtOpcUa is the **only** OPC UA client that talks to equipment directly. Equipment holds one session — to OtOpcUa — regardless of how many downstream consumers need its data.
+- **Site-local aggregation.** Downstream consumers connect to OtOpcUa rather than to equipment directly. A consumer reading the same tag gets the same value regardless of who else is subscribed.
+- **Unified OPC UA endpoint for OT data.** Clients read both raw equipment data and processed System Platform data from **one OPC UA endpoint** with two namespaces.
+- **Access control / authorization chokepoint.** Authentication, authorization, rate limiting, and audit of OT OPC UA reads/writes are enforced at OtOpcUa, not at each consumer.
+- **Clustered for HA.** Multi-node cluster — node loss does not drop equipment or System Platform visibility.
+
+---
+
+## Build vs Buy
+
+**Decision: custom build, in-house.** Not Kepware, Matrikon, Aveva Communication Drivers, HiveMQ Edge, or any off-the-shelf OPC UA aggregator.
+
+**Rationale:**
+- Matches the existing in-house .NET pattern (ScadaBridge, SnowBridge, and LmxOpcUa itself)
+- Full control over clustering semantics, access model, and integration with ScadaBridge's operational surface
+- No per-site commercial license
+- No vendor roadmap risk for a component this central
+
+**Primary cost acknowledged:** equipment driver coverage. Commercial aggregators like Kepware justify their license cost through their driver library. Picking custom build means that library has to be built in-house. See Driver Strategy below.
+
+**Reference products** (Kepware, Matrikon, etc.) may still be useful for comparison on specific capabilities even though they're not the target.
+
+---
+
+## Driver Strategy: Hybrid — Proactive Core Library + On-Demand Long-Tail
+
+### Core driver library (proactive, Year 1 → Year 2)
+
+A core library covering the **top equipment protocols** for the estate, built proactively so that most site onboardings can draw from existing drivers rather than blocking on driver work.
+
+**Core library scope is driven by the equipment protocol survey** — see below and [`../current-state/equipment-protocol-survey.md`](../current-state/equipment-protocol-survey.md). A protocol becomes "core" if it meets any of:
+1. Present at 3+ sites
+2. Combined instance count above ~25
+3. Needed to onboard a Year 1 or Year 2 site
+4. Strategic vendor whose equipment is expected to grow (judgment call)
+
+### Long-tail drivers (on-demand, as sites onboard)
+
+Protocols beyond the core library are built on-demand when the first site that needs the protocol reaches onboarding.
+
+### Implementation approach (not committed, one possible tactic)
+
+Embedded open-source protocol stacks wrapped in OtOpcUa's driver framework:
+- **NModbus** for Modbus TCP/RTU
+- **Sharp7** for Siemens S7
+- **libplctag** for EtherNet/IP (Allen-Bradley)
+- Other libraries as needed
+
+This reduces driver work to "write the OPC UA ↔ protocol adapter" rather than "implement the protocol from scratch." The build team may pick this or a different approach per driver.
+
+### Equipment where no driver is needed
+
+Equipment that already speaks **native OPC UA** requires no driver build — OtOpcUa simply proxies the OPC UA session. The driver-build effort is scoped only to equipment exposing non-OPC-UA protocols.
+
+---
+
+## Equipment Protocol Survey (Year 1 prerequisite — not yet run)
+
+The protocol survey determines the core driver library scope. **It has not been run yet.**
+
+Template, schema, classification rule, rollup views, and a 6-step discovery approach are documented in [`../current-state/equipment-protocol-survey.md`](../current-state/equipment-protocol-survey.md).
+
+**Pre-seeded expected categories** (placeholders, not confirmed):
+
+| ID | Equipment class | Native protocol | Core candidate? |
+|---|---|---|---|
+| EQP-001 | OPC UA-native equipment | OPC UA | No driver needed |
+| EQP-002 | Siemens S7 PLCs (S7-300/400/1200/1500) | Siemens S7 / OPC UA on newer models | Unknown — depends on S7-1500 vs older ratio |
+| EQP-003 | Allen-Bradley / Rockwell PLCs | EtherNet/IP (CIP) | Likely core |
+| EQP-004 | Generic Modbus devices | Modbus TCP / RTU | Likely core |
+| EQP-005 | Fanuc CNC controllers | FOCAS (proprietary library) | Depends on CNC count |
+| EQP-006 | Long-tail (everything else) | Various | On-demand |
+
+**Dual mandate:** the same discovery walk also produces the initial **UNS naming hierarchy snapshot** at equipment-instance granularity (see UNS section below). Two outputs, one walk.
+
+---
+
+## Deployment Footprint
+
+**Co-located on existing Aveva System Platform nodes.** Same pattern as ScadaBridge — no dedicated hardware.
+
+- **Cluster size:** 2-node clusters at most sites. Largest sites (Warsaw West, Warsaw North) run one cluster per production building, matching ScadaBridge's and System Platform's existing per-building cluster pattern.
+- **Rationale:** zero new hardware footprint; OtOpcUa largely replaces what LmxOpcUa already runs on these nodes, so the incremental resource draw is just the new equipment-driver and clustering work.
+- **Trade-off accepted:** System Platform, ScadaBridge, and OtOpcUa all share nodes. Resource contention mitigated by (1) modest driver workload relative to ScadaBridge's proven 225k/sec OPC UA ingestion ceiling, (2) monitoring via observability signals, (3) option to move off-node if contention is observed.
+
+_TBD — measured impact of adding this workload; headroom numbers at largest sites; whether any site needs dedicated hardware._
+
+---
+
+## Authorization Model
+
+**OPC UA-native — user tokens for authentication + namespace-level ACLs for authorization.**
+
+- Every downstream consumer authenticates with **standard OPC UA user tokens** (UserName tokens and/or X.509 client certs, per site/consumer policy)
+- Authorization enforced via **namespace-level ACLs** — each identity scoped to permitted equipment/namespaces
+- **Inherits the LmxOpcUa auth pattern** — consumer-side experience does not change for clients that used LmxOpcUa previously
+
+**Explicitly not federated with the enterprise IdP.** OT data access is a pure OT concern. The plan's IT/OT boundary stays at ScadaBridge central, not at OtOpcUa. Two identity stores to operate (enterprise IdP for IT-facing components, OPC UA-native identities for OtOpcUa) is the accepted trade-off.
+
+_TBD — specific security mode + profile combinations required; credential source (local directory, per-site vault, AD/LDAP); rotation cadence; audit trail of authz decisions._
+
+---
+
+## Rollout Posture
+
+### Deploy everywhere fast
+
+The cluster software (server + core driver library) is built and rolled out to **every site's System Platform nodes as fast as practical** — deployment to all sites is treated as a **prerequisite**, not a gradual effort.
+
+"Deployment" = installing and configuring so the node is ready to front equipment. It does **not** mean immediately migrating consumers. A deployed but inactive cluster is cheap.
+
+### Tiered consumer cutover (sequenced by risk)
+
+Existing direct equipment connections are moved to OtOpcUa **one consumer at a time**, in risk order:
+
+| Tier | Consumer | Why this order | Timeline |
+|---|---|---|---|
+| 1 | **ScadaBridge** | We own both ends; lowest-risk cutover; validates cluster under real load | Year 1 (begin at large sites) → Year 2 (complete all sites) |
+| 2 | **Ignition** | Reduces WAN OPC UA sessions from *N per equipment* to *one per site*; medium risk | Year 2 (begin) → Year 3 (complete) |
+| 3 | **Aveva System Platform IO** | Hardest cutover — System Platform IO feeds validated data collection; needs compliance validation | Year 3 |
+
+**Steady state at end of Year 3:** every equipment session is held by OtOpcUa; every downstream consumer reads OT data through it.
+
+---
+
+## UNS Naming Hierarchy (must implement in equipment namespace)
+
+OtOpcUa's equipment namespace browse paths must implement the plan's **5-level UNS naming hierarchy**:
+
+### Five levels, always present
+
+| Level | Name | Example |
+|---|---|---|
+| 1 | Enterprise | `ent` *(placeholder — real shortname TBD)* |
+| 2 | Site | `warsaw-west`, `shannon`, `south-bend` |
+| 3 | Area | `bldg-3`, `_default` (placeholder at single-cluster sites) |
+| 4 | Line | `line-2`, `assembly-a` |
+| 5 | Equipment | `cnc-mill-05`, `injection-molder-02` |
+
+**OPC UA browse path form:** `ent/warsaw-west/bldg-3/line-2/cnc-mill-05`
+**Text form (for messages, dbt keys):** `ent.warsaw-west.bldg-3.line-2.cnc-mill-05`
+
+Signals / tags are **children of equipment nodes** (level 6), not a separate path level.
+
+### Naming rules
+
+- `[a-z0-9-]` only. Lowercase enforced.
+- Hyphens within a segment (`warsaw-west`), slashes between segments in OPC UA browse paths.
+- Max 32 chars per segment, max 200 chars total path.
+- `_default` is the only reserved segment name (placeholder for levels that don't apply).
+
+### Stable equipment UUID
+
+Every equipment node must expose a **stable UUIDv4** as a property:
+- UUID is assigned once, never changes, never reused.
+- Path can change (equipment moves, area renamed); UUID cannot.
+- Canonical events downstream carry both UUID (for joins/lineage) and path (for dashboards/filtering).
+
+### Authority
+
+The hierarchy definition lives in the **central `schemas` repo** (not yet created). OtOpcUa is a **consumer** of the authoritative definition — it builds its per-site browse tree from the relevant subtree at deploy/config time. **Drift between OtOpcUa's browse paths and the `schemas` repo is a defect.**
+
+---
+
+## Canonical Model Integration
+
+OtOpcUa's equipment namespace is one of **three surfaces** that expose the plan's canonical equipment / production / event model:
+
+| Surface | Role |
+|---|---|
+| **OtOpcUa equipment namespace** | Canonical per-equipment OPC UA node structure. Equipment-class templates from `schemas` repo define the node layout. |
+| **Redpanda topics + Protobuf schemas** | Canonical event shape on the wire. Source of truth for the model lives in the `schemas` repo. |
+| **dbt curated layer in Snowflake** | Canonical analytics model — same vocabulary, different access pattern. |
+
+### Canonical machine state vocabulary
+
+The plan commits to a canonical set of machine state values. OtOpcUa does **not derive these states** (that's a Layer 3 responsibility — System Platform / Ignition), but OtOpcUa's equipment namespace should expose the raw signals that feed the derivation, and the System Platform namespace will expose the derived state values using this vocabulary:
+
+| State | Semantics |
+|---|---|
+| `Running` | Actively producing at or near theoretical cycle time |
+| `Idle` | Powered and available but not producing |
+| `Faulted` | Fault raised, requires intervention |
+| `Starved` | Ready but blocked by missing upstream input |
+| `Blocked` | Ready but blocked by downstream constraint |
+
+**Under consideration (TBD):** `Changeover`, `Maintenance`, `Setup` / `WarmingUp`.
+
+State derivation lives at Layer 3 and is published as `equipment.state.transitioned` events on Redpanda. OtOpcUa's role is to deliver the raw signals cleanly so derivation can be accurate.
+
+---
+
+## Digital Twin Touchpoints
+
+### Use case 1 — Standardized equipment state model
+OtOpcUa delivers the raw signals that feed the canonical state derivation at Layer 3. Equipment-class templates in the `schemas` repo define which raw signals each equipment class exposes, standardized across the estate.
+
+### Use case 2 — Virtual testing / simulation
+OtOpcUa's namespace architecture can accommodate a future `simulated` namespace — replaying historical equipment data to exercise tier-1/tier-2 consumers without physical equipment. **Not committed for build**, but the namespace system should be designed so adding it is a configuration change.
+
+### Use case 3 — Cross-system canonical model
+OtOpcUa's equipment namespace IS the OT-side surface of the canonical model. Every consumer reading equipment data through OtOpcUa sees the same node structure, same naming, same data types, same units — regardless of the underlying equipment's native protocol.
+
+---
+
+## Downstream Consumer Impact
+
+When OtOpcUa is deployed and consumers are cut over:
+
+- **ScadaBridge** reads equipment data from OtOpcUa's equipment namespace and System Platform data from OtOpcUa's System Platform namespace — all from the same OPC UA endpoint. Data locality preserved.
+- **Ignition** consumes from each site's OtOpcUa instead of direct WAN OPC UA sessions. WAN session collapse from *N per equipment* to *one per site*.
+- **Aveva System Platform IO** consumes equipment data from OtOpcUa's equipment namespace rather than direct equipment sessions. This is a meaningful shift in System Platform's IO layer and **needs validation against Aveva's supported patterns** — System Platform is the most opinionated consumer.
+- **LmxOpcUa consumers** continue working — the System Platform namespace carries forward unchanged; the previous auth pattern (credentials, security modes) carries forward.
+
+---
+
+## Sites
+
+### Primary data center
+- **South Bend** — primary cluster
+
+### Largest sites (one cluster per production building)
+- **Warsaw West**
+- **Warsaw North**
+
+### Other integrated sites (single cluster per site)
+- **Shannon**
+- **Galway**
+- **TMT**
+- **Ponce**
+
+### Not yet integrated (Year 2+ onboarding)
+- **Berlin**
+- **Winterthur**
+- **Jacksonville**
+- Others — list is expected to change
+
+---
+
+## Roadmap Summary
+
+| Year | What happens |
+|---|---|
+| **Year 1 — Foundation** | Evolve LmxOpcUa into OtOpcUa (equipment namespace + clustering). Run protocol survey (Q1). Build core driver library (Q2+). Deploy to every site. Begin tier-1 cutover (ScadaBridge) at large sites. |
+| **Year 2 — Scale** | Complete tier 1 (ScadaBridge) all sites. Begin tier 2 (Ignition). Build long-tail drivers on demand. |
+| **Year 3 — Completion** | Complete tier 2 (Ignition). Execute tier 3 (System Platform IO) with compliance validation. Reach steady state. |
+
+---
+
+## Open Questions / TBDs
+
+Collected from across the plan files — these are items the implementation work will need to resolve:
+
+- Equipment-protocol inventory (drives core library scope) — survey not yet run
+- First-cutover site selection for tier-1 (ScadaBridge)
+- Per-site tier-2 rollout sequence (Ignition)
+- Per-equipment-class criteria for System Platform IO re-validation (tier 3)
+- Measured resource impact of co-location with System Platform and ScadaBridge
+- Headroom numbers at largest sites (Warsaw campuses)
+- Whether any site needs dedicated hardware
+- Specific OPC UA security mode + profile combinations required vs allowed
+- Where UserName credentials/certs are sourced from (local directory, per-site vault, AD/LDAP)
+- Credential rotation cadence
+- Audit trail of authz decisions
+- Whether namespace ACL definitions live alongside driver/topology config or in their own governance surface
+- Exact OPC UA namespace shape for the equipment namespace (how equipment-class templates map to browse tree structure)
+- How ScadaBridge templates address equipment across multiple per-node OtOpcUa instances
+- Enterprise shortname for UNS hierarchy root (currently `ent` placeholder)
+- Storage format for the hierarchy in the `schemas` repo (YAML vs Protobuf vs both)
+- Reconciliation rule if System Platform and Ignition derivations of the same equipment's state diverge
+- Pilot equipment class for the first canonical definition
+
+---
+
+## Sending Corrections Back
+
+If implementation work surfaces any of the following, send corrections back for integration into the 3-year plan:
+
+- **Inaccuracies** — something stated here or in the plan doesn't match what the codebase or equipment actually does.
+- **Missing constraints** — a real-world constraint (Aveva limitation, OPC UA spec requirement, equipment behavior) that the plan doesn't account for.
+- **Architectural decisions that need revisiting** — a plan decision that turns out to be impractical, with evidence for why and a proposed alternative.
+- **Resolved TBDs** — answers to any of the open questions above, discovered during implementation.
+- **New TBDs** — questions the plan didn't think to ask but should have.
+
+**Format for corrections:**
+1. What the plan currently says (quote or cite file + section)
+2. What you found (evidence — code, equipment behavior, Aveva docs, etc.)
+3. What the plan should say instead (proposed change)
+
+Corrections will be reviewed and folded into the authoritative plan files (`goal-state.md`, `current-state.md`, `roadmap.md`, etc.). This handoff document is a snapshot and will **not** be updated — the plan files are the living source of truth.