From 9b2acfe69989256eec0109bd0071dc58350bbd2f Mon Sep 17 00:00:00 2001 From: Joseph Doherty Date: Fri, 17 Apr 2026 09:54:36 -0400 Subject: [PATCH] Add OtOpcUa implementation corrections (2026-04-17) capturing mismatches between the otopcua-handoff and the v2 design work in lmxopcua/docs/v2/: 2 framing inaccuracies (native-OPC-UA-needs-no-driver, single-endpoint-per-cluster), 3 missing constraints (namespace ACLs not yet planned in the data path, schemas-repo dependencies blocking equipment-class templates, per-node ApplicationUri trust-pinning as a pre-cutover certificate-distribution step), 6 architectural decisions to revisit (driver list committed pre-survey, Tier A/B/C process-isolation model with Galaxy + FOCAS out-of-process, Polly v8+ resilience, 5-identifier equipment model with MachineCode/ZTag/SAPID alongside UUID, missing tier 1/2/3 consumer cutover plan, per-building cluster pattern interactions at Warsaw), 4 resolved TBDs (pilot class = FANUC CNC, schemas-repo format = JSON Schema, ACL location = central config DB co-located with topology, enterprise shortname still unresolved), and 4 new TBDs (UUID-generation authority, System Platform IO Aveva-pattern validation as Year 1/2 research, multi-cluster site addressing at Warsaw, cluster-endpoint mental model). Format follows the handoff's Sending-Corrections-Back protocol (what plan says / what was found / what plan should say). Co-Authored-By: Claude Opus 4.7 (1M context) --- handoffs/otopcua-corrections-2026-04-17.md | 375 +++++++++++++++++++++ 1 file changed, 375 insertions(+) create mode 100644 handoffs/otopcua-corrections-2026-04-17.md diff --git a/handoffs/otopcua-corrections-2026-04-17.md b/handoffs/otopcua-corrections-2026-04-17.md new file mode 100644 index 0000000..7fa0752 --- /dev/null +++ b/handoffs/otopcua-corrections-2026-04-17.md @@ -0,0 +1,375 @@ +# OtOpcUa Implementation Corrections — 2026-04-17 + +**From:** OtOpcUa v2 design work in `lmxopcua/docs/v2/` (branch `v2`, commits `a1e79cd` and forward) +**Source handoff:** `handoffs/otopcua-handoff.md` +**Format:** per the handoff's "Sending Corrections Back" protocol — each item lists what the plan says, what was found during design, and what the plan should say instead. + +This batch covers corrections surfaced while drafting the v2 implementation design (six docs: `plan.md`, `driver-specs.md`, `driver-stability.md`, `test-data-sources.md`, `config-db-schema.md`, `admin-ui.md`). No code changes yet — all design-phase findings. + +--- + +## A. Inaccuracies + +### A1. "Equipment that already speaks native OPC UA requires no driver build" + +**Plan currently says** (`otopcua-handoff.md` §"Driver Strategy", "Equipment where no driver is needed" subsection): +> Equipment that already speaks **native OPC UA** requires no driver build — OtOpcUa simply proxies the OPC UA session. + +**What was found:** OPC UA-native equipment still requires an `OpcUaClient` gateway driver instance. The driver is thin (it wraps the OPC Foundation .NETStandard SDK), but it is a real driver in our taxonomy: +- Has its own project (`Driver.OpcUaClient`) +- Has its own per-instance config (endpoint URL, security policy, browse strategy, certificate trust, etc.) +- Has its own stability concerns (subscription drift on remote-server reconnect, cascading quality, browse cache memory) +- Is generation-versioned in central config like every other driver +- Counts against the same operational surface (alerts, audit, watchdog) + +The "no driver build" framing risks underestimating the work because every OPC UA-native equipment connection still needs configured tag mappings, namespace remapping (remote namespace is not UNS-compliant by default), and per-equipment routing rules. + +**Plan should say instead:** "Equipment that already speaks native OPC UA does not require a *new protocol-specific driver project* — the `OpcUaClient` gateway driver in the core library handles all of them. Per-equipment configuration (endpoint, browse strategy, namespace remapping to UNS, certificate trust) is still required. Onboarding effort per OPC UA-native equipment is ~hours of config, not zero." + +--- + +### A2. "Two namespaces through a single endpoint" — single endpoint is per-node, not per-cluster + +**Plan currently says** (`otopcua-handoff.md` §"Two Namespaces"): +> OtOpcUa serves **two logical namespaces** through a single endpoint + +And §"Responsibilities": +> Unified OPC UA endpoint for OT data. Clients read both raw equipment data and processed System Platform data from **one OPC UA endpoint** with two namespaces. + +**What was found:** OPC UA non-transparent redundancy (the v1 LmxOpcUa pattern, inherited by v2 — see `lmxopcua/docs/v2/plan.md` decision #85) requires **each cluster node to have its own unique `ApplicationUri`** per OPC UA spec. Clients see *both* endpoints in `ServerUriArray` and choose by `ServiceLevel`. The "single endpoint" framing is true *per node* but not *per cluster*. + +True transparent redundancy (`RedundancySupport.Transparent`) would require a virtual IP / load balancer in front of both nodes — not currently planned for v2 because v1 doesn't have it and adding a VIP would be substantial new infrastructure (per `lmxopcua/docs/v2/plan.md` decisions #79, #84, #85). + +This is not a contradiction — both nodes serve identical address spaces, so any consumer connecting to either sees "the same" namespaces. But the plan's wording suggests a single endpoint URL exists per cluster, which is only true with VIP + transparent redundancy. + +**Plan should say instead:** Either: +- (a) "OtOpcUa serves two logical namespaces through a single endpoint **per node**. In a 2-node cluster, both nodes expose identical namespaces; consumers see two endpoints in `ServerUriArray` and select by `ServiceLevel` (non-transparent redundancy)." +- OR (b) commit explicitly to deploying a VIP / load balancer in front of each cluster to achieve true transparent single-endpoint behavior — substantial infrastructure addition. + +The v2 implementation has chosen (a). If the plan intends (b), that needs to be flagged as a separate work item. + +--- + +## B. Missing constraints + +### B1. Namespace-level / equipment-subtree ACLs not yet modeled in the v2 design + +**Plan currently says** (`otopcua-handoff.md` §"Authorization Model"): +> Authorization enforced via **namespace-level ACLs** — each identity scoped to permitted equipment/namespaces + +**What was found:** The v2 central-config schema (`lmxopcua/docs/v2/config-db-schema.md`) has tables for clusters, nodes, namespaces, drivers, devices, equipment, tags, poll groups, credentials, generation tracking, and audit log — but **no `EquipmentAcl` or `NamespaceAcl` table for OPC UA client authorization**. + +The v2 plan does cover Admin UI authorization (LDAP groups → admin roles `FleetAdmin` / `ConfigEditor` / `ReadOnly`, cluster-scoped grants — decisions #88, #105) but admin authorization governs *who edits configuration*, not *who can read/write equipment data through the OPC UA endpoint*. + +The v1 LmxOpcUa LDAP layer (`Security.md`) maps LDAP groups to OPC UA permission roles (`ReadOnly`, `WriteOperate`, `WriteTune`, `WriteConfigure`, `AlarmAck`) but applies them globally — not per namespace, per equipment subtree, or per UNS Area/Line. + +**Plan should say instead:** The plan should call out that the implementation needs: +- A per-cluster `EquipmentAcl` (or equivalent) table mapping LDAP-group → permitted Namespace + UnsArea/UnsLine/Equipment subtree + permission level (Read / Write / AlarmAck) +- The Admin UI must surface ACL editing +- The OPC UA NodeManager must check the ACL on every browse/read/write/subscribe against the connected user's group claims +- ACLs should be generation-versioned (changes go through publish/diff like any other config) + +This is a substantial missing surface area in both the handoff and the v2 design. Suggest the plan add this as an explicit Year 1 deliverable alongside driver work, since Tier 1 (ScadaBridge) cutover will need authorization enforcement working from day one to satisfy "access control / authorization chokepoint" responsibility. + +--- + +### B2. Equipment-class templates dependency on the not-yet-created `schemas` repo blocks more than canonical-model integration + +**Plan currently says** (`otopcua-handoff.md` §"UNS Naming Hierarchy"): +> The hierarchy definition lives in the **central `schemas` repo (not yet created)**. OtOpcUa is a **consumer** of the authoritative definition + +And §"Canonical Model Integration": +> Equipment-class templates from `schemas` repo define the node layout + +**What was found:** The schemas-repo gap blocks not only template-driven node layout but also: +- Per-equipment-class auto-generated tag lists (without a template, every equipment's tags must be hand-configured) +- Per-equipment-class signal validation (which raw signals each class is *expected* to expose; missing signals = compliance gap) +- Cross-cluster consistency checks (two CNC mills in different clusters should expose the same signal vocabulary) +- The state-derivation contract at Layer 3 (handoff §13) — derivation rules need to know which raw signals each class provides + +The v2 design ships `Equipment.EquipmentClassRef` as a nullable hook column (per `lmxopcua/docs/v2/plan.md` decision #112) but this only avoids retrofit cost when the schemas repo lands. Operationally, until the schemas repo exists, every equipment's tag set is hand-curated by the operator who configures it — a real cost in time and consistency. + +**Plan should say instead:** The plan should make the schemas-repo dependency explicit on the OtOpcUa critical path: +- Schemas repo creation should be a Year 1 deliverable (its own handoff doc, distinct from OtOpcUa's) +- Until it exists, OtOpcUa equipment configurations are hand-curated and prone to drift +- The Equipment Protocol Survey output should feed both: (a) OtOpcUa core driver scope, and (b) the initial schemas repo equipment-class list +- A pilot equipment class (proposed: FANUC CNC, see D1 below) should land in the schemas repo before tier 1 cutover begins, to validate the template-consumer contract end-to-end + +--- + +### B3. Per-node `ApplicationUri` uniqueness and trust-pinning is an OPC UA spec constraint with operational implications + +**Plan currently says** (handoff is silent on this). + +**What was found:** OPC UA clients pin trust to the server's `ApplicationUri` as part of the certificate validation chain. Once a client has trusted a server with a given `ApplicationUri`, changing it requires every client to re-establish trust (re-import certificates, etc.). This is the explicit reason `lmxopcua/docs/v2/plan.md` decision #86 commits to "auto-suggested but never auto-rewritten" — the Admin UI prefills `urn:{Host}:OtOpcUa` on node creation but warns operators if they change `Host` later, requiring explicit opt-in to update `ApplicationUri`. + +This affects the cutover plan: when a Tier 1 / Tier 2 / Tier 3 consumer is being moved to OtOpcUa, the consumer's certificate trust store must include OtOpcUa's certificate (and node-specific `ApplicationUri`) before the cutover. For a 2-node cluster, that's two `ApplicationUri` entries per consumer per cluster. + +For the Warsaw campuses with one cluster per building, a consumer that needs visibility across multiple buildings will trust 2N ApplicationUris (where N = building count). + +**Plan should say instead:** The cutover plan should include certificate-distribution as an explicit pre-cutover step. Suggest the plan note: "Tier 1/2/3 cutover prerequisite: each consumer's OPC UA certificate trust store must be populated with the target OtOpcUa cluster's per-node certificates and ApplicationUris (2 per cluster, plus per-building multiplier at Warsaw campuses) before cutover. Consumers without pre-loaded trust will fail to connect." + +--- + +## C. Architectural decisions to revisit + +### C1. Driver list committed for v2 implementation before Equipment Protocol Survey results + +**Plan currently says** (handoff §"Driver Strategy"): +> **Core library scope is driven by the equipment protocol survey** — see below and `../current-state/equipment-protocol-survey.md`. A protocol becomes "core" if it meets any of: Present at 3+ sites; Combined instance count above ~25; Needed to onboard a Year 1 or Year 2 site; Strategic vendor whose equipment is expected to grow. + +And: "It has not been run yet." + +**What was found:** The v2 design (`lmxopcua/docs/v2/driver-specs.md`) commits to seven specific drivers in addition to Galaxy: +1. Modbus TCP (also covers DL205 via octal address translation) +2. AB CIP (ControlLogix / CompactLogix) +3. AB Legacy (SLC 500 / MicroLogix, PCCC) — separate driver from CIP +4. S7 (Siemens S7-300/400/1200/1500) +5. TwinCAT (Beckhoff ADS, native subscriptions) +6. FOCAS (FANUC CNC) +7. OPC UA Client (gateway for OPC UA-native equipment) + +This commitment was made before the survey ran. The handoff's pre-seeded categories (EQP-001 through EQP-006) confirm OPC UA-native, S7, AB, Modbus, FANUC CNC as expected. **TwinCAT (Beckhoff ADS)** is not in the handoff's expected list, and **AB Legacy as separate from AB CIP** is a finer split than the handoff anticipated. + +**Plan should say instead:** Either: +- (a) The plan should acknowledge that the OtOpcUa team has internal knowledge confirming TwinCAT and AB Legacy as committed core drivers ahead of the formal survey (e.g. known Beckhoff installations at specific sites, known SLC/MicroLogix legacy equipment). +- (b) The v2 design should defer the TwinCAT and AB Legacy commitments until the survey runs in Year 1 Q1 — implementation order becomes Modbus → AB CIP → S7 → FOCAS → (survey results) → potentially add TwinCAT, AB Legacy. + +Note: from a purely technical standpoint, AB CIP and AB Legacy share the libplctag library, so the marginal cost of adding AB Legacy to the AB CIP driver work is low. But it's still a separate driver instance with its own stability tier (B), test coverage, and Admin UI surface. + +**Recommendation:** option (a) — confirm the pre-commitment is justified by known site equipment, document the justification, and proceed. Re-evaluate after the survey if any of these protocols turn out to be absent from the estate. + +--- + +### C2. Process-isolation stability tiers (A/B/C) — substantial architectural addition not in handoff + +**Plan currently says** (handoff is silent on driver process model — implies all drivers run in-process in the OtOpcUa server). + +**What was found:** The v2 design introduces a three-tier driver stability model (`lmxopcua/docs/v2/driver-stability.md`, decisions #63–67): +- **Tier A (pure managed)**: Modbus, OPC UA Client. Run in-process. +- **Tier B (wrapped native, mature)**: S7, AB CIP, AB Legacy, TwinCAT. Run in-process with extra guards (SafeHandle wrappers, memory watchdog, bounded queues). +- **Tier C (heavy native / COM / black-box vendor DLL)**: Galaxy, FOCAS. Run **out-of-process** as separate Windows services with named-pipe IPC. + +The Tier C decision means any deployment using Galaxy or FOCAS runs at least one *additional* Windows service per cluster node beyond the OtOpcUa main service. For sites with both Galaxy and FOCAS, that's three services per node. With 2-node clusters, six services per cluster. + +**Reason for the decision:** an `AccessViolationException` from native code (e.g. FANUC's `Fwlib64.dll`) is uncatchable in modern .NET — would tear down the entire OtOpcUa server, all sessions, all other drivers. Process isolation contains the blast radius. Also, Galaxy stays .NET 4.8 x86 due to MXAccess COM bitness constraints, which alone forces out-of-process — the Tier C model generalizes the pattern. + +**Plan should say instead:** The plan should incorporate the tier model as an architectural decision the OtOpcUa work surfaced: +- Galaxy is out-of-process for two reasons: bitness AND stability isolation +- FOCAS is out-of-process for stability isolation (Fwlib uncatchable AVs) +- Both follow a generalized `Proxy/Host/Shared` three-project pattern reusable for any future Tier C driver +- Operational footprint: 1 to 3 Windows services per cluster node depending on which drivers are configured +- Deployment guides must cover the multi-service install/upgrade/recycle workflow + +**Operational implication:** the handoff §"Deployment Footprint" mentions co-location with System Platform / ScadaBridge and modest workload. With multi-service per node and process recycling for Tier C drivers, the operational picture is more complex than implied. Worth a footprint reassessment after the first cluster is deployed. + +--- + +### C3. Polly v8+ resilience model — runtime dependency added beyond handoff scope + +**Plan currently says** (handoff is silent on resilience strategy). + +**What was found:** The v2 design (`lmxopcua/docs/v2/plan.md` decisions #34–36, #44–45) standardizes on **Polly v8+** (`Microsoft.Extensions.Resilience`) for retry, circuit-breaker, timeout pipelines per driver and per device. Adds: +- A composable resilience pipeline per driver instance and per device within a driver +- Circuit-breaker state surfacing in the status dashboard +- Per-device per-tag write-retry policy with explicit `WriteIdempotent` opt-in (never-retry by default) + +The write-retry safety policy (decision #44) is a real-world constraint not in the handoff: timeouts on writes can fire after the device already accepted the command, so blind retry of non-idempotent operations (pulses, alarm acks, recipe steps) would cause duplicate field actions. The default-to-never-retry policy with explicit opt-in is a substantive safety decision. + +**Plan should say instead:** The plan should note that the implementation has standardized on Polly v8+ for per-device resilience, with a default-no-retry policy on writes that requires explicit per-tag opt-in for idempotency. This affects operator runbooks and onboarding training — operators configuring tags need to understand the `WriteIdempotent` flag's semantics. + +--- + +### C4. Multi-identifier equipment model (5 identifiers) added beyond handoff's UUID-only spec + +**Plan currently says** (handoff §"UNS Naming Hierarchy"): +> **Stable equipment UUID** — Every equipment node must expose a **stable UUIDv4** as a property: UUID is assigned once, never changes, never reused. Path can change (equipment moves, area renamed); UUID cannot. Canonical events downstream carry both UUID (for joins/lineage) and path (for dashboards/filtering). + +**What was found:** Production usage requires more than UUID + path. The v2 design (`lmxopcua/docs/v2/plan.md` decisions #116–119) adds three operationally-required identifiers: +- **MachineCode** — operator-facing colloquial name (e.g. `machine_001`). Required. Within-cluster uniqueness. This is what operators actually say in conversation and write in runbooks; UUID and path are not mnemonic enough for daily operations. +- **ZTag** — ERP equipment identifier. Optional. Fleet-wide uniqueness. Primary identifier for browsing in Admin UI per operational request. +- **SAPID** — SAP PM equipment identifier. Optional. Fleet-wide uniqueness. Required for maintenance system join. + +All five identifiers (UUID, EquipmentId, MachineCode, ZTag, SAPID) are exposed as OPC UA properties on the equipment node so external systems resolve by their preferred identifier without a sidecar service. + +**Plan should say instead:** The handoff's UUID-only spec satisfies cross-system join requirements but underspecifies operator and external-system identifier needs. The plan should incorporate the multi-identifier model: +- Permanent: `EquipmentUuid` (UUIDv4, immutable, downstream events / canonical model joins) +- Operational: `MachineCode` (operator colloquial, required) +- ERP integration: `ZTag` (optional, ERP-allocated, fleet-wide unique) +- SAP PM integration: `SAPID` (optional, SAP-allocated, fleet-wide unique) +- Internal config key: `EquipmentId` (immutable after publish, internal logical key for cross-generation diffs) + +**Resolved TBD:** This also resolves the implicit question "how do operators find equipment in dashboards / on the radio" — the answer is MachineCode, surfaced prominently in Admin UI alongside ZTag. + +--- + +### C5. Consumer cutover plan (Tier 1 / 2 / 3) is not addressed in v2 design docs + +**Plan currently says** (handoff §"Rollout Posture", §"Roadmap Summary"): +> Tier 1 ScadaBridge cutover begins Year 1. Tier 2 Ignition begins Year 2. Tier 3 System Platform IO Year 3. + +**What was found:** The v2 design docs (`lmxopcua/docs/v2/plan.md` Phases 0–5) cover building the OtOpcUa server, drivers, configuration, and Admin UI — but **do not address consumer cutover at all**. There is no plan for: +- ScadaBridge migration sequencing per equipment / per site +- Per-site cutover validation (proving consumers see equivalent data through OtOpcUa as they did via direct sessions) +- Rollback procedures if a cutover causes a consumer regression +- Coordination with Aveva on System Platform IO cutover (Tier 3 — most opinionated consumer per handoff) +- Operational runbooks for what to do when a consumer can't connect / fails over + +**Plan should say instead:** The v2 design should add cutover phases: +- **Phase 6 — Tier 1 (ScadaBridge) cutover** — per-site sequencing, validation methodology, rollback procedure +- **Phase 7 — Tier 2 (Ignition) cutover** — WAN-session collapse validation +- **Phase 8 — Tier 3 (System Platform IO) cutover** — Aveva validation prerequisite, compliance review + +Or alternatively, the cutover plan lives outside the OtOpcUa v2 design and is owned by an integration / operations team — in which case the handoff should make that ownership explicit. + +**Recommendation:** the cutover plan needs an owner and a doc. Either OtOpcUa team owns it (add Phases 6–8 to v2 plan) or another team owns it (handoff should name them and link the doc). + +--- + +### C6. Per-building cluster pattern (Warsaw) — UNS path interaction needs clarification + +**Plan currently says** (handoff §"Deployment Footprint"): +> Largest sites (Warsaw West, Warsaw North) run **one cluster per production building** + +And §"UNS Naming Hierarchy": +> Level 3 — Area — `bldg-3` + +**What was found:** The v2 design models clusters with `Enterprise` + `Site` columns (`lmxopcua/docs/v2/config-db-schema.md` ServerCluster table). For Warsaw with per-building clusters, the cleanest mapping is: +- Cluster A: Site = `warsaw-west`, equipment in this cluster all under UnsArea = `bldg-3` +- Cluster B: Site = `warsaw-west`, equipment in this cluster all under UnsArea = `bldg-4` +- ...etc. + +This works structurally, but raises questions: +1. **ZTag fleet-wide uniqueness**: if `Warsaw West building 3` and `Warsaw West building 4` are separate clusters but the same physical site, can the same ZTag appear in both? Per the v2 design, no — ZTag is fleet-wide unique. Operationally that should be true (ERP doesn't reuse identifiers across buildings) but worth confirming. +2. **Cross-cluster consumer consistency**: a ScadaBridge instance reading equipment data for the whole Warsaw West site would need to connect to N clusters (one per building) and stitch the data together. The handoff's "site-local aggregation" responsibility (§"Responsibilities") is satisfied per-cluster but not site-wide. +3. **UNS path uniqueness**: with one cluster per building, equipment paths `ent/warsaw-west/bldg-3/line-2/cnc-mill-05` and `ent/warsaw-west/bldg-4/line-2/cnc-mill-05` are unique by the Area segment. But two clusters serve overlapping paths (`ent/warsaw-west/...`) — clients connecting to building 3's cluster can't see building 4's equipment. That's correct by intent, but downstream consumers expecting site-wide visibility need to know they connect per-cluster. + +**Plan should say instead:** The handoff should clarify: +- Whether ZTag fleet-wide uniqueness extends across per-building clusters at the same site (assumed yes; confirm with ERP team) +- That consumers needing site-wide visibility at Warsaw campuses will connect to multiple clusters, not one +- Whether the per-building cluster decision is a constraint to optimize for or a cost to minimize (e.g. would a single Warsaw-West cluster with 4 buildings' worth of equipment be feasible if hardware allowed?) + +--- + +## D. Resolved TBDs + +### D1. Pilot equipment class for first canonical definition + +**Plan TBD:** "Pilot equipment class for the first canonical definition" (handoff §"Open Questions / TBDs"). + +**Proposal:** **FANUC CNC** is the natural pilot. The v2 design (`lmxopcua/docs/v2/driver-specs.md` §7 "FANUC FOCAS Driver") already specifies a **fixed pre-defined node hierarchy** for FOCAS — Identity, Status, Axes, Spindle, Program, Tool, Alarms, PMC, Macro categories — populated by specific FOCAS2 API calls. This is essentially a class template already; the schemas repo just needs to formalize it. + +**Why FANUC CNC over alternatives:** +- Pre-defined hierarchy already exists in driver design — no greenfield template work +- Single vendor with well-known API surface (FOCAS2 spec) — low ambiguity +- Equipment count is finite (CNCs per site are countable, not in the hundreds like generic Modbus instances) +- Tier C out-of-process driver design — failure-mode boundary is clean for piloting class-template-driven config + +**Why not Modbus or AB CIP first:** these have no inherent class — every device has a hand-curated tag list. Picking them for the pilot would mean inventing a class taxonomy from scratch, which is exactly the schemas-repo team's job, not the OtOpcUa team's. + +--- + +### D2. Storage format for the hierarchy in the `schemas` repo + +**Plan TBD:** "Storage format for the hierarchy in the `schemas` repo (YAML vs Protobuf vs both)" (handoff §"Open Questions / TBDs"). + +**Proposal:** **JSON Schema (.json files, version-controlled in the schemas repo)** as the authoritative format, with Protobuf code generation for runtime serialization where needed (e.g. Redpanda event schemas). + +**Rationale:** +- Idiomatic for .NET (System.Text.Json and System.Text.Json.Schema in .NET 10) — OtOpcUa reads templates with no extra dependencies +- Idiomatic for CI tooling (every CI runner can `jq` / validate JSON Schema without extra toolchain) +- Supports validation at multiple layers (operator-visible Admin UI errors, schemas-repo CI gates, OtOpcUa runtime validation) +- Protobuf is better for *wire* serialization (size, speed, generated code) but worse for *authoring* (binary, requires .proto compiler, poor merge story in git). For a repo where humans author equipment-class definitions, JSON Schema wins. +- If Protobuf is needed for downstream (Redpanda events with Protobuf-defined schemas), it can be code-generated from the JSON Schema authoring source. One-way derivation is simpler than bidirectional sync. + +**Where it lives in OtOpcUa:** at startup, OtOpcUa nodes fetch the relevant equipment-class templates from the schemas repo (cached locally) and use them to validate `Equipment.EquipmentClassRef` against actual tag mappings. Drift between OtOpcUa's tag set and the schema is surfaced as a config validation error. + +--- + +### D3. Where namespace ACL definitions live + +**Plan TBD:** "Whether namespace ACL definitions live alongside driver/topology config or in their own governance surface" (handoff §"Open Questions / TBDs"). + +**Proposal:** **Live in the central config DB alongside DriverInstance/Equipment/Tag**, in a new `EquipmentAcl` table per cluster. Operationally co-located with the rest of the cluster's config; managed through the same Admin UI; edited through the same draft → diff → publish flow; generation-versioned for the same auditability and rollback safety. + +**Rationale:** +- A separate governance surface means a separate auth layer, separate UI, separate audit log, separate rollback workflow — meaningful operational tax for unclear benefit. +- Co-locating means one publish includes both topology changes and ACL changes atomically, so a new equipment is added AND its ACL is granted in one transaction. Avoids the "equipment exists for 30 seconds with default-deny" race window that a separate-governance-surface model would create. +- Per `lmxopcua/docs/v2/plan.md` decision #105, cluster-scoped admin grants already live in the central DB — extending the same model to data-path ACLs is the consistent choice. + +**Open sub-question:** ACL granularity — per Equipment? per UnsLine? per UnsArea? per Namespace? Suggest supporting all four levels with inheritance (grant at UnsArea cascades to all UnsLines + Equipment beneath, unless overridden). + +This proposal also addresses B1 (the missing ACL constraint) — they're the same gap, viewed from two directions. + +--- + +### D4. Enterprise shortname for UNS hierarchy root + +**Plan TBD:** "Enterprise shortname for UNS hierarchy root (currently `ent` placeholder)" (handoff §"Open Questions / TBDs"). + +**Status:** **Not resolved.** OtOpcUa work cannot determine this — it's an organizational naming choice. Flagging because the v2 design currently uses `ent` per the handoff's placeholder. Suggest the schemas repo team or enterprise-naming authority resolve before any production deployment, since changing the Enterprise segment after equipment is published would require a generation-wide path rewrite (UUIDs preserved but every consumer's path-based query needs to learn the new prefix). + +**Recommendation:** resolve before tier 1 (ScadaBridge) cutover at the first site. + +--- + +## E. New TBDs + +Questions the plan didn't think to ask but should have, surfaced during v2 design. + +### E1. EquipmentUuid generation authority — OtOpcUa or external system? + +**Question:** Who allocates `EquipmentUuid`? The v2 design (`lmxopcua/docs/v2/admin-ui.md`) auto-generates UUIDv4 in the Admin UI on equipment creation. But if equipment is also tracked in ERP (ZTag) or SAP PM (SAPID), those systems may have their own equipment identifiers and may want to be the authoritative UUID source. + +**Implication:** if ERP/SAP becomes UUID authority, OtOpcUa Admin UI shifts from "generate UUID on create" to "look up UUID from external system; reject creation if equipment not yet in ERP". That's a different operational flow. + +**Suggested resolution:** OtOpcUa generates UUIDs by default (current v2 design); plan should note that if ERP/SAP take authoritative ownership of equipment registration in the future, the UUID-generation policy is configurable per cluster. + +--- + +### E2. Tier 3 (System Platform IO) cutover — Aveva-supported pattern verification + +**Question:** Does Aveva System Platform IO support consuming equipment data from another OPC UA server (OtOpcUa) instead of from equipment directly? The handoff §"Downstream Consumer Impact" says this "needs validation against Aveva's supported patterns — System Platform is the most opinionated consumer." + +**Recommendation:** plan should commit to a research deliverable in Year 1 or Year 2 (well ahead of Year 3 tier 3 cutover): "Validate with Aveva that System Platform IO drivers support upstream OPC UA-server data sources, including any restrictions on security mode, namespace structure, or session model." If Aveva's pattern requires something OtOpcUa doesn't expose, that's a long-lead-time discovery. + +--- + +### E3. Site-wide vs per-cluster consumer addressing at multi-cluster sites + +**Question:** At Warsaw West / Warsaw North with per-building clusters, how do consumers that need site-wide equipment visibility address the equipment? The current v2 design has each cluster with its own endpoints + namespace; a site-wide consumer must connect to N clusters. + +**Possible resolutions** (out of scope for v2 design but should be on someone's plate): +- (a) Configure consumer-side templates to enumerate all per-building clusters and stitch — current expected pattern but adds consumer-side complexity +- (b) Build a site-aggregator OtOpcUa instance that consumes from per-building clusters via the OPC UA Client gateway driver and re-exposes a unified site namespace — second-order aggregation, doable with our existing toolset but operational complexity +- (c) Reconsider per-building clustering — if hardware allows, single site cluster is simpler + +**Recommendation:** flag as Year 1 design discussion before per-building clusters are committed at Warsaw. If (b) becomes the answer, OtOpcUa's architecture already supports it (the OpcUaClient driver was designed for exactly this gateway-of-gateways scenario). + +--- + +### E4. Cluster-as-single-endpoint vs per-node-endpoints — clarify handoff mental model + +(Cross-references A2.) Even if we agree non-transparent redundancy is the right v2 model, the handoff's wording ("single endpoint", "unified OPC UA endpoint") implies a single URL. Worth deciding whether to: +- Update the handoff wording to acknowledge two endpoints per cluster (operationally accurate per OPC UA non-transparent spec), OR +- Add an explicit roadmap line item to introduce a VIP/load-balancer in front of each cluster for transparent redundancy (substantial new infrastructure) + +**Recommendation:** update the handoff wording. Transparent redundancy is a meaningful infrastructure investment for marginal client-side simplification (ScadaBridge / Ignition / System Platform IO are all sophisticated OPC UA clients that can handle ServerUriArray + ServiceLevel-driven failover). + +--- + +## Summary + +This batch: + +| Category | Count | Notes | +|----------|------:|-------| +| A. Inaccuracies | 2 | Both wording/framing issues; no architectural conflict | +| B. Missing constraints | 3 | ACLs (substantial gap), schemas-repo dependencies, certificate trust pre-cutover step | +| C. Architectural decisions to revisit | 6 | Driver list pre-survey, stability tiers, Polly resilience, multi-identifier model, missing cutover plan, per-building cluster interactions | +| D. Resolved TBDs | 4 | Pilot class (FANUC CNC), schemas repo format (JSON Schema), ACL location (central config DB), enterprise shortname (still unresolved — flagged) | +| E. New TBDs | 4 | UUID-generation authority, System Platform IO Aveva validation, multi-cluster site addressing, cluster-endpoint mental model | + +**Most urgent for plan integration:** B1 (missing ACL surface) and C5 (missing consumer cutover plan) — both are large work items the v2 implementation design discovered are needed but doesn't currently own. They should be assigned (to OtOpcUa team or another team) before Year 1 tier 1 cutover begins. + +**Reference:** all v2 design decisions are in `lmxopcua/docs/v2/plan.md` (decision log §). Specific cross-references in this doc cite decision numbers (#XX) where applicable.