Files

Joseph Doherty 8a6c227dbc Add same-day addendum to OtOpcUa corrections doc noting four v2 design defects an adversarial review surfaced after the corrections doc was filed (one critical: cross-cluster namespace binding, three high: namespace state bypassing publish boundary, ZTag/SAPID rollback-reuse hazard, operator-supplied EquipmentId minting duplicate identities) — all four closed in lmxopcua v2 branch at commit a59ad2e (decisions #122–125). Two of the fixes refine claims this corrections doc made (C4 multi-identifier model: EquipmentId is now system-generated not operator-supplied; D3 ACL location: ExternalIdReservation precedent shows some cross-generation invariants need non-versioned tables) so plan-team awareness matters; the other two (same-cluster namespace invariant, Namespace generation-versioning) are purely internal correctness with no handoff relevance, included for audit trail.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-17 11:10:05 -04:00

37 KiB

Raw Blame History

OtOpcUa Implementation Corrections — 2026-04-17

From: OtOpcUa v2 design work in lmxopcua/docs/v2/ (branch v2, commits a1e79cd and forward) Source handoff: handoffs/otopcua-handoff.md Format: per the handoff's "Sending Corrections Back" protocol — each item lists what the plan says, what was found during design, and what the plan should say instead.

This batch covers corrections surfaced while drafting the v2 implementation design (six docs: plan.md, driver-specs.md, driver-stability.md, test-data-sources.md, config-db-schema.md, admin-ui.md). No code changes yet — all design-phase findings.

A. Inaccuracies

A1. "Equipment that already speaks native OPC UA requires no driver build"

Plan currently says (otopcua-handoff.md §"Driver Strategy", "Equipment where no driver is needed" subsection):

Equipment that already speaks native OPC UA requires no driver build — OtOpcUa simply proxies the OPC UA session.

What was found: OPC UA-native equipment still requires an OpcUaClient gateway driver instance. The driver is thin (it wraps the OPC Foundation .NETStandard SDK), but it is a real driver in our taxonomy:

Has its own project (Driver.OpcUaClient)
Has its own per-instance config (endpoint URL, security policy, browse strategy, certificate trust, etc.)
Has its own stability concerns (subscription drift on remote-server reconnect, cascading quality, browse cache memory)
Is generation-versioned in central config like every other driver
Counts against the same operational surface (alerts, audit, watchdog)

The "no driver build" framing risks underestimating the work because every OPC UA-native equipment connection still needs configured tag mappings, namespace remapping (remote namespace is not UNS-compliant by default), and per-equipment routing rules.

Plan should say instead: "Equipment that already speaks native OPC UA does not require a new protocol-specific driver project — the OpcUaClient gateway driver in the core library handles all of them. Per-equipment configuration (endpoint, browse strategy, namespace remapping to UNS, certificate trust) is still required. Onboarding effort per OPC UA-native equipment is ~hours of config, not zero."

A2. "Two namespaces through a single endpoint" — single endpoint is per-node, not per-cluster

Plan currently says (otopcua-handoff.md §"Two Namespaces"):

OtOpcUa serves two logical namespaces through a single endpoint

And §"Responsibilities":

Unified OPC UA endpoint for OT data. Clients read both raw equipment data and processed System Platform data from one OPC UA endpoint with two namespaces.

What was found: OPC UA non-transparent redundancy (the v1 LmxOpcUa pattern, inherited by v2 — see lmxopcua/docs/v2/plan.md decision #85) requires each cluster node to have its own unique ApplicationUri per OPC UA spec. Clients see both endpoints in ServerUriArray and choose by ServiceLevel. The "single endpoint" framing is true per node but not per cluster.

True transparent redundancy (RedundancySupport.Transparent) would require a virtual IP / load balancer in front of both nodes — not currently planned for v2 because v1 doesn't have it and adding a VIP would be substantial new infrastructure (per lmxopcua/docs/v2/plan.md decisions #79, #84, #85).

This is not a contradiction — both nodes serve identical address spaces, so any consumer connecting to either sees "the same" namespaces. But the plan's wording suggests a single endpoint URL exists per cluster, which is only true with VIP + transparent redundancy.

Plan should say instead: Either:

(a) "OtOpcUa serves two logical namespaces through a single endpoint per node. In a 2-node cluster, both nodes expose identical namespaces; consumers see two endpoints in ServerUriArray and select by ServiceLevel (non-transparent redundancy)."
OR (b) commit explicitly to deploying a VIP / load balancer in front of each cluster to achieve true transparent single-endpoint behavior — substantial infrastructure addition.

The v2 implementation has chosen (a). If the plan intends (b), that needs to be flagged as a separate work item.

B. Missing constraints

B1. Namespace-level / equipment-subtree ACLs not yet modeled in the v2 design

Plan currently says (otopcua-handoff.md §"Authorization Model"):

Authorization enforced via namespace-level ACLs — each identity scoped to permitted equipment/namespaces

What was found: The v2 central-config schema (lmxopcua/docs/v2/config-db-schema.md) has tables for clusters, nodes, namespaces, drivers, devices, equipment, tags, poll groups, credentials, generation tracking, and audit log — but no EquipmentAcl or NamespaceAcl table for OPC UA client authorization.

The v2 plan does cover Admin UI authorization (LDAP groups → admin roles FleetAdmin / ConfigEditor / ReadOnly, cluster-scoped grants — decisions #88, #105) but admin authorization governs who edits configuration, not who can read/write equipment data through the OPC UA endpoint.

The v1 LmxOpcUa LDAP layer (Security.md) maps LDAP groups to OPC UA permission roles (ReadOnly, WriteOperate, WriteTune, WriteConfigure, AlarmAck) but applies them globally — not per namespace, per equipment subtree, or per UNS Area/Line.

Plan should say instead: The plan should call out that the implementation needs:

A per-cluster EquipmentAcl (or equivalent) table mapping LDAP-group → permitted Namespace + UnsArea/UnsLine/Equipment subtree + permission level (Read / Write / AlarmAck)
The Admin UI must surface ACL editing
The OPC UA NodeManager must check the ACL on every browse/read/write/subscribe against the connected user's group claims
ACLs should be generation-versioned (changes go through publish/diff like any other config)

This is a substantial missing surface area in both the handoff and the v2 design. Suggest the plan add this as an explicit Year 1 deliverable alongside driver work, since Tier 1 (ScadaBridge) cutover will need authorization enforcement working from day one to satisfy "access control / authorization chokepoint" responsibility.

B2. Equipment-class templates dependency on the not-yet-created `schemas` repo blocks more than canonical-model integration

Plan currently says (otopcua-handoff.md §"UNS Naming Hierarchy"):

The hierarchy definition lives in the central schemas repo (not yet created). OtOpcUa is a consumer of the authoritative definition

And §"Canonical Model Integration":

Equipment-class templates from schemas repo define the node layout

What was found: The schemas-repo gap blocks not only template-driven node layout but also:

Per-equipment-class auto-generated tag lists (without a template, every equipment's tags must be hand-configured)
Per-equipment-class signal validation (which raw signals each class is expected to expose; missing signals = compliance gap)
Cross-cluster consistency checks (two CNC mills in different clusters should expose the same signal vocabulary)
The state-derivation contract at Layer 3 (handoff §13) — derivation rules need to know which raw signals each class provides

The v2 design ships Equipment.EquipmentClassRef as a nullable hook column (per lmxopcua/docs/v2/plan.md decision #112) but this only avoids retrofit cost when the schemas repo lands. Operationally, until the schemas repo exists, every equipment's tag set is hand-curated by the operator who configures it — a real cost in time and consistency.

Plan should say instead: The plan should make the schemas-repo dependency explicit on the OtOpcUa critical path:

Schemas repo creation should be a Year 1 deliverable (its own handoff doc, distinct from OtOpcUa's)
Until it exists, OtOpcUa equipment configurations are hand-curated and prone to drift
The Equipment Protocol Survey output should feed both: (a) OtOpcUa core driver scope, and (b) the initial schemas repo equipment-class list
A pilot equipment class (proposed: FANUC CNC, see D1 below) should land in the schemas repo before tier 1 cutover begins, to validate the template-consumer contract end-to-end

B3. Per-node `ApplicationUri` uniqueness and trust-pinning is an OPC UA spec constraint with operational implications

Plan currently says (handoff is silent on this).

What was found: OPC UA clients pin trust to the server's ApplicationUri as part of the certificate validation chain. Once a client has trusted a server with a given ApplicationUri, changing it requires every client to re-establish trust (re-import certificates, etc.). This is the explicit reason lmxopcua/docs/v2/plan.md decision #86 commits to "auto-suggested but never auto-rewritten" — the Admin UI prefills urn:{Host}:OtOpcUa on node creation but warns operators if they change Host later, requiring explicit opt-in to update ApplicationUri.

This affects the cutover plan: when a Tier 1 / Tier 2 / Tier 3 consumer is being moved to OtOpcUa, the consumer's certificate trust store must include OtOpcUa's certificate (and node-specific ApplicationUri) before the cutover. For a 2-node cluster, that's two ApplicationUri entries per consumer per cluster.

For the Warsaw campuses with one cluster per building, a consumer that needs visibility across multiple buildings will trust 2N ApplicationUris (where N = building count).

Plan should say instead: The cutover plan should include certificate-distribution as an explicit pre-cutover step. Suggest the plan note: "Tier 1/2/3 cutover prerequisite: each consumer's OPC UA certificate trust store must be populated with the target OtOpcUa cluster's per-node certificates and ApplicationUris (2 per cluster, plus per-building multiplier at Warsaw campuses) before cutover. Consumers without pre-loaded trust will fail to connect."

C. Architectural decisions to revisit

C1. Driver list committed for v2 implementation before Equipment Protocol Survey results

Plan currently says (handoff §"Driver Strategy"):

Core library scope is driven by the equipment protocol survey — see below and ../current-state/equipment-protocol-survey.md. A protocol becomes "core" if it meets any of: Present at 3+ sites; Combined instance count above ~25; Needed to onboard a Year 1 or Year 2 site; Strategic vendor whose equipment is expected to grow.

And: "It has not been run yet."

What was found: The v2 design (lmxopcua/docs/v2/driver-specs.md) commits to seven specific drivers in addition to Galaxy:

Modbus TCP (also covers DL205 via octal address translation)
AB CIP (ControlLogix / CompactLogix)
AB Legacy (SLC 500 / MicroLogix, PCCC) — separate driver from CIP
S7 (Siemens S7-300/400/1200/1500)
TwinCAT (Beckhoff ADS, native subscriptions)
FOCAS (FANUC CNC)
OPC UA Client (gateway for OPC UA-native equipment)

This commitment was made before the survey ran. The handoff's pre-seeded categories (EQP-001 through EQP-006) confirm OPC UA-native, S7, AB, Modbus, FANUC CNC as expected. TwinCAT (Beckhoff ADS) is not in the handoff's expected list, and AB Legacy as separate from AB CIP is a finer split than the handoff anticipated.

Plan should say instead: Either:

(a) The plan should acknowledge that the OtOpcUa team has internal knowledge confirming TwinCAT and AB Legacy as committed core drivers ahead of the formal survey (e.g. known Beckhoff installations at specific sites, known SLC/MicroLogix legacy equipment).
(b) The v2 design should defer the TwinCAT and AB Legacy commitments until the survey runs in Year 1 Q1 — implementation order becomes Modbus → AB CIP → S7 → FOCAS → (survey results) → potentially add TwinCAT, AB Legacy.

Note: from a purely technical standpoint, AB CIP and AB Legacy share the libplctag library, so the marginal cost of adding AB Legacy to the AB CIP driver work is low. But it's still a separate driver instance with its own stability tier (B), test coverage, and Admin UI surface.

Recommendation: option (a) — confirm the pre-commitment is justified by known site equipment, document the justification, and proceed. Re-evaluate after the survey if any of these protocols turn out to be absent from the estate.

C2. Process-isolation stability tiers (A/B/C) — substantial architectural addition not in handoff

Plan currently says (handoff is silent on driver process model — implies all drivers run in-process in the OtOpcUa server).

What was found: The v2 design introduces a three-tier driver stability model (lmxopcua/docs/v2/driver-stability.md, decisions #63–67):

Tier A (pure managed): Modbus, OPC UA Client. Run in-process.
Tier B (wrapped native, mature): S7, AB CIP, AB Legacy, TwinCAT. Run in-process with extra guards (SafeHandle wrappers, memory watchdog, bounded queues).
Tier C (heavy native / COM / black-box vendor DLL): Galaxy, FOCAS. Run out-of-process as separate Windows services with named-pipe IPC.

The Tier C decision means any deployment using Galaxy or FOCAS runs at least one additional Windows service per cluster node beyond the OtOpcUa main service. For sites with both Galaxy and FOCAS, that's three services per node. With 2-node clusters, six services per cluster.

Reason for the decision: an AccessViolationException from native code (e.g. FANUC's Fwlib64.dll) is uncatchable in modern .NET — would tear down the entire OtOpcUa server, all sessions, all other drivers. Process isolation contains the blast radius. Also, Galaxy stays .NET 4.8 x86 due to MXAccess COM bitness constraints, which alone forces out-of-process — the Tier C model generalizes the pattern.

Plan should say instead: The plan should incorporate the tier model as an architectural decision the OtOpcUa work surfaced:

Galaxy is out-of-process for two reasons: bitness AND stability isolation
FOCAS is out-of-process for stability isolation (Fwlib uncatchable AVs)
Both follow a generalized Proxy/Host/Shared three-project pattern reusable for any future Tier C driver
Operational footprint: 1 to 3 Windows services per cluster node depending on which drivers are configured
Deployment guides must cover the multi-service install/upgrade/recycle workflow

Operational implication: the handoff §"Deployment Footprint" mentions co-location with System Platform / ScadaBridge and modest workload. With multi-service per node and process recycling for Tier C drivers, the operational picture is more complex than implied. Worth a footprint reassessment after the first cluster is deployed.

C3. Polly v8+ resilience model — runtime dependency added beyond handoff scope

Plan currently says (handoff is silent on resilience strategy).

What was found: The v2 design (lmxopcua/docs/v2/plan.md decisions #34–36, #44–45) standardizes on Polly v8+ (Microsoft.Extensions.Resilience) for retry, circuit-breaker, timeout pipelines per driver and per device. Adds:

A composable resilience pipeline per driver instance and per device within a driver
Circuit-breaker state surfacing in the status dashboard
Per-device per-tag write-retry policy with explicit WriteIdempotent opt-in (never-retry by default)

The write-retry safety policy (decision #44) is a real-world constraint not in the handoff: timeouts on writes can fire after the device already accepted the command, so blind retry of non-idempotent operations (pulses, alarm acks, recipe steps) would cause duplicate field actions. The default-to-never-retry policy with explicit opt-in is a substantive safety decision.

Plan should say instead: The plan should note that the implementation has standardized on Polly v8+ for per-device resilience, with a default-no-retry policy on writes that requires explicit per-tag opt-in for idempotency. This affects operator runbooks and onboarding training — operators configuring tags need to understand the WriteIdempotent flag's semantics.

C4. Multi-identifier equipment model (5 identifiers) added beyond handoff's UUID-only spec

Plan currently says (handoff §"UNS Naming Hierarchy"):

Stable equipment UUID — Every equipment node must expose a stable UUIDv4 as a property: UUID is assigned once, never changes, never reused. Path can change (equipment moves, area renamed); UUID cannot. Canonical events downstream carry both UUID (for joins/lineage) and path (for dashboards/filtering).

What was found: Production usage requires more than UUID + path. The v2 design (lmxopcua/docs/v2/plan.md decisions #116–119) adds three operationally-required identifiers:

MachineCode — operator-facing colloquial name (e.g. machine_001). Required. Within-cluster uniqueness. This is what operators actually say in conversation and write in runbooks; UUID and path are not mnemonic enough for daily operations.
ZTag — ERP equipment identifier. Optional. Fleet-wide uniqueness. Primary identifier for browsing in Admin UI per operational request.
SAPID — SAP PM equipment identifier. Optional. Fleet-wide uniqueness. Required for maintenance system join.

All five identifiers (UUID, EquipmentId, MachineCode, ZTag, SAPID) are exposed as OPC UA properties on the equipment node so external systems resolve by their preferred identifier without a sidecar service.

Plan should say instead: The handoff's UUID-only spec satisfies cross-system join requirements but underspecifies operator and external-system identifier needs. The plan should incorporate the multi-identifier model:

Permanent: EquipmentUuid (UUIDv4, immutable, downstream events / canonical model joins)
Operational: MachineCode (operator colloquial, required)
ERP integration: ZTag (optional, ERP-allocated, fleet-wide unique)
SAP PM integration: SAPID (optional, SAP-allocated, fleet-wide unique)
Internal config key: EquipmentId (immutable after publish, internal logical key for cross-generation diffs)

Resolved TBD: This also resolves the implicit question "how do operators find equipment in dashboards / on the radio" — the answer is MachineCode, surfaced prominently in Admin UI alongside ZTag.

C5. Consumer cutover plan (Tier 1 / 2 / 3) is not addressed in v2 design docs

Plan currently says (handoff §"Rollout Posture", §"Roadmap Summary"):

Tier 1 ScadaBridge cutover begins Year 1. Tier 2 Ignition begins Year 2. Tier 3 System Platform IO Year 3.

What was found: The v2 design docs (lmxopcua/docs/v2/plan.md Phases 0–5) cover building the OtOpcUa server, drivers, configuration, and Admin UI — but do not address consumer cutover at all. There is no plan for:

ScadaBridge migration sequencing per equipment / per site
Per-site cutover validation (proving consumers see equivalent data through OtOpcUa as they did via direct sessions)
Rollback procedures if a cutover causes a consumer regression
Coordination with Aveva on System Platform IO cutover (Tier 3 — most opinionated consumer per handoff)
Operational runbooks for what to do when a consumer can't connect / fails over

Plan should say instead: The v2 design should add cutover phases:

Phase 6 — Tier 1 (ScadaBridge) cutover — per-site sequencing, validation methodology, rollback procedure
Phase 7 — Tier 2 (Ignition) cutover — WAN-session collapse validation
Phase 8 — Tier 3 (System Platform IO) cutover — Aveva validation prerequisite, compliance review

Or alternatively, the cutover plan lives outside the OtOpcUa v2 design and is owned by an integration / operations team — in which case the handoff should make that ownership explicit.

Recommendation: the cutover plan needs an owner and a doc. Either OtOpcUa team owns it (add Phases 6–8 to v2 plan) or another team owns it (handoff should name them and link the doc).

C6. Per-building cluster pattern (Warsaw) — UNS path interaction needs clarification

Plan currently says (handoff §"Deployment Footprint"):

Largest sites (Warsaw West, Warsaw North) run one cluster per production building

And §"UNS Naming Hierarchy":

Level 3 — Area — bldg-3

What was found: The v2 design models clusters with Enterprise + Site columns (lmxopcua/docs/v2/config-db-schema.md ServerCluster table). For Warsaw with per-building clusters, the cleanest mapping is:

Cluster A: Site = warsaw-west, equipment in this cluster all under UnsArea = bldg-3
Cluster B: Site = warsaw-west, equipment in this cluster all under UnsArea = bldg-4
...etc.

This works structurally, but raises questions:

ZTag fleet-wide uniqueness: if Warsaw West building 3 and Warsaw West building 4 are separate clusters but the same physical site, can the same ZTag appear in both? Per the v2 design, no — ZTag is fleet-wide unique. Operationally that should be true (ERP doesn't reuse identifiers across buildings) but worth confirming.
Cross-cluster consumer consistency: a ScadaBridge instance reading equipment data for the whole Warsaw West site would need to connect to N clusters (one per building) and stitch the data together. The handoff's "site-local aggregation" responsibility (§"Responsibilities") is satisfied per-cluster but not site-wide.
UNS path uniqueness: with one cluster per building, equipment paths ent/warsaw-west/bldg-3/line-2/cnc-mill-05 and ent/warsaw-west/bldg-4/line-2/cnc-mill-05 are unique by the Area segment. But two clusters serve overlapping paths (ent/warsaw-west/...) — clients connecting to building 3's cluster can't see building 4's equipment. That's correct by intent, but downstream consumers expecting site-wide visibility need to know they connect per-cluster.

Plan should say instead: The handoff should clarify:

Whether ZTag fleet-wide uniqueness extends across per-building clusters at the same site (assumed yes; confirm with ERP team)
That consumers needing site-wide visibility at Warsaw campuses will connect to multiple clusters, not one
Whether the per-building cluster decision is a constraint to optimize for or a cost to minimize (e.g. would a single Warsaw-West cluster with 4 buildings' worth of equipment be feasible if hardware allowed?)

D. Resolved TBDs

D1. Pilot equipment class for first canonical definition

Plan TBD: "Pilot equipment class for the first canonical definition" (handoff §"Open Questions / TBDs").

Proposal: FANUC CNC is the natural pilot. The v2 design (lmxopcua/docs/v2/driver-specs.md §7 "FANUC FOCAS Driver") already specifies a fixed pre-defined node hierarchy for FOCAS — Identity, Status, Axes, Spindle, Program, Tool, Alarms, PMC, Macro categories — populated by specific FOCAS2 API calls. This is essentially a class template already; the schemas repo just needs to formalize it.

Why FANUC CNC over alternatives:

Pre-defined hierarchy already exists in driver design — no greenfield template work
Single vendor with well-known API surface (FOCAS2 spec) — low ambiguity
Equipment count is finite (CNCs per site are countable, not in the hundreds like generic Modbus instances)
Tier C out-of-process driver design — failure-mode boundary is clean for piloting class-template-driven config

Why not Modbus or AB CIP first: these have no inherent class — every device has a hand-curated tag list. Picking them for the pilot would mean inventing a class taxonomy from scratch, which is exactly the schemas-repo team's job, not the OtOpcUa team's.

D2. Storage format for the hierarchy in the `schemas` repo

Plan TBD: "Storage format for the hierarchy in the schemas repo (YAML vs Protobuf vs both)" (handoff §"Open Questions / TBDs").

Proposal: JSON Schema (.json files, version-controlled in the schemas repo) as the authoritative format, with Protobuf code generation for runtime serialization where needed (e.g. Redpanda event schemas).

Rationale:

Idiomatic for .NET (System.Text.Json and System.Text.Json.Schema in .NET 10) — OtOpcUa reads templates with no extra dependencies
Idiomatic for CI tooling (every CI runner can jq / validate JSON Schema without extra toolchain)
Supports validation at multiple layers (operator-visible Admin UI errors, schemas-repo CI gates, OtOpcUa runtime validation)
Protobuf is better for wire serialization (size, speed, generated code) but worse for authoring (binary, requires .proto compiler, poor merge story in git). For a repo where humans author equipment-class definitions, JSON Schema wins.
If Protobuf is needed for downstream (Redpanda events with Protobuf-defined schemas), it can be code-generated from the JSON Schema authoring source. One-way derivation is simpler than bidirectional sync.

Where it lives in OtOpcUa: at startup, OtOpcUa nodes fetch the relevant equipment-class templates from the schemas repo (cached locally) and use them to validate Equipment.EquipmentClassRef against actual tag mappings. Drift between OtOpcUa's tag set and the schema is surfaced as a config validation error.

D3. Where namespace ACL definitions live

Plan TBD: "Whether namespace ACL definitions live alongside driver/topology config or in their own governance surface" (handoff §"Open Questions / TBDs").

Proposal: Live in the central config DB alongside DriverInstance/Equipment/Tag, in a new EquipmentAcl table per cluster. Operationally co-located with the rest of the cluster's config; managed through the same Admin UI; edited through the same draft → diff → publish flow; generation-versioned for the same auditability and rollback safety.

Rationale:

A separate governance surface means a separate auth layer, separate UI, separate audit log, separate rollback workflow — meaningful operational tax for unclear benefit.
Co-locating means one publish includes both topology changes and ACL changes atomically, so a new equipment is added AND its ACL is granted in one transaction. Avoids the "equipment exists for 30 seconds with default-deny" race window that a separate-governance-surface model would create.
Per lmxopcua/docs/v2/plan.md decision #105, cluster-scoped admin grants already live in the central DB — extending the same model to data-path ACLs is the consistent choice.

Open sub-question: ACL granularity — per Equipment? per UnsLine? per UnsArea? per Namespace? Suggest supporting all four levels with inheritance (grant at UnsArea cascades to all UnsLines + Equipment beneath, unless overridden).

This proposal also addresses B1 (the missing ACL constraint) — they're the same gap, viewed from two directions.

D4. Enterprise shortname for UNS hierarchy root

Plan TBD: "Enterprise shortname for UNS hierarchy root (currently ent placeholder)" (handoff §"Open Questions / TBDs").

Status: Not resolved. OtOpcUa work cannot determine this — it's an organizational naming choice. Flagging because the v2 design currently uses ent per the handoff's placeholder. Suggest the schemas repo team or enterprise-naming authority resolve before any production deployment, since changing the Enterprise segment after equipment is published would require a generation-wide path rewrite (UUIDs preserved but every consumer's path-based query needs to learn the new prefix).

Recommendation: resolve before tier 1 (ScadaBridge) cutover at the first site.

E. New TBDs

Questions the plan didn't think to ask but should have, surfaced during v2 design.

E1. EquipmentUuid generation authority — OtOpcUa or external system?

Question: Who allocates EquipmentUuid? The v2 design (lmxopcua/docs/v2/admin-ui.md) auto-generates UUIDv4 in the Admin UI on equipment creation. But if equipment is also tracked in ERP (ZTag) or SAP PM (SAPID), those systems may have their own equipment identifiers and may want to be the authoritative UUID source.

Implication: if ERP/SAP becomes UUID authority, OtOpcUa Admin UI shifts from "generate UUID on create" to "look up UUID from external system; reject creation if equipment not yet in ERP". That's a different operational flow.

Suggested resolution: OtOpcUa generates UUIDs by default (current v2 design); plan should note that if ERP/SAP take authoritative ownership of equipment registration in the future, the UUID-generation policy is configurable per cluster.

E2. Tier 3 (System Platform IO) cutover — Aveva-supported pattern verification

Question: Does Aveva System Platform IO support consuming equipment data from another OPC UA server (OtOpcUa) instead of from equipment directly? The handoff §"Downstream Consumer Impact" says this "needs validation against Aveva's supported patterns — System Platform is the most opinionated consumer."

Recommendation: plan should commit to a research deliverable in Year 1 or Year 2 (well ahead of Year 3 tier 3 cutover): "Validate with Aveva that System Platform IO drivers support upstream OPC UA-server data sources, including any restrictions on security mode, namespace structure, or session model." If Aveva's pattern requires something OtOpcUa doesn't expose, that's a long-lead-time discovery.

E3. Site-wide vs per-cluster consumer addressing at multi-cluster sites

Question: At Warsaw West / Warsaw North with per-building clusters, how do consumers that need site-wide equipment visibility address the equipment? The current v2 design has each cluster with its own endpoints + namespace; a site-wide consumer must connect to N clusters.

Possible resolutions (out of scope for v2 design but should be on someone's plate):

(a) Configure consumer-side templates to enumerate all per-building clusters and stitch — current expected pattern but adds consumer-side complexity
(b) Build a site-aggregator OtOpcUa instance that consumes from per-building clusters via the OPC UA Client gateway driver and re-exposes a unified site namespace — second-order aggregation, doable with our existing toolset but operational complexity
(c) Reconsider per-building clustering — if hardware allows, single site cluster is simpler

Recommendation: flag as Year 1 design discussion before per-building clusters are committed at Warsaw. If (b) becomes the answer, OtOpcUa's architecture already supports it (the OpcUaClient driver was designed for exactly this gateway-of-gateways scenario).

E4. Cluster-as-single-endpoint vs per-node-endpoints — clarify handoff mental model

(Cross-references A2.) Even if we agree non-transparent redundancy is the right v2 model, the handoff's wording ("single endpoint", "unified OPC UA endpoint") implies a single URL. Worth deciding whether to:

Update the handoff wording to acknowledge two endpoints per cluster (operationally accurate per OPC UA non-transparent spec), OR
Add an explicit roadmap line item to introduce a VIP/load-balancer in front of each cluster for transparent redundancy (substantial new infrastructure)

Recommendation: update the handoff wording. Transparent redundancy is a meaningful infrastructure investment for marginal client-side simplification (ScadaBridge / Ignition / System Platform IO are all sophisticated OPC UA clients that can handle ServerUriArray + ServiceLevel-driven failover).

Summary

This batch:

Category	Count	Notes
A. Inaccuracies	2	Both wording/framing issues; no architectural conflict
B. Missing constraints	3	ACLs (substantial gap), schemas-repo dependencies, certificate trust pre-cutover step
C. Architectural decisions to revisit	6	Driver list pre-survey, stability tiers, Polly resilience, multi-identifier model, missing cutover plan, per-building cluster interactions
D. Resolved TBDs	4	Pilot class (FANUC CNC), schemas repo format (JSON Schema), ACL location (central config DB), enterprise shortname (still unresolved — flagged)
E. New TBDs	4	UUID-generation authority, System Platform IO Aveva validation, multi-cluster site addressing, cluster-endpoint mental model

Most urgent for plan integration: B1 (missing ACL surface) and C5 (missing consumer cutover plan) — both are large work items the v2 implementation design discovered are needed but doesn't currently own. They should be assigned (to OtOpcUa team or another team) before Year 1 tier 1 cutover begins.

Reference: all v2 design decisions are in lmxopcua/docs/v2/plan.md (decision log §). Specific cross-references in this doc cite decision numbers (#XX) where applicable.

Addendum — v2 design hardening, same day (2026-04-17)

After this corrections doc was filed, an adversarial review of the v2 db schema and Admin UI surfaced four internal design defects (one critical, three high) that the OtOpcUa team has now closed. None of these are corrections back to the handoff — they are internal design corrections within the v2 work — but two of them refine claims this corrections doc made and are worth flagging for plan-team awareness:

Affects C4 (multi-identifier equipment model) — `EquipmentId` is now system-generated, not operator-supplied

C4 above describes a 5-identifier equipment model (UUID, EquipmentId, MachineCode, ZTag, SAPID) and implies EquipmentId is operator-set with the other operator-facing identifiers. The hardened design (lmxopcua decision #125) makes EquipmentId system-generated as 'EQ-' + first 12 hex chars of EquipmentUuid, never operator-supplied, never editable, never present in CSV imports. The operator-facing identifiers are now MachineCode, ZTag, SAPID — three operator-set fields, two system-generated (EquipmentId, EquipmentUuid).

Why the change: operator-supplied EquipmentId was a corruption path — typos and bulk-import renames would mint duplicate equipment identities, each with a fresh UUID, permanently splitting downstream UUID-keyed lineage. Removing operator authoring eliminates the failure mode entirely. CSV imports now match by EquipmentUuid for updates; rows without UUID create new equipment with system-generated identifiers.

For the plan team: this doesn't change the audience-mapping story (operators say MachineCode in conversation, ERP queries by ZTag, etc.) — it just means there's one less operator field in the equipment-create form. If any plan-level documentation describes EquipmentId as operator-managed, update to "system-generated".

Affects D3 (ACL location) and adds a new architectural concept — `ExternalIdReservation` table for rollback-safe identifier uniqueness

D3 above proposes ACL definitions live alongside topology in the central config DB, generation-versioned. The same review surfaced that fleet-wide uniqueness for ZTag and SAPID cannot be expressed within generation-versioned tables because old generations and disabled equipment can still hold the same external IDs — rollback or re-enable then silently reintroduces duplicates that corrupt downstream ERP/SAP joins.

The hardened design (lmxopcua decision #124) introduces a new ExternalIdReservation table that sits outside generation versioning specifically for this rollback-safety property. sp_PublishGeneration reserves IDs atomically at publish; FleetAdmin-only sp_ReleaseExternalIdReservation (audit-logged, requires reason) is the only path to free a value for reuse by a different EquipmentUuid; rollback respects the reservation table.

For the plan team: this is a precedent that some cross-generation invariants need their own non-versioned tables. When the missing ACL design is being scoped (per B1 / C5 above), consider whether any ACL grant has a similar rollback-reuse hazard. (Initial intuition: ACLs probably don't — granting a permission and then revoking it doesn't create downstream join corruption the way an ID does. But worth checking.)

Two purely-internal fixes (no plan-team relevance)

For completeness and audit trail:

Same-cluster invariant on DriverInstance.NamespaceId (lmxopcua decision #122) — closes a critical cross-cluster trust-boundary leak in the schema where a draft for cluster A could bind to cluster B's namespace and leak its URI. Three-layer enforcement (sp_ValidateDraft + API scoping + audit log).
Namespace table moves from cluster-level to generation-versioned (lmxopcua decision #123) — earlier draft mistakenly treated namespaces as cluster-topology like ClusterNode rows. They are consumer-visible content (define what consumers see at the OPC UA endpoint) and must travel through draft → diff → publish like every other consumer-visible config.

Neither of these affects the handoff or this corrections doc directly.

Updated summary

Category	Count	Notes
A. Inaccuracies	2	Both wording/framing issues; no architectural conflict
B. Missing constraints	3	ACLs, schemas-repo dependencies, certificate trust pre-cutover
C. Architectural decisions to revisit	6	Driver list pre-survey, stability tiers, Polly resilience, multi-identifier model (now refined per addendum), missing cutover plan, per-building cluster interactions
D. Resolved TBDs	4	Pilot class, schemas repo format, ACL location, enterprise shortname (unresolved)
E. New TBDs	4	UUID-gen authority, Aveva validation, multi-cluster site addressing, cluster-endpoint mental model
Addendum hardening fixes	4	EquipmentId system-generated; ExternalIdReservation table; same-cluster namespace binding; Namespace generation-versioned

The hardening fixes are committed in lmxopcua branch v2 at commit a59ad2e (2026-04-17). Decisions #122–125 in lmxopcua/docs/v2/plan.md carry the rationale.

37 KiB Raw Blame History Unescape Escape