3yearplan/handoffs/otopcua-corrections-2026-04-17.md

# OtOpcUa Implementation Corrections — 2026-04-17

**From:** OtOpcUa v2 design work in `lmxopcua/docs/v2/` (branch `v2`, commits `a1e79cd` and forward)
**Source handoff:** `handoffs/otopcua-handoff.md`
**Format:** per the handoff's "Sending Corrections Back" protocol — each item lists what the plan says, what was found during design, and what the plan should say instead.

This batch covers corrections surfaced while drafting the v2 implementation design (six docs: `plan.md`, `driver-specs.md`, `driver-stability.md`, `test-data-sources.md`, `config-db-schema.md`, `admin-ui.md`). No code changes yet — all design-phase findings.

---

## A. Inaccuracies

### A1. "Equipment that already speaks native OPC UA requires no driver build"

**Plan currently says** (`otopcua-handoff.md` §"Driver Strategy", "Equipment where no driver is needed" subsection):
> Equipment that already speaks **native OPC UA** requires no driver build — OtOpcUa simply proxies the OPC UA session.

**What was found:** OPC UA-native equipment still requires an `OpcUaClient` gateway driver instance. The driver is thin (it wraps the OPC Foundation .NETStandard SDK), but it is a real driver in our taxonomy:
- Has its own project (`Driver.OpcUaClient`)
- Has its own per-instance config (endpoint URL, security policy, browse strategy, certificate trust, etc.)
- Has its own stability concerns (subscription drift on remote-server reconnect, cascading quality, browse cache memory)
- Is generation-versioned in central config like every other driver
- Counts against the same operational surface (alerts, audit, watchdog)

The "no driver build" framing risks underestimating the work because every OPC UA-native equipment connection still needs configured tag mappings, namespace remapping (remote namespace is not UNS-compliant by default), and per-equipment routing rules.

**Plan should say instead:** "Equipment that already speaks native OPC UA does not require a *new protocol-specific driver project* — the `OpcUaClient` gateway driver in the core library handles all of them. Per-equipment configuration (endpoint, browse strategy, namespace remapping to UNS, certificate trust) is still required. Onboarding effort per OPC UA-native equipment is ~hours of config, not zero."

---

### A2. "Two namespaces through a single endpoint" — single endpoint is per-node, not per-cluster

**Plan currently says** (`otopcua-handoff.md` §"Two Namespaces"):
> OtOpcUa serves **two logical namespaces** through a single endpoint

And §"Responsibilities":
> Unified OPC UA endpoint for OT data. Clients read both raw equipment data and processed System Platform data from **one OPC UA endpoint** with two namespaces.

**What was found:** OPC UA non-transparent redundancy (the v1 LmxOpcUa pattern, inherited by v2 — see `lmxopcua/docs/v2/plan.md` decision #85) requires **each cluster node to have its own unique `ApplicationUri`** per OPC UA spec. Clients see *both* endpoints in `ServerUriArray` and choose by `ServiceLevel`. The "single endpoint" framing is true *per node* but not *per cluster*.

True transparent redundancy (`RedundancySupport.Transparent`) would require a virtual IP / load balancer in front of both nodes — not currently planned for v2 because v1 doesn't have it and adding a VIP would be substantial new infrastructure (per `lmxopcua/docs/v2/plan.md` decisions #79, #84, #85).

This is not a contradiction — both nodes serve identical address spaces, so any consumer connecting to either sees "the same" namespaces. But the plan's wording suggests a single endpoint URL exists per cluster, which is only true with VIP + transparent redundancy.

**Plan should say instead:** Either:
- (a) "OtOpcUa serves two logical namespaces through a single endpoint **per node**. In a 2-node cluster, both nodes expose identical namespaces; consumers see two endpoints in `ServerUriArray` and select by `ServiceLevel` (non-transparent redundancy)."
- OR (b) commit explicitly to deploying a VIP / load balancer in front of each cluster to achieve true transparent single-endpoint behavior — substantial infrastructure addition.

The v2 implementation has chosen (a). If the plan intends (b), that needs to be flagged as a separate work item.

---

## B. Missing constraints

### B1. Namespace-level / equipment-subtree ACLs not yet modeled in the v2 design

**Plan currently says** (`otopcua-handoff.md` §"Authorization Model"):
> Authorization enforced via **namespace-level ACLs** — each identity scoped to permitted equipment/namespaces

**What was found:** The v2 central-config schema (`lmxopcua/docs/v2/config-db-schema.md`) has tables for clusters, nodes, namespaces, drivers, devices, equipment, tags, poll groups, credentials, generation tracking, and audit log — but **no `EquipmentAcl` or `NamespaceAcl` table for OPC UA client authorization**.

The v2 plan does cover Admin UI authorization (LDAP groups → admin roles `FleetAdmin` / `ConfigEditor` / `ReadOnly`, cluster-scoped grants — decisions #88, #105) but admin authorization governs *who edits configuration*, not *who can read/write equipment data through the OPC UA endpoint*.

The v1 LmxOpcUa LDAP layer (`Security.md`) maps LDAP groups to OPC UA permission roles (`ReadOnly`, `WriteOperate`, `WriteTune`, `WriteConfigure`, `AlarmAck`) but applies them globally — not per namespace, per equipment subtree, or per UNS Area/Line.

**Plan should say instead:** The plan should call out that the implementation needs:
- A per-cluster `EquipmentAcl` (or equivalent) table mapping LDAP-group → permitted Namespace + UnsArea/UnsLine/Equipment subtree + permission level (Read / Write / AlarmAck)
- The Admin UI must surface ACL editing
- The OPC UA NodeManager must check the ACL on every browse/read/write/subscribe against the connected user's group claims
- ACLs should be generation-versioned (changes go through publish/diff like any other config)

This is a substantial missing surface area in both the handoff and the v2 design. Suggest the plan add this as an explicit Year 1 deliverable alongside driver work, since Tier 1 (ScadaBridge) cutover will need authorization enforcement working from day one to satisfy "access control / authorization chokepoint" responsibility.

**Resolution** (lmxopcua decisions #129–132, 2026-04-17): the OtOpcUa team has designed and committed the data-path ACL model in `lmxopcua/docs/v2/acl-design.md`. Highlights:
- `NodePermissions` bitmask enum covering Browse / Read / Subscribe / HistoryRead / WriteOperate / WriteTune / WriteConfigure / AlarmRead / AlarmAcknowledge / AlarmConfirm / AlarmShelve / MethodCall, plus common bundles (`ReadOnly` / `Operator` / `Engineer` / `Admin`)
- 6-level scope hierarchy (Cluster / Namespace / UnsArea / UnsLine / Equipment / Tag) with default-deny + additive grants and Browse-implication on ancestors
- `NodeAcl` table is generation-versioned (decision #130) — ACL changes go through draft → diff → publish → rollback like every other content table
- Cluster-create workflow seeds default ACL set matching the v1 LmxOpcUa LDAP-role-to-permission map (decision #131), preserving behavioral parity for v1 → v2 consumer migration
- Per-session permission-trie evaluator with O(depth × group-count) cost; cache invalidated on generation-apply or LDAP group cache expiry
- Admin UI ACL tab + bulk grant + permission simulator
- Phase 1 ships the schema + Admin UI + evaluator unit tests; per-driver enforcement lands in each driver's phase (Phase 2+) per `acl-design.md` §"Implementation Plan"

The "must be working from day one of Tier 1 cutover" timing constraint is satisfied — Phase 1 (Configuration + Admin scaffold) is completed before any driver phase, so the ACL model exists in the central config DB before any driver consumes it.

---

### B2. Equipment-class templates dependency on the not-yet-created `schemas` repo blocks more than canonical-model integration

**Plan currently says** (`otopcua-handoff.md` §"UNS Naming Hierarchy"):
> The hierarchy definition lives in the **central `schemas` repo (not yet created)**. OtOpcUa is a **consumer** of the authoritative definition

And §"Canonical Model Integration":
> Equipment-class templates from `schemas` repo define the node layout

**What was found:** The schemas-repo gap blocks not only template-driven node layout but also:
- Per-equipment-class auto-generated tag lists (without a template, every equipment's tags must be hand-configured)
- Per-equipment-class signal validation (which raw signals each class is *expected* to expose; missing signals = compliance gap)
- Cross-cluster consistency checks (two CNC mills in different clusters should expose the same signal vocabulary)
- The state-derivation contract at Layer 3 (handoff §13) — derivation rules need to know which raw signals each class provides

The v2 design ships `Equipment.EquipmentClassRef` as a nullable hook column (per `lmxopcua/docs/v2/plan.md` decision #112) but this only avoids retrofit cost when the schemas repo lands. Operationally, until the schemas repo exists, every equipment's tag set is hand-curated by the operator who configures it — a real cost in time and consistency.

**Plan should say instead:** The plan should make the schemas-repo dependency explicit on the OtOpcUa critical path:
- Schemas repo creation should be a Year 1 deliverable (its own handoff doc, distinct from OtOpcUa's)
- Until it exists, OtOpcUa equipment configurations are hand-curated and prone to drift
- A pilot equipment class (proposed: FANUC CNC, see D1 below) should land in the schemas repo before tier 1 cutover begins, to validate the template-consumer contract end-to-end

**Resolution — partial** (2026-04-17): the OtOpcUa team contributed an initial seed at `3yearplan/schemas/` (temporary location until the dedicated `schemas` repo is created — Gitea push-to-create is disabled and the dedicated repo needs a manual UI step). The seed includes:
- README + CONTRIBUTING explaining purpose, scope, ownership-TBD framing, format decision, and proposed workflow
- JSON Schemas defining the format (`format/equipment-class.schema.json`, `format/tag-definition.schema.json`, `format/uns-subtree.schema.json`)
- The pilot equipment class as a worked example (`classes/fanuc-cnc.json` — 16 signals + 3 alarm definitions + suggested state-derivation notes, per D1)
- A worked UNS subtree (`uns/example-warsaw-west.json`)
- Documentation: `docs/overview.md`, `docs/format-decisions.md` (8 numbered decisions), `docs/consumer-integration.md`

What still needs the plan team / cross-team owner:
- Name an owner team for the schemas content
- Decide whether to move it to a dedicated `gitea.dohertylan.com/dohertj2/schemas` repo (proposed; would be cleaner than living under `3yearplan/`) or keep it as a 3-year-plan sub-tree
- Ratify or revise the format decisions in `schemas/docs/format-decisions.md` (8 items including JSON Schema choice, per-class semver, additive-only minor bumps, `_default` placeholder, signal-name vs UNS-segment regex distinction, stateModel-as-informational, no per-equipment overrides at this layer, applicability.drivers as OtOpcUa driver enumeration)
- Confirm the FANUC CNC pilot is the right starting point (D1 recommendation)
- Establish the CI gate for JSON Schema validation
- Decide on consumer-integration plumbing for Redpanda Protobuf code-gen and dbt macro generation per `schemas/docs/consumer-integration.md`

---

### B3. Per-node `ApplicationUri` uniqueness and trust-pinning is an OPC UA spec constraint with operational implications

**Plan currently says** (handoff is silent on this).

**What was found:** OPC UA clients pin trust to the server's `ApplicationUri` as part of the certificate validation chain. Once a client has trusted a server with a given `ApplicationUri`, changing it requires every client to re-establish trust (re-import certificates, etc.). This is the explicit reason `lmxopcua/docs/v2/plan.md` decision #86 commits to "auto-suggested but never auto-rewritten" — the Admin UI prefills `urn:{Host}:OtOpcUa` on node creation but warns operators if they change `Host` later, requiring explicit opt-in to update `ApplicationUri`.

This affects the cutover plan: when a Tier 1 / Tier 2 / Tier 3 consumer is being moved to OtOpcUa, the consumer's certificate trust store must include OtOpcUa's certificate (and node-specific `ApplicationUri`) before the cutover. For a 2-node cluster, that's two `ApplicationUri` entries per consumer per cluster.

For the Warsaw campuses with one cluster per building, a consumer that needs visibility across multiple buildings will trust 2N ApplicationUris (where N = building count).

**Plan should say instead:** The cutover plan should include certificate-distribution as an explicit pre-cutover step. Suggest the plan note: "Tier 1/2/3 cutover prerequisite: each consumer's OPC UA certificate trust store must be populated with the target OtOpcUa cluster's per-node certificates and ApplicationUris (2 per cluster, plus per-building multiplier at Warsaw campuses) before cutover. Consumers without pre-loaded trust will fail to connect."

---

## C. Architectural decisions to revisit

### C1. Driver list committed for v2 implementation before Equipment Protocol Survey results — **RESOLVED**

**Plan currently says** (handoff §"Driver Strategy"):
> **Core library scope is driven by the equipment protocol survey** — see below and `../current-state/equipment-protocol-survey.md`. A protocol becomes "core" if it meets any of: Present at 3+ sites; Combined instance count above ~25; Needed to onboard a Year 1 or Year 2 site; Strategic vendor whose equipment is expected to grow.

**Resolution** (lmxopcua decision #128, 2026-04-17): the seven committed drivers (Modbus TCP including DL205, AB CIP, AB Legacy, S7, TwinCAT, FOCAS, OPC UA Client) plus the existing Galaxy/MXAccess driver are confirmed by direct knowledge of the equipment estate, **not pending the formal survey**. TwinCAT and AB Legacy were specifically confirmed as committed v2 drivers by the OtOpcUa team based on known Beckhoff installations and known SLC/MicroLogix legacy equipment in the estate.

The Equipment Protocol Survey may still produce useful inventory data for downstream planning (per-site capacity sizing, prioritization order between drivers, on-demand long-tail driver scoping per the handoff's "Long-tail drivers" section), but **adding or removing drivers from the v2 implementation list is out of scope for the survey**. The v2 driver list is fixed.

**Plan should say instead:** the handoff's "Core library scope is driven by the survey" wording should be updated to reflect that the v2.0 core library is **pre-committed** by direct equipment-estate knowledge, with the survey informing only long-tail driver scoping and per-site planning details.

---

### C2. Process-isolation stability tiers (A/B/C) — substantial architectural addition not in handoff

**Plan currently says** (handoff is silent on driver process model — implies all drivers run in-process in the OtOpcUa server).

**What was found:** The v2 design introduces a three-tier driver stability model (`lmxopcua/docs/v2/driver-stability.md`, decisions #63–67):
- **Tier A (pure managed)**: Modbus, OPC UA Client. Run in-process.
- **Tier B (wrapped native, mature)**: S7, AB CIP, AB Legacy, TwinCAT. Run in-process with extra guards (SafeHandle wrappers, memory watchdog, bounded queues).
- **Tier C (heavy native / COM / black-box vendor DLL)**: Galaxy, FOCAS. Run **out-of-process** as separate Windows services with named-pipe IPC.

The Tier C decision means any deployment using Galaxy or FOCAS runs at least one *additional* Windows service per cluster node beyond the OtOpcUa main service. For sites with both Galaxy and FOCAS, that's three services per node. With 2-node clusters, six services per cluster.

**Reason for the decision:** an `AccessViolationException` from native code (e.g. FANUC's `Fwlib64.dll`) is uncatchable in modern .NET — would tear down the entire OtOpcUa server, all sessions, all other drivers. Process isolation contains the blast radius. Also, Galaxy stays .NET 4.8 x86 due to MXAccess COM bitness constraints, which alone forces out-of-process — the Tier C model generalizes the pattern.

**Plan should say instead:** The plan should incorporate the tier model as an architectural decision the OtOpcUa work surfaced:
- Galaxy is out-of-process for two reasons: bitness AND stability isolation
- FOCAS is out-of-process for stability isolation (Fwlib uncatchable AVs)
- Both follow a generalized `Proxy/Host/Shared` three-project pattern reusable for any future Tier C driver
- Operational footprint: 1 to 3 Windows services per cluster node depending on which drivers are configured
- Deployment guides must cover the multi-service install/upgrade/recycle workflow

**Operational implication:** the handoff §"Deployment Footprint" mentions co-location with System Platform / ScadaBridge and modest workload. With multi-service per node and process recycling for Tier C drivers, the operational picture is more complex than implied. Worth a footprint reassessment after the first cluster is deployed.

---

### C3. Polly v8+ resilience model — runtime dependency added beyond handoff scope

**Plan currently says** (handoff is silent on resilience strategy).

**What was found:** The v2 design (`lmxopcua/docs/v2/plan.md` decisions #34–36, #44–45) standardizes on **Polly v8+** (`Microsoft.Extensions.Resilience`) for retry, circuit-breaker, timeout pipelines per driver and per device. Adds:
- A composable resilience pipeline per driver instance and per device within a driver
- Circuit-breaker state surfacing in the status dashboard
- Per-device per-tag write-retry policy with explicit `WriteIdempotent` opt-in (never-retry by default)

The write-retry safety policy (decision #44) is a real-world constraint not in the handoff: timeouts on writes can fire after the device already accepted the command, so blind retry of non-idempotent operations (pulses, alarm acks, recipe steps) would cause duplicate field actions. The default-to-never-retry policy with explicit opt-in is a substantive safety decision.

**Plan should say instead:** The plan should note that the implementation has standardized on Polly v8+ for per-device resilience, with a default-no-retry policy on writes that requires explicit per-tag opt-in for idempotency. This affects operator runbooks and onboarding training — operators configuring tags need to understand the `WriteIdempotent` flag's semantics.

---

### C4. Multi-identifier equipment model (5 identifiers) added beyond handoff's UUID-only spec

**Plan currently says** (handoff §"UNS Naming Hierarchy"):
> **Stable equipment UUID** — Every equipment node must expose a **stable UUIDv4** as a property: UUID is assigned once, never changes, never reused. Path can change (equipment moves, area renamed); UUID cannot. Canonical events downstream carry both UUID (for joins/lineage) and path (for dashboards/filtering).

**What was found:** Production usage requires more than UUID + path. The v2 design (`lmxopcua/docs/v2/plan.md` decisions #116–119) adds three operationally-required identifiers:
- **MachineCode** — operator-facing colloquial name (e.g. `machine_001`). Required. Within-cluster uniqueness. This is what operators actually say in conversation and write in runbooks; UUID and path are not mnemonic enough for daily operations.
- **ZTag** — ERP equipment identifier. Optional. Fleet-wide uniqueness. Primary identifier for browsing in Admin UI per operational request.
- **SAPID** — SAP PM equipment identifier. Optional. Fleet-wide uniqueness. Required for maintenance system join.

All five identifiers (UUID, EquipmentId, MachineCode, ZTag, SAPID) are exposed as OPC UA properties on the equipment node so external systems resolve by their preferred identifier without a sidecar service.

**Plan should say instead:** The handoff's UUID-only spec satisfies cross-system join requirements but underspecifies operator and external-system identifier needs. The plan should incorporate the multi-identifier model:
- Permanent: `EquipmentUuid` (UUIDv4, immutable, downstream events / canonical model joins)
- Operational: `MachineCode` (operator colloquial, required)
- ERP integration: `ZTag` (optional, ERP-allocated, fleet-wide unique)
- SAP PM integration: `SAPID` (optional, SAP-allocated, fleet-wide unique)
- Internal config key: `EquipmentId` (immutable after publish, internal logical key for cross-generation diffs)

**Resolved TBD:** This also resolves the implicit question "how do operators find equipment in dashboards / on the radio" — the answer is MachineCode, surfaced prominently in Admin UI alongside ZTag.

---

### C5. Consumer cutover plan (Tier 1 / 2 / 3) is not addressed in v2 design docs

**Plan currently says** (handoff §"Rollout Posture", §"Roadmap Summary"):
> Tier 1 ScadaBridge cutover begins Year 1. Tier 2 Ignition begins Year 2. Tier 3 System Platform IO Year 3.

**What was found:** The v2 design docs (`lmxopcua/docs/v2/plan.md` Phases 0–5) cover building the OtOpcUa server, drivers, configuration, and Admin UI — but **do not address consumer cutover at all**. There is no plan for:
- ScadaBridge migration sequencing per equipment / per site
- Per-site cutover validation (proving consumers see equivalent data through OtOpcUa as they did via direct sessions)
- Rollback procedures if a cutover causes a consumer regression
- Coordination with Aveva on System Platform IO cutover (Tier 3 — most opinionated consumer per handoff)
- Operational runbooks for what to do when a consumer can't connect / fails over

**Plan should say instead:** The v2 design should add cutover phases:
- **Phase 6 — Tier 1 (ScadaBridge) cutover** — per-site sequencing, validation methodology, rollback procedure
- **Phase 7 — Tier 2 (Ignition) cutover** — WAN-session collapse validation
- **Phase 8 — Tier 3 (System Platform IO) cutover** — Aveva validation prerequisite, compliance review

Or alternatively, the cutover plan lives outside the OtOpcUa v2 design and is owned by an integration / operations team — in which case the handoff should make that ownership explicit.

**Recommendation:** the cutover plan needs an owner and a doc. Either OtOpcUa team owns it (add Phases 6–8 to v2 plan) or another team owns it (handoff should name them and link the doc).

**Resolution** (lmxopcua decision #136, 2026-04-17): consumer cutover is **OUT of v2 scope** for the OtOpcUa team. The OtOpcUa team's responsibility ends at Phase 5 — all drivers built, all stability protections in place, full Admin UI shipped including the data-path ACL editor (per the §B1 resolution). Cutover sequencing per site, validation methodology, rollback procedures, and Aveva-pattern validation for tier 3 (System Platform IO) are deliverables of a separate **integration / operations team** that has yet to be named.

The plan should explicitly assign ownership of the cutover plan to that team and link to their (forthcoming) doc, replacing any wording that implies the OtOpcUa team owns end-to-end through tier 3. The handoff §"Rollout Posture" tier 1/2/3 sequencing remains the authoritative high-level roadmap; the implementation-level plan for it lives outside OtOpcUa's docs.

---

### C6. Per-building cluster pattern (Warsaw) — UNS path interaction needs clarification

**Plan currently says** (handoff §"Deployment Footprint"):
> Largest sites (Warsaw West, Warsaw North) run **one cluster per production building**

And §"UNS Naming Hierarchy":
> Level 3 — Area — `bldg-3`

**What was found:** The v2 design models clusters with `Enterprise` + `Site` columns (`lmxopcua/docs/v2/config-db-schema.md` ServerCluster table). For Warsaw with per-building clusters, the cleanest mapping is:
- Cluster A: Site = `warsaw-west`, equipment in this cluster all under UnsArea = `bldg-3`
- Cluster B: Site = `warsaw-west`, equipment in this cluster all under UnsArea = `bldg-4`
- ...etc.

This works structurally, but raises questions:
1. **ZTag fleet-wide uniqueness**: if `Warsaw West building 3` and `Warsaw West building 4` are separate clusters but the same physical site, can the same ZTag appear in both? Per the v2 design, no — ZTag is fleet-wide unique. Operationally that should be true (ERP doesn't reuse identifiers across buildings) but worth confirming.
2. **Cross-cluster consumer consistency**: a ScadaBridge instance reading equipment data for the whole Warsaw West site would need to connect to N clusters (one per building) and stitch the data together. The handoff's "site-local aggregation" responsibility (§"Responsibilities") is satisfied per-cluster but not site-wide.
3. **UNS path uniqueness**: with one cluster per building, equipment paths `ent/warsaw-west/bldg-3/line-2/cnc-mill-05` and `ent/warsaw-west/bldg-4/line-2/cnc-mill-05` are unique by the Area segment. But two clusters serve overlapping paths (`ent/warsaw-west/...`) — clients connecting to building 3's cluster can't see building 4's equipment. That's correct by intent, but downstream consumers expecting site-wide visibility need to know they connect per-cluster.

**Plan should say instead:** The handoff should clarify:
- Whether ZTag fleet-wide uniqueness extends across per-building clusters at the same site (assumed yes; confirm with ERP team)
- That consumers needing site-wide visibility at Warsaw campuses will connect to multiple clusters, not one
- Whether the per-building cluster decision is a constraint to optimize for or a cost to minimize (e.g. would a single Warsaw-West cluster with 4 buildings' worth of equipment be feasible if hardware allowed?)

---

## D. Resolved TBDs

### D1. Pilot equipment class for first canonical definition

**Plan TBD:** "Pilot equipment class for the first canonical definition" (handoff §"Open Questions / TBDs").

**Proposal:** **FANUC CNC** is the natural pilot. The v2 design (`lmxopcua/docs/v2/driver-specs.md` §7 "FANUC FOCAS Driver") already specifies a **fixed pre-defined node hierarchy** for FOCAS — Identity, Status, Axes, Spindle, Program, Tool, Alarms, PMC, Macro categories — populated by specific FOCAS2 API calls. This is essentially a class template already; the schemas repo just needs to formalize it.

**Why FANUC CNC over alternatives:**
- Pre-defined hierarchy already exists in driver design — no greenfield template work
- Single vendor with well-known API surface (FOCAS2 spec) — low ambiguity
- Equipment count is finite (CNCs per site are countable, not in the hundreds like generic Modbus instances)
- Tier C out-of-process driver design — failure-mode boundary is clean for piloting class-template-driven config

**Why not Modbus or AB CIP first:** these have no inherent class — every device has a hand-curated tag list. Picking them for the pilot would mean inventing a class taxonomy from scratch, which is exactly the schemas-repo team's job, not the OtOpcUa team's.

---

### D2. Storage format for the hierarchy in the `schemas` repo

**Plan TBD:** "Storage format for the hierarchy in the `schemas` repo (YAML vs Protobuf vs both)" (handoff §"Open Questions / TBDs").

**Proposal:** **JSON Schema (.json files, version-controlled in the schemas repo)** as the authoritative format, with Protobuf code generation for runtime serialization where needed (e.g. Redpanda event schemas).

**Rationale:**
- Idiomatic for .NET (System.Text.Json and System.Text.Json.Schema in .NET 10) — OtOpcUa reads templates with no extra dependencies
- Idiomatic for CI tooling (every CI runner can `jq` / validate JSON Schema without extra toolchain)
- Supports validation at multiple layers (operator-visible Admin UI errors, schemas-repo CI gates, OtOpcUa runtime validation)
- Protobuf is better for *wire* serialization (size, speed, generated code) but worse for *authoring* (binary, requires .proto compiler, poor merge story in git). For a repo where humans author equipment-class definitions, JSON Schema wins.
- If Protobuf is needed for downstream (Redpanda events with Protobuf-defined schemas), it can be code-generated from the JSON Schema authoring source. One-way derivation is simpler than bidirectional sync.

**Where it lives in OtOpcUa:** at startup, OtOpcUa nodes fetch the relevant equipment-class templates from the schemas repo (cached locally) and use them to validate `Equipment.EquipmentClassRef` against actual tag mappings. Drift between OtOpcUa's tag set and the schema is surfaced as a config validation error.

---

### D3. Where namespace ACL definitions live

**Plan TBD:** "Whether namespace ACL definitions live alongside driver/topology config or in their own governance surface" (handoff §"Open Questions / TBDs").

**Proposal:** **Live in the central config DB alongside DriverInstance/Equipment/Tag**, in a new `EquipmentAcl` table per cluster. Operationally co-located with the rest of the cluster's config; managed through the same Admin UI; edited through the same draft → diff → publish flow; generation-versioned for the same auditability and rollback safety.

**Rationale:**
- A separate governance surface means a separate auth layer, separate UI, separate audit log, separate rollback workflow — meaningful operational tax for unclear benefit.
- Co-locating means one publish includes both topology changes and ACL changes atomically, so a new equipment is added AND its ACL is granted in one transaction. Avoids the "equipment exists for 30 seconds with default-deny" race window that a separate-governance-surface model would create.
- Per `lmxopcua/docs/v2/plan.md` decision #105, cluster-scoped admin grants already live in the central DB — extending the same model to data-path ACLs is the consistent choice.

**Open sub-question:** ACL granularity — per Equipment? per UnsLine? per UnsArea? per Namespace? Suggest supporting all four levels with inheritance (grant at UnsArea cascades to all UnsLines + Equipment beneath, unless overridden).

This proposal also addresses B1 (the missing ACL constraint) — they're the same gap, viewed from two directions.

---

### D4. Enterprise shortname for UNS hierarchy root

**Plan TBD:** "Enterprise shortname for UNS hierarchy root (currently `ent` placeholder)" (handoff §"Open Questions / TBDs").

**Status: RESOLVED (2026-04-17) — `zb`.** Confirmed as the org's enterprise shortname matching the existing `ZB.MOM.WW.*` namespace prefix used throughout the codebase. Short by design (level-1 segment appears in every equipment path so brevity matters); operators already say "ZB" colloquially; no introduction of a new identifier vocabulary.

Propagated through:
- `goal-state.md` UNS hierarchy table (level 1 example), worked-example paths (text + OPC UA browse forms), small-site placeholder example
- `lmxopcua/docs/v2/plan.md` UNS section browse-path example + Namespace schema sketch NamespaceUri example
- `lmxopcua/docs/v2/admin-ui.md` cluster-create workflow form (Enterprise field default-prefilled `zb`)
- `lmxopcua/docs/v2/config-db-schema.md` ServerCluster.Enterprise column comment
- `3yearplan/schemas/uns/example-warsaw-west.json` `enterprise: "zb"`
- This corrections doc D4 entry

Production deployments use `zb` from cluster-create. The hardcoded `_default` reserved-segment rule is unchanged (still the placeholder for unused Area / Line levels at single-cluster sites).

**Recommendation:** resolve before tier 1 (ScadaBridge) cutover at the first site.

---

## E. New TBDs

Questions the plan didn't think to ask but should have, surfaced during v2 design.

### E1. EquipmentUuid generation authority — OtOpcUa or external system?

**Question:** Who allocates `EquipmentUuid`? The v2 design (`lmxopcua/docs/v2/admin-ui.md`) auto-generates UUIDv4 in the Admin UI on equipment creation. But if equipment is also tracked in ERP (ZTag) or SAP PM (SAPID), those systems may have their own equipment identifiers and may want to be the authoritative UUID source.

**Implication:** if ERP/SAP becomes UUID authority, OtOpcUa Admin UI shifts from "generate UUID on create" to "look up UUID from external system; reject creation if equipment not yet in ERP". That's a different operational flow.

**Suggested resolution:** OtOpcUa generates UUIDs by default (current v2 design); plan should note that if ERP/SAP take authoritative ownership of equipment registration in the future, the UUID-generation policy is configurable per cluster.

---

### E2. Tier 3 (System Platform IO) cutover — Aveva-supported pattern verification

**Question:** Does Aveva System Platform IO support consuming equipment data from another OPC UA server (OtOpcUa) instead of from equipment directly? The handoff §"Downstream Consumer Impact" says this "needs validation against Aveva's supported patterns — System Platform is the most opinionated consumer."

**Recommendation:** plan should commit to a research deliverable in Year 1 or Year 2 (well ahead of Year 3 tier 3 cutover): "Validate with Aveva that System Platform IO drivers support upstream OPC UA-server data sources, including any restrictions on security mode, namespace structure, or session model." If Aveva's pattern requires something OtOpcUa doesn't expose, that's a long-lead-time discovery.

---

### E3. Site-wide vs per-cluster consumer addressing at multi-cluster sites

**Question:** At Warsaw West / Warsaw North with per-building clusters, how do consumers that need site-wide equipment visibility address the equipment? The current v2 design has each cluster with its own endpoints + namespace; a site-wide consumer must connect to N clusters.

**Possible resolutions** (out of scope for v2 design but should be on someone's plate):
- (a) Configure consumer-side templates to enumerate all per-building clusters and stitch — current expected pattern but adds consumer-side complexity
- (b) Build a site-aggregator OtOpcUa instance that consumes from per-building clusters via the OPC UA Client gateway driver and re-exposes a unified site namespace — second-order aggregation, doable with our existing toolset but operational complexity
- (c) Reconsider per-building clustering — if hardware allows, single site cluster is simpler

**Recommendation:** flag as Year 1 design discussion before per-building clusters are committed at Warsaw. If (b) becomes the answer, OtOpcUa's architecture already supports it (the OpcUaClient driver was designed for exactly this gateway-of-gateways scenario).

---

### E4. Cluster-as-single-endpoint vs per-node-endpoints — clarify handoff mental model

(Cross-references A2.) Even if we agree non-transparent redundancy is the right v2 model, the handoff's wording ("single endpoint", "unified OPC UA endpoint") implies a single URL. Worth deciding whether to:
- Update the handoff wording to acknowledge two endpoints per cluster (operationally accurate per OPC UA non-transparent spec), OR
- Add an explicit roadmap line item to introduce a VIP/load-balancer in front of each cluster for transparent redundancy (substantial new infrastructure)

**Recommendation:** update the handoff wording. Transparent redundancy is a meaningful infrastructure investment for marginal client-side simplification (ScadaBridge / Ignition / System Platform IO are all sophisticated OPC UA clients that can handle ServerUriArray + ServiceLevel-driven failover).

---

## Summary

This batch:

| Category | Count | Notes |
|----------|------:|-------|
| A. Inaccuracies | 2 | Both wording/framing issues; no architectural conflict |
| B. Missing constraints | 3 | ACLs (substantial gap), schemas-repo dependencies, certificate trust pre-cutover step |
| C. Architectural decisions to revisit | 6 | Driver list pre-survey, stability tiers, Polly resilience, multi-identifier model, missing cutover plan, per-building cluster interactions |
| D. Resolved TBDs | 4 | Pilot class (FANUC CNC), schemas repo format (JSON Schema), ACL location (central config DB), enterprise shortname (still unresolved — flagged) |
| E. New TBDs | 4 | UUID-generation authority, System Platform IO Aveva validation, multi-cluster site addressing, cluster-endpoint mental model |

**Most urgent for plan integration:** B1 (missing ACL surface) and C5 (missing consumer cutover plan) — both are large work items the v2 implementation design discovered are needed but doesn't currently own. They should be assigned (to OtOpcUa team or another team) before Year 1 tier 1 cutover begins.

**Reference:** all v2 design decisions are in `lmxopcua/docs/v2/plan.md` (decision log §). Specific cross-references in this doc cite decision numbers (#XX) where applicable.

---

## Addendum — v2 design hardening, same day (2026-04-17)

After this corrections doc was filed, an adversarial review of the v2 db schema and Admin UI surfaced four internal design defects (one critical, three high) that the OtOpcUa team has now closed. None of these are corrections back to the handoff — they are internal design corrections within the v2 work — but two of them refine claims this corrections doc made and are worth flagging for plan-team awareness:

### Affects C4 (multi-identifier equipment model) — `EquipmentId` is now system-generated, not operator-supplied

C4 above describes a 5-identifier equipment model (UUID, EquipmentId, MachineCode, ZTag, SAPID) and implies `EquipmentId` is operator-set with the other operator-facing identifiers. The hardened design (lmxopcua decision #125) makes `EquipmentId` **system-generated** as `'EQ-' + first 12 hex chars of EquipmentUuid`, never operator-supplied, never editable, never present in CSV imports. The operator-facing identifiers are now MachineCode, ZTag, SAPID — three operator-set fields, two system-generated (EquipmentId, EquipmentUuid).

**Why the change**: operator-supplied EquipmentId was a corruption path — typos and bulk-import renames would mint duplicate equipment identities, each with a fresh UUID, permanently splitting downstream UUID-keyed lineage. Removing operator authoring eliminates the failure mode entirely. CSV imports now match by EquipmentUuid for updates; rows without UUID create new equipment with system-generated identifiers.

**For the plan team**: this doesn't change the audience-mapping story (operators say MachineCode in conversation, ERP queries by ZTag, etc.) — it just means there's one less operator field in the equipment-create form. If any plan-level documentation describes EquipmentId as operator-managed, update to "system-generated".

### Affects D3 (ACL location) and adds a new architectural concept — `ExternalIdReservation` table for rollback-safe identifier uniqueness

D3 above proposes ACL definitions live alongside topology in the central config DB, generation-versioned. The same review surfaced that **fleet-wide uniqueness for ZTag and SAPID cannot be expressed within generation-versioned tables** because old generations and disabled equipment can still hold the same external IDs — rollback or re-enable then silently reintroduces duplicates that corrupt downstream ERP/SAP joins.

The hardened design (lmxopcua decision #124) introduces a new `ExternalIdReservation` table that sits **outside generation versioning** specifically for this rollback-safety property. `sp_PublishGeneration` reserves IDs atomically at publish; FleetAdmin-only `sp_ReleaseExternalIdReservation` (audit-logged, requires reason) is the only path to free a value for reuse by a different EquipmentUuid; rollback respects the reservation table.

**For the plan team**: this is a precedent that some cross-generation invariants need their own non-versioned tables. When the missing ACL design is being scoped (per B1 / C5 above), consider whether *any* ACL grant has a similar rollback-reuse hazard. (Initial intuition: ACLs probably don't — granting a permission and then revoking it doesn't create downstream join corruption the way an ID does. But worth checking.)

### Two purely-internal fixes (no plan-team relevance)

For completeness and audit trail:

- **Same-cluster invariant on `DriverInstance.NamespaceId`** (lmxopcua decision #122) — closes a critical cross-cluster trust-boundary leak in the schema where a draft for cluster A could bind to cluster B's namespace and leak its URI. Three-layer enforcement (sp_ValidateDraft + API scoping + audit log).
- **`Namespace` table moves from cluster-level to generation-versioned** (lmxopcua decision #123) — earlier draft mistakenly treated namespaces as cluster-topology like ClusterNode rows. They are consumer-visible content (define what consumers see at the OPC UA endpoint) and must travel through draft → diff → publish like every other consumer-visible config.

Neither of these affects the handoff or this corrections doc directly.

### Updated summary

| Category | Count | Notes |
|----------|------:|-------|
| A. Inaccuracies | 2 | Both wording/framing issues; no architectural conflict |
| B. Missing constraints | 3 | ACLs, schemas-repo dependencies, certificate trust pre-cutover |
| C. Architectural decisions to revisit | 6 | Driver list pre-survey, stability tiers, Polly resilience, multi-identifier model (now refined per addendum), missing cutover plan, per-building cluster interactions |
| D. Resolved TBDs | 4 | Pilot class, schemas repo format, ACL location, **D4 enterprise shortname = `zb` RESOLVED 2026-04-17** |
| E. New TBDs | 4 | UUID-gen authority, Aveva validation, multi-cluster site addressing, cluster-endpoint mental model |
| **Addendum hardening fixes** | **4** | EquipmentId system-generated; ExternalIdReservation table; same-cluster namespace binding; Namespace generation-versioned |

The hardening fixes are committed in `lmxopcua` branch `v2` at commit `a59ad2e` (2026-04-17). Decisions #122–125 in `lmxopcua/docs/v2/plan.md` carry the rationale.

---

## Round 3 additions (2026-04-17, post-integration)

After the plan team integrated the original 19 corrections + 4 hardening fixes, the OtOpcUa team made a follow-on set of additions that landed directly in the plan files (`goal-state.md`, `roadmap.md`) plus the `schemas/` seed. Captured here for audit trail; no further action required from the plan team.

**ACL design committed** (lmxopcua decisions #129–132, closes B1 fully) — `NodePermissions` bitmask covering Browse / Read / Subscribe / HistoryRead / WriteOperate / WriteTune / WriteConfigure / AlarmRead / AlarmAcknowledge / AlarmConfirm / AlarmShelve / MethodCall + bundles, 6-level scope hierarchy with default-deny + additive grants, generation-versioned `NodeAcl` table, cluster-create workflow seeding the v1 LDAP-role-to-permission map for v1 → v2 consumer migration parity, Admin UI ACL tab + bulk grant + permission simulator. Phase 1 ships the schema + Admin UI + evaluator unit tests; per-driver enforcement lands in each driver's phase. Doc: `lmxopcua/docs/v2/acl-design.md`.

**Dev-environment two-tier model** (decisions #133–137) — inner-loop on developer machines (in-process simulators only) + integration on a single dedicated Windows host with Docker WSL2 backend so TwinCAT XAR VM can run in Hyper-V alongside containerized simulators. Galaxy testing stays on developer machines that have local Aveva licenses; integration host doesn't carry the license. Doc: `lmxopcua/docs/v2/dev-environment.md`.

**Cutover removed from OtOpcUa v2 scope** (decision #136, closes C5 fully) — owned by a separate integration / operations team (not yet named). OtOpcUa team's responsibility ends at Phase 5 (all drivers built, all stability protections in place, full Admin UI shipped including ACL editor). Already integrated into `roadmap.md` line 66 by the plan team in commit `68dbc01`.

**Schemas-repo seed** at `3yearplan/schemas/` (closes B2 partially — content available, owner team naming + dedicated repo creation still pending). Includes JSON Schema format definitions, FANUC CNC pilot, worked UNS subtree example, documentation. Already integrated into `goal-state.md` line 574 by the plan team in commit `dee56a6`.

**`_base` equipment-class template + OPC 40010 alignment** (lmxopcua decisions #138–139, builds on B2 resolution) — universal cross-machine baseline that every other class extends. References OPC UA Companion Spec OPC 40010 (Machinery) for the Identification component + MachineryOperationMode enum, OPC UA Part 9 for alarm-summary fields, ISO 22400 for lifetime counters that feed Availability + Performance KPIs, the canonical state vocabulary from this handoff §"Canonical Model Integration". Equipment table extended with 9 nullable OPC 40010 identity columns (Manufacturer, Model, SerialNumber, HardwareRevision, SoftwareRevision, YearOfConstruction, AssetLocation, ManufacturerUri, DeviceManualUri); drivers that can read these dynamically (FANUC `cnc_sysinfo()`, Beckhoff `TwinCAT.SystemInfo`, etc.) override the static value at runtime. `_base` declares 27 signals across Identity / Status / Alarm / Diagnostic / Counter / Process categories + 2 universal alarms (communication-loss, data-stale) + the canonical state vocabulary. Already integrated into `goal-state.md` line 574 (schemas-repo seed paragraph) and `goal-state.md` line 156 (multi-identifier section gains the OPC 40010 fields paragraph) and `roadmap.md` line 66/67 by this commit.

### Updated summary

| Category | Count | Notes |
|----------|------:|-------|
| A. Inaccuracies | 2 | Both wording/framing issues; no architectural conflict |
| B. Missing constraints | 3 | B1 ACLs **CLOSED** (#129–132); B2 schemas-repo **PARTIALLY CLOSED** (seed contributed; owner-team + dedicated-repo TBD); B3 cert-distribution remains operational concern |
| C. Architectural decisions to revisit | 6 | C1 driver list **CLOSED** (#128); C5 cutover scope **CLOSED** (#136 — out of v2 scope); others still flagged |
| D. Resolved TBDs | 4 | Pilot class, schemas repo format, ACL location, **D4 enterprise shortname = `zb` RESOLVED 2026-04-17** |
| E. New TBDs | 4 | UUID-gen authority, Aveva validation, multi-cluster site addressing, cluster-endpoint mental model |
| **Addendum hardening fixes** | **4** | EquipmentId system-generated; ExternalIdReservation table; same-cluster namespace binding; Namespace generation-versioned |
| **Round 3 additions** | **4** | ACL design (#129–132); dev-environment two-tier (#133–137); cutover scope removal (#136); `_base` template + OPC 40010 columns (#138–139) |

The Round 3 additions are committed in `lmxopcua` branch `v2` at commits `4903a19` (ACL + dev-env + cutover removal) and `d8fa3a0` (Equipment OPC 40010 columns + Identification panel), and in `3yearplan` at commits `5953685` (schemas seed) and `cd85159` (`_base` template + OPC 40010 alignment + format-decisions D9 + D10).