After shipping the four Phase 6 plan drafts (PRs 77-80), the adversarial-review
adjustments lived only as trailing "Review" sections. An implementer reading
Stream A would find the original unadjusted guidance, then have to cross-reference
the review to reconcile. This PR makes the plans genuinely executable:
1. Merges every ACCEPTed review finding into the actual Scope / Stream / Compliance
sections of each phase plan:
- phase-6-1: Scope table rewrite (per-capability retry, (instance,host) pipeline key,
MemoryTracking vs MemoryRecycle split, hybrid watchdog formula, demand-aware
wedge detector, generation-sealed LiteDB). Streams A/B/D + Compliance rewritten.
- phase-6-2: AuthorizationDecision tri-state, control/data-plane separation,
MembershipFreshnessInterval (15 min), AuthCacheMaxStaleness (5 min),
subscription stamp-and-reevaluate. Stream C widened to 11 OPC UA operations.
- phase-6-3: 8-state ServiceLevel matrix (OPC UA Part 5 §6.3.34-compliant),
two-layer peer probe (/healthz + UaHealthProbe), apply-lease via await using,
publish-generation fencing, InvalidTopology runtime state, ServerUriArray
self-first + peers. New Stream F (interop matrix + Galaxy failover).
- phase-6-4: DraftRevisionToken concurrency control, staged-import via
EquipmentImportBatch with user-scoped visibility, CSV header version marker,
decision-#117-aligned identifier columns, 1000-row diff cap,
decision-#139 OPC 40010 fields, Identification inherits Equipment ACL.
2. Appends decisions #143 through #162 to docs/v2/plan.md capturing the
architectural commitments the adjustments created. Each decision carries its
dated rationale so future readers know why the choice was made.
3. Scaffolds scripts/compliance/phase-6-{1,2,3,4}-compliance.ps1 — PowerShell
stubs with Assert-Todo / Assert-Pass / Assert-Fail helpers. Every check
maps to a Stream task ID from the corresponding phase plan. Currently all
checks are TODO and scripts exit 0; each implementation task is responsible
for replacing its TODO with a real check before closing that task. Saved
as UTF-8 with BOM so Windows PowerShell 5.1 parses em-dash characters
without breaking.
Net result: the Phase 6.1 plan is genuinely ready to execute. Stream A.3 can
start tomorrow without reconciling Streams vs. Review on every task; the
compliance script is wired to the Stream IDs; plan.md has the architectural
commitments that justify the Stream choices.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
111 KiB
Next Phase Plan — OtOpcUa v2: Multi-Driver Architecture
Status: DRAFT — brainstorming in progress, do NOT execute until explicitly approved.
Branch:
v2Created: 2026-04-16
Vision
Rename from LmxOpcUa to OtOpcUa and evolve from a single-protocol OPC UA server (Galaxy/MXAccess only) into a multi-driver OPC UA server where:
- The common core owns the OPC UA server, address space management, session/security/subscription machinery, and client-facing concerns.
- Driver modules are pluggable backends that each know how to connect to a specific data source, discover its tags/hierarchy, and shuttle live data back through the core to OPC UA clients.
- Drivers implement composable capability interfaces — a driver only implements what it supports (e.g. subscriptions, alarms, history).
- The existing Galaxy/MXAccess integration becomes the first driver module, proving the abstraction works against real production use.
Target Drivers
| Driver | Protocol | Capability Profile | Notes |
|---|---|---|---|
| Galaxy | MXAccess COM + Galaxy DB | Read, Write, Subscribe, Alarms, HDA | Existing v1 logic, out-of-process (.NET 4.8 x86) |
| Modbus TCP | MB-TCP | Read, Write, Subscribe (polled) | Flat register model, config-driven tag map. Also covers DL205 via AddressFormat=DL205 (octal translation) |
| AB CIP | EtherNet/IP CIP | Read, Write, Subscribe (polled) | ControlLogix/CompactLogix, symbolic tag addressing |
| AB Legacy | EtherNet/IP PCCC | Read, Write, Subscribe (polled) | SLC 500/MicroLogix, file-based addressing |
| Siemens S7 | S7comm (ISO-on-TCP) | Read, Write, Subscribe (polled) | S7-300/400/1200/1500, DB/M/I/Q addressing |
| TwinCAT | ADS (Beckhoff) | Read, Write, Subscribe (native) | Symbol-based, native ADS notifications |
| FOCAS | FOCAS2 (FANUC CNC) | Read, Write, Subscribe (polled) | CNC data model (axes, spindle, PMC, macros) |
| OPC UA Client | OPC UA | Read, Write, Subscribe, Alarms, HDA | Gateway/aggregation — proxy a remote server |
Driver Characteristics That Shape the Interface
| Concern | Galaxy | Modbus TCP | AB CIP | AB Legacy | S7 | TwinCAT | FOCAS | OPC UA Client |
|---|---|---|---|---|---|---|---|---|
| Tag discovery | DB query | Config DB | Config DB | Config DB | Config DB | Symbol upload | CNC query + Config DB | Browse remote |
| Hierarchy | Rich tree | Flat (user groups) | Flat or program-scoped | Flat (file-based) | Flat (DB/area) | Symbol tree | Functional (axes/spindle/PMC) | Mirror remote |
| Data types | mx_data_type | Raw registers (user-typed) | CIP typed | File-typed (N=INT16, F=FLOAT) | S7 typed | IEC 61131-3 | Scaled integers + structs | Full OPC UA |
| Native subscriptions | Yes (MXAccess) | No (polled) | No (polled) | No (polled) | No (polled) | Yes (ADS notifications) | No (polled) | Yes (OPC UA) |
| Alarms | Yes | No | No | No | No | Possible (ADS state) | Yes (CNC alarms) | Yes (A&C) |
| History | Yes (Historian) | No | No | No | No | No | No | Yes (HistoryRead) |
Note: AutomationDirect DL205 PLCs are supported by the Modbus TCP driver via AddressFormat=DL205 (octal V/X/Y/C/T/CT address translation over H2-ECOM100 module, port 502). No separate driver needed.
Architecture — Key Decisions & Open Questions
1. Common Core Boundary
Core owns:
- OPC UA server lifecycle (startup, shutdown, session management)
- Security (transport profiles, authentication, authorization)
- Address space tree management (add/remove/update nodes)
- Subscription engine (create, publish, transfer)
- Status dashboard / health reporting
- Redundancy
- Configuration framework
- Namespace allocation per driver
Driver owns:
- Data source connection management
- Tag/hierarchy discovery
- Data type mapping (driver types → OPC UA types)
- Read/write translation
- Alarm sourcing (if supported)
- Historical data access (if supported)
Decided:
- Each driver instance manages its own polling internally — the core does not provide a shared poll scheduler.
- Multiple instances of the same driver type are supported (e.g. two Modbus TCP drivers for different device groups).
- One namespace index per driver instance (each instance gets its own
NamespaceUri).
Decided:
- Drivers register nodes via a builder/context API (
IAddressSpaceBuilder) provided by the core. Core owns the tree; driver streamsAddFolder/AddVariablecalls as it discovers nodes. Supports incremental/large address spaces without forcing the driver to buffer the whole tree.
2. Driver Capability Interfaces
Composable — a driver implements only what it supports:
IDriver — required: lifecycle, metadata, health
├── ITagDiscovery — discover tags/hierarchy from the backend
├── IReadable — on-demand read
├── IWritable — on-demand write
├── ISubscribable — data change subscriptions (native or driver-managed polling)
├── IAlarmSource — alarm events and acknowledgment
└── IHistoryProvider — historical data reads
Note: ISubscribable covers both native subscriptions (Galaxy MXAccess advisory, OPC UA monitored items) and driver-internal polled subscriptions (Modbus, AB CIP). The driver owns its polling loop — the core just sees OnDataChange callbacks regardless of mechanism.
Capability matrix:
| Interface | Galaxy | Modbus TCP | AB CIP | AB Legacy | S7 | TwinCAT | FOCAS | OPC UA Client |
|---|---|---|---|---|---|---|---|---|
| IDriver | Y | Y | Y | Y | Y | Y | Y | Y |
| ITagDiscovery | Y | Y (config DB) | Y (config DB) | Y (config DB) | Y (config DB) | Y (symbol upload) | Y (built-in + config DB) | Y (browse) |
| IReadable | Y | Y | Y | Y | Y | Y | Y | Y |
| IWritable | Y | Y | Y | Y | Y | Y | Y (limited) | Y |
| ISubscribable | Y (native) | Y (polled) | Y (polled) | Y (polled) | Y (polled) | Y (native ADS) | Y (polled) | Y (native) |
| IAlarmSource | Y | — | — | — | — | — | Y (CNC alarms) | Y |
| IHistoryProvider | Y | — | — | — | — | — | — | Y |
Decided:
- Data change callback uses shared data models (
DataValuewith value,StatusCodequality, timestamp). Every driver maps to the same OPC UAStatusCodespace — drivers define which quality codes they can produce but the model is universal. - Driver isolation: each driver instance runs independently. A crash or disconnect in one driver sets Bad quality on its own nodes only — no impact on other driver instances. The core must catch and contain driver failures.
Resilience — Polly
Decided: Use Polly v8+ (Microsoft.Extensions.Resilience) as the resilience layer across all drivers and the configuration subsystem.
Polly provides composable resilience pipelines rather than hand-rolled retry/circuit-breaker logic. Each driver instance (and each device within a driver) gets its own pipeline so failures are isolated at the finest practical level.
Where Polly applies:
| Component | Pipeline | Strategies | Purpose |
|---|---|---|---|
| Driver device connection | Per device | Retry (exp. backoff) + CircuitBreaker + Timeout | Reconnect to offline PLC/device, stop hammering after N failures, bound connection attempts |
| Driver read ops | Per device | Timeout + Retry | Reads are idempotent — retry transient failures freely |
| Driver write ops | Per device | Timeout only by default | Writes are NOT auto-retried — a timeout may fire after the device already accepted the command; replaying non-idempotent field actions (pulses, acks, recipe steps, counter increments) can cause duplicate operations |
| Driver poll loop | Per device | CircuitBreaker | When a device is consistently unreachable, open circuit and probe periodically instead of polling at full rate |
| Galaxy IPC (Proxy → Host) | Per proxy | Retry (backoff) + CircuitBreaker | Reconnect when Galaxy Host service restarts, stop retrying if Host is down for extended period |
| Config DB polling | Singleton | Retry (backoff) + Fallback (use cache) | Central DB unreachable → fall back to LiteDB cache, keep retrying in background |
| Config DB startup | Singleton | Retry (backoff) + Fallback (use cache) | If DB is briefly unavailable at startup, retry before falling back to cache |
How it integrates:
IHostedService (per driver instance)
├── Per-device ReadPipeline
│ ├── Timeout — bound how long a read can take
│ ├── Retry — transient failure recovery with jitter (SAFE: reads are idempotent)
│ └── CircuitBreaker — stop polling dead devices, probe periodically
│ on break: set device tags to Bad quality
│ on reset: resume normal polling, restore quality
│
└── Per-device WritePipeline
├── Timeout — bound how long a write can take
└── (NO retry by default) — opt-in per tag via TagConfig.WriteIdempotent = true
OR via a CAS (compare-and-set) wrapper that verifies
the device state before each retry attempt
ConfigurationService
└── ResiliencePipeline
├── Retry — transient DB connectivity issues
└── Fallback — serve from LiteDB cache on sustained outage
Write-retry policy (per the adversarial review, finding #1):
- Default: no automatic retry on writes. A timeout bubbles up as a write failure; the OPC UA client decides whether to re-issue.
- Opt-in per tag via
TagConfig.WriteIdempotent = true— explicit assertion by the configurer that replaying the same write has no side effect (e.g. setpoint overwrite, steady-state mode selection). - Opt-in via CAS (compare-and-set): before retrying, read the current value; retry only if the device still holds the pre-write value. Drivers whose protocol supports atomic read-modify-write (e.g. Modbus mask-write, OPC UA writes with expected-value) can plug this in.
- Documented never-retry cases: edge-triggered acks, pulse outputs, monotonic counters, recipe-step advances, alarm acknowledgments, any "fire-and-forget" command register.
Polly integration points:
Microsoft.Extensions.Resiliencefor DI-friendly pipeline registrationTelemetryListenerfeeds circuit-breaker state changes into the status dashboard (operators see which devices are in open/half-open/closed state)- Per-driver/per-device pipeline configuration from the central config DB (retry counts, backoff intervals, circuit breaker thresholds can be tuned per device)
Decided:
- Capability discovery uses interface checks via
is(e.g.if (driver is IAlarmSource a) ...). The interface is the capability — no redundant flag enum to keep in sync. ITagDiscoveryis discovery-only. Drivers with a change signal (Galaxy deploy time, OPC UA server change notifications) additionally implement an optionalIRediscoverablesub-interface; the core subscribes and rebuilds the affected subtree. Static drivers (Modbus, S7, etc. whose tags only change via a published config generation) don't implement it.
3. Runtime & Target Framework
Decided: .NET 10, C#, x64 for everything — except where explicitly required.
| Component | Target | Reason |
|---|---|---|
| Core, Core.Abstractions | .NET 10 x64 | Default |
| Server | .NET 10 x64 | Default |
| Configuration | .NET 10 x64 | Default |
| Admin | .NET 10 x64 | Blazor Server |
| Driver.ModbusTcp | .NET 10 x64 | Default |
| Driver.AbCip | .NET 10 x64 | Default |
| Driver.OpcUaClient | .NET 10 x64 | Default |
| Client.CLI | .NET 10 x64 | Default |
| Client.UI | .NET 10 x64 | Avalonia |
| Driver.Galaxy | .NET Framework 4.8 x86 | MXAccess COM interop requires 32-bit |
Critical implication: The Galaxy driver cannot load in-process with a .NET 10 x64 server. It must run as an out-of-process driver — a separate .NET 4.8 x86 process that the core communicates with over IPC.
Decided: Named pipes with MessagePack serialization for IPC.
- Galaxy Host always runs on the same machine (MXAccess needs local ArchestrA Platform)
- Named pipes are fast, no port allocation, built into both .NET 4.8 (
System.IO.Pipes) and .NET 10 Galaxy.Shareddefines request/response message types serialized with MessagePack over length-prefixed frames- MessagePack-CSharp (
MessagePackNuGet) supports .NET Framework 4.6.1+ and .NET Standard 2.0+ — works on both sides - Compact binary format, faster than JSON, good fit for high-frequency data change callbacks
- Simpler than gRPC on .NET 4.8 (which needs legacy
Grpc.Corenative library)
Decided: Galaxy Host is a separate Windows service.
- Independent lifecycle from the OtOpcUa Server
- Can be restarted without affecting the main server or other drivers
- Galaxy.Proxy detects connection loss, sets Bad quality on Galaxy nodes, reconnects when Host comes back
- Installed/managed via standard Windows service tooling
┌──────────────────────────────────┐ named pipe ┌───────────────────────────┐
│ OtOpcUa Server (.NET 10 x64) │◄────────────►│ Galaxy Host Service │
│ Windows Service │ │ Windows Service │
│ (Microsoft.Extensions.Hosting) │ │ (.NET 4.8 x86) │
│ │ │ │
│ Core │ │ MxAccessBridge │
│ ├── Driver.ModbusTcp (in-proc)│ │ GalaxyRepository │
│ ├── Driver.AbCip (in-proc) │ │ GalaxyDriverService │
│ └── GalaxyProxy (in-proc)──┼──────────────┼──AlarmTracking │
│ │ │ HDA Plugin │
└──────────────────────────────────┘ └───────────────────────────┘
Notes for future work:
- The Proxy/Host/Shared split is a general pattern — any future driver with process-isolation requirements (bitness mismatch, unstable native dependency, license boundary) can reuse the same three-project layout.
- Reusability of
LmxNodeManageras a "generic driver node manager" will be assessed during Phase 2 interface extraction.
4. Galaxy/MXAccess as Out-of-Process Driver
Current tightly-coupled pieces to refactor:
LmxNodeManager— mixes OPC UA node management with MXAccess-specific logicMxAccessBridge— COM thread, subscriptions, reconnectGalaxyRepository— SQL queries for hierarchy/attributes- Alarm tracking tied to MXAccess subscription model
- HDA via Wonderware Historian plugin
All of these stay in the Galaxy Host process (.NET 4.8 x86). The GalaxyProxy in the main server implements the standard driver interfaces and forwards over IPC.
Decided:
- Refactor is incremental: extract
IDriver/ISubscribable/ITagDiscoveryetc. against the existingLmxNodeManagerfirst (still in-process on v2 branch), validate the system still runs, then move the implementation behind the IPC boundary into Galaxy.Host. Keeps the system runnable at each step and de-risks the out-of-process move. - Parity test: run the existing v1 IntegrationTests suite against the v2 Galaxy driver (same Galaxy, same expectations) plus a scripted Client.CLI walkthrough (connect / browse / read / write / subscribe / history / alarms) on a dev Galaxy. Automated regression + human-observable behavior.
Dev environment for the LmxOpcUa breakout: the Phase 0/1 dev box (DESKTOP-6JL3KKO) hosts the full AVEVA stack required to execute Phase 2 Streams D + E — 27 ArchestrA / Wonderware / AVEVA services running including aaBootstrap, aaGR (Galaxy Repository), aaLogger, aaUserValidator, aaPim, ArchestrADataStore, AsbServiceManager; the full Historian set (aahClientAccessPoint, aahGateway, aahInSight, aahSearchIndexer, InSQLStorage, InSQLConfiguration, InSQLEventSystem, InSQLIndexing, InSQLIOServer, HistorianSearch-x64); SuiteLink (slssvc); MXAccess COM at C:\Program Files (x86)\ArchestrA\Framework\bin\ArchestrA.MXAccess.dll; and OI-Gateway at C:\Program Files (x86)\Wonderware\OI-Server\OI-Gateway\ — so the Phase 1 Task E.10 AppServer-via-OI-Gateway smoke test (decision #142) is also runnable on the same box, no separate AVEVA test machine required. Inventory captured in dev-environment.md.
4. Configuration Model — Centralized MSSQL + Local Cache
Deployment topology — server clusters:
Sites deploy OtOpcUa as 2-node clusters to provide non-transparent OPC UA redundancy (per v1 — RedundancySupport.Warm / Hot, no VIP/load-balancer involvement; clients see both endpoints in ServerUriArray and pick by ServiceLevel). Single-node deployments are the same model with NodeCount = 1. The config schema treats this uniformly: every server is a member of a ServerCluster with 1 or 2 ClusterNode members.
Within a cluster, both nodes serve identical address spaces — defining tags twice would invite drift — so driver definitions, device configs, tag definitions, and poll groups attach to ClusterId, not to individual nodes. Per-node overrides exist only for physical-machine settings that legitimately differ (host, port, ApplicationUri, redundancy role, machine cert) and for the rare driver setting that must differ per node (e.g. MxAccess.ClientName so Galaxy distinguishes them). Overrides are minimal by intent.
Namespaces — two today, extensible to N:
Each cluster serves multiple OPC UA namespaces through a single endpoint, per the 3-year-plan handoff (handoffs/otopcua-handoff.md §4). At v2.0 GA there are two namespace kinds:
| Kind | Source | Purpose |
|---|---|---|
| Equipment | New drivers (Modbus, AB CIP, AB Legacy, S7, TwinCAT, FOCAS, OPC UA Client when gatewaying raw data) | Raw equipment data — no deadbanding, no aggregation, no business meaning. The OT-side surface of the canonical model. |
| SystemPlatform | Galaxy driver (existing v1 LmxOpcUa functionality, folded in) | Processed data tap — Aveva System Platform objects exposed as OPC UA so OPC UA-native consumers read derived state through the same endpoint as raw equipment data. |
Future kinds — Simulated is named in the plan as a next addition (replay historical equipment data to exercise tier-1/tier-2 consumers without physical equipment). Architecturally supported, not committed for v2.0 build. The schema models namespace as a first-class entity (Namespace table) so adding a third kind is a config-DB row insert + driver wiring, not a structural refactor.
A cluster always has at most one namespace per kind (UNIQUE on ClusterId, Kind). Each DriverInstance is bound to exactly one NamespaceId; a driver type is restricted to the namespace kinds it can populate (Galaxy → SystemPlatform; all native-protocol drivers → Equipment; OPC UA Client → either, by config).
UNS naming hierarchy — mandatory in the Equipment namespace:
Per the 3-year-plan handoff §12, the Equipment namespace browse paths must conform to the canonical 5-level Unified Namespace structure:
| Level | Name | Source | Example |
|---|---|---|---|
| 1 | Enterprise | ServerCluster.Enterprise |
ent |
| 2 | Site | ServerCluster.Site |
warsaw-west |
| 3 | Area | UnsArea.Name (first-class table) |
bldg-3 or _default |
| 4 | Line | UnsLine.Name (first-class table) |
line-2 or _default |
| 5 | Equipment | Equipment.Name |
cnc-mill-05 |
| 6 | Signal | Tag.Name |
RunState, ActualFeedRate |
OPC UA browse path: zb/warsaw-west/bldg-3/line-2/cnc-mill-05/RunState.
UnsArea and UnsLine are first-class generation-versioned entities so the UNS structure is manageable on its own — operators can rename bldg-3 → bldg-3a and every equipment under it picks up the new path automatically; bulk-move 5 lines from one building to another with a single edit; etc. Equipment references UnsLineId (FK), not denormalized Area/Line strings.
Naming rules (validated at draft-publish time and in Admin UI):
- Each segment matches
^[a-z0-9-]{1,32}$, OR equals the reserved placeholder_default - Lowercase enforced; hyphens allowed within a segment, slashes only between segments
- Total path ≤ 200 characters
Equipment is a first-class entity with five distinct identifiers serving different audiences:
| Identifier | Audience | Mutability | Uniqueness | Purpose |
|---|---|---|---|---|
EquipmentUuid |
Downstream events / dbt / Redpanda | Immutable forever | Globally unique (UUIDv4) | Permanent join key across systems and time |
EquipmentId |
Internal config DB | Immutable after publish | Within cluster | Stable logical key for cross-generation diffs |
MachineCode |
OT operators | Mutable (with publish) | Within cluster | Colloquial name in conversations and runbooks (e.g. machine_001) |
ZTag |
ERP integration | Mutable (rare) | Fleet-wide | Primary identifier for browsing in Admin UI — list/search default sort |
SAPID |
SAP PM integration | Mutable (rare) | Fleet-wide | Maintenance system join key |
All five are exposed as OPC UA properties on the equipment node. External systems can resolve equipment by whichever identifier they natively use — ERP queries by ZTag, SAP PM by SAPID, OT operators by MachineCode in conversation, downstream events by EquipmentUuid for permanent lineage. The OPC UA browse path uses Equipment.Name as the level-5 segment; the other identifiers do not appear in the path but are properties on the node.
SystemPlatform namespace does NOT use UNS — Galaxy's hierarchy is preserved as v1 LmxOpcUa exposes it (Area > Object). UNS rules apply only to drivers in Equipment-kind namespaces.
Authority for equipment-class templates lives in a future central schemas repo (not yet created per the 3-year-plan). v2.0 ships an Equipment.EquipmentClassRef column as a hook (nullable, FK-to-future); enforcement is added when the schemas repo lands. Cheap to add now, expensive to retrofit.
Canonical machine state vocabulary (Running, Idle, Faulted, Starved, Blocked) — derivation lives at Layer 3 (System Platform / Ignition), not in OtOpcUa. Our role is delivering the raw signals cleanly so derivation is accurate. Equipment-class templates from the schemas repo will define which raw signals each class exposes.
Architecture:
┌─────────────────────────────────┐
│ Central Config DB (MSSQL) │
│ │
│ - Server clusters (1 or 2 nodes)│
│ - Cluster nodes (physical srvs)│
│ - Driver assignments (per cluster)│
│ - Tag definitions (per cluster)│
│ - Device configs (per cluster) │
│ - Per-node overrides (minimal) │
│ - Schemaless driver config │
│ (JSON; cluster-level + node │
│ override JSON) │
└──────────┬──────────────────────┘
│ poll / change detection
▼
┌─── Cluster LINE3-OPCUA ────────────────────┐
│ │
┌──────┴──────────────────┐ ┌──────────────────┴──┐
│ Node LINE3-OPCUA-A │ │ Node LINE3-OPCUA-B │
│ RedundancyRole=Primary │ │ RedundancyRole=Secondary │
│ │ │ │
│ appsettings.json: │ │ appsettings.json: │
│ - MSSQL conn string │ │ - MSSQL conn str │
│ - ClusterId │ │ - ClusterId │
│ - NodeId │ │ - NodeId │
│ - Local cache path │ │ - Local cache path│
│ │ │ │
│ Local cache (LiteDB) │ │ Local cache (LiteDB)│
└─────────────────────────┘ └─────────────────────┘
How it works:
- Each OtOpcUa node has a minimal
appsettings.jsonwith just: MSSQL connection string, itsClusterIdandNodeId, a local machine-bound client certificate (or gMSA credential), and local cache file path. OPC UA port andApplicationUricome from the central DB (ClusterNode.OpcUaPort/ClusterNode.ApplicationUri), not from local config — they're cluster topology, not local concerns. - On startup, the node authenticates to the central DB using a credential bound to its
NodeId— a client cert or SQL login per node, NOT a shared DB login. The DB-side authorization layer enforces that the authenticated principal may only read config for itsNodeId'sClusterId. A self-assertedNodeIdwith the wrong credential is rejected. A node may not read another cluster's config, even if both clusters belong to the same admin team. - The node requests its current config generation from the central DB: "give me the latest published generation for cluster X." Generations are cluster-scoped — one generation = one cluster's full configuration snapshot.
- The node receives the cluster-level config (drivers, devices, tags, poll groups) plus its own
ClusterNoderow (physical attributes + override JSON). It merges node overrides onto cluster-level driver configs at apply time. - Config is cached locally in a LiteDB file keyed by generation number — if the central DB is unreachable at startup, the node boots from the latest cached generation.
- The node polls the central DB for a new published generation. When a new generation is published, the node downloads it, diffs it against its current one, and applies only the affected drivers/devices/tags (surgical application against an atomic snapshot).
- Both nodes of a cluster apply the same generation, but apply timing can differ slightly (network jitter, polling phase). During the apply window, one node may be on generation N and the other on N+1; this is acceptable because OPC UA non-transparent redundancy already accommodates per-endpoint state divergence and
ServiceLevelwill dip on the node that's mid-apply. - If generation application fails mid-flight, the node rolls back to the previous generation and surfaces the failure in the status dashboard; admins can publish a corrective generation or explicitly roll back the cluster.
- The central DB is the single source of truth for fleet management — all tag definitions, device configs, driver assignments, and cluster topology live there, versioned by generation.
Central DB schema (conceptual):
ServerCluster ← top-level deployment unit (1 or 2 nodes)
- ClusterId (PK)
- Name ← human-readable e.g. "LINE3-OPCUA"
- Enterprise ← UNS level 1, canonical org value: "zb" (validated [a-z0-9-]{1,32})
- Site ← UNS level 2, e.g. "warsaw-west" (validated [a-z0-9-]{1,32})
- NodeCount (1 | 2)
- RedundancyMode (None | Warm | Hot) ← None when NodeCount=1
- Enabled
- Notes
-- NOTE: NamespaceUri removed; namespaces are now first-class rows in Namespace table
Namespace ← generation-versioned (revised after adversarial review finding #2),
1+ per cluster per generation
- NamespaceRowId (PK)
- GenerationId (FK)
- NamespaceId ← stable logical ID across generations, e.g. "LINE3-OPCUA-equipment"
- ClusterId (FK)
- Kind (Equipment | SystemPlatform | Simulated) ← UNIQUE (GenerationId, ClusterId, Kind)
- NamespaceUri ← e.g. "urn:zb:warsaw-west:equipment".
UNIQUE per generation; cross-generation invariant: once a
(NamespaceId, ClusterId) pair publishes a NamespaceUri,
it cannot change in any future generation
- Enabled
- Notes
ClusterNode ← physical OPC UA server within a cluster
- NodeId (PK) ← stable per physical machine, e.g. "LINE3-OPCUA-A"
- ClusterId (FK)
- RedundancyRole (Primary | Secondary | Standalone)
- Host ← machine hostname / IP
- OpcUaPort ← typically 4840 on each machine
- DashboardPort ← typically 8081
- ApplicationUri ← MUST be unique per node per OPC UA spec.
Convention: urn:{Host}:OtOpcUa (hostname-embedded).
Unique index enforced fleet-wide, not just per-cluster
— two clusters sharing an ApplicationUri would confuse
any client that browses both.
Stored explicitly, NOT derived from Host at runtime —
OPC UA clients pin trust to ApplicationUri (part of
the cert validation chain), so silent rewrites would
break client trust.
- ServiceLevelBase ← Primary 200, Secondary 150 by default
- DriverConfigOverridesJson ← per-node overrides keyed by DriverInstanceId,
merged onto cluster-level DriverConfig at apply.
Minimal by intent — only settings that genuinely
differ per node (e.g. MxAccess.ClientName).
- Enabled
- LastSeenAt
ClusterNodeCredential ← 1:1 or 1:N with ClusterNode
- CredentialId (PK)
- NodeId (FK) ← bound to the physical node, NOT the cluster
- Kind (SqlLogin | ClientCertThumbprint | ADPrincipal | gMSA)
- Value ← login name, thumbprint, SID, etc.
- Enabled
- RotatedAt
ConfigGeneration ← atomic, immutable snapshot of one cluster's config
- GenerationId (PK) ← monotonically increasing
- ClusterId (FK) ← cluster-scoped — every generation belongs to one cluster
- PublishedAt
- PublishedBy
- Status (Draft | Published | Superseded | RolledBack)
- ParentGenerationId (FK) ← rollback target
- Notes
DriverInstance ← rows reference GenerationId; new generations = new rows
- DriverInstanceRowId (PK)
- GenerationId (FK)
- DriverInstanceId ← stable logical ID across generations
- ClusterId (FK) ← driver lives at the cluster level — both nodes
instantiate it identically (modulo node overrides)
- NamespaceId (FK) ← which namespace this driver populates.
Driver type restricts allowed namespace Kind:
Galaxy → SystemPlatform
Modbus/AB CIP/AB Legacy/S7/TwinCAT/FOCAS → Equipment
OpcUaClient → either, by config
- Name
- DriverType (Galaxy | ModbusTcp | AbCip | OpcUaClient | …)
- Enabled
- DriverConfig (JSON) ← schemaless, driver-type-specific settings.
Per-node overrides applied via
ClusterNode.DriverConfigOverridesJson at apply time.
Device (for multi-device drivers like Modbus, CIP)
- DeviceRowId (PK)
- GenerationId (FK)
- DeviceId ← stable logical ID
- DriverInstanceId (FK)
- Name
- DeviceConfig (JSON) ← host, port, unit ID, slot, etc.
UnsArea ← UNS level 3 (first-class for rename/move)
- UnsAreaRowId (PK)
- GenerationId (FK)
- UnsAreaId ← stable logical ID across generations
- ClusterId (FK)
- Name ← UNS level 3, [a-z0-9-]{1,32} or "_default"
- Notes
UnsLine ← UNS level 4 (first-class for rename/move)
- UnsLineRowId (PK)
- GenerationId (FK)
- UnsLineId ← stable logical ID across generations
- UnsAreaId (FK)
- Name ← UNS level 4, [a-z0-9-]{1,32} or "_default"
- Notes
Equipment ← UNS level-5 entity. Only for drivers in Equipment-kind namespace.
- EquipmentRowId (PK)
- GenerationId (FK)
- EquipmentId ← SYSTEM-GENERATED ('EQ-' + first 12 hex chars of EquipmentUuid).
Never operator-supplied, never editable, never in CSV imports.
(Revised after adversarial review finding #4 — operator-set ID
is a corruption path: typos mint duplicate identities.)
- EquipmentUuid (UUIDv4) ← IMMUTABLE across all generations of the same EquipmentId.
Validated by sp_ValidateDraft. Path/MachineCode/ZTag/SAPID
can change; UUID cannot.
- DriverInstanceId (FK) ← which driver provides data for this equipment
- DeviceId (FK, nullable) ← optional, for multi-device drivers
- UnsLineId (FK) ← UNS level-3+4 source via UnsLine→UnsArea
- Name ← UNS level 5, [a-z0-9-]{1,32} (the equipment name)
-- Operator-facing and external-system identifiers (all exposed as OPC UA properties)
- MachineCode ← Operator colloquial id (e.g. "machine_001"); REQUIRED;
unique within cluster
- ZTag ← ERP equipment id; nullable; unique fleet-wide;
PRIMARY identifier for browsing in Admin UI
- SAPID ← SAP PM equipment id; nullable; unique fleet-wide
- EquipmentClassRef ← nullable; future FK to schemas-repo template (TBD authority)
- Enabled
Tag
- TagRowId (PK)
- GenerationId (FK)
- TagId ← stable logical ID
- EquipmentId (FK, nullable) ← REQUIRED when driver is in Equipment-kind namespace.
NULL when driver is in SystemPlatform-kind namespace
(Galaxy hierarchy is preserved as v1 expressed it).
- DriverInstanceId (FK) ← always present (Equipment.DriverInstanceId mirrors this
when EquipmentId is set; redundant but indexed for joins)
- DeviceId (FK, nullable)
- Name ← signal name. UNS level 6 when in Equipment namespace.
- FolderPath ← only used when EquipmentId is NULL (SystemPlatform ns);
Equipment provides path otherwise.
- DataType
- AccessLevel (Read | ReadWrite)
- WriteIdempotent (bool) ← opt-in for write retry eligibility (see Polly section)
- TagConfig (JSON) ← register address, poll group, scaling, etc.
PollGroup
- PollGroupRowId (PK)
- GenerationId (FK)
- PollGroupId ← stable logical ID
- DriverInstanceId (FK)
- Name
- IntervalMs
ClusterNodeGenerationState ← tracks which generation each NODE has applied
- NodeId (PK, FK) ← per-node, not per-cluster — both nodes of a
2-node cluster track independently
- CurrentGenerationId (FK)
- LastAppliedAt
- LastAppliedStatus (Applied | RolledBack | Failed)
- LastAppliedError
ExternalIdReservation ← NOT generation-versioned (revised after adversarial review finding #3).
Fleet-wide ZTag/SAPID uniqueness that survives rollback,
disable, and re-enable. Per-generation indexes can't enforce
this — old generations still hold the same external IDs.
- ReservationId (PK)
- Kind (ZTag | SAPID)
- Value ← the identifier string
- EquipmentUuid ← which equipment owns this reservation, FOREVER
- ClusterId ← first cluster to publish it
- FirstPublishedAt / LastPublishedAt
- ReleasedAt / ReleasedBy / ReleaseReason ← non-null when explicitly released by FleetAdmin
Lifecycle: sp_PublishGeneration auto-reserves on publish. Disable doesn't release.
Rollback respects the reservation table. Explicit release is the only way to free a value
for reuse by a different EquipmentUuid. UNIQUE (Kind, Value) WHERE ReleasedAt IS NULL.
Authorization model (server-side, enforced in DB):
- All config reads go through stored procedures that take the authenticated principal from
SESSION_CONTEXT/SUSER_SNAME()/CURRENT_USERand cross-check it againstClusterNodeCredential.Valuefor the requestingNodeId. A principal asking for config of aClusterIdthat does not contain itsNodeIdgets rejected, not just filtered. - Cross-cluster reads are forbidden even within the same site or admin scope — every config read carries the requesting
NodeIdand is checked. - Admin UI connects with a separate elevated principal that has read/write on all clusters and generations.
- Publishing a generation is a stored procedure that validates the draft, computes the diff vs. the previous generation, and flips
StatustoPublishedatomically within a transaction. The publish is cluster-scoped — publishing a new generation for one cluster does not affect any other cluster.
appsettings.json stays minimal:
{
"Cluster": {
"ClusterId": "LINE3-OPCUA",
"NodeId": "LINE3-OPCUA-A"
// OPC UA port, ApplicationUri, redundancy role all come from central DB
},
"ConfigDatabase": {
// The connection string MUST authenticate as a principal bound to this NodeId.
// Options (pick one per deployment):
// - Integrated Security + gMSA (preferred on AD-joined hosts)
// - Client certificate (Authentication=ActiveDirectoryMsi or cert-auth)
// - SQL login scoped via ClusterNodeCredential table (rotate regularly)
// A shared DB login across nodes is NOT supported — the server-side
// authorization layer will reject cross-cluster config reads.
"ConnectionString": "Server=configsrv;Database=OtOpcUaConfig;Authentication=...;...",
"GenerationPollIntervalSeconds": 30,
"LocalCachePath": "config_cache.db"
},
"Security": { /* transport/auth settings — still local */ }
}
Decided:
- Central MSSQL database is the single source of truth for all configuration.
- Top-level deployment unit is
ServerClusterwith 1 or 2ClusterNodemembers. Single-node and 2-node deployments use the same schema; single-node is a cluster of one. - Driver, device, tag, equipment, and poll-group config attaches to
ClusterId, not to individual nodes. Both nodes of a cluster serve identical address spaces. - Per-node overrides are minimal by intent —
ClusterNode.DriverConfigOverridesJsonis the only override mechanism, scoped to driver-config settings that genuinely must differ per node (e.g.MxAccess.ClientName). Tags, equipment, and devices have no per-node override path. ApplicationUriis auto-suggested but never auto-rewritten. When an operator creates a newClusterNodein Admin, the UI prefillsurn:{Host}:OtOpcUa. If the operator later changesHost, the UI surfaces a warning thatApplicationUriis not updated automatically — OPC UA clients pin trust to it, and a silent rewrite would force every client to re-pair. Operator must explicitly opt in to changing it.- Each node identifies itself by
NodeIdandClusterIdand authenticates with a credential bound to its NodeId; the DB enforces the mapping server-side. A self-assertedNodeIdis not accepted, and a node may not read another cluster's config. - Each cluster serves multiple namespaces through one endpoint, modeled as first-class
Namespacerows (Kind ∈ {Equipment, SystemPlatform, Simulated}). Adding a future namespace kind is a config-DB row insert + driver wiring, not a structural refactor. - UNS naming hierarchy mandatory in Equipment-kind namespaces: 5 levels (Enterprise/Site/Area/Line/Equipment) with signals as level-6 children. Each segment validated
^[a-z0-9-]{1,32}$or_default; total path ≤ 200 chars. SystemPlatform namespace preserves Galaxy's existing hierarchy unchanged. - Equipment is a first-class entity in Equipment namespaces with stable
EquipmentUuid(UUIDv4) immutable across renames, moves, and generations. Path can change; UUID cannot. Equipment.EquipmentClassRefis a hook for future schemas-repo integration — nullable now, FK enforcement added when the centralschemasrepo lands per the 3-year-plan.- Local LiteDB cache for offline startup resilience, keyed by generation.
- JSON columns for driver-type-specific config (schemaless per driver type, structured at the fleet level).
- Multiple instances of the same driver type supported within one cluster.
- Each device in a driver instance appears as a folder node in the address space.
Decided (rollout model):
- Config is versioned as immutable, cluster-scoped generations. Admin authors a draft for a cluster, then publishes it in a single transaction. Nodes only ever observe a fully-published generation — never a half-edited mix of rows.
- One generation = one cluster's full configuration snapshot. Publishing a generation for one cluster does not affect any other cluster.
- Each node polls for the latest generation for its cluster, diffs it against its current applied generation, and surgically applies only the affected drivers/devices/tags. Surgical application is safe because the source snapshot is atomic.
- Both nodes of a cluster apply the same generation independently — the apply timing can differ slightly. During the apply window, one node may be on generation N while the other is on N+1; this is acceptable because non-transparent redundancy already accommodates per-endpoint state divergence and
ServiceLevelwill dip on the node that's mid-apply. - Rollback: publishing a new generation never deletes old ones. Admins can roll back a cluster to any previous generation; nodes apply the target generation the same way as a forward publish.
- Applied-state per node is tracked in
ClusterNodeGenerationStateso Admin can see which nodes have picked up a new publish and detect stragglers or a 2-node cluster that's diverged. - If neither the central DB nor a local cache is available, the node fails to start. This is acceptable — there's no meaningful "run with zero config" mode.
Decided:
- Transport security config (certs, LDAP settings, transport profiles) stays local in
appsettings.jsonper instance. Avoids a bootstrap chicken-and-egg where DB connection credentials would depend on config retrieved from the DB. Matches current v1 deployment model. - Generation retention: keep all generations forever. Rollback target is always available; audit trail is complete. Config rows are small and publish cadence is low (days/weeks), so storage cost is negligible versus the utility of a complete history.
Deferred:
- Event-driven generation notification (SignalR / Service Broker) as an optimisation over poll interval — deferred until polling proves insufficient.
5. Project Structure
All projects target .NET 10 x64 unless noted.
src/
# ── Configuration layer ──
ZB.MOM.WW.OtOpcUa.Configuration/ # Central DB schema (EF), change detection,
# local LiteDB cache, config models (.NET 10)
ZB.MOM.WW.OtOpcUa.Admin/ # Blazor Server admin UI + API for managing the
# central config DB (.NET 10)
# ── Core + Server ──
ZB.MOM.WW.OtOpcUa.Core/ # OPC UA server, address space, subscriptions,
# driver hosting (.NET 10)
ZB.MOM.WW.OtOpcUa.Core.Abstractions/ # IDriver, IReadable, ISubscribable, etc.
# thin contract (.NET 10)
ZB.MOM.WW.OtOpcUa.Server/ # Host (Microsoft.Extensions.Hosting),
# Windows Service, config bootstrap (.NET 10)
# ── In-process drivers (.NET 10 x64) ──
ZB.MOM.WW.OtOpcUa.Driver.ModbusTcp/ # Modbus TCP driver (NModbus)
ZB.MOM.WW.OtOpcUa.Driver.AbCip/ # Allen-Bradley CIP driver (libplctag)
ZB.MOM.WW.OtOpcUa.Driver.AbLegacy/ # Allen-Bradley SLC/MicroLogix driver (libplctag)
ZB.MOM.WW.OtOpcUa.Driver.S7/ # Siemens S7 driver (S7netplus)
ZB.MOM.WW.OtOpcUa.Driver.TwinCat/ # Beckhoff TwinCAT ADS driver (Beckhoff.TwinCAT.Ads)
ZB.MOM.WW.OtOpcUa.Driver.Focas/ # FANUC FOCAS CNC driver (Fwlib64.dll P/Invoke)
ZB.MOM.WW.OtOpcUa.Driver.OpcUaClient/ # OPC UA client gateway driver
# ── Out-of-process Galaxy driver ──
ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Proxy/ # In-process proxy that implements IDriver interfaces
# and forwards over IPC (.NET 10)
ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/ # Separate process: MXAccess COM, Galaxy DB,
# alarms, HDA. Hosts IPC server (.NET 4.8 x86)
ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Shared/ # Shared IPC message contracts between Proxy
# and Host (.NET Standard 2.0)
# ── Client tooling (.NET 10 x64) ──
ZB.MOM.WW.OtOpcUa.Client.CLI/ # client CLI
ZB.MOM.WW.OtOpcUa.Client.UI/ # Avalonia client
tests/
ZB.MOM.WW.OtOpcUa.Configuration.Tests/
ZB.MOM.WW.OtOpcUa.Core.Tests/
ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Tests/
ZB.MOM.WW.OtOpcUa.Driver.ModbusTcp.Tests/
ZB.MOM.WW.OtOpcUa.Driver.AbCip.Tests/
ZB.MOM.WW.OtOpcUa.Driver.AbLegacy.Tests/
ZB.MOM.WW.OtOpcUa.Driver.S7.Tests/
ZB.MOM.WW.OtOpcUa.Driver.TwinCat.Tests/
ZB.MOM.WW.OtOpcUa.Driver.Focas.Tests/
ZB.MOM.WW.OtOpcUa.Driver.OpcUaClient.Tests/
ZB.MOM.WW.OtOpcUa.IntegrationTests/
Deployment units:
| Unit | Description | Target | Deploys to |
|---|---|---|---|
| OtOpcUa Server | Windows Service (M.E.Hosting) — OPC UA server + in-process drivers | .NET 10 x64 | Each site node |
| Galaxy Host | Windows Service — out-of-process MXAccess driver | .NET 4.8 x86 | Same machine as Server (when Galaxy driver is used) |
| OtOpcUa Admin | Blazor Server config management UI | .NET 10 x64 | Same server or central management host |
| OtOpcUa Client CLI | Operator CLI tool | .NET 10 x64 | Any workstation |
| OtOpcUa Client UI | Avalonia desktop client | .NET 10 x64 | Any workstation |
Dependency graph:
Admin ──→ Configuration
Server ──→ Core ──→ Core.Abstractions
│ ↑
│ Driver.ModbusTcp, Driver.AbCip, Driver.AbLegacy,
│ Driver.S7, Driver.TwinCat, Driver.Focas,
│ Driver.OpcUaClient (in-process)
│ Driver.Galaxy.Proxy (in-process, forwards over IPC)
↓
Configuration
Galaxy.Proxy ──→ Galaxy.Shared ←── Galaxy.Host
(.NET 4.8 x86, separate process)
Core.Abstractions— no dependencies, referenced by Core and all drivers (including Galaxy.Proxy)Configuration— owns central DB access + local cache, referenced by Server and AdminAdmin— Blazor Server app, depends on Configuration, can deploy on same server- In-process drivers depend on
Core.Abstractionsonly Galaxy.Shared— .NET Standard 2.0 IPC contracts, referenced by both Proxy (.NET 10) and Host (.NET 4.8)Galaxy.Host— standalone .NET 4.8 x86 process, does NOT reference Core or Core.AbstractionsGalaxy.Proxy— implementsIDriveretc., depends on Core.Abstractions + Galaxy.Shared
Decided:
- Mono-repo (Decision #31 above).
Core.Abstractionsis internal-only for now — no standalone NuGet. Keep the contract mutable while the first 8 drivers are being built; revisit publishing after Phase 5 when the shape has stabilized. Design the contract as if it will eventually be public (no leaky types, stable names) to minimize churn later.
5a. LmxNodeManager Reusability Analysis
Investigated 2026-04-17. The existing LmxNodeManager (2923 lines) is the foundation for the new generic node manager — not a rewrite candidate. Categorized inventory:
| Bucket | Lines | % | What's here |
|---|---|---|---|
| Already generic | ~1310 | 45% | OPC UA plumbing: CreateAddressSpace + topological sort + _nodeMap, Read/Write dispatch, HistoryRead + continuation points, subscription delivery + _pendingDataChanges queue, dispatch thread lifecycle, runtime-status node mechanism, status-code mapping |
| Generic pattern, Galaxy-coded today | ~1170 | 40% | Bad-quality fan-out when a host drops, alarm auto-subscribe (InAlarm+Priority+Description pattern), background-subscribe tracking with shutdown-safe WaitAll, value normalization for arrays, connection-health probe machinery — each is a pattern every driver will need, currently wired to Galaxy types |
| Truly MXAccess-specific | ~290 | 10% | IMxAccessClient calls, MxDataTypeMapper, SecurityClassificationMapper, GalaxyRuntimeProbeManager construction/lifecycle, Historian literal, alarm auto-subscribe trigger |
| Metadata / comments | ~153 | 5% |
Interleaving assessment: concerns are cleanly separated at method boundaries. Read/Write handlers do generic resolution → generic host-status check → isolated _mxAccessClient call. The dispatch loop is fully generic. The only meaningful interleaving is in BuildAddressSpace() where GalaxyAttributeInfo leaks into node creation — fixable by introducing a driver-agnostic DriverAttributeInfo DTO.
Refactor plan:
- Rename
LmxNodeManager→GenericDriverNodeManager : CustomNodeManager2and lift the generic blocks unchanged. SwapIMxAccessClientforIDriver(composingIReadable/IWritable/ISubscribable). SwapGalaxyAttributeInfofor a driver-agnosticDriverAttributeInfo { FullName, DriverDataType, IsArray, ArrayDim, SecurityClass, IsHistorized }. PromoteGalaxyRuntimeProbeManagerto anIHostConnectivityProbecapability interface. - Derive
GalaxyNodeManager : GenericDriverNodeManager— driver-specific builder that mapsGalaxyAttributeInfo → DriverAttributeInfo, registersMxDataTypeMapper/SecurityClassificationMapper, injects the probe manager. - New drivers (Modbus, S7, etc.) extend
GenericDriverNodeManagerand implement the capability interfaces. No forking of the OPC UA machinery.
Ordering within Phase 2 (fits the "incremental extraction" approach in Decision #55):
- (a) Introduce capability interfaces +
DriverAttributeInfoinCore.Abstractions. - (b) Rename to
GenericDriverNodeManagerwith Galaxy still in-process as the only driver; validate parity against v1 integration tests + CLI walkthrough. - (c) Only then move Galaxy behind the IPC boundary into
Galaxy.Host.
Each step leaves the system runnable. The generic extraction is effectively free — the class is already mostly generic, just named and typed for Galaxy.
6. Migration Strategy
Decided approach:
Phase 0 — Rename + .NET 10 migration
- Rename to OtOpcUa — mechanical rename of namespaces, assemblies, config, and docs
- Migrate to .NET 10 x64 — retarget all projects except Galaxy Host
Phase 1 — Core extraction + Configuration layer + Admin scaffold
3. Build Configuration project — central MSSQL schema with ServerCluster, ClusterNode, ClusterNodeCredential, Namespace (generation-versioned), UnsArea, UnsLine, ConfigGeneration, ClusterNodeGenerationState, ExternalIdReservation (NOT generation-versioned, fleet-wide ZTag/SAPID uniqueness) plus the cluster-scoped DriverInstance / Device / Equipment / Tag / PollGroup tables (EF Core + migrations); UNS naming validators (segment regex, path length, _default placeholder, UUIDv4 immutability across generations, system-generated EquipmentId, same-cluster namespace binding, ZTag/SAPID reservation pre-flight, within-cluster uniqueness for MachineCode); server-side authorization stored procs that enforce per-node-bound-to-cluster access from authenticated principals; atomic cluster-scoped publish/rollback stored procs (sp_PublishGeneration reserves external IDs atomically; sp_ReleaseExternalIdReservation is FleetAdmin-only); LiteDB local cache keyed by generation; generation-diff application logic; per-node override merge at apply time.
4. Extract Core.Abstractions — define IDriver, ITagDiscovery, IReadable, IWritable, ISubscribable, IAlarmSource, IHistoryProvider. IWritable contract separates idempotent vs. non-idempotent writes at the interface level.
5. Build Core — generic driver-hosting node manager that delegates to capability interfaces, driver isolation (catch/contain), address space registration, separate Polly pipelines for reads vs. writes per the write-retry policy above.
6. Wire Server — bootstrap from Configuration using an instance-bound credential (cert/gMSA/SQL login), fail fast if the credential is rejected, register drivers, start Core.
7. Scaffold Admin — Blazor Server app with: instance + credential management, draft/publish/rollback generation workflow (diff viewer, "publish to fleet", per-instance override), and core CRUD for drivers/devices/tags. Driver-specific config screens deferred to later phases.
Phase 2 — Galaxy driver (prove the refactor)
8. Build Galaxy.Shared — .NET Standard 2.0 IPC message contracts
9. Build Galaxy.Host — .NET 4.8 x86 process hosting MxAccessBridge, GalaxyRepository, alarms, HDA with IPC server
10. Build Galaxy.Proxy — .NET 10 in-process proxy implementing IDriver interfaces, forwarding over IPC
11. Validate parity — v2 Galaxy driver must pass the same integration tests as v1
Phase 3 — Modbus TCP driver (prove the abstraction)
12. Build Driver.ModbusTcp — NModbus, config-driven tags from central DB, internal poll loop, device-as-folder hierarchy
13. Add Modbus config screens to Admin (first driver-specific config UI)
Phase 4 — PLC drivers
14. Build Driver.AbCip — libplctag, ControlLogix/CompactLogix symbolic tags + Admin config screens
15. Build Driver.AbLegacy — libplctag, SLC 500/MicroLogix file-based addressing + Admin config screens
16. Build Driver.S7 — S7netplus, Siemens S7-300/400/1200/1500 + Admin config screens
17. Build Driver.TwinCat — Beckhoff.TwinCAT.Ads v6, native ADS notifications, symbol upload + Admin config screens
Phase 5 — Specialty drivers
18. Build Driver.Focas — FANUC FOCAS2 P/Invoke, pre-defined CNC tag set, PMC/macro config + Admin config screens
19. Build Driver.OpcUaClient — OPC UA client gateway/aggregation, namespace remapping, subscription proxying + Admin config screens
Decided:
- Parity test for Galaxy: existing v1 IntegrationTests suite + scripted Client.CLI walkthrough (see Section 4 above).
- Timeline: no hard deadline. Each phase ships when it's right — tests passing, Galaxy parity bar met. Quality cadence over calendar cadence.
- FOCAS SDK: license already secured. Phase 5 can proceed as scheduled;
Fwlib64.dllavailable for P/Invoke.
Decision Log
| # | Decision | Rationale | Date |
|---|---|---|---|
| 1 | Work on v2 branch |
Keep master stable for production | 2026-04-16 |
| 2 | OPC UA core + pluggable driver modules | Enable multi-protocol support without forking the server | 2026-04-16 |
| 3 | Rename to OtOpcUa | Product is no longer LMX-specific | 2026-04-16 |
| 4 | Composable capability interfaces | Drivers vary widely in what they support; flat IDriver would force stubs |
2026-04-16 |
| 5 | Target drivers: Galaxy, Modbus TCP, AB CIP, AB Legacy, S7, TwinCAT, FOCAS, OPC UA Client | Full PLC/CNC/SCADA/aggregation coverage | 2026-04-16 |
| 6 | Polling is driver-internal, not core-managed | Each driver owns its poll loop; core just sees data change callbacks | 2026-04-16 |
| 7 | Multiple instances of same driver type supported | Need e.g. separate Modbus drivers for different device groups | 2026-04-16 |
| 8 | Namespace index per driver instance | Each instance gets its own NamespaceUri for clean isolation | 2026-04-16 |
| 9 | Rename to OtOpcUa as step 1 | Clean mechanical change before any refactoring | 2026-04-16 |
| 10 | Modbus TCP as second driver | Simplest protocol, validates abstraction with flat/polled/config-driven model | 2026-04-16 |
| 11 | Library selections per driver | NModbus (Modbus), libplctag (AB CIP + AB Legacy), S7netplus (S7), Beckhoff.TwinCAT.Ads v6 (TwinCAT), Fwlib64.dll P/Invoke (FOCAS), OPC Foundation SDK (OPC UA Client) | 2026-04-16 |
| 12 | Driver isolation — failure contained per instance | One driver crash/disconnect must not affect other drivers' nodes or quality | 2026-04-16 |
| 13 | Shared OPC UA StatusCode model for quality | Drivers map to the same StatusCode space; each defines which codes it produces | 2026-04-16 |
| 14 | Central MSSQL config database | Single source of truth for fleet-wide config — instances, drivers, tags, devices | 2026-04-16 |
| 15 | LiteDB local cache per instance | Offline startup resilience — instance boots from cache if central DB is unreachable | 2026-04-16 |
| 16 | JSON columns for driver-specific config | Schemaless per driver type, avoids table-per-driver-type explosion | 2026-04-16 |
| 17 | Device-as-folder in address space | Multi-device drivers expose Device/Tag hierarchy for intuitive browsing | 2026-04-16 |
| 18 | Minimal appsettings.json (ClusterId + NodeId + DB conn) | All real config lives in central DB, not local files. OPC UA port and ApplicationUri come from ClusterNode row, not local config |
2026-04-16 / 2026-04-17 |
| 19 | Blazor Server admin app for config management | Separate deployable, manages central MSSQL config DB | 2026-04-16 |
| 20 | Surgical config change detection | Instance detects which drivers/devices/tags changed, applies incremental updates | 2026-04-16 |
| 21 | Fail-to-start without DB or cache | No meaningful zero-config mode — requires at least cached config | 2026-04-16 |
| 22 | Configuration project owns DB + cache layer |
Clean separation: Server and Admin both depend on it | 2026-04-16 |
| 23 | .NET 10 x64 default, .NET 4.8 x86 only for Galaxy Host | Modern runtime for everything; COM constraint isolated to Galaxy | 2026-04-16 |
| 24 | Galaxy driver is out-of-process | .NET 4.8 x86 process can't load into .NET 10 x64; IPC bridge required | 2026-04-16 |
| 25 | Galaxy.Shared (.NET Standard 2.0) for IPC contracts | Must be consumable by both .NET 10 Proxy and .NET 4.8 Host | 2026-04-16 |
| 26 | Admin deploys on same server (co-hosted) | Simplifies deployment; can also run on separate management host | 2026-04-16 |
| 27 | Admin scaffold early, driver-specific screens deferred | Core CRUD for instances/drivers first; per-driver config UI added with each driver | 2026-04-16 |
| 28 | Named pipes for Galaxy IPC | Fast, no port conflicts, native to both .NET 4.8 and .NET 10 | 2026-04-16 |
| 29 | Galaxy Host is a separate Windows service | Independent lifecycle, can restart without affecting main server or other drivers | 2026-04-16 |
| 30 | Drop TopShelf, use Microsoft.Extensions.Hosting | Built-in Windows Service support in .NET 10, no third-party dependency | 2026-04-16 |
| 31 | Mono-repo for all drivers | Simpler dependency management, single CI pipeline, shared abstractions | 2026-04-16 |
| 32 | MessagePack serialization for Galaxy IPC | Binary, fast, works on .NET 4.8+ and .NET 10 via MessagePack-CSharp NuGet | 2026-04-16 |
| 33 | EF Core for Configuration DB | Migrations, LINQ queries, standard .NET 10 ORM | 2026-04-16 |
| 34 | Polly v8+ for resilience | Retry, circuit breaker, timeout per device/driver — replaces hand-rolled supervision | 2026-04-16 |
| 35 | Per-device resilience pipelines | Circuit breaker on Drive1 doesn't affect Drive2, even in same driver instance | 2026-04-16 |
| 36 | Polly for config DB access | Retry + fallback to LiteDB cache on sustained DB outage | 2026-04-16 |
| 37 | FOCAS driver uses pre-defined tag set | CNC data is functional (axes, spindle, PMC), not user-defined tags — driver exposes fixed node hierarchy populated by specific FOCAS2 API calls | 2026-04-16 |
| 38 | FOCAS PMC + macro variables are user-configured | PMC addresses (R, D, G, F, etc.) and macro variable ranges configured in central DB; not auto-discovered | 2026-04-16 |
| 39 | TwinCAT uses native ADS notifications | One of 3 drivers with native subscriptions (Galaxy, TwinCAT, OPC UA Client); no polling needed for subscribed tags | 2026-04-16 |
| 40 | TwinCAT no runtime required on server | Beckhoff.TwinCAT.Ads v6 supports in-process ADS router; only needs AMS route on target device | 2026-04-16 |
| 41 | AB Legacy (SLC/MicroLogix) as separate driver from AB CIP | Different protocol (PCCC vs CIP), different addressing (file-based vs symbolic), severe connection limits (4-8) | 2026-04-16 |
| 42 | S7 driver notes: PUT/GET must be enabled on S7-1200/1500 | Disabled by default in TIA Portal; document as prerequisite | 2026-04-16 |
| 43 | DL205 (AutomationDirect) handled by Modbus TCP driver | DL205 supports Modbus TCP via H2-ECOM100; no separate driver needed — AddressFormat=DL205 adds octal address translation |
2026-04-16 |
| 44 | No automatic retry on writes by default | Write retries are unsafe for non-idempotent field actions — a timeout can fire after the device already accepted the command, and replay duplicates pulses/acks/counters/recipe steps (adversarial review finding #1) | 2026-04-16 |
| 45 | Opt-in write retry via TagConfig.WriteIdempotent or CAS wrapper |
Retries must be explicit per tag; CAS (compare-and-set) verifies device state before retry where the protocol supports it | 2026-04-16 |
| 46 | Instance identity is credential-bound, not self-asserted | Each instance authenticates to the central DB with a credential (cert/gMSA/SQL login) bound to its InstanceId; the DB rejects cross-instance config reads server-side (adversarial review finding #2) |
2026-04-16 |
| 47 | InstanceCredential table + authorization stored procs |
Credentials and the InstanceId they are authorized for live in the DB; all config reads go through procs that enforce the mapping rather than trusting the client |
2026-04-16 |
| 48 | Config is versioned as immutable generations with atomic publish | Admin publishes a whole generation in one transaction; instances only ever observe fully-published generations, never partial multi-row edits (adversarial review finding #3) | 2026-04-16 |
| 49 | Surgical reload applies a generation diff, not raw row deltas | The source snapshot is atomic (generation), but applying it to a running instance is still incremental — only affected drivers/devices/tags reload | 2026-04-16 |
| 50 | Explicit rollback via re-publishing a prior generation | Generations are never deleted; rollback is just publishing an older generation as the new current, so instances apply it the same way as a forward publish | 2026-04-16 |
| 51 | InstanceGenerationState tracks applied generation per instance |
Admin can see which instances have picked up a new publish and detect stragglers or failed applies | 2026-04-16 |
| 52 | Address space registration via builder/context API | Core owns the tree; driver streams AddFolder/AddVariable on an IAddressSpaceBuilder, avoids buffering the whole tree and supports incremental discovery |
2026-04-17 |
| 53 | Capability discovery via interface checks (is IAlarmSource) |
The interface is the capability — no redundant flag enum to keep in sync with the implementation | 2026-04-17 |
| 54 | Optional IRediscoverable sub-interface for change-detection |
Drivers with a native change signal (Galaxy deploy time, OPC UA change notifications) opt in; static drivers skip it | 2026-04-17 |
| 55 | Galaxy refactor is incremental — extract interfaces in place first | Refactor LmxNodeManager against new abstractions while still in-process, validate, then move behind IPC. Keeps system runnable at each step |
2026-04-17 |
| 56 | Galaxy parity test = v1 integration suite + scripted CLI walkthrough | Automated regression plus human-observable behavior on a dev Galaxy | 2026-04-17 |
| 57 | Transport security config stays local in appsettings.json |
Avoids bootstrap chicken-and-egg (DB-connection credentials can't depend on config fetched from the DB); matches v1 deployment | 2026-04-17 |
| 58 | Generation retention: keep all generations forever | Rollback target always available; audit trail complete; storage cost negligible at publish cadence of days/weeks | 2026-04-17 |
| 59 | Core.Abstractions internal-only for now, no NuGet |
Keep the contract mutable through the first 8 drivers; design as if public, revisit after Phase 5 | 2026-04-17 |
| 60 | No hard deadline — phases deliver when they're right | Quality cadence over calendar cadence; Galaxy parity bar must be met before moving on | 2026-04-17 |
| 61 | FOCAS SDK license already secured | Phase 5 can proceed; Fwlib64.dll available for P/Invoke with no procurement blocker |
2026-04-17 |
| 62 | LmxNodeManager is the foundation for GenericDriverNodeManager, not a rewrite |
~85% of the 2923 lines are generic or generic-in-spirit; only ~10% (~290 lines) are truly MXAccess-specific. Concerns are cleanly separated at method boundaries — refactor is rename + DTO swap, not restructuring | 2026-04-17 |
| 63 | Driver stability tier model (A/B/C) | Drivers vary in failure profile (pure managed vs wrapped native vs black-box DLL); tier dictates hosting and protection level. See driver-stability.md |
2026-04-17 |
| 64 | FOCAS is Tier C — out-of-process Windows service from day one | Fwlib64.dll is a black-box vendor DLL; an AccessViolationException is uncatchable in modern .NET and would tear down the OPC UA server. Same Proxy/Host/Shared pattern as Galaxy |
2026-04-17 |
| 65 | Cross-cutting stability protections mandatory in all tiers | SafeHandle for every native resource, memory watchdog, bounded operation queues, scheduled recycle, crash-loop circuit breaker, post-mortem log — apply to every driver process whether in-proc or isolated | 2026-04-17 |
| 66 | Out-of-process driver pattern is reusable across Tier C drivers | Galaxy.Proxy/Host/Shared template generalizes; FOCAS is the second user; future Tier B → Tier C escalations reuse the same three-project template | 2026-04-17 |
| 67 | Tier B drivers may escalate to Tier C on production evidence | libplctag (AB CIP/Legacy), S7netplus, TwinCAT.Ads start in-process; promote to isolated host if leaks or crashes appear in field | 2026-04-17 |
| 68 | Crash-loop circuit breaker — 3 crashes/5 min stops respawn | Prevents host respawn thrashing when the underlying device or DLL is in a state respawning won't fix; surfaces operator-actionable alert; manual reset via Admin UI | 2026-04-17 |
| 69 | Post-mortem log via memory-mapped file | Ring buffer of last-N operations + driver-specific state; survives hard process death including native AV; supervisor reads MMF after corpse is gone — only viable post-mortem path for native crashes | 2026-04-17 |
| 70 | Watchdog thresholds = hybrid multiplier + absolute floor + hard ceiling | Pure multipliers misfire on tiny baselines; pure absolute MB doesn't scale across deployment sizes. max(N× baseline, baseline + floor MB) for warn/recycle plus an absolute hard ceiling. Slope detection stays orthogonal |
2026-04-17 |
| 71 | Crash-loop reset = escalating cooldown (1 h → 4 h → 24 h manual) with sticky alerts | Manual-only is too rigid for unattended plants; pure auto-reset silently retries forever. Escalating cooldown auto-recovers transient problems but forces human attention on persistent ones; sticky alerts preserve the trail regardless of reset path | 2026-04-17 |
| 72 | Heartbeat cadence = 2 s with 3-miss tolerance (6 s detection) | 5 s × 3 = 15 s is too slow against 1 s OPC UA publish intervals; 1 s × 3 = 3 s false-positives on GC pauses and pipe jitter. 2 s × 3 = 6 s is the sweet spot | 2026-04-17 |
| 73 | Process-level protections (RSS watchdog, scheduled recycle) apply ONLY to Tier C isolated host processes | Process recycle in the shared server would kill every other in-proc driver, every session, and the OPC UA endpoint — directly contradicts the per-driver isolation invariant. Tier A/B drivers get per-instance allocation tracking + cache flush + no-process-kill instead (adversarial review finding #1) | 2026-04-17 |
| 74 | A Tier A/B driver that needs process-level recycle MUST be promoted to Tier C | The only safe way to apply process recycle to a single driver is to give it its own process. If allocation tracking + cache flush can't bound a leak, the answer is isolation, not killing the server | 2026-04-17 |
| 75 | Wedged native calls in Tier C drivers escalate to hard process exit, never handle-free-during-call | Calling release functions on a handle with an active native call is undefined behavior — exactly the AV path Tier C is designed to prevent. After grace window, leave the handle Abandoned and Environment.Exit(2). The OS reclaims fds/sockets on exit; the device's connection-timeout reclaims its end (adversarial review finding #2) |
2026-04-17 |
| 76 | Tier C IPC has mandatory pipe ACL + caller SID verification + per-process shared secret | Default named-pipe ACL allows any local user to bypass OPC UA auth and issue reads/writes/acks directly against the host. Pipe ACL restricts to server service SID, host verifies caller token on connect, supervisor-generated per-process secret as defense-in-depth (adversarial review finding #3) | 2026-04-17 |
| 77 | FOCAS stability test coverage = TCP stub (functional) + FaultShim native DLL (host-side faults) | A TCP stub cannot make Fwlib leak handles or AV — those live inside the P/Invoke boundary. Two artifacts cover the two layers honestly: TCP stub for ~80% of failures (network/protocol), FaultShim for the remaining ~20% (native crashes/leaks). Real-CNC validation remains the only path for vendor-specific Fwlib quirks (adversarial review finding #5) | 2026-04-17 |
| 78 | Per-driver stability treatment is proportional to driver risk | Galaxy and FOCAS get full Tier C deep dives in driver-stability.md (different concerns: COM/STA pump vs Fwlib handle pool); TwinCAT, AB CIP, AB Legacy get short Operational Stability Notes in driver-specs.md for their tier-promotion triggers and protocol-specific failure modes; pure-managed Tier A drivers get one paragraph each. Avoids duplicating the cross-cutting protections doc seven times |
2026-04-17 |
| 79 | Top-level deployment unit is ServerCluster with 1 or 2 ClusterNode members |
Sites deploy 2-node clusters for OPC UA non-transparent redundancy (per v1 — Warm/Hot, no VIP). Single-node deployments are clusters of one. Uniform schema avoids forking the config model | 2026-04-17 |
| 80 | Driver / device / tag / poll-group config attaches to ClusterId, not to individual nodes |
Both nodes of a cluster serve identical address spaces; defining tags twice would invite drift. One generation = one cluster's complete config | 2026-04-17 |
| 81 | Per-node overrides minimal — ClusterNode.DriverConfigOverridesJson only |
Some driver settings legitimately differ per node (e.g. MxAccess.ClientName so Galaxy distinguishes them) but the surface is small. Single JSON column merged onto cluster-level DriverConfig at apply time. Tags and devices have no per-node override path |
2026-04-17 |
| 82 | ConfigGeneration is cluster-scoped, not fleet-scoped |
Publishing a generation for one cluster does not affect any other cluster. Simpler rollout (one cluster at a time), simpler rollback, simpler auth boundary. Fleet-wide synchronized rollouts (if ever needed) become a separate concern — orchestrate per-cluster publishes from Admin | 2026-04-17 |
| 83 | Each node authenticates with its own ClusterNodeCredential bound to NodeId |
Cluster-scoped auth would be too coarse — both nodes sharing a credential makes credential rotation harder and obscures which node read what. Per-node binding also enforces that Node A cannot impersonate Node B in audit logs | 2026-04-17 |
| 84 | Both nodes apply the same generation independently; brief divergence acceptable | OPC UA non-transparent redundancy already handles per-endpoint state divergence; ServiceLevel dips on the node mid-apply and clients fail over. Forcing two-phase commit across nodes would be a complex distributed-system problem with no real upside |
2026-04-17 |
| 85 | OPC UA RedundancySupport.Transparent not adopted in v2 |
True transparent redundancy needs a VIP/load-balancer in front of the cluster. v1 ships non-transparent (Warm/Hot) with ServerUriArray and client-driven failover; v2 inherits the same model. Revisit only if a customer requirement demands LB-fronted transparency |
2026-04-17 |
| 86 | ApplicationUri auto-suggested as urn:{Host}:OtOpcUa but never auto-rewritten |
OPC UA clients pin trust to ApplicationUri — it's part of the cert validation chain. Auto-rewriting it when an operator changes Host would silently invalidate every client trust relationship. Admin UI prefills on node creation, warns on Host change, requires explicit opt-in to change. Fleet-wide unique index enforces no two nodes share an ApplicationUri |
2026-04-17 |
| 87 | Concrete schema and stored-proc design lives in config-db-schema.md |
The plan §4 sketches the conceptual model; the schema doc carries the actual DDL, indexes, stored procs, JSON conventions, and authorization model implementations. Keeps the plan readable while making the schema concrete enough to start implementing | 2026-04-17 |
| 88 | Admin UI is Blazor Server with LDAP-mapped admin roles (FleetAdmin / ConfigEditor / ReadOnly) | Blazor Server gives real-time SignalR for live cluster status without a separate SPA build pipeline. LDAP reuses the OPC UA auth provider (no parallel user table). Three roles cover the common ops split; cluster-scoped editor grants deferred to v2.1 | 2026-04-17 |
| 89 | Edit path is draft → diff → publish; no in-place edits, no auto-publish | Generations are atomic snapshots — every change goes through an explicit publish boundary so operators see what they're committing. The diff viewer is required reading before the publish dialog enables. Bulk operations always preview before commit | 2026-04-17 |
| 90 | Per-node overrides are NOT generation-versioned | Overrides are operationally bound to a specific physical machine, not to the cluster's logical config evolution. Editing a node override doesn't create a new generation — it updates ClusterNode.DriverConfigOverridesJson directly and takes effect on next apply. Replacement-node scenarios copy the override via deployment tooling, not by replaying generation history |
2026-04-17 |
| 91 | JSON content validation runs in the Admin app, not via SQL CLR | CLR is disabled by default on hardened SQL Server instances; many DBAs refuse to enable it. Admin validates against per-driver JSON schemas before invoking sp_PublishGeneration; the proc enforces structural integrity (FKs, uniqueness, ISJSON) only. Direct proc invocation is already prevented by the GRANT model |
2026-04-17 |
| 92 | Dotted-path syntax for DriverConfigOverridesJson keys (e.g. MxAccess.ClientName) |
More readable than JSON Pointer in operator UI and CSV exports. Reserved-char escaping documented (\., \\); array indexing uses Items[0].Name |
2026-04-17 |
| 93 | sp_PurgeGenerationsBefore deferred to v2.1; signature pre-specified |
Initial release keeps all generations forever (decision #58). Purge proc shape locked in now: requires @ConfirmToken UI-shown random hex to prevent script-based mass deletion, CASCADE-deletes via WHERE GenerationId IN (...), audit-log entry with row counts. Surface only when a customer compliance ask demands it |
2026-04-17 |
| 94 | (See #102 — switched to Bootstrap 5 for ScadaLink parity) | 2026-04-17 | |
| 95 | CSV import dialect = strict CSV (RFC 4180) UTF-8, BOM accepted | Excel "Save as CSV (UTF-8)" produces RFC 4180 output and is the documented primary input format. TSV not initially supported | 2026-04-17 |
| 96 | Push-from-DB notification deferred to v2.1; polling is the v2.0 model | Tightening apply latency from ~30 s → ~1 s would need SignalR backplane or SQL Service Broker — infrastructure not earning its keep at v2.0 scale. Publish dialog reserves a disabled "Push now" button labeled "Available in v2.1" so the future UX is anchored | 2026-04-17 |
| 97 | Draft auto-save (debounced 500 ms) with explicit Discard; Publish is the only commit | Eliminates "lost work" complaints; matches Google Docs / Notion mental model. Auto-save writes to draft rows only — never to Published. Discard requires confirmation dialog | 2026-04-17 |
| 98 | (See #103 — light-only to match ScadaLink) | 2026-04-17 | |
| 99 | CI tiering: PR-CI uses only in-process simulators; nightly/integration CI runs on dedicated Docker + Hyper-V host | Keeps PR builds fast and runnable on minimal build agents; the dedicated integration host runs the heavy simulators (oitc/modbus-server, TwinCAT XAR VM, Snap7 Server, libplctag ab_server). Operational dependency: stand up the dedicated host before Phase 3 |
2026-04-17 |
| 100 | Studio 5000 Logix Emulate: pre-release validation tier only, no phase-gate | If an org license can be earmarked, designate a golden box for quarterly UDT/Program-scope passes. If not, AB CIP ships validated against ab_server only with documented UAT-time fidelity gap. Don't block Phase 4 on procurement |
2026-04-17 |
| 101 | FOCAS Wireshark capture is a Phase 5 prerequisite identified during Phase 4 | Target capture (production CNC, CNC Guide seat, or customer site visit) identified by Phase 4 mid-point; if no target by then, escalate to procurement (CNC Guide license or dev-rig CNC) as a Phase 5 dependency | 2026-04-17 |
| 102 | Admin UI styling = Bootstrap 5 vendored (parity with ScadaLink CentralUI) | Operators using both ScadaLink and OtOpcUa Admin see the same login screen, same sidebar, same component vocabulary. ScadaLink ships Bootstrap 5 with a custom dark-sidebar + light-main aesthetic; mirroring it directly outweighs MudBlazor's Blazor-component conveniences. Supersedes #94 | 2026-04-17 |
| 103 | Admin UI ships single light theme matching ScadaLink (no dark mode in v2.0) | ScadaLink is light-only; cross-app aesthetic consistency outweighs the ergonomic argument for dark mode. Revisit only if ScadaLink adds dark mode. Supersedes #98 | 2026-04-17 |
| 104 | Admin auth pattern lifted directly from ScadaLink: LdapAuthService + RoleMapper + JwtTokenService + cookie auth + CookieAuthenticationStateProvider |
Same login form, same cookie scheme (30-min sliding), same claim shape (Name, DisplayName, Username, Role[], optional ClusterId[] scope), parallel /auth/token endpoint for API clients. Code lives in ZB.MOM.WW.OtOpcUa.Admin.Security (sibling of ScadaLink.Security); consolidate to a shared NuGet only if it later makes operational sense |
2026-04-17 |
| 105 | Cluster-scoped admin grants ship in v2.0 (lifted from v2.1 deferred list) | ScadaLink already ships the equivalent site-scoped pattern (PermittedSiteIds claim, IsSystemWideDeployment flag), so we get cluster-scoped grants free by mirroring it. LdapGroupRoleMapping table maps groups → role + cluster scope; users without explicit cluster claims are system-wide |
2026-04-17 |
| 106 | Shared component set copied verbatim from ScadaLink CentralUI | DataTable, ConfirmDialog, LoadingSpinner, ToastNotification, TimestampDisplay, RedirectToLogin, NotAuthorizedView. New Admin-specific shared components added to our folder rather than diverging from ScadaLink's set, so the shared vocabulary stays aligned |
2026-04-17 |
| 107 | Each cluster serves multiple OPC UA namespaces through one endpoint, modeled as first-class Namespace rows |
Per 3-year-plan handoff §4: at v2.0 GA there are two namespaces (Equipment for raw signals, SystemPlatform for Galaxy-derived data); future Simulated namespace must be addable as a config-DB row + driver wiring, not a structural refactor. UNIQUE (ClusterId, Kind) | 2026-04-17 |
| 108 | UNS 5-level naming hierarchy mandatory in Equipment-kind namespaces | Per 3-year-plan handoff §12: Enterprise/Site/Area/Line/Equipment with signals as level-6 children. Each segment ^[a-z0-9-]{1,32}$ or _default; total path ≤ 200 chars. Validated at draft-publish and in Admin UI. SystemPlatform namespace preserves Galaxy's existing hierarchy unchanged — UNS rules don't apply there |
2026-04-17 |
| 109 | Equipment is a first-class entity in Equipment namespaces with stable EquipmentUuid (UUIDv4), immutable across renames/moves/generations |
Per handoff §12: path can change (rename, move) but UUID cannot. Downstream consumers (Redpanda events, dbt) carry both UUID for joins/lineage and path for dashboards/filtering. sp_ValidateDraft enforces UUID-per-EquipmentId is constant across all generations of a cluster |
2026-04-17 |
| 110 | Tag belongs to Equipment in Equipment namespaces; tag belongs to Driver+FolderPath in SystemPlatform namespaces | Single Tag table with nullable EquipmentId. When set (Equipment ns), full path is computed Enterprise/Site/Area/Line/Name/TagName. When null (SystemPlatform ns), v1-style DriverInstanceId + FolderPath provides the path. Application-level constraint enforced by sp_ValidateDraft, not DB CHECK |
2026-04-17 |
| 111 | Driver type restricts allowed namespace Kind | Galaxy → SystemPlatform only; Modbus/AB CIP/AB Legacy/S7/TwinCAT/FOCAS → Equipment only; OpcUaClient → either, by config. Encoded in Core.Abstractions driver-type registry; enforced by sp_ValidateDraft |
2026-04-17 |
| 112 | Equipment.EquipmentClassRef shipped as nullable hook in v2.0 for future schemas-repo integration |
Per handoff §12: equipment-class templates will live in a central schemas repo (not yet created). Cheap to add the column now; expensive to retrofit later. Enforcement added when schemas repo lands. v2.0 ships without template validation |
2026-04-17 |
| 113 | Canonical machine state derivation lives at Layer 3, not in OtOpcUa | Per handoff §13: Running/Idle/Faulted/Starved/Blocked derivation is System Platform / Ignition's job. OtOpcUa's role is delivering raw signals cleanly so derivation is accurate. Equipment-class templates (when schemas repo lands) define which raw signals each class exposes |
2026-04-17 |
| 114 | Future Simulated namespace architecturally supported, not v2.0 committed |
Per handoff §14: Simulated is named as the next namespace kind for replaying historical equipment data without physical equipment. The Namespace.Kind enum reserves the value; no driver implementation in v2.0. Adds via config-DB row + a future replay driver |
2026-04-17 |
| 115 | UNS structure (Area, Line) modeled as first-class generation-versioned tables (UnsArea, UnsLine), not denormalized strings on Equipment |
Renaming an area or moving lines between buildings is a single edit that propagates to every equipment under it; bulk-restructure operations work cleanly. Generation-versioning preserves the publish/diff/rollback safety boundary for structural changes | 2026-04-17 |
| 116 | Equipment carries five identifiers: EquipmentUuid, EquipmentId, MachineCode, ZTag, SAPID — each with a different audience | Single-identifier-per-equipment can't satisfy the diverse consumer set: downstream events need a UUID for permanent lineage, OT operators say machine_001 (MachineCode), ERP queries by ZTag, SAP PM by SAPID, internal config diffs need a stable EquipmentId. All five exposed as OPC UA properties on the equipment node so external systems resolve by their preferred identifier without a sidecar |
2026-04-17 |
| 117 | ZTag is the primary browse identifier in the Admin UI |
Equipment list/search defaults to ZTag column + sort. MachineCode shown alongside; SAPID searchable. The OPC UA browse path itself uses Equipment.Name (UNS-segment rules); ZTag/MachineCode/SAPID are properties on the node, not path components |
2026-04-17 |
| 118 | MachineCode required, fleet-wide uniqueness on ZTag and SAPID when set |
MachineCode is the operator's colloquial name — every equipment must have one. ZTag and SAPID are external system identifiers that may not exist for newly commissioned equipment. Fleet-wide uniqueness on ERP/SAP IDs prevents the same external identifier from referencing two equipment in our config (which would silently corrupt joins) | 2026-04-17 |
| 119 | MachineCode/ZTag/SAPID free-text, not subject to UNS regex | These are external system identifiers, not OPC UA path segments. They can carry whatever conventions ERP/SAP/operator workflows use (mixed case, underscores, vendor-specific schemes). Validation is only non-empty (when present) and ≤64 chars | 2026-04-17 |
| 120 | Admin UI exposes UNS structure as a first-class management surface | Dedicated UNS Structure tab with tree of UnsArea → UnsLine → Equipment, drag-drop reorganize, rename with live impact preview ("X lines, Y equipment, Z signals will pick up new path"). Hybrid model: read-only navigation over the published generation, click-to-edit opens the draft editor scoped to that node. Bulk-rename and bulk-move propagate through UnsLineId FK (no per-equipment row rewrite) | 2026-04-17 |
| 121 | All five equipment identifiers exposed as OPC UA properties on the equipment node | MachineCode, ZTag, SAPID, EquipmentUuid, EquipmentId are properties so external systems resolve equipment by their preferred identifier without a sidecar lookup service. Browse path uses Equipment.Name as the level-5 segment (UNS-compliant); the other identifiers are properties, not path components |
2026-04-17 |
| 122 | Same-cluster invariant on DriverInstance.NamespaceId enforced in three layers (sp_ValidateDraft, API scoping, audit) |
Without enforcement a draft for cluster A could bind to cluster B's namespace, leaking the URI into A's endpoint and breaking tenant isolation. UI filtering alone is insufficient — server-side scoping prevents bypass via crafted requests. Cross-cluster attempts audit-logged as CrossClusterNamespaceAttempt. (Closes adversarial review 2026-04-17 finding #1, critical) |
2026-04-17 |
| 123 | Namespace is generation-versioned (revised from earlier "cluster-level" decision) |
A cluster-level namespace lets an admin disable a namespace that a published driver depends on, breaking the live config without a generation change and making rollback unreproducible. Namespaces affect what consumers see at the OPC UA endpoint — they are content, not topology — and must travel through draft → diff → publish like every other consumer-visible config. Cross-generation invariant: once a (NamespaceId, ClusterId) publishes a NamespaceUri/Kind, it cannot change. (Closes adversarial review 2026-04-17 finding #2, supersedes part of #107) | 2026-04-17 |
| 124 | ZTag/SAPID fleet-wide uniqueness backed by an ExternalIdReservation table, NOT generation-versioned per-generation indexes |
Per-generation indexes fail under rollback and disable: old generations and disabled equipment can still hold the same external IDs, so rollback or re-enable can silently reintroduce duplicates that corrupt downstream ERP/SAP joins. The reservation table sits outside generation versioning, survives rollback, and reserves fresh values atomically at publish via sp_PublishGeneration. Explicit FleetAdmin release (audit-logged) is the only path that frees a value for reuse by a different EquipmentUuid. (Closes adversarial review 2026-04-17 finding #3) |
2026-04-17 |
| 125 | Equipment.EquipmentId is system-generated ('EQ-' + first 12 hex chars of EquipmentUuid), never operator-supplied or editable, never in CSV imports |
Operator-supplied IDs are a real corruption path: typos and bulk-import renames mint new EquipmentIds, which then get new UUIDs even when the physical asset is the same. That permanently splits downstream joins keyed on EquipmentUuid. Removing operator authoring of EquipmentId eliminates the failure mode entirely. CSV imports match by EquipmentUuid (preferred) for updates; rows without UUID create new equipment with system-generated identifiers. Explicit Merge / Rebind operator flow handles the rare case where two UUIDs need to be reconciled. (Closes adversarial review 2026-04-17 finding #4, supersedes part of #116) | 2026-04-17 |
| 126 | Three-gate model (entry / mid / exit) for every implementation phase, with explicit compliance-check categories | Specified in implementation/overview.md. Categories: schema compliance (DB matches the doc), decision compliance (every decision number has a code/test citation), visual compliance (Admin UI parity with ScadaLink), behavioral compliance (per-phase smoke test), stability compliance (cross-cutting protections wired up for Tier C drivers), documentation compliance (any deviation reflected back in v2 docs). Exit gate requires two-reviewer signoff; silent deviation is the failure mode the gates exist to prevent |
2026-04-17 |
| 127 | Per-phase implementation docs live under docs/v2/implementation/ with structured task / acceptance / compliance / completion sections |
Each phase doc enumerates: scope (in / out), entry gate checklist, task breakdown with per-task acceptance criteria, compliance checks (script-runnable), behavioral smoke test, completion checklist. Phase 0 + Phase 1 docs are committed; Phases 2–8 land as their predecessors clear exit gates | 2026-04-17 |
| 128 | Driver list is fixed for v2.0 — Equipment Protocol Survey is NOT a prerequisite | The seven committed drivers (Modbus TCP including DL205, AB CIP, AB Legacy, S7, TwinCAT, FOCAS, OPC UA Client) plus the existing Galaxy/MXAccess driver are confirmed by direct knowledge of the equipment estate, not pending the formal survey. Supersedes the corrections-doc concern (C1) that the v2 commitment was made pre-survey. The survey may still produce useful inventory data for downstream planning (capacity, prioritization), but adding or removing drivers from the v2 implementation list is out of scope. Closes corrections-doc C1 | 2026-04-17 |
| 129 | OPC UA client data-path authorization model = NodePermissions bitmask flags + per-LDAP-group grants on a 6-level scope hierarchy (Cluster / Namespace / UnsArea / UnsLine / Equipment / Tag) with default-deny + additive grants; explicit Deny deferred to v2.1 |
Mirrors v1 SecurityClassification model for Write tiers (WriteOperate / WriteTune / WriteConfigure); adds explicit AlarmRead / AlarmAcknowledge / AlarmConfirm / AlarmShelve / MethodCall flags; bundles (ReadOnly / Operator / Engineer / Admin) for one-click grants. Per-session permission-trie evaluator with O(depth × group-count) cost; cache invalidated on generation-apply or LDAP group cache expiry. Closes corrections-doc B1. See acl-design.md |
2026-04-17 |
| 130 | NodeAcl table generation-versioned, edited via draft → diff → publish |
Same pattern as Namespace (#123) and Equipment (#109). ACL changes are content, not topology — they affect what consumers see at the OPC UA endpoint. Rollback restores the prior ACL state. Cross-generation invariant: NodeAclId once published with (LdapGroup, ScopeKind, ScopeId) cannot have any of those columns change |
2026-04-17 |
| 131 | Cluster-create workflow seeds default ACL set matching v1 LmxOpcUa LDAP-role-to-permission map | Preserves behavioral parity for v1 → v2 consumer migration. Operators tighten or loosen from there. Admin UI flags any cluster whose ACL set diverges from the seed | 2026-04-17 |
| 132 | OPC UA NodeManager logs denied operations only; allowed operations rely on SDK session/operation diagnostics | Logging every allowed op would dwarf the audit log. Denied-only mirrors typical authorization audit practice. Per-deployment policy can tighten if compliance requires positive-action logging | 2026-04-17 |
| 133 | Two-tier dev environment: inner-loop (in-process simulators on developer machines) + integration (Docker / VM / native simulators on a single dedicated Windows host) | Per decision #99. Concrete inventory + setup plan in dev-environment.md |
2026-04-17 |
| 134 | Docker Desktop with WSL2 backend (not Hyper-V backend) on integration host so TwinCAT XAR VM can run in Hyper-V alongside Docker | TwinCAT runtime cannot coexist with Hyper-V-mode Docker Desktop; WSL2 backend leaves Hyper-V free for the XAR VM. Documented operational constraint | 2026-04-17 |
| 135 | TwinCAT XAR runs only in a dedicated VM on the integration host; developer machines do NOT run XAR locally | The 7-day trial reactivation needs centralized management; the VM is shared infrastructure. Galaxy is the inverse — runs only on developer machines (Aveva license scoping), not on integration host | 2026-04-17 |
| 136 | Consumer cutover (ScadaBridge / Ignition / System Platform IO) is OUT of v2 scope | Owned by a separate integration / operations team. OtOpcUa team's scope ends at Phase 5 (all drivers built, all stability protections in place, full Admin UI shipped including ACL editor). Cutover sequencing, validation methodology, rollback procedures, and Aveva-pattern validation for tier 3 are the integration team's deliverables, tracked in 3-year-plan handoff §"Rollout Posture" and corrections doc §C5 | 2026-04-17 |
| 137 | Dev env credentials documented openly in dev-environment.md; production uses Integrated Security / gMSA per decision #46 |
Dev defaults are not secrets — they're convenience. Production never uses these values; documented separation prevents leakage | 2026-04-17 |
| 138 | Every equipment-class template extends a shared _base class providing universal cross-machine metadata (identity, state, alarm summary, optional production context) |
References OPC UA Companion Spec OPC 40010 (Machinery) for the Identification component + MachineryOperationMode enum, OPC UA Part 9 for alarm summary fields, ISO 22400 for KPI inputs (TotalRunSeconds, TotalCycles), 3-year-plan handoff §"Canonical Model Integration" for the canonical state vocabulary. Inheritance via extends field on the equipment-class JSON Schema. Avoids per-class drift in identity / state / alarm field naming and ensures every machine in the estate exposes the same baseline metadata regardless of vendor or protocol. _base lives in 3yearplan/schemas/classes/_base.json (temporary location until the dedicated schemas repo is created) |
2026-04-17 |
| 139 | Equipment table extended with OPC 40010 identity columns (Manufacturer, Model, SerialNumber, HardwareRevision, SoftwareRevision, YearOfConstruction, AssetLocation, ManufacturerUri, DeviceManualUri) all nullable so equipment can be added before identity is fully captured | First-class columns rather than a JSON blob because these fields are universal (every machine has them) and need to be queryable / searchable in the Admin UI. Manufacturer and Model are declared isRequired: true in _base.json and the Admin UI flags equipment that lacks them; the rest are optional. Drivers that can read these dynamically (FANUC, Beckhoff, etc.) override the static value at runtime; static value is the fallback. Exposed on the OPC UA node under the OPC 40010-standard Identification sub-folder |
2026-04-17 |
| 140 | Enterprise shortname = zb (UNS level-1 segment) |
Closes corrections-doc D4. Matches the existing ZB.MOM.WW.* namespace prefix used throughout the codebase; short by design since this segment appears in every equipment path (zb/warsaw-west/bldg-3/line-2/cnc-mill-05/RunState); operators already say "ZB" colloquially. Admin UI cluster-create form default-prefills zb for the Enterprise field. Production deployments use it directly from cluster-create |
2026-04-17 |
| 141 | Tier 3 (AppServer IO) cutover is feasible — AVEVA's OI Gateway supports arbitrary upstream OPC UA servers as a documented pattern | Closes corrections-doc E2 with GREEN-YELLOW verdict. Multiple AVEVA partners (Software Toolbox, InSource) have published working integrations against four different non-AVEVA upstream servers (TOP Server, OPC Router, OmniServer, Cogent DataHub). No re-architecting of OtOpcUa required. Path: OPC UA node → OI Gateway → SuiteLink → $DDESuiteLinkDIObject → AppServer attribute. Recommended AppServer floor: System Platform 2023 R2 Patch 01. Two integrator-burden risks tracked: validation/GxP paperwork (no AVEVA blueprint exists for non-AVEVA upstream servers in Part 11 deployments) and unpublished scale benchmarks (in-house benchmark required before cutover scheduling). See aveva-system-platform-io-research.md |
2026-04-17 |
| 142 | Phase 1 acceptance includes an end-to-end AppServer-via-OI-Gateway smoke test against OtOpcUa | Catches AppServer-specific quirks (cert exchange via reject-and-trust workflow, endpoint URL must NOT include /discovery suffix per Inductive Automation forum failure mode, service-account install required because OI Gateway under SYSTEM cannot connect to remote OPC servers, Basic256Sha256 + SignAndEncrypt + LDAP-username token combination must work end-to-end) early — well before the Year 3 tier-3 cutover schedule. Adds one task to phase-1-configuration-and-admin-scaffold.md Stream E (Admin smoke test) |
2026-04-17 |
| 143 | Polly per-capability policy — Read / HistoryRead / Discover / Probe / Alarm-subscribe auto-retry; Write does NOT auto-retry unless the tag metadata carries [WriteIdempotent] |
Decisions #44-45 forbid auto-retry on Write because a timed-out write can succeed on the device + be replayed by the pipeline, duplicating pulses / alarm acks / counter increments / recipe-step advances. Per-capability policy in the shared Polly layer makes the retry safety story explicit; WriteIdempotentAttribute on tag definitions is the opt-in surface |
2026-04-19 |
| 144 | Polly pipeline key = (DriverInstanceId, HostName), not DriverInstanceId alone |
Decision #35 requires per-device isolation. One dead PLC behind a multi-device Modbus driver must NOT open the circuit breaker for healthy sibling hosts. Per-instance pipelines would poison every device behind one bad endpoint | 2026-04-19 |
| 145 | Tier A/B/C runtime enforcement splits into MemoryTracking (all tiers — soft/hard thresholds log + surface, NEVER kill) and MemoryRecycle (Tier C only — requires out-of-process topology). Tier A/B hard-breach logs a promotion-to-Tier-C recommendation; the runtime never auto-kills an in-process driver |
Decisions #73-74 reserve process-kill protections for Tier C. An in-process Tier A/B "recycle" would kill every OPC UA session + every other in-proc driver for one leaky instance, blast-radius worse than the leak | 2026-04-19 |
| 146 | Memory watchdog uses the hybrid formula soft = max(multiplier × baseline, baseline + floor), with baseline captured as the median of the first 5 min of GetMemoryFootprint() samples post-InitializeAsync. Tier-specific constants: A multiplier=3 floor=50 MB, B multiplier=3 floor=100 MB, C multiplier=2 floor=500 MB. Hard = 2 × soft |
Codex adversarial review on the Phase 6.1 plan flagged that hardcoded per-tier MB bands diverge from decision #70's specified formula. Static bands false-trigger on small-footprint drivers + miss meaningful growth on large ones. Observed-baseline + hybrid formula recovers the original intent | 2026-04-19 |
| 147 | WedgeDetector uses demand-aware criteria (state==Healthy AND hasPendingWork AND noProgressIn > threshold). hasPendingWork = (Polly bulkhead depth > 0) OR (active MonitoredItem count > 0) OR (queued historian read count > 0). Idle + subscription-only + write-only-burst drivers stay Healthy without false-fault |
Previous "no successful Read in N intervals" formulation flipped legitimate idle subscribers, slow historian backfills, and write-heavy drivers to Faulted. The demand-aware check only fires when the driver claims work is outstanding | 2026-04-19 |
| 148 | LiteDB config cache is generation-sealed: sp_PublishGeneration writes <cache-root>/<cluster>/<generationId>.db as a read-only sealed file; cache reads serve the last-known-sealed generation. Mixed-generation reads are impossible |
Prior "refresh on every successful query" cache could serve LDAP role mapping from one generation alongside UNS topology from another, producing impossible states. Sealed-snapshot invariant keeps cache-served reads coherent with a real published state | 2026-04-19 |
| 149 | AuthorizationDecision { Allow | NotGranted | Denied, IReadOnlyList<MatchedGrant> Provenance } — tri-state internal model. Phase 6.2 only produces Allow + NotGranted (grant-only semantics per decision #129); v2.1 Deny widens without API break |
bool return would collapse no-matching-grant and explicit-deny into the same runtime state + UI explanation; provenance record is needed for the audit log anyway. Making the shape tri-state from Phase 6.2 avoids a breaking change in v2.1 |
2026-04-19 |
| 150 | Data-plane ACL evaluator consumes NodeAcl rows joined against the session's resolved LDAP group memberships. LdapGroupRoleMapping (decision #105) is control-plane only — routes LDAP groups to Admin UI roles. Zero runtime overlap between the two |
Codex adversarial review flagged that Phase 6.2 draft conflated the two — building the data-plane trie from LdapGroupRoleMapping would let a user inherit tag permissions from an admin-role claim path never intended as a data-path grant |
2026-04-19 |
| 151 | UserAuthorizationState cached per session but bounded by MembershipFreshnessInterval (default 15 min). Past that interval the next hot-path authz call re-resolves LDAP group memberships; failure to re-resolve (LDAP unreachable) → fail-closed (evaluator returns NotGranted until memberships refresh successfully) |
Previous design cached memberships until session close, so a user removed from a privileged LDAP group could keep authorized access for hours. Bounded freshness + fail-closed covers the revoke-takes-effect story | 2026-04-19 |
| 152 | Auth cache has its own staleness budget AuthCacheMaxStaleness (default 5 min), independent of decision #36's availability-oriented config cache (24 h). Past 5 min on authorization data, evaluator fails closed regardless of whether the underlying config is still serving from cache |
Availability-oriented caches trade correctness for uptime. Authorization data is correctness-sensitive — stale ACLs silently extend revoked access. Auth-specific budget keeps the two concerns from colliding | 2026-04-19 |
| 153 | MonitoredItem carries (AuthGenerationId, MembershipVersion) stamp at create time. On every Publish, items with a mismatching stamp re-evaluate; unchanged items stay fast-path. Revoked items drop to BadUserAccessDenied within one publish cycle |
Create-time-only authorization leaves revoked users receiving data forever; per-publish re-authorization at 100 ms cadence across 50 groups × 6 levels is too expensive. Stamp-then-reevaluate-on-change balances correctness with cost | 2026-04-19 |
| 154 | ServiceLevel reserves 0 for operator-declared maintenance only; 1 = NoData (unreachable / Faulted); operational states occupy 2..255 in an 8-state matrix (Authoritative-Primary=255, Isolated-Primary=230, Primary-Mid-Apply=200, Recovering-Primary=180, Authoritative-Backup=100, Isolated-Backup=80, Backup-Mid-Apply=50, Recovering-Backup=30, InvalidTopology=2) |
OPC UA Part 5 §6.3.34 defines 0=Maintenance + 1=NoData; using 0 for our Faulted case collides with spec + triggers spec-compliant clients to enter maintenance-mode cutover. Expanded 8-state matrix covers operational states the 5-state original collapsed together (e.g. Isolated-Primary vs Primary-Mid-Apply were both 200) |
2026-04-19 |
| 155 | ServerUriArray includes self + peers (self first, deterministic ordering), per OPC UA Part 4 §6.6.2.2 |
Previous design excluded self from the array — spec violation + clients lose the ability to map server identities consistently during failover | 2026-04-19 |
| 156 | Redundancy peer health uses a two-layer probe: /healthz (2 s) as fast-fail + UaHealthProbe (10 s, opens OPC UA client session to peer + reads its ServiceLevel node) as the authority signal. HTTP-healthy ≠ UA-authoritative |
/healthz returns 200 whenever HTTP + config DB/cache is healthy — but a peer can be HTTP-healthy with a broken OPC UA endpoint or a stuck subscription publisher. Using HTTP alone would advertise authority against servers that can't actually publish data |
2026-04-19 |
| 157 | Publish-generation fencing — coordinator CAS on a monotonic ConfigGenerationId; every topology + role decision is generation-stamped; peers reject state propagated from a lower generation. Runtime InvalidTopology state (both self-demote to ServiceLevel 2) when >1 Primary detected post-startup |
Operator race publishing two drafts with different roles can produce two locally-valid views; without fencing + runtime containment both nodes can serve as Primary until manual intervention | 2026-04-19 |
| 158 | Apply-window uses named leases keyed to (ConfigGenerationId, PublishRequestId) via await using. ApplyLeaseWatchdog auto-closes any lease older than ApplyMaxDuration (default 10 min) |
Simple IDisposable-counter design leaks on cancellation / async-ownership races; a stuck positive count leaves the node permanently mid-apply. Generation-keyed leases + watchdog bound worst case |
2026-04-19 |
| 159 | CSV import header row must start with # OtOpcUaCsv v1 (version marker). Future shape changes bump the version; parser forks per version. Canonical identifier columns follow decision #117: ZTag, MachineCode, SAPID, EquipmentId, EquipmentUuid |
Without a version marker the CSV schema has no upgrade path — adding a required column breaks every old export silently. The version prefix makes parser dispatch explicit + future-compatible | 2026-04-19 |
| 160 | Equipment CSV import uses a staged-import pattern: EquipmentImportBatch + EquipmentImportRow tables receive chunked inserts; FinaliseImportBatch is one atomic transaction that applies accepted rows to Equipment + ExternalIdReservation. Rollback = drop the batch row; Equipment never partially mutates |
10k-row single-transaction import holds locks too long; chunked direct writes lose all-or-nothing rollback. Staging + atomic finalize bounds transaction duration + preserves rollback semantics | 2026-04-19 |
| 161 | UNS drag-reorder impact preview carries a DraftRevisionToken; Confirm re-checks against the current draft + returns 409 Conflict / refresh-required if the draft advanced between preview and commit |
Without concurrency control, two operators editing the same draft can overwrite each other's changes silently. Draft-revision token + 409 response makes the race visible + forces refresh | 2026-04-19 |
| 162 | OPC 40010 Identification sub-folder exposed under each equipment node inherits the Equipment scope's ACL grants — the ACL trie does NOT add a new scope level for Identification | Adding a new scope level for Identification would require every grant to add a second grant for Equipment/Identification; inheriting the Equipment scope keeps the grant model flat + prevents operator-forgot-to-grant-Identification access surprises |
2026-04-19 |
Reference Documents
- Driver Implementation Specifications — per-driver details: connection settings, addressing, data types, libraries, API mappings, error handling, implementation notes
- Test Data Sources — per-driver simulator/emulator/stub for development and integration testing
- Driver Stability & Isolation — stability tier model (A/B/C), per-driver hosting decisions, cross-cutting protections, FOCAS and Galaxy deep dives
- Central Config DB Schema — concrete table definitions, indexes, stored procedures, authorization model, JSON conventions, EF Core migrations approach
- Admin Web UI — Blazor Server admin app: information architecture, page-by-page workflows, per-driver config screen extensibility, real-time updates, UX rules
- OPC UA Client Authorization (ACL Design) — data-path authz model:
NodePermissionsbitmask flags (Browse / Read / Subscribe / HistoryRead / WriteOperate / WriteTune / WriteConfigure / AlarmRead / AlarmAcknowledge / AlarmConfirm / AlarmShelve / MethodCall + bundles), 6-level scope hierarchy (Cluster / Namespace / UnsArea / UnsLine / Equipment / Tag) with inheritance, default-deny + additive grants, per-session permission-trie evaluator with O(depth × group-count) cost, default cluster-seed mapping v1 LmxOpcUa LDAP roles, Admin UI ACL tab + bulk grant + simulator. Closes corrections-doc finding B1. - Development Environment — every external resource the v2 build needs (SQL Server, GLAuth, Galaxy, Docker simulators, TwinCAT XAR VM, OPC Foundation reference server, FOCAS stub + FaultShim) with default ports / credentials / owners; two-tier model (inner-loop on developer machines, integration on a single dedicated Windows host with WSL2-backed Docker + Hyper-V VM for TwinCAT); concrete bootstrap order for both tiers
- AVEVA System Platform IO research — closes corrections-doc E2. Validates that the planned tier-3 cutover (AppServer IO consuming from OtOpcUa instead of equipment directly) is supported via AVEVA's OI Gateway driver. Verdict: GREEN-YELLOW. Multiple non-AVEVA upstream-server integrations published. Two integrator-burden risks: validation/GxP paperwork and unpublished scale benchmarks
- Implementation Plan Overview — phase gate structure (entry / mid / exit), compliance check categories (schema / decision / visual / behavioral / stability / documentation), deliverable conventions, "what counts as following the plan"
- Phase 0 — Rename + .NET 10 cleanup — mechanical LmxOpcUa → OtOpcUa rename with full task breakdown, compliance checks, completion checklist
- Phase 1 — Configuration + Core.Abstractions + Admin scaffold — central MSSQL schema, EF Core migrations, stored procs, LDAP-authenticated Blazor Server admin app with ScadaLink visual parity, LiteDB local cache, generation-diff applier; 5 work streams (A–E), full task breakdown, compliance checks, 14-step end-to-end smoke test
- Phase 2 — Galaxy out-of-process refactor (Tier C) — split legacy in-process Galaxy into
Driver.Galaxy.Shared(.NET Standard 2.0 IPC contracts) +Driver.Galaxy.Host(.NET 4.8 x86 separate Windows service with STA pump, COM SafeHandle wrappers, named-pipe IPC with mandatory ACL, memory watchdog, scheduled recycle with WM_QUIT escalation, post-mortem MMF, FaultShim) +Driver.Galaxy.Proxy(.NET 10 in-process IDriver implementation with heartbeat sender and crash-loop circuit-breaker supervisor); retire legacyOtOpcUa.Hostproject; parity gate is v1 IntegrationTests + scripted Client.CLI walkthrough passing byte-equivalent to v1; closes the four 2026-04-13 stability findings as named regression tests