Compare commits

...

98 Commits

Author SHA1 Message Date
Joseph Doherty
36774842cf Phase 7 Stream A.3 — ScriptLoggerFactory + ScriptLogCompanionSink. Third of 3 increments closing out Stream A. Adds the Serilog plumbing that ties script-emitted log events to the dedicated scripts-*.log rolling sink with structured-property filtering AND forwards script Error+ events to the main opcua-*.log at Warning level so operators see script failures in the primary log without drowning it in Debug/Info script chatter. Both pieces are library-level building blocks — the actual file-sink + logger composition at server startup happens in Stream F (Admin UI) / Stream G (address-space wiring). This PR ships the reusable factory + sink + tests so any consumer can wire them up without rediscovering the structured-property contract.
ScriptLoggerFactory wraps a Serilog root logger (the scripts-*.log pipeline) and .Create(scriptName) returns a per-script ILogger with the ScriptName structured property pre-bound via ForContext. The structured property name is a public const (ScriptNameProperty = "ScriptName") because the Admin UI's log-viewer filter references this exact string — changing it breaks the filter silently, so it's stable by contract. Factory constructor rejects a null root logger; Create rejects null/empty/whitespace script names. No per-evaluation allocation in the hot path — engines (Stream B virtual-tag / Stream C scripted-alarm) create one factory per engine instance then cache per-script loggers beside the ScriptContext instances they already build.

ScriptLogCompanionSink is a Serilog ILogEventSink that forwards Error+ events from the script-logger pipeline to a separate "main" logger (the opcua-*.log pipeline in production) at Warning level. Rationale: operators usually watch the main server log, not scripts-*.log. Script authors log Info/Debug liberally during development — those stay in the scripts file. When a script actually fails (Error or Fatal), the operator needs to see it in the primary log so it can't be missed. Downgrading to Warning in the main log marks these as "needs attention but not a core server issue" since the server itself is healthy; the script author fixes the script. Forwarded event includes the ScriptName property (so operators can tell which script failed at a glance), the OriginalLevel (Error vs Fatal, preserved), the rendered message, and the original exception (preserved so the main log keeps the full stack trace — critical for diagnosis). Missing ScriptName property falls back to "unknown" without throwing; bypassing the factory is defensive but shouldn't happen in practice. Mirror threshold is configurable via constructor (defaults to LogEventLevel.Error) so deployments with stricter signal/noise requirements can raise it to Fatal.

15 new unit tests across two files. ScriptLoggerFactoryTests (6): Create sets the ScriptName structured property, each script gets its own property value across fan-out, Error-level event preserves level and exception, null root rejected, empty/whitespace/null name rejected, ScriptNameProperty const is stable at "ScriptName" (external-contract guard). ScriptLogCompanionSinkTests (9): Info/Warning events land in scripts sink only (not mirrored), Error event mirrored to main at Warning level (level-downgrade behavior), mirrored event includes ScriptName + OriginalLevel properties, mirrored event preserves exception for main-log stack-trace diagnosis, Fatal mirrored identically to Error, missing ScriptName falls back to "unknown" without throwing (defensive), null main logger rejected, custom mirror threshold (raised to Fatal) applied correctly.

Full Core.Scripting test suite after Stream A: 63/63 green (29 A.1 + 19 A.2 + 15 A.3). Stream A is complete — the scripting engine foundation, sandbox, sandbox-defense-in-depth, AST-inferred dependency extraction, compile cache, per-evaluation timeout, per-script logger with structured-property filtering, and companion-warn forwarding are all shipped and tested. Streams B through G build on this; Stream H closes out the phase with the compliance script + test baseline + merge to v2.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 16:42:48 -04:00
cb5d7b2d58 Merge pull request 'Phase 7 Stream A.2 — compile cache + per-evaluation timeout wrapper' (#178) from phase-7-stream-a2-cache-timeout into v2 2026-04-20 16:41:07 -04:00
Joseph Doherty
0ae715cca4 Phase 7 Stream A.2 — compile cache + per-evaluation timeout wrapper. Second of 3 increments within Stream A. Adds two independent resilience primitives that the virtual-tag engine (Stream B) and scripted-alarm engine (Stream C) will compose with the base ScriptEvaluator. Both are generic on (TContext, TResult) so different engines get their own instances without cross-contamination.
CompiledScriptCache<TContext, TResult> — source-hash-keyed cache of compiled evaluators. Roslyn compilation is the most expensive step in the evaluator pipeline (5-20ms per script depending on size); re-compiling on every value-change event would starve the engine. ConcurrentDictionary of Lazy<ScriptEvaluator> with ExecutionAndPublication mode ensures concurrent callers never double-compile even on a cold cache race. Failed compiles evict the cache entry so an Admin UI retry with corrected source actually recompiles (otherwise the cached exception would persist). Whitespace-sensitive hash — reformatting a script misses the cache on purpose, simpler than AST-canonicalize and happens rarely. No capacity bound because virtual-tag + alarm scripts are config-DB bounded (thousands, not millions); if scale pushes past that in v3 an LRU eviction slots in behind the same API.

TimedScriptEvaluator<TContext, TResult> — wraps a ScriptEvaluator with a per-evaluation wall-clock timeout (default 250ms per Phase 7 plan Stream A.4, configurable per tag so slower backends can widen). Critical implementation detail: the underlying Roslyn ScriptRunner executes synchronously on the calling thread for CPU-bound user scripts, returning an already-completed Task before the caller can register a timeout. Naive `Task.WaitAsync(timeout)` would see the completed task and never fire. Fix: push evaluation to a thread-pool thread via Task.Run, so the caller's thread is free to wait and the timeout reliably fires after the configured budget. Known trade-off (documented in the class summary): when a script times out, the underlying evaluation task continues running on the thread-pool thread until Roslyn returns; in the CPU-bound-infinite-loop case it's effectively leaked until the runtime decides to unwind. Tighter CPU budgeting would require an out-of-process script runner (v3 concern). In practice the timeout + structured warning log surfaces the offending script so the operator fixes it, and the orphan thread is rare. Caller-supplied CancellationToken is honored and takes precedence over the timeout, so driver-shutdown paths see a clean OperationCanceledException rather than a misclassified ScriptTimeoutException.

ScriptTimeoutException carries the configured Timeout and a diagnostic message pointing the operator at ctx.Logger output around the failure plus suggesting widening the timeout, simplifying the script, or moving heavy work out of the evaluation path. The virtual-tag engine (Stream B) will catch this and map the owning tag's quality to BadInternalError per Phase 7 decision #11, logging a structured warning with the offending script name.

Tests: CompiledScriptCacheTests (10) — first-call compile, identical-source dedupe to same instance, different-source produces different evaluator, whitespace-sensitivity documented, cached evaluator still runs correctly, failed compile evicted for retry, Clear drops entries, concurrent GetOrCompile of the same source deduplicates to one instance, different TContext/TResult use separate cache instances, null source rejected. TimedScriptEvaluatorTests (9) — fast script completes under timeout, CPU-bound script throws ScriptTimeoutException, caller cancellation takes precedence over timeout (shutdown path correctness), default 250ms per plan, zero/negative timeout rejected at construction, null inner rejected, null context rejected, user-thrown exceptions propagate unwrapped (not conflated with timeout), timeout exception message contains diagnostic guidance. Full suite: 48/48 green (29 from A.1 + 19 new).

Next: Stream A.3 wires the dedicated scripts-*.log Serilog rolling sink + structured-property filtering + companion-WARN enricher to the main log, closing out Stream A.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 16:38:43 -04:00
d2bfcd9f1e Merge pull request 'Phase 7 Stream A.1 — Core.Scripting project scaffold + ScriptContext + sandbox + AST dependency extractor' (#177) from phase-7-stream-a1-core-scripting into v2 2026-04-20 16:29:44 -04:00
Joseph Doherty
e4dae01bac Phase 7 Stream A.1 — Core.Scripting project scaffold + ScriptContext + sandbox + AST dependency extractor. First of 3 increments within Stream A. Ships the Roslyn-based script engine's foundation: user C# snippets compile against a constrained ScriptOptions allow-list + get a post-compile sandbox guard, the static tag-dependency set is extracted from the AST at publish time, and the script sees a stable ctx.GetTag/SetVirtualTag/Now/Logger/Deadband API that later streams plug into concrete backends.
ScriptContext abstract base defines the API user scripts see as ctx — GetTag(string) returns DataValueSnapshot so scripts branch on quality naturally, SetVirtualTag(string, object?) is the only write path virtual tags have (OPC UA client writes to virtual nodes rejected separately in DriverNodeManager per ADR-002), Now + Logger + Deadband static helper round out the surface. Concrete subclasses in Streams B + C plug in actual tag backends + per-script Serilog loggers.

ScriptSandbox.Build(contextType) produces the ScriptOptions for every compile — explicit allow-list of six assemblies (System.Private.CoreLib / System.Linq / Core.Abstractions / Core.Scripting / Serilog / the context type's own assembly), with a matching import list so scripts don't need using clauses. Allow-list is plan-level — expanding it is not a casual change.

DependencyExtractor uses CSharpSyntaxWalker to find every ctx.GetTag("literal") and ctx.SetVirtualTag("literal", ...) call, rejects every non-literal path (variable, concatenation, interpolation, method-returned). Rejections carry the exact TextSpan so the Admin UI can point at the offending token. Reads + writes are returned as two separate sets so the virtual-tag engine (Stream B) knows both the subscription targets and the write targets.

Sandbox enforcement turned out needing a second-pass semantic analyzer because .NET 10's type forwarding makes assembly-level restriction leaky — System.Net.Http.HttpClient resolves even with WithReferences limited to six assemblies. ForbiddenTypeAnalyzer runs after Roslyn's Compile() against the SemanticModel, walks every ObjectCreationExpression / InvocationExpression / MemberAccessExpression / IdentifierName, resolves to the containing type's namespace, and rejects any prefix-match against the deny-list (System.IO, System.Net, System.Diagnostics, System.Reflection, System.Threading.Thread, System.Runtime.InteropServices, Microsoft.Win32). Rejections throw ScriptSandboxViolationException with the aggregated list + source spans so the Admin UI surfaces every violation in one round-trip instead of whack-a-mole. System.Environment explicitly stays allowed (read-only process state, doesn't persist or leak outside) and that compromise is pinned by a dedicated test.

ScriptGlobals<TContext> wraps the context as a named field so scripts see ctx instead of the bare globalsType-member-access convention Roslyn defaults to — keeps script ergonomics (ctx.GetTag) consistent with the AST walker's parse shape and the Admin UI's hand-written type stub (coming in Stream F). Generic on TContext so Stream C's alarm-predicate context with an Alarm property inherits cleanly.

ScriptEvaluator<TContext, TResult>.Compile is the three-step gate: (1) Roslyn compile — throws CompilationErrorException on syntax/type errors with Location-carrying diagnostics; (2) ForbiddenTypeAnalyzer semantic pass — catches type-forwarding sandbox escapes; (3) delegate creation. Runtime exceptions from user code propagate unwrapped — the virtual-tag engine in Stream B catches + maps per-tag to BadInternalError quality per Phase 7 decision #11.

29 unit tests covering every surface: DependencyExtractorTests has 14 theories — single/multiple/deduplicated reads, separate write tracking, rejection of variable/concatenated/interpolated/method-returned/empty/whitespace paths, ignoring non-ctx methods named GetTag, empty-source no-op, source span carried in rejections, multiple bad paths reported in one pass, nested literal extraction. ScriptSandboxTests has 15 — happy-path compile + run, SetVirtualTag round-trip, rejection of File.IO + HttpClient + Process.Start + Reflection.Assembly.Load via ScriptSandboxViolationException, Environment.GetEnvironmentVariable explicitly allowed (pinned compromise), script-exception propagation, ctx.Now reachable, Deadband static reachable, LINQ Where/Sum reachable, DataValueSnapshot usable in scripts including quality branches, compile error carries source location.

Next two PRs within Stream A: A.2 adds the compile cache (source-hash keyed) + per-evaluation timeout wrapper; A.3 wires the dedicated scripts-*.log Serilog rolling sink with structured-property filtering + the companion-warning enricher to the main log.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 16:27:07 -04:00
6ae638a6de Merge pull request 'ADR-002 — driver-vs-virtual dispatch for Phase 7 scripting' (#176) from adr-002-driver-vs-virtual-dispatch into v2 2026-04-20 16:10:30 -04:00
Joseph Doherty
2a74daf228 ADR-002 — driver-vs-virtual dispatch: DriverNodeManager routes reads/writes/subscriptions across driver tags and virtual (scripted) tags via a single NodeManager with a NodeSource tag on NodeScopeResolver's output. Locks the architecture decision Phase 7 Stream G was going to have to make anyway — documenting it up front so the stream implementation can reference the chosen shape instead of rediscovering it. Option A (separate VirtualTagNodeManager sibling) rejected because shared Equipment folders owning both driver and virtual children would force two NodeManagers to fight for ownership on every Equipment node — the common case, not the exception — defeating the separation. Option C (virtual engine registers as a synthetic IDriver through DriverTypeRegistry) rejected because DriverInstance shape is wrong for scripting config (no DriverType, no HostAddress, no connectivity probe, no NSSM wrapper), IDriver.InitializeAsync semantics don't match script compilation, Polly resilience wrappers calibrated for network calls would either passthrough pointlessly or tune wrong, and Admin UI would need special-casing everywhere to hide fields that don't apply. Option B (single DriverNodeManager, NodeScopeResolver returns NodeSource enum alongside ScopeId, dispatch branches on source) accepted because it preserves one address-space tree with one walker, ACL binding works identically for both kinds, Phase 6.1 resilience + Phase 6.2 audit apply uniformly to the driver branch without needing Roslyn analyzer exemptions, and adding future source kinds is a single-enum-case addition. NodeScopeResolver.Resolve returns NodeScope(ScopeId, NodeSource, DriverInstanceId?, VirtualTagId?); DriverNodeManager pattern-matches on scope.Source and routes to either the driver dictionary or IVirtualTagEngine. OPC UA client writes to a virtual node return BadUserAccessDenied before the dispatch branch because Phase 7 decision #6 restricts virtual-tag writes to scripts via ctx.SetVirtualTag. Dispatch test coverage specified for Stream G.4: mixed Equipment folders browsing correctly, read routing per source kind, subscription fan-out across both kinds, the BadUserAccessDenied guard on virtual writes, and script-driven writes firing subscription notifications. ADR-001's walker gains the VirtualTag config-DB table as an additional input channel alongside Tag; NodeScopeResolver's ScopeId return stays unchanged so Phase 6.2's ACL trie needs no modification. Consequences flagged: whether IVirtualTagEngine lives in Core.Abstractions vs Phase 7's Core.VirtualTags project, and whether future server-side methods on virtual nodes would route through this dispatch, both marked out-of-scope for ADR-002.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 16:08:01 -04:00
3eb5f1d9da Merge pull request 'Phase 7 plan doc — scripting runtime + virtual tags + scripted alarms + historian alarm sink' (#175) from phase-7-plan-doc into v2 2026-04-20 16:07:34 -04:00
Joseph Doherty
f2c1cc84e9 Phase 7 plan doc — scripting runtime + virtual tags + scripted alarms + historian alarm sink. Draft output from the 2026-04-20 interactive planning session. Phase 7 is the last phase before v2 release readiness; adds two additive runtime capabilities on top of the existing driver + Equipment address-space foundation: (1) virtual (calculated) tags — OPC UA variables whose values are computed by user-authored C# scripts against other tags, evaluated on change and/or timer, living in the existing Equipment tree alongside driver tags, behaving identically to clients; (2) Part 9 scripted alarms — full state machine (EnabledState/ActiveState/AckedState/ConfirmedState/ShelvingState) with persistent operator-supplied state across restarts, complementing (not replacing) the existing Galaxy-native and AB CIP ALMD alarm sources. A third tie-in capability — Aveva Historian as alarm system of record — routes every qualifying alarm transition from any IAlarmSource (scripted + Galaxy + ALMD) through a local SQLite store-and-forward queue to Galaxy.Host, which uses its already-loaded aahClientManaged DLLs to write to the Historian alarm schema; per-alarm HistorizeToAveva toggle gates which sources flow (default off for Galaxy-native to avoid duplicating the direct Galaxy historian path, default on for scripted).
Locks in 22 design decisions from the planning conversation: C# via Roslyn scripting; virtual tags in the Equipment tree (not a separate /Virtual/ namespace); change-driven + timer-driven triggers operator-configurable per tag; Shape A one-script-per-tag-or-alarm (no predicate/action split); full OPC UA Part 9 alarm fidelity; read-only sandbox (scripts read any tag, write only to virtual tags, no File/HttpClient/Process/reflection); AST-inferred dependencies via CSharpSyntaxWalker (non-literal tag paths rejected at publish); config DB storage with generation-sealed cache; ctx.GetTag returns a full DataValue {Value, StatusCode, Timestamp}; per-tag Historize checkbox; per-tag error isolation (throwing script sets tag quality BadInternalError, engine unaffected); dedicated scripts-*.log Serilog sink bound to ctx.Logger; alarm message as template with {TagPath} substitution resolved at event emission; ActiveState recomputed from tags on startup while EnabledState/AckedState/ConfirmedState/ShelvingState + audit persist to config DB; historian sink scope = all IAlarmSource impls with per-alarm toggle; SQLite store-and-forward on the node so operators are never blocked by Historian downtime; IPC to Galaxy.Host for ingestion reusing the already-loaded aahClientManaged DLLs; Monaco editor for Admin code editing; serial cascade evaluation for v1 (parallel as follow-up); shelving UX via OPC UA method calls only with no custom Admin controls (operator drives state transitions from plant HMIs or Client.CLI); 30-day dead-letter retention with manual retry button; test harness accepts only declared-input paths so the harness enforces dependency declaration.

Eight streams totaling ~10-12 weeks, scope-comparable to Phase 6: A - Core.Scripting (Roslyn engine + sandbox + AST inference + logger); B - virtual tag engine (dependency graph + change/timer schedulers + historize); C - scripted alarm engine (Part 9 state machine + template messages + startup recovery + OPC UA method binding); D - historian alarm sink (SQLite store-and-forward + Galaxy.Host IPC contract extension); E - config DB schema (four new tables under sp_PublishGeneration); F - Admin UI scripting tab (Monaco + test harness + dependency preview + script-log viewer + historian diagnostics); G - address-space integration (extend EquipmentNodeWalker for virtual source kind + extend DriverNodeManager dispatch); H - exit gate.

Compliance-check surface covers sandbox escape (typeof/Assembly.Load/File/HttpClient attempts must fail at compile), dependency inference (literal-only paths), change cascade (topological ordering), cycle rejection at publish, startup recovery (ack/confirm/shelve survive restart but ActiveState recomputed), ack audit trail persistence, historian queue durability (Galaxy.Host offline → online drains in-order), per-alarm historian toggle gating, script timeout isolation, log sink isolation, ACL binding (virtual tags inherit Equipment scope grants).

Follow-up artifacts tracked as tasks #231-#238 (stream placeholders). Supporting doc updates (plan.md §6 Migration Strategy, config-db-schema.md §§ for the four new tables, driver-specs.md §Alarm semantics clarification, new ADR-002 for driver-vs-virtual dispatch) will land alongside the streams that touch them, not in this doc.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 16:05:12 -04:00
8384e58655 Merge pull request 'Modbus exception-injection profile — wire-level coverage for codes 0x01/0x03/0x04/0x05/0x06/0x0A/0x0B' (#174) from modbus-exception-injection-profile into v2 2026-04-20 15:14:00 -04:00
Joseph Doherty
96940aeb24 Modbus exception-injection profile — closes the end-to-end test gap for exception codes 0x01/0x03/0x04/0x05/0x06/0x0A/0x0B. pymodbus simulator naturally emits only 0x02 (Illegal Data Address on reads outside configured ranges) + 0x03 (Illegal Data Value on over-length); the driver's MapModbusExceptionToStatus table translates eight codes, but only 0x02 had integration-level coverage (via DL205's unmapped-register test). Unit tests lock the translation function in isolation but an integration test was missing for everything else. This PR lands wire-level coverage for the remaining seven codes without depending on device-specific quirks to naturally produce them.
New exception_injector.py — standalone pure-Python-stdlib Modbus/TCP server shipped alongside the pymodbus image. Speaks the wire protocol directly (MBAP header parse + FC 01/02/03/04/05/06/15/16 dispatch + store-backed happy-path reads/writes + spec-enforced length caps) and looks up each (fc, starting-address) against a rules list loaded from JSON; a matching rule makes the server respond [fc|0x80, exception_code] instead of the normal response. Zero runtime dependencies outside the stdlib — the Dockerfile just COPY's the script into /fixtures/ alongside the pymodbus profile JSONs, no new pip install needed. ~200 lines. New exception_injection.json profile carries rules for every exception code on FC03 (addresses 1000-1007, one per code), FC06 (2000-2001 for CPU-PROGRAM-mode and busy), and FC16 (3000 for server failure). New exception_injection compose profile binds :5020 like every other service + runs python /fixtures/exception_injector.py --config /fixtures/exception_injection.json.

New ExceptionInjectionTests.cs in Modbus.IntegrationTests — 11 tests. Eight FC03-read theories exercise every exception code 0x01/0x02/0x03/0x04/0x05/0x06/0x0A/0x0B asserting the driver's expected OPC UA StatusCode mapping (BadNotSupported/BadOutOfRange/BadOutOfRange/BadDeviceFailure/BadDeviceFailure/BadDeviceFailure/BadCommunicationError/BadCommunicationError). Two FC06-write theories cover the write path for 0x04 (Server Failure, CPU in PROGRAM mode) + 0x06 (Server Busy). One sanity-check read at address 5 confirms the injector isn't globally broken + non-injected reads round-trip cleanly with Value=5/StatusCode=Good. All tests follow the MODBUS_SIM_PROFILE=exception_injection skip guard so they no-op on a fresh clone without Docker running.

Docker/README.md gains an §Exception injection section explaining what pymodbus can and cannot emit, what the injector does, where the rules live, and how to append new ones. docs/drivers/Modbus-Test-Fixture.md follow-up item #2 (extend pymodbus profiles to inject exceptions) gets a shipped strikethrough with the new coverage inventory; the unit-level section adds ExceptionInjectionTests next to DL205ExceptionCodeTests so the split-of-responsibilities is explicit (DL205 test = natural out-of-range via dl205 profile, ExceptionInjectionTests = every other code via the injector).

Test baselines: Modbus unit 182/182 green (unchanged); Modbus integration with exception_injection profile live 11/11 new tests green. Existing DL205/S7/Mitsubishi integration tests unaffected since they skip on MODBUS_SIM_PROFILE mismatch.

Found + fixed during validation: a stale native pymodbus simulator from April 18 was still listening on port 5020 on IPv6 localhost (Windows was load-balancing between it + the Docker IPv4 forward, making injected exceptions intermittently come back as pymodbus's default 0x02). Killed the leftover. Documented the debugging path in the commit as a note for anyone who hits the same "my tests see exception 0x02 but the injector log has no request" symptom.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 15:11:32 -04:00
340f580be0 Merge pull request 'FOCAS Tier-C PR E — ops glue: ProcessHostLauncher + post-mortem MMF + NSSM scripts' (#173) from focas-tier-c-pr-e-ops-glue into v2 2026-04-20 14:26:35 -04:00
Joseph Doherty
8d88ffa14d FOCAS Tier-C PR E — ops glue: ProcessHostLauncher + post-mortem MMF + NSSM install scripts + doc close-out. Final of the 5 PRs for #220. With this landing, the Tier-C architecture is fully shipped; the only remaining FOCAS work is the hardware-dependent FwlibHostedBackend (real Fwlib32.dll P/Invoke, gated on #222 lab rig).
Production IHostProcessLauncher (ProcessHostLauncher.cs): Process.Start spawns OtOpcUa.Driver.FOCAS.Host.exe with OTOPCUA_FOCAS_PIPE / OTOPCUA_ALLOWED_SID / OTOPCUA_FOCAS_SECRET / OTOPCUA_FOCAS_BACKEND in the environment (supervisor-owned, never disk), polls FocasIpcClient.ConnectAsync at 250ms cadence until the pipe is up or the Host exits or the ConnectTimeout deadline passes, then wraps the connected client in an IpcFocasClient. TerminateAsync kills the entire process tree + disposes the IPC stream. ProcessHostLauncherOptions carries HostExePath + PipeName + AllowedSid plus optional SharedSecret (auto-generated from a GUID when omitted so install scripts don't have to), Arguments, Backend (fwlib32/fake/unconfigured default-unconfigured), ConnectTimeout (15s), and Series for CNC pre-flight.

Post-mortem MMF (Host/Stability/PostMortemMmf.cs + Proxy/Supervisor/PostMortemReader.cs): ring-buffer of the last ~1000 IPC operations written by the Host into a memory-mapped file. On a Host crash the supervisor reads the MMF — which survives process death — to see what was in flight. File format: 16-byte header [magic 'OFPC' (0x4F465043) | version | capacity | writeIndex] + N × 256-byte entries [8-byte UTC unix ms | 8-byte opKind | 240-byte UTF-8 message + null terminator]. Magic distinguishes FOCAS MMFs from the Galaxy MMFs that ship the same format shape. Writer is single-producer (Host) with a lock_writeGate; reader is multi-consumer (Proxy + any diagnostic tool) using a separate MemoryMappedFile handle.

NSSM install wrappers (scripts/install/Install-FocasHost.ps1 + Uninstall-FocasHost.ps1): idempotent service registration for OtOpcUaFocasHost. Resolves SID from the ServiceAccount, generates a fresh shared secret per install if not supplied, stages OTOPCUA_FOCAS_PIPE/SID/SECRET/BACKEND in AppEnvironmentExtra so they never hit disk, rotates 10MB stdout/stderr logs under %ProgramData%\OtOpcUa, DependOnService=OtOpcUa so startup order is deterministic. Backend selector defaults to unconfigured so a fresh install doesn't accidentally load a half-configured Fwlib32.dll on first start.

Tests (7 new, 2 files): PostMortemMmfTests.cs in FOCAS.Host.Tests — round-trip write+read preserves order + content, ring-buffer wraps at capacity (writes 10 entries to a 3-slot buffer, asserts only op-7/8/9 survive in FIFO order), message truncation at the 240-byte cap is null-terminated + non-overflowing, reopening an existing file preserves entries. PostMortemReaderCompatibilityTests.cs in FOCAS.Tests — hand-writes a file in the exact host format (magic/entry layout) + asserts the Proxy reader decodes with correct ring-walk ordering when writeIndex != 0, empty-return on missing file + magic mismatch. Keeps the two codebases in format-lockstep without the net10 test project referencing the net48 Host assembly.

Docs updated: docs/v2/implementation/focas-isolation-plan.md promoted from DRAFT to PRs A-E shipped status with per-PR citations + post-ship test counts (189 + 24 + 13 = 226 FOCAS-family tests green). docs/drivers/FOCAS-Test-Fixture.md §5 updated from "architecture scoped but not implemented" to listing the shipped components with the FwlibHostedBackend gap explicitly labeled as hardware-gated. Install-FocasHost.ps1 documents the OTOPCUA_FOCAS_BACKEND selector + points at docs/v2/focas-deployment.md for Fwlib32.dll licensing.

What ISN'T in this PR: (1) the real FwlibHostedBackend implementing IFocasBackend with the P/Invoke — requires either a CNC on the bench or a licensed FANUC developer kit to validate, tracked under #220 as a single follow-up task; (2) Admin /hosts surface integration for FOCAS runtime status — Galaxy Tier-C already has the shape, FOCAS can slot in when someone wires ObservedCrashes/StickyAlertActive/BackoffAttempt to the FleetStatusHub; (3) a full integration test that actually spawns a real FOCAS Host process — ProcessHostLauncher is tested via its contract + the MMF is tested via round-trip, but no test spins up the real exe (the Galaxy Tier-C tests do this, but the FOCAS equivalent adds no new coverage over what's already in place).

Total FOCAS-family tests green after this PR: 189 driver + 24 Shared + 13 Host = 226.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 14:24:13 -04:00
446a5c022c Merge pull request 'FOCAS Tier-C PR D — supervisor + backoff + crash-loop breaker' (#172) from focas-tier-c-pr-d-supervisor into v2 2026-04-20 14:19:32 -04:00
Joseph Doherty
5033609944 FOCAS Tier-C PR D — supervisor + backoff + crash-loop breaker + heartbeat monitor. Fourth of 5 PRs for #220. Ships the resilience harness that sits between the driver's IFocasClient requests and the Tier-C Host process, so a crashing Fwlib32.dll takes down only the Host (not the main server), gets respawned on a backoff ladder, and opens a circuit with a sticky operator alert when the crash rate is pathological. Same shape as Galaxy Tier-C so the Admin /hosts surface has a single mental model. New Supervisor/ namespace in Driver.FOCAS (.NET 10, Proxy-side): Backoff with the 5s→15s→60s default ladder + StableRunThreshold that resets the index after a 2-min clean run (so a one-off crash after hours of steady-state doesn't restart from the top); CircuitBreaker with 3-crashes-in-5-min threshold + escalating 1h→4h→manual-reset cooldown ladder + StickyAlertActive flag that persists across cooldowns until AcknowledgeAndReset is called; HeartbeatMonitor tracking ConsecutiveMisses against the 3-misses-kill threshold + LastAckUtc for telemetry; IHostProcessLauncher abstraction over "spawn Host process + produce an IFocasClient connected to it" so the supervisor stays I/O-free and fully testable with a fake launcher that can be told to throw on specific attempts (production wiring over Process.Start + FocasIpcClient.ConnectAsync is the PR E ops-glue concern); FocasHostSupervisor orchestrating them — GetOrLaunchAsync cycles through backoff until either a client is returned or the breaker opens (surfaced as InvalidOperationException so the driver maps to BadDeviceFailure), NotifyHostDeadAsync fans out the unavailable event + terminates the current launcher + records the crash without blocking (so heartbeat-loss detection can short-circuit subscriber fan-out and let the next GetOrLaunchAsync handle the respawn), AcknowledgeAndReset is the operator-clear path, OnUnavailable event for Admin /hosts wiring + ObservedCrashes + BackoffAttempt + StickyAlertActive for telemetry. 14 new unit tests across SupervisorTests.cs: Backoff (default sequence, clamping, RecordStableRun resets), CircuitBreaker (below threshold allowed, opens at threshold, escalates cooldown after second open, ManualReset clears state), HeartbeatMonitor (3 consecutive misses declares dead, ack resets counter), FocasHostSupervisor (first-launch success, retry-with-backoff after transient failure, repeated failures open breaker + surface InvalidOperationException, NotifyHostDeadAsync terminates + fan-outs + increments crash count, AcknowledgeAndReset clears sticky, Dispose terminates). Full FOCAS driver tests now 186/186 green (172 + 14 new). No changes to IFocasClient DI contract; existing FakeFocasClient-based tests unaffected. PR E wires the real Process-based IHostProcessLauncher + NSSM install scripts + MMF post-mortem + docs.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 14:17:23 -04:00
9034294b77 Merge pull request 'FOCAS Tier-C PR C — IPC path end-to-end' (#171) from focas-tier-c-pr-c-ipc-proxy into v2 2026-04-20 14:13:33 -04:00
Joseph Doherty
3892555631 FOCAS Tier-C PR C — IPC path end-to-end: Proxy IpcFocasClient + Host FwlibFrameHandler + IFocasBackend abstraction. Third of 5 PRs for #220. Ships the wire path from IFocasClient calls in the .NET 10 driver, over a named-pipe (or in-memory stream) to the .NET 4.8 Host's FwlibFrameHandler, dispatched to an IFocasBackend. Keeps the existing IFocasClient DI seam intact so existing unit tests are unaffected (172/172 still pass). Proxy side adds Ipc/FocasIpcClient (owns one pipe stream + call gate so concurrent callers don't interleave frames, supports both real NamedPipeClientStream and arbitrary Stream for in-memory test loopback) and Ipc/IpcFocasClient (implements IFocasClient by forwarding every call as an IPC frame — Connect sends OpenSessionRequest and caches the SessionId; Read sends ReadRequest and decodes the typed value via FocasDataTypeCode; Write sends WriteRequest for non-bit data or PmcBitWriteRequest when it's a PMC bit so the RMW critical section stays on the Host; Probe sends ProbeRequest; Dispose best-effort sends CloseSessionRequest); plus FocasIpcException surfacing Host-side ErrorResponse frames as typed exceptions. Host side adds Backend/IFocasBackend (the Host's view of one FOCAS session — Open/Close/Read/Write/PmcBitWrite/Probe) with two implementations: FakeFocasBackend (in-memory, per-address value store, honors bit-write RMW semantics against the containing byte — used by tests and as an OTOPCUA_FOCAS_BACKEND=fake operational stub) and UnconfiguredFocasBackend (structured failure pointing at docs/v2/focas-deployment.md — the safe default when OTOPCUA_FOCAS_BACKEND is unset or hardware isn't configured). Ipc/FwlibFrameHandler replaces StubFrameHandler: deserializes each request DTO, delegates to the IFocasBackend, re-serializes into the matching response kind. Catches backend exceptions and surfaces them as ErrorResponse{backend-exception} rather than tearing down the pipe. Program.cs now picks the backend from OTOPCUA_FOCAS_BACKEND env var (fake/unconfigured/fwlib32; fwlib32 still maps to Unconfigured because the real Fwlib32 P/Invoke integration is a hardware-dependent follow-up — #220 captures it). Tests: 7 new IPC round-trip tests on the Proxy side (IpcFocasClient vs. an IpcLoopback fake server: connect happy path, connect rejection, read decode, write round-trip, PMC bit write routes to first-class RMW frame, probe, ErrorResponse surfaces as typed exception) + 6 new Host-side tests on FwlibFrameHandler (OpenSession allocates id, read-without-session fails, full open/write/read round-trip preserves value, PmcBitWrite sets the specified bit, Probe reports healthy with open session, UnconfiguredBackend returns pointed-at-docs error with ErrorCode=NoFwlibBackend). Existing 165 FOCAS unit tests + 24 Shared tests + 3 Host handshake tests all unchanged. Total post-PR: 172+24+9 = 205 FOCAS-family tests green. What's NOT in this PR: the actual Fwlib32.dll P/Invoke integration inside the Host (FwlibHostedBackend) lands as a hardware-dependent follow-up since no CNC is available for validation; supervisor + respawn + crash-loop breaker comes in PR D; MMF + NSSM install scripts in PR E.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 14:10:52 -04:00
3609a5c676 Merge pull request 'FOCAS Tier-C PR B � Driver.FOCAS.Host net48 x86 skeleton' (#170) from focas-tier-c-pr-b-host into v2 2026-04-20 14:02:56 -04:00
Joseph Doherty
a6f53e5b22 FOCAS Tier-C PR B — Driver.FOCAS.Host net48 x86 skeleton + pipe server. Second PR of the 5-PR #220 split. Stands up the Windows Service entry point + named-pipe scaffolding so PR C has a place to move the Fwlib32 calls into. New net48 x86 project (Fwlib32.dll is 32-bit-only, same bitness constraint as Galaxy.Host/MXAccess); references Driver.FOCAS.Shared for framing + DTOs so the wire format Proxy<->Host stays a single type system. Ships four files: PipeAcl creates a PipeSecurity where only the configured server principal SID gets ReadWrite+Synchronize + LocalSystem/BuiltinAdministrators are explicitly denied (so a compromised service account on the same host can't escalate via the pipe); IFrameHandler abstracts the dispatch surface with HandleAsync + AttachConnection for server-push event sinks; PipeServer accepts one connection at a time, verifies the peer SID via RunAsClient, reads the first Hello frame + matches the shared-secret and protocol major version, sends HelloAck, then hands off to the handler until EOF or cancel; StubFrameHandler fully handles Heartbeat/HeartbeatAck so a future supervisor's liveness detector stays happy, and returns ErrorResponse{Code=not-implemented,Message="Kind X is stubbed - Fwlib32 lift lands in PR C"} for every data-plane request. Program.cs mirrors the Galaxy.Host startup exactly: reads OTOPCUA_FOCAS_PIPE / OTOPCUA_ALLOWED_SID / OTOPCUA_FOCAS_SECRET from the env (supervisor passes these at spawn time), builds Serilog rolling-file logger into %ProgramData%\OtOpcUa\focas-host-*.log, constructs the pipe server with StubFrameHandler, loops on RunAsync until SIGINT. Log messages mark the backend as "stub" so it's visible in logs that Fwlib32 isn't actually connected yet. Driver.FOCAS.Host.Tests (net48 x86) ships three integration tests mirroring IpcHandshakeIntegrationTests from Galaxy.Host: correct-secret handshake + heartbeat round-trip, wrong-secret rejection with UnauthorizedAccessException, and a new test that sends a ReadRequest and asserts the StubFrameHandler returns ErrorResponse{not-implemented} mentioning PR C in the message so the wiring between frame dispatch + kind → error mapping is locked. Tests follow the same is-Administrator skip guard as Galaxy because PipeAcl denies BuiltinAdministrators. No changes to existing driver code; FOCAS unit tests still at 165/165 + Shared tests at 24/24. PR C wires the real Fwlib32 backend.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 14:00:56 -04:00
b968496471 Merge pull request 'FOCAS Tier-C PR A � Driver.FOCAS.Shared MessagePack contracts' (#169) from focas-tier-c-pr-a-shared into v2 2026-04-20 13:57:45 -04:00
Joseph Doherty
e6ff39148b FOCAS Tier-C PR A — Driver.FOCAS.Shared MessagePack IPC contracts. First PR of the 5-PR #220 split (isolation plan at docs/v2/implementation/focas-isolation-plan.md). Adds a new netstandard2.0 project consumable by both the .NET 10 Proxy and the future .NET 4.8 x86 Host, carrying every wire DTO the Proxy <-> Host pair will exchange: Hello/HelloAck + Heartbeat/HeartbeatAck + ErrorResponse for session negotiation (shared-secret + protocol major/minor mirroring Galaxy.Shared); OpenSessionRequest/Response + CloseSessionRequest carrying the declared FocasCncSeries so the Host picks up the pre-flight matrix; FocasAddressDto + FocasDataTypeCode for wire-compatible serialization of parsed addresses (0=Pmc/1=Param/2=Macro matches FocasAreaKind enum order so both sides cast (int)); ReadRequest/Response + WriteRequest/Response with MessagePack-serialized boxed values tagged by FocasDataTypeCode; PmcBitWriteRequest/Response as a first-class RMW operation so the critical section stays Host-side; Subscribe/Unsubscribe/OnDataChangeNotification for poll-loop-pushes-deltas model (FOCAS has no CNC-initiated callbacks); Probe + RuntimeStatusChange + Recycle surface for Tier-C supervision. Framing is [4-byte BE length][1-byte kind][body] with 16 MiB body cap matching Galaxy; FocasMessageKind byte values align with Galaxy ranges so an operator reading a hex dump doesn't have to context-switch. FrameReader/FrameWriter ported from Galaxy.Shared with thread-safe concurrent-write serialization. 24 new unit tests: 18 per-DTO round-trip tests covering every field + 6 framing tests (single-frame round-trip, clean-EOF returns null, oversized-length rejection, mid-frame EOF throws, 20-way concurrent-write ordering preserved, MessageKind byte values locked as wire-stable). No driver changes; existing 165 FOCAS unit tests still pass unchanged. PR B (Host skeleton) goes next.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 13:55:35 -04:00
4a6fe7fa7e Merge pull request 'FOCAS version-matrix stabilization (PR 1 of #220 split)' (#168) from focas-version-matrix-stabilization into v2 2026-04-20 13:46:53 -04:00
Joseph Doherty
a6be2f77b5 FOCAS version-matrix stabilization (PR 1 of #220 split) — ship the cheap half of the hardware-free stability gap ahead of the Tier-C out-of-process split. Without any CNC or simulator on the bench, the highest-leverage move is to catch operator config errors at init time instead of at steady-state per-read. Adds FocasCncSeries enum (Unknown/16i/0i-D/0i-F family/30i family/PowerMotion-i) + FocasCapabilityMatrix static class that encodes the per-series documented ranges for macro variables (cnc_rdmacro/wrmacro), parameters (cnc_rdparam/wrparam), and PMC letters + byte ceilings (pmc_rdpmcrng/wrpmcrng) straight from the Fanuc FOCAS Developer Kit. FocasDeviceOptions gains a Series knob (defaults Unknown = permissive so pre-matrix configs don't break on upgrade). FocasDriver.InitializeAsync now calls FocasAddress.TryParse on every tag + runs FocasCapabilityMatrix.Validate against the owning device's declared series, throwing InvalidOperationException with a reason string that names both the series and the documented limit ("Parameter #30000 is outside the documented range [0, 29999] for Thirty_i") so an operator can tell whether the mismatch is in the config or in their declared CNC model. Unknown series skips validation entirely. Ships 46 new theory cases in FocasCapabilityMatrixTests.cs — covering every boundary in the matrix (widen 16i->0i-F: macro ceiling 999->9999, param 9999->14999; widen 0i-F->30i: PMC letters +K+T; PMC-number 16i=999/0i-D=1999/0i-F=9999/30i=59999), permissive Unknown-series behavior, rejection-message content, and case-insensitive PMC-letter matching. Widening a range without updating docs/v2/focas-version-matrix.md fails a test because every InlineData cites the row it reflects. Full FOCAS test suite stays at 165/165 passing (119 existing + 46 new). Also authors docs/v2/focas-version-matrix.md as the authoritative range reference with per-function citations, CNC-series era context, error-surface shape, and the link back to the matrix code; docs/v2/implementation/focas-isolation-plan.md as the multi-PR plan for #220 Tier-C isolation (Shared contracts -> Host skeleton -> move Fwlib32 calls -> Supervisor+respawn -> MMF+ops glue, 2200-3200 LOC across 5 PRs mirroring the Galaxy Tier-C topology); and promotes docs/drivers/FOCAS-Test-Fixture.md from "version-matrix coverage = no" to explicit coverage via the new test file + cross-links to the matrix and isolation-plan docs. Leaves task #220 open since isolation itself (the expensive half) is still ahead.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 13:44:37 -04:00
64bbc12e8e Merge pull request 'AB Legacy ab_server PCCC Docker fixture scaffold (#224)' (#167) from ablegacy-docker-fixture into v2 2026-04-20 13:28:16 -04:00
Joseph Doherty
a0cf7c5860 AB Legacy ab_server PCCC Docker fixture scaffold (#224) — Docker infrastructure + test-class shape in place; wire-level round-trip currently blocked by an ab_server-side PCCC coverage gap documented honestly in the fixture + coverage docs. Closes the Docker-infrastructure piece of #224; the remaining work is upstream (patch ab_server's PCCC server opcodes) or sideways (RSEmulate 500 golden-box tier, lab rig).
New project tests/ZB.MOM.WW.OtOpcUa.Driver.AbLegacy.IntegrationTests/ with four pieces. AbLegacyServerFixture — TCP probe against localhost:44818 (or AB_LEGACY_ENDPOINT override), distinct from AB_SERVER_ENDPOINT so both CIP + PCCC containers can run simultaneously. Single-public-ctor to satisfy xunit collection-fixture constraint. AbLegacyServerProfile + KnownProfiles carry the per-family (SLC500 / MicroLogix / PLC-5) ComposeProfile + Notes; drives per-theory parameterisation. AbLegacyFactAttribute / AbLegacyTheoryAttribute match the AB CIP skip-attribute pattern.

Docker/docker-compose.yml reuses the AB CIP otopcua-ab-server:libplctag-release image — `build:` block points at ../../AbCip.IntegrationTests/Docker context so `docker compose build` from here produces / reuses the same multi-stage build. Three compose profiles (slc500 / micrologix / plc5) with per-family `--plc` + `--tag=<file>[<size>]` flags matching the PCCC tag syntax (different from CIP's `Name:Type[size]`).

AbLegacyReadSmokeTests — one parametric theory reading N7:0 across all three families + one SLC500 write-then-read on N7:5. Targets the shape the driver would use against real hardware. Verified 2026-04-20 against a live SLC500 container: TCP probe passes + container accepts connections + libplctag negotiates session, but read/write returns BadCommunicationError (libplctag status 0x80050000). Root-caused to ab_server's PCCC server-side opcode coverage being narrower than libplctag's PCCC client expects — not a driver-side bug, not a scaffold bug, just an ab_server upstream limitation. Documented honestly in Docker/README.md + AbLegacy-Test-Fixture.md rather than skipping the tests or weakening assertions; tests now skip cleanly when container is absent, fail with clear message when container is up but the protocol gap surfaces. Operator resolves by filing an ab_server upstream patch, pointing AB_LEGACY_ENDPOINT at real hardware, or scaffolding an RSEmulate 500 golden-box tier.

Docker/README.md — Known limitations section leads with the PCCC round-trip gap (test date, failure signature, possible root causes, three resolution paths) before the pre-existing limitations (T/C file decomposition, ST file quirks, indirect addressing, DF1 serial). Reader can't miss the "scaffolded but blocked on upstream" framing.

docs/drivers/AbLegacy-Test-Fixture.md — TL;DR flipped from "no integration fixture" to "Docker scaffold in place; wire-level round-trip currently blocked by ab_server PCCC gap". What-the-fixture-is gains an Integration section. Follow-up candidates rewritten: #1 is now "fix ab_server PCCC upstream", #2 is RSEmulate 500 golden-box (with cost callouts matching our existing Logix Emulate + TwinCAT XAR scaffolds — license + Hyper-V conflict + binary project format), #3 is lab rig. Key-files list adds the four new files. docs/drivers/README.md coverage-map row updated from "no integration fixture" to "Docker scaffold via ab_server PCCC; wire-level round-trip currently blocked, docs call out resolution paths".

Solution file picks up the new tests/.../AbLegacy.IntegrationTests entry. AbLegacyDataType.Int used throughout (not Int16 — the enum uses SLC file-type naming). Build 0 errors; 2 smoke tests skip cleanly without container + fail with clear errors when container up (proving the infrastructure works end-to-end + the gap is specifically the ab_server protocol coverage, not the scaffold).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 13:26:19 -04:00
2fe1a326dc Merge pull request 'TwinCAT XAR integration fixture scaffold (#221)' (#166) from twincat-xar-fixture-scaffold into v2 2026-04-20 13:11:53 -04:00
Joseph Doherty
7b49ea13c7 TwinCAT XAR integration fixture — scaffold the code + docs so the Hyper-V VM + .tsproj drop in without fixture-code changes. Mirrors the AB CIP Logix Emulate scaffold shipped in PR #165: tier-gated smoke tests that skip cleanly when the VM isn't reachable, a project README documenting exactly what the XAR needs to run, fixture-coverage doc promoting TwinCAT from "no integration fixture" to "scaffolded + needs operational setup". The actual Beckhoff-side work (provision VM, install XAR, author tsproj, rotate 7-day trial) lives in #221 + the new TwinCatProject/README.md walkthrough.
New project tests/ZB.MOM.WW.OtOpcUa.Driver.TwinCAT.IntegrationTests/ with four pieces. TwinCATXarFixture — TCP probe against the ADS-over-TCP port 48898 on the host from TWINCAT_TARGET_HOST env var, requires TWINCAT_TARGET_NETID for the target AmsNetId, optional TWINCAT_TARGET_PORT for runtime 2+ (default 851 = PLC runtime 1). Doesn't own a lifecycle — XAR can't run in Docker because it bypasses the Windows kernel scheduler to hit real-time cycles, so the VM stays operator-managed. Explicit skip reasons surface the setup steps (start VM, set env vars, reactivate trial license) instead of a confusing hang. TwinCATFactAttribute + TwinCATTheoryAttribute — xunit skip gate matching AbServerFactAttribute / OpcPlcCollection patterns.

TwinCAT3SmokeTests — three smoke tests through the real AdsTwinCATClient + real ADS over TCP. Driver_reads_seeded_DINT_through_real_ADS reads GVL_Fixture.nCounter, asserts >= 1234 (MAIN increments every cycle so an exact match would race). Driver_write_then_read_round_trip_on_scratch_REAL writes 42.5 to GVL_Fixture.rSetpoint + reads back, catches the ADS write path regression that unit tests can't see. Driver_subscribe_receives_native_ADS_notifications_on_counter_changes validates the #189 native-notification path end-to-end — AddDeviceNotification fires OnDataChange at the PLC cycle boundary, the test observes one firing within 3 s. All three gated on TWINCAT_TARGET_HOST + NETID; skip via TwinCATFactAttribute when unset, verified in this commit with 3 clean [SKIP] results.

TwinCatProject/README.md — the tsproj state the smoke tests depend on. GVL_Fixture with nCounter:DINT:=1234 + rSetpoint:REAL:=0.0 + bFlag:BOOL:=TRUE; MAIN program with the single-line ladder `GVL_Fixture.nCounter := GVL_Fixture.nCounter + 1;`; PlcTask cyclic @ 10 ms priority 20; PLC runtime 1 (AMS port 851). Explains why tsproj over the compiled bootproject (text-diffable, rebuildable, no per-install state). Full XAR VM setup walkthrough — Hyper-V Gen 2 VM, TC3 XAE+XAR install, noting the AmsNetId from the tray icon, bilateral route configuration (VM System Manager → Routes + dev box StaticRoutes.xml), project import, Activate Configuration + Run Mode. License-rotation section walks through two options — scheduled TcActivate.exe /reactivate via Task Scheduler (not officially Beckhoff-supported, reportedly works on current builds) or paid runtime license (~$1k one-time per runtime per CPU). Final section shows the exact env-var recipe + dotnet test command on the dev box.

docs/drivers/TwinCAT-Test-Fixture.md — flipped TL;DR from "there is no integration fixture" to "scaffolding lives at tests/..., remaining operational work is VM + tsproj + license rotation". "What the fixture is" gains an Integration section describing the XAR VM target. "What it actually covers" gains an Integration subsection listing the three named smoke tests. Follow-up candidates rewritten — the #1 item used to be "TwinCAT 3 runtime on CI" as a speculative option; now it's concrete "XAR VM live-population" with a link to #221 + the project README for the operational walkthrough. License rotation becomes #2 with both automation paths. Key fixture / config files list adds the three new files + the project README. docs/drivers/README.md coverage-map row updated from "no integration fixture" to "XAR-VM integration scaffolding".

Solution file picks up the new tests/ZB.MOM.WW.OtOpcUa.Driver.TwinCAT.IntegrationTests entry alongside the existing TwinCAT.Tests. xunit CollectionDefinition added to TwinCATXarFixture after the first build revealed the [Collection("TwinCATXar")] reference on TwinCAT3SmokeTests had no matching registration. Build 0 errors; 3 skip-clean test outcomes verified. #221 stays open as in_progress until the VM + tsproj land.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 13:11:17 -04:00
b820b9a05f Merge pull request 'AB CIP Logix Emulate golden-box tier � scaffolding' (#165) from abcip-emulate-tier-scaffold into v2 2026-04-20 12:56:38 -04:00
Joseph Doherty
58a0cccc67 AB CIP Logix Emulate golden-box tier — scaffold the code + docs so the L5X + Emulate PC drop in without fixture-code changes. Closes the initial design question the user raised; the actual Emulate-side work (author project, commit L5X, install Emulate on the dev box) is tracked as #223. Scaffolding ships everything that doesn't need the live Emulate instance: tier-gated test classes that skip cleanly when AB_SERVER_PROFILE is unset, the profile gate helper, the LogixProject/README.md documenting the exact project state the tests expect, the fixture coverage doc's new §Logix Emulate tier section with the when-to-trust table extended from 3 columns to 4, and the dev-environment.md integration-host row.
AbServerProfileGate — static helper that reads `AB_SERVER_PROFILE` env var (defaults to "abserver") + exposes `SkipUnless(params string[] requiredProfiles)` matching the MODBUS_SIM_PROFILE pattern the DL205StringQuirkTests uses one directory over. Emulate-only tests call `AbServerProfileGate.SkipUnless("emulate")` at the top of each fact body; ab_server-default runs see them skip with a clear message pointing at the Emulate setup steps.

AbCipEmulateUdtReadTests — one test proving the #194 whole-UDT read optimization works against the real Logix Template Object, not just the golden byte buffers the unit suite uses. Builds an `AbCipDriverOptions` with a Structure tag `Motor1 : Motor_UDT` that has three declared members (Speed : DINT, Torque : REAL, Status : DINT), reads them via the `.Speed / .Torque / .Status` dotted-tag syntax, asserts the driver gets the grouped whole-UDT path + decodes each at the right offset. Required seed values documented inline + in LogixProject/README.md: Speed=1800, Torque=42.5f, Status=0x0001.

AbCipEmulateAlmdTests — one test proving the #177 ALMD projection fires `OnAlarmEvent` when a real ALMD instruction's `In` edge rises, not just the fake `InFaulted` timer edges the unit suite drives. Needs a `SimulateAlarm : BOOL` tag routed through `MainRoutine` ladder (`XIC SimulateAlarm OTE HighTempAlarm.In`) so the test case can pulse the input via the existing `IWritable.WriteAsync` path instead of scripting Emulate via its own socket. Alarm-projection options carry `EnableAlarmProjection = true` + 200 ms poll interval; a `TaskCompletionSource` gates the raise-event assertion with a 5 s deadline. Cleanup writes SimulateAlarm=false so consecutive runs start from known state.

LogixProject/README.md — the Studio 5000 project state the Emulate-tier tests depend on. Explains why L5X over ACD (text diff, reproducible import, no per-install state), the UDT + tag + routine structure, how to bring it up on the Emulate PC. Ships as a stub pending actual author + L5X export + commit; the README itself keeps the requirements visible so the L5X author has a checklist.

docs/drivers/AbServer-Test-Fixture.md — new §Logix Emulate golden-box tier section with the coverage-promotion table (ab_server / Emulate / hardware per gap), the setup-env-var recipe, the costs to accept (license, Hyper-V conflict, manual lifecycle). "When to trust" table extended from 3 columns (ab_server / unit / rig) to 4 (ab_server / unit / Logix Emulate / rig); two new rows for EtherNet/IP embedded-switch + redundant-chassis failover that even Emulate can't help with. Follow-up candidates list gets Logix Emulate as option 1 ahead of the pre-existing "extend ab_server upstream" + "stand up a lab rig". See-also file list gains AbServerProfileGate.cs + Docker/ + Emulate/ + LogixProject/README.md entries.

docs/v2/dev-environment.md — §C Integration host gains a Rockwell Studio 5000 Logix Emulate row: purpose (AB CIP golden-box tier closing UDT/ALMD/AOI/safety/ConnectionSize gaps), type (Windows-only, Hyper-V conflict matching TwinCAT XAR's constraint), port 44818, credentials note, owner split between integration-host admin for license+install and developer for per-session runtime start.

Verified: Emulate tests skip cleanly when AB_SERVER_PROFILE is unset — both `[SKIP]` with the operator-facing message pointing at the env-var setup. Whole-solution build 0 errors. Tests will transition from skip → pass once the L5X + Emulate PC land per #223.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 12:54:39 -04:00
231148d7f0 Merge pull request 'Doc + code-comment sweep � finish the native-fallback removal' (#164) from docs-native-fallback-cleanup into v2 2026-04-20 12:38:11 -04:00
Joseph Doherty
fdb268cee0 Docs + code-comment sweep — remove stale Pymodbus/ + PythonSnap7/ + LocateBinary references left behind by the native-fallback removal PR. Answer to "is the dev inventory + documentation updated": it was partial; this PR finishes the job.
Files touched — docs/drivers/Modbus-Test-Fixture.md dropped the key-files pointer at deleted Pymodbus/ + flipped "primary launcher is Docker, native fallback retained" framing to "Docker is the only supported launch path" (matching the code). docs/v2/dev-environment.md dropped the "skips both Docker + native-binary paths" parenthetical from AB_SERVER_ENDPOINT + flipped the "Native fallbacks" subsection to a one-liner that says Docker is the only supported path. docs/v2/modbus-test-plan.md rewrote §Harness from "pip install pymodbus + serve.ps1" setup pattern to "docker compose --profile <…> up" + updated the §PR 43 status bullet to point at Docker/profiles/. docs/v2/test-data-sources.md §"CI fixture (task #180)" rewrote the AB CIP section from "LocateBinary() picks binary off PATH" + GitHub Actions zip-download step to "Docker is the only supported reproducible build path" + docker compose GitHub Actions step; dropped the pinned-version SHA256 table + lock-file reference because the Dockerfile's LIBPLCTAG_TAG build-arg is the new pin.

Code docstrings + error messages — these are developer-facing operational text too. ModbusSimulatorFixture SkipReason strings (both branches) now point at `docker compose -f Docker/docker-compose.yml --profile standard up -d` instead of the deleted `Pymodbus\serve.ps1`; doc-comment at the top references Docker/docker-compose.yml. Snap7ServerFixture SkipReason strings + doc-comment point at Docker/docker-compose.yml instead of PythonSnap7/serve.ps1. S7_1500Profile.cs docstring updated. Modbus Dockerfile comment pointing at deleted tests/.../Pymodbus/README.md redirected to docs/drivers/Modbus-Test-Fixture.md. DL205Profile.cs + DL205StringQuirkTests.cs + S7_1500Profile.cs (in Modbus project) docstrings flipped from Pymodbus/*.json references to Docker/profiles/*.json.

Left untouched deliberately: docs/v2/implementation/exit-gate-phase-2-closed.md — that's a historical as-of-2026-04-18 snapshot documenting what was skipped at Phase 2 closure; rewriting would lose the date-stamped context. Its "oitc/modbus-server Docker container not started" + "ab_server binary not on PATH" lines describe the fixture landscape that existed at close time, not current operational guidance.

Final sweep confirms zero remaining `Pymodbus/` / `PythonSnap7/` / `LocateBinary` / `AbServerSeedTag` / `BuildCliArgs` / `AbServerPlcArg` mentions anywhere in tracked files outside that historical exit-gate doc. Whole-solution build still 0 errors.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 12:36:19 -04:00
4473197cf5 Merge pull request 'Remove native-launcher fallbacks; Docker is the only path for Modbus / S7 / AB CIP / OpcUaClient' (#163) from remove-native-fallbacks into v2 2026-04-20 12:29:46 -04:00
Joseph Doherty
0e1dcc119e Remove native-launcher fallbacks for the four Dockerized fixtures — Docker is the only supported path for Modbus / S7 / AB CIP / OpcUaClient integration. Native paths stay in place only where Docker isn't compatible (Galaxy: MXAccess COM + Windows-only; TwinCAT: Beckhoff runtime vs Hyper-V; FOCAS: closed-source Fanuc Fwlib32.dll; AB Legacy: PCCC has no OSS simulator). Simplifies the fixture landscape + removes the "which path do I run" ambiguity; removes two full native-launcher directories + the AB CIP native-spawn path; removes the parallel profile-as-CLI-arg-builder code from AbServerFixture.
Modbus — deletes tests/.../Modbus.IntegrationTests/Pymodbus/ (serve.ps1, standard.json, dl205.json, mitsubishi.json, s7_1500.json, README.md). Profile JSONs live only under Docker/profiles/ now. Docker/README.md loses its "Native-Python fallback" section; docs/drivers/Modbus-Test-Fixture.md "What the fixture is" bullet flipped from "primary launcher is Docker, native fallback under Pymodbus/" to "Docker is the only supported launch path".

S7 — deletes tests/.../S7.IntegrationTests/PythonSnap7/ (server.py, s7_1500.json, serve.ps1, README.md). Docker/README.md loses "Native-Python fallback"; docs/drivers/S7-Test-Fixture.md updated to match.

AB CIP — the biggest simplification because the native-binary spawn had the most code. AbServerFixture.cs rewrites: drops Process management (no more Process _proc + Kill/WaitForExit), drops LocateBinary() PATH lookup, drops the IAsyncLifetime initialize-spawns-server behavior. Fixture is now a thin TCP probe against localhost:44818 (or AB_SERVER_ENDPOINT override) — same shape as Snap7ServerFixture / ModbusSimulatorFixture / OpcPlcFixture. IsServerAvailable() simplifies to a single 500 ms probe. AbServerProfile.cs drops AbServerPlcArg + SeedTags + BuildCliArgs + ToCliSpec + the entire AbServerSeedTag record — the compose file is the canonical source of truth for which tags + which --plc mode each family gets; the profile record now carries just Family + ComposeProfile (matches the docker-compose service key) + Notes. KnownProfiles.ForFamily + .All stay for tests that iterate families. AbServerProfileTests.cs rewrites to match: drops BuildCliArgs_* + ToCliSpec_* + SeedTags_* tests; keeps the family-coverage contract tests + verifies the ComposeProfile strings match compose-file service names (a typo in either surfaces as a unit-test failure, not a silent "wrong family booted" at runtime). Docker/README.md loses "Native-binary fallback" section; docs/drivers/AbServer-Test-Fixture.md "What the fixture is" flipped to Docker-only with clearer skip rules.

dev-environment.md §Docker fixtures — the "Native fallbacks" subsection goes away; replaced with a one-line note that Docker is the only supported path for these four fixtures + a fresh clone needs Docker Desktop and nothing else.

Verified: whole-solution build 0 errors, AB CIP profile unit tests 6/6, AB CIP Docker smoke 4/4 (all family theory rows), S7 Docker smoke 3/3. Container lifecycle clean. The deleted native code surface was already redundant — every fixture the native paths served is now covered by Docker; keeping them invited drift between the two paths (the original AB CIP native profile had three undetected bugs per the #162 commit message: case-sensitive --plc, bracket tag notation, --path=1,0 requirement — noise the Docker path now avoids by never running the buggy code).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 12:27:44 -04:00
27d135bd59 Merge pull request 'Dockerize Modbus + AB CIP + S7 test fixtures for reproducibility' (#162) from fixtures-all-docker into v2 2026-04-20 12:11:45 -04:00
Joseph Doherty
6609141493 Dockerize Modbus + AB CIP + S7 test fixtures for reproducibility. Every driver integration simulator now has a pinned Docker image alongside the existing native launcher — Docker is the primary path, native fallbacks kept for contributors who prefer them. Matches the already-Dockerized OpcUaClient/opc-plc pattern from #215 so every fixture in the fleet presents the same compose-up/test/compose-down loop. Reproducibility gain: what used to require a local pip/Python install (Modbus pymodbus, S7 python-snap7) or a per-OS C build from source (AB CIP ab_server from libplctag) now collapses to a Dockerfile + docker compose up. Modbus — new tests/ZB.MOM.WW.OtOpcUa.Driver.Modbus.IntegrationTests/Docker/ with Dockerfile (python:3.12-slim-bookworm + pymodbus[simulator]==3.13.0) + docker-compose.yml with four compose profiles (standard / dl205 / mitsubishi / s7_1500) backed by the existing profile JSONs copied under Docker/profiles/ as canonical; native fallback in Pymodbus/ retained with the same JSON set (symlink-equivalent — manual re-sync when profiles change, noted in both READMEs). Port 5020 unchanged so MODBUS_SIM_ENDPOINT + ModbusSimulatorFixture work without code change. Dropped the --no_http CLI arg the old serve.ps1 + compose draft passed — pymodbus 3.13 doesn't recognize it; the simulator's http ui just binds inside the container where nothing maps it out and costs nothing. S7 — new tests/ZB.MOM.WW.OtOpcUa.Driver.S7.IntegrationTests/Docker/ with Dockerfile (python:3.12-slim-bookworm + python-snap7>=2.0) + docker-compose.yml with one s7_1500 compose profile; copies the existing server.py shim + s7_1500.json seed profile; runs python -u server.py ... --port 1102. Native fallback in PythonSnap7/ retained. Port 1102 unchanged. AB CIP — hardest because ab_server is a source-only C tool in libplctag's src/tools/ab_server/. New tests/ZB.MOM.WW.OtOpcUa.Driver.AbCip.IntegrationTests/Docker/ Dockerfile is multi-stage: build stage (debian:bookworm-slim + build-essential + cmake) clones libplctag at a pinned tag + cmake --build build --target ab_server; runtime stage (debian:bookworm-slim) copies just the binary from /src/build/bin_dist/ab_server. docker-compose.yml ships four compose profiles (controllogix / compactlogix / micro800 / guardlogix) with per-family ab_server CLI args matching AbServerProfile.cs. AbServerFixture updated: tries TCP probe on 127.0.0.1:44818 first (Docker path) + spawns the native binary only as fallback when no listener is there. AB_SERVER_ENDPOINT env var supported for pointing at a real PLC. AbServerFact/Theory attributes updated to IsServerAvailable() which accepts any of: live listener on 44818, AB_SERVER_ENDPOINT set, or binary on PATH. Required two CLI-compat fixes to ab_server's argument expectations that the existing native profile never caught because it was never actually run at CI: --plc is case-sensitive (ControlLogix not controllogix), CIP tags need [size] bracket notation (DINT[1] not bare DINT), ControlLogix also requires --path=1,0. Compose files carry the corrected flags; the existing native-path AbServerProfile.cs was never invoked in practice so we don't rewrite it here. Micro800 now uses the --plc=Micro800 mode rather than falling back to ControlLogix emulation — ab_server does have the dedicated mode, the old Notes saying otherwise were wrong. Updated docs: three fixture coverage docs (Modbus-Test-Fixture.md, S7-Test-Fixture.md, AbServer-Test-Fixture.md) flip their "What the fixture is" section from native-only to Docker-primary-with-native-fallback; dev-environment.md §Resource Inventory replaces the old ambiguous "Docker Desktop + ab_server native" mix with four per-driver rows (each listing the image, compose file, compose profiles, port, credentials) + a new Docker fixtures — quick reference subsection giving the one-line docker compose -f <…> --profile <…> up for each driver + the env-var override names + the native fallback install recipes. drivers/README.md coverage map table updated — Modbus/AB CIP/S7 entries now read "Dockerized …" consistent with OpcUaClient's line. Verified end-to-end against live containers: Modbus DL205 smoke 1/1, S7 3/3, AB CIP ControlLogix 4/4 (all family theory rows). Container lifecycle clean (up/test/down, no leaked state). Every fixture keeps its skip-when-absent probe + env-var endpoint override so dotnet test on a fresh clone without Docker running still gets a green run.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 12:09:44 -04:00
3b3e814855 Merge pull request 'OpcUaClient integration fixture � opc-plc in Docker (#215)' (#161) from opcuaclient-opc-plc-fixture into v2 2026-04-20 11:45:24 -04:00
Joseph Doherty
c985c50a96 OpcUaClient integration fixture — opc-plc in Docker closes the wire-level gap (#215). Closes task #215. The OpcUaClient driver had the richest capability matrix in the fleet (reads/writes/subscribe/alarms/history across 11 unit-test classes) + zero wire-level coverage; every test mocked the Session surface. opc-plc is Microsoft Industrial IoT's OPC UA PLC simulator — already containerized, already on MCR, pinned to 2.14.10 here. Wins vs the loopback-against-our-own-server option we'd originally scoped: (a) independent cert chain + user-token handling catches interop bugs loopback can't because both endpoints would share our own cert store; (b) pinned image tag fixes the test surface in a way our evolving server wouldn't; (c) the --alm flag opens the door to real IAlarmSource coverage later without building a custom FakeAlarmDriver. Loss vs loopback: both use the OPCFoundation.NetStandard stack internally so bugs common to that stack don't surface — addressed by a follow-up to add open62541/open62541 as a second independent-stack image (tracked). Docker is the fixture launcher — no PowerShell/Python wrapper like Modbus/pymodbus or S7/python-snap7 because opc-plc ships containerized. Docker/docker-compose.yml pins 2.14.10 + maps port 50000 + command flags --pn=50000 --ut --aa --alm; the healthcheck TCP-probes 50000 so docker ps surfaces ready state. Fixture OpcPlcFixture follows the same shape as Snap7ServerFixture + ModbusSimulatorFixture: collection-scoped, parses OPCUA_SIM_ENDPOINT (default opc.tcp://localhost:50000) into host + port, 2-second TCP probe at init, SkipReason records the failure for Assert.Skip. Forced IPv4 on the probe socket for the same reason those two fixtures do — .NET's dual-stack "localhost" resolves IPv6 ::1 first + hangs the full connect timeout when the target binds 0.0.0.0 (IPv4). OpcPlcProfile holds well-known node identifiers opc-plc exposes (ns=3;s=StepUp, FastUInt1, RandomSignedInt32, AlternatingBoolean) + builds OpcUaClientDriverOptions with SecurityPolicy.None + AutoAcceptCertificates=true since opc-plc regenerates its server cert on every container spin-up + there's no meaningful chain to validate against in CI. Three smoke tests covering what the unit suite couldn't reach: (1) Client_connects_and_reads_StepUp_node_through_real_OPC_UA_stack — full Secure Channel + Session + Read on ns=3;s=StepUp (counter that ticks every 1 s); (2) Client_reads_batch_of_varied_types_from_live_simulator — batch Read of UInt32 / Int32 / Boolean to prove typed Variant decoding, with an explicit ShouldBeOfType<bool> assertion on AlternatingBoolean to catch the common "variant gets stringified" regression; (3) Client_subscribe_receives_StepUp_data_changes_from_live_server — real MonitoredItem subscription on FastUInt1 (100 ms cadence) with a SemaphoreSlim gate + 3 s deadline on the first OnDataChange fire, tolerating container warm-up. Driver ran end-to-end against a live 2.14.10 container: all 3 pass; unit suite 78/78 unchanged. Container lifecycle verified (compose up → tests → compose down) clean, no leaked state. Docker/README.md documents install (Docker Desktop already on the dev box per Phase 1 decision #134), run (compose up / compose up -d / compose down), endpoint override (OPCUA_SIM_ENDPOINT), what opc-plc advertises with the current command flags, what's tunable via compose-file tweaks (--daa for username auth tests; --fn/--fr/--ft for subscription-stress nodes), known limitation that opc-plc shares the OPCFoundation stack with our driver. OpcUaClient-Test-Fixture.md updated — TL;DR flipped from "there is no integration fixture" to the new reality; "What it actually covers" gains an Integration section listing the three smoke tests. Follow-up the doc flags: add open62541/open62541 as a second image for fully-independent-stack interop coverage; once #219 (server-side IAlarmSource/IHistoryProvider integration tests) lands, re-run the client-side suite against opc-plc's --alm nodes to close the alarm gap from the client side too.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 11:43:20 -04:00
820567bc2a Merge pull request 'S7 integration fixture via python-snap7 (#216) + per-driver test-fixture coverage docs' (#160) from s7-integration-fixture into v2 2026-04-20 11:31:14 -04:00
Joseph Doherty
1d3544f18e S7 integration fixture — python-snap7 server closes the wire-level coverage gap (#216) + per-driver fixture coverage docs for every driver in the fleet. Closes #216. Two shipments in one PR because the docs landed as I surveyed each driver's fixture + the S7 work is the first wire-level-gap closer pulled from that survey.
S7 integration — AbCip/Modbus already have real-simulator integration suites; S7 had zero wire-level coverage despite being a Tier-A driver (all unit tests mocked IS7Client). Picked python-snap7's `snap7.server.Server` over raw Snap7 C library because `pip install` beats per-OS binary-pin maintenance, the package ships a Python __main__ shim that mirrors our existing pymodbus serve.ps1 + *.json pattern structurally, and the python-snap7 project is actively maintained. New project `tests/ZB.MOM.WW.OtOpcUa.Driver.S7.IntegrationTests/` with four moving parts: (a) `Snap7ServerFixture` — collection-scoped TCP probe on `localhost:1102` that sets `SkipReason` when the simulator's not running, matching the `ModbusSimulatorFixture` shape one directory over (same S7_SIM_ENDPOINT env var override convention for pointing at a real S7 CPU on port 102); (b) `PythonSnap7/` — `serve.ps1` wrapper + `server.py` shim + `s7_1500.json` seed profile + `README.md` documenting install / run / known limitations; (c) `S7_1500/S7_1500Profile.cs` — driver-side `S7DriverOptions` whose tag addresses map 1:1 to the JSON profile's seed offsets (DB1.DBW0 u16, DB1.DBW10 i16, DB1.DBD20 i32, DB1.DBD30 f32, DB1.DBX50.3 bool, DB1.DBW100 scratch); (d) `S7_1500SmokeTests` — three tests proving typed reads + write-then-read round-trip work through real S7netplus + real ISO-on-TCP + real snap7 server. Picked port 1102 default instead of S7-standard 102 because 102 is privileged on Linux + triggers Windows Firewall prompt; S7netplus 0.20 has a 5-arg `Plc(CpuType, host, port, rack, slot)` ctor that lets the driver honour `S7DriverOptions.Port`, but the existing driver code called the 4-arg overload + silently hardcoded 102. One-line driver fix (S7Driver.cs:87) threads `_options.Port` through — the S7 unit suite (58/58) still passes unchanged because every unit test uses a fake IS7Client that never sees the real ctor. Server seed-type matrix in `server.py` covers u8 / i8 / u16 / i16 / u32 / i32 / f32 / bool-with-bit / ascii (S7 STRING with max_len header). register_area takes the SrvArea enum value, not the string name — a 15-minute debug after the first test run caught that; documented inline.

Per-driver test-fixture coverage docs — eight new files in `docs/drivers/` laying out what each driver's harness actually benchmarks vs. what's trusted from field deployments. Pattern mirrors the AbServer-Test-Fixture.md doc that shipped earlier in this arc: TL;DR → What the fixture is → What it actually covers → What it does NOT cover → When-to-trust table → Follow-up candidates → Key files. Ugly truth the survey made visible: Galaxy + Modbus + (now) S7 + AB CIP have real wire-level coverage; AB Legacy / TwinCAT / FOCAS / OpcUaClient are still contract-only because their libraries ship no fake + no open-source simulator exists (AB Legacy PCCC), no public simulator exists (FOCAS), the vendor SDK has no in-process fake (TwinCAT/ADS.NET), or the test wiring just hasn't happened yet (OpcUaClient could trivially loopback against this repo's own server — flagged as #215). Each doc names the specific follow-up route: Snap7 server for S7 (done), TwinCAT 3 developer-runtime auto-restart for TwinCAT, Tier-C out-of-process Host for FOCAS, lab rigs for AB Legacy + hardware-gated bits of the others. `docs/drivers/README.md` gains a coverage-map section linking all eight. Tracking tasks #215-#222 filed for each PR-able follow-up.

Build clean (driver + integration project + docs); S7.Tests 58/58 (unchanged); S7.IntegrationTests 3/3 (new, verified end-to-end against a live python-snap7 server: `driver_reads_seeded_u16_through_real_S7comm`, `driver_reads_seeded_typed_batch`, `driver_write_then_read_round_trip_on_scratch_word`). Next fixture follow-up is #215 (OpcUaClient loopback against own server) — highest ROI of the remaining set, zero external deps.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 11:29:15 -04:00
4fe96fca9b Merge pull request 'AbCip IAlarmSource via ALMD projection (#177, feature-flagged)' (#159) from abcip-alarm-source into v2 2026-04-20 04:26:39 -04:00
Joseph Doherty
4e80db4844 AbCip IAlarmSource via ALMD projection (#177) — feature-flagged OFF by default; when enabled, polls declared ALMD UDT member fields + raises OnAlarmEvent on 0→1 + 1→0 transitions. Closes task #177. The AB CIP driver now implements IAlarmSource so the generic-driver alarm dispatch path (PR 14's sinks + the Server.Security.AuthorizationGate AlarmSubscribe/AlarmAck invoker wrapping) can treat AB-backed alarms uniformly with Galaxy + OpcUaClient + FOCAS. Projection is ALMD-only in this pass: the Logix ALMD (digital alarm) instruction's UDT shape is well-understood (InFaulted + Acked + Severity + In + Cfg_ProgTime at stable member names) so the polled-read + state-diff pattern fits without concessions. ALMA (analog alarm) deferred to a follow-up because its HHLimit/HLimit/LLimit/LLLimit threshold + In value semantics deserve their own design pass — raising on threshold-crossing is not the same shape as raising on InFaulted-edge. AbCipDriverOptions gains two knobs: EnableAlarmProjection (default false) + AlarmPollInterval (default 1s). Explicit opt-in because projection semantics don't exactly mirror Rockwell FT Alarm & Events; shops running FT Live should leave this off + take alarms through the native A&E route. AbCipAlarmProjection is the state machine: per-subscription background loop polls the source-node set via the driver's public ReadAsync — which gains the #194 whole-UDT optimization for free when ALMDs are declared with their standard member set, so one poll tick reads (N alarms × 2 members) = N libplctag round-trips rather than 2N. Per-tick state diff: compare InFaulted + Severity against last-seen, fire raise (0→1) / clear (1→0) with AlarmSeverity bucketed via the 1-1000 Logix severity scale (≤250 Low, ≤500 Medium, ≤750 High, rest Critical — matches OpcUaClient's MapSeverity shape). ConditionId is {sourceNode}#active — matches a single active-branch per alarm which is all ALMD supports; when Cfg_ProgTime-based branch identity becomes interesting (re-raise after ack with new timestamp), a richer ConditionId pass can land. Subscribe-while-disabled returns a handle wrapping id=0 — capability negotiation (the server queries IAlarmSource presence at driver-load time) still succeeds, the alarm surface just never fires. Unsubscribe cancels the sub's CTS + awaits its loop; ShutdownAsync cancels every sub on its way out so a driver reload doesn't leak poll tasks. AcknowledgeAsync routes through the driver's existing WriteAsync path — per-ack writes {SourceNodeId}.Acked = true (the simpler semantic; operators whose ladder watches AckCmd + rising-edge can wire a client-side pulse until a driver-level edge-mode knob lands). Best-effort — per-ack faults are swallowed so one bad ack doesn't poison the whole batch. Six new AbCipAlarmProjectionTests: detector flags ALMD signature + skips non-signature UDTs + atomics; severity mapping matches OPC UA A&C bucket boundaries; feature-flag OFF returns a handle but never touches the fake runtime (proving no background polling happens); feature-flag ON fires a raise event on 0→1; clear event fires on 1→0 after a prior raise; unsubscribe stops the poll loop (ReadCount doesn't grow past cancel + at most one straggler read). Driver builds 0 errors; AbCip.Tests 233/233 (was 227, +6 new). Task #177 closed — the last pending AB CIP follow-up is now #194 (already shipped). Remaining pending fleet-wide: #150 (Galaxy MXAccess failover hardware) + #199 (UnsTab Playwright smoke).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 04:24:40 -04:00
f6d5763448 Merge pull request 'AbCip whole-UDT read optimization (#194)' (#158) from abcip-whole-udt-read into v2 2026-04-20 04:19:51 -04:00
Joseph Doherty
780358c790 AbCip whole-UDT read optimization (#194) — declaration-driven member grouping collapses N per-member reads into one parent-UDT read + client-side decode. Closes task #194. On a batch that includes multiple members of the same hand-declared UDT tag, ReadAsync now issues one libplctag read on the parent + decodes each member from the runtime's buffer at its computed byte offset. A 6-member Motor UDT read goes from 6 libplctag round-trips to 1 — the Rockwell-suggested pattern for minimizing CIP request overhead on batch reads of UDT state (decision #11's follow-through on what the template decoder from task #179 was meant to enable). AbCipUdtMemberLayout is a pure-function helper that computes declared-member byte offsets under Logix natural-alignment rules (SInt 1-byte / Int 2-byte / DInt + Real + Dt 4-byte / LInt + ULInt + LReal 8-byte; alignment pad inserted before each member as needed). Opts out for BOOL / String / Structure members — BOOL storage in Logix UDTs packs into a hidden host byte whose position can't be computed from declaration-only info, and String members need length-prefix + STRING[82] fan-out which libplctag already handles via a per-tag DecodeValue path. The CIP Template Object shape from task #179 (when populated via FetchUdtShapeAsync) carries real offsets for those members — layering that richer path on top of the planner is a separate follow-up and does not change this PR's conservative behaviour. AbCipUdtReadPlanner is the scheduling function ReadAsync consults each batch — pure over (requests, tagsByName), emits Groups + Fallbacks. A group is formed when (a) the reference resolves to "parent.member"; (b) parent is a Structure tag with declared Members; (c) the layout helper succeeds on those members; (d) the specific member appears in the computed offset map; (e) at least two members of the same parent appear in the batch — single-member groups demote to the fallback path because one whole-UDT read vs one per-member read is equivalent cost but more client-side work. Original batch indices are preserved through the plan so out-of-order batches write decoded values back at the right output slot; the caller's result array order is invariant. IAbCipTagRuntime.DecodeValueAt(AbCipDataType, int offset, int? bitIndex) is the new hot-path method — LibplctagTagRuntime delegates to libplctag's offset-aware Get*(offset) calls (GetInt32, GetFloat32, etc.) that were always there; previously every call passed offset 0. DecodeValue(type, bitIndex) stays as the shorthand + forwards to DecodeValueAt with offset 0, preserving the existing single-tag read path + every test that exercises it. FakeAbCipTag gains a ValuesByOffset dictionary so tests can drive multi-member decoding by setting offset→value before the read fires; unmapped offsets fall back to the existing Value field so the 200+ existing tests that never set ValuesByOffset keep working unchanged. AbCipDriver.ReadAsync refactored: planner splits the batch, ReadGroupAsync handles each UDT group (one EnsureTagRuntimeAsync on the parent + one ReadAsync + N DecodeValueAt calls), ReadSingleAsync handles each fallback (the pre-#194 per-tag path, now extracted + threaded through). A per-group failure stamps the mapped libplctag status across every grouped member only — sibling groups + fallback refs are unaffected. Health-surface updates happen once per successful group rather than once per member to avoid ping-ponging the DriverState bookkeeping. Five AbCipUdtMemberLayoutTests: packed atomics get natural-alignment offsets including 8-byte pad before LInt; SInts pack without padding; BOOL/String/Structure opt out + return null; empty member list returns null. Six AbCipUdtReadPlannerTests: two members group; single-member demotes to fallback; unknown references fall back without poisoning groups; atomic top-level tags fall back untouched; UDTs containing BOOL don't group; original indices survive out-of-order batches. Five AbCipDriverWholeUdtReadTests (real driver + fake runtime): two grouped members trigger exactly one parent read + one fake runtime (proving the optimization engages); each member decodes at its own offset via ValuesByOffset; parent-read non-zero status stamps Bad across the group; mixed UDT-member + atomic top-level batch produces 2 runtimes + 2 reads (not 3); single-member-of-UDT still uses the member-level runtime (proving demotion works). Driver builds 0 errors; AbCip.Tests 227/227 (was 211, +16 new).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 04:17:57 -04:00
1ac87f1fac Merge pull request 'ADR-001 last-mile � Program.cs composes walker into production boot (#214)' (#157) from equipment-content-wiring into v2 2026-04-20 03:52:30 -04:00
Joseph Doherty
432173c5c4 ADR-001 last-mile — Program.cs composes EquipmentNodeWalker into the production boot path. Closes task #214 + fully lands ADR-001 Option A as a live code path, not just a connected set of unit-tested primitives. After this PR a server booted against a real Config DB with Published Equipment rows materializes the UNS tree into the OPC UA address space on startup — the whole walker → wire-in → loader chain (PRs #153, #154, #155, #156) finally fires end-to-end in the production process. DriverEquipmentContentRegistry is the handoff between OpcUaServerService's bootstrap-time populate pass + OpcUaApplicationHost's StartAsync walker invocation. It's a singleton mutable holder with Get/Set/Count + Lock-guarded internal dictionary keyed OrdinalIgnoreCase to match the DriverInstanceId convention used by Equipment / Tag rows + walker grouping. Set-once-per-bootstrap semantics in practice though nothing enforces that at the type level — OpcUaServerService.PopulateEquipmentContentAsync is the only expected writer. Shared-mutable rather than immutable-passed-by-value because the DI graph builds OpcUaApplicationHost before NodeBootstrap has resolved the generation, so the registry must exist at compose time + fill at boot time. Program.cs now registers OpcUaApplicationHost via a factory lambda that threads registry.Get as the equipmentContentLookup delegate PR #155 added to the ctor seam — the one-line composition the earlier PR promised. EquipmentNamespaceContentLoader (from PR #156) is AddScoped since it takes the scoped OtOpcUaConfigDbContext; the populate pass in OpcUaServerService opens one IServiceScopeFactory scope + reuses the same loader + DbContext across every driver query rather than scoping-per-driver. OpcUaServerService.ExecuteAsync gets a new PopulateEquipmentContentAsync step between bootstrap + StartAsync: iterates DriverHost.RegisteredDriverIds, calls loader.LoadAsync per driver at the bootstrapped generationId, stashes non-null results in the registry. Null results are skipped — the wire-in's null-check treats absent registry entries as "this driver isn't Equipment-kind; let DiscoverAsync own the address space" which is the correct backward-compat path for Modbus / AB CIP / TwinCAT / FOCAS. Guarded on result.GenerationId being non-null — a fleet with no Published generation yet boots cleanly into a UNS-less address space and fills the registry on the next restart after first publish. Ctor on OpcUaServerService gained two new dependencies (DriverEquipmentContentRegistry + IServiceScopeFactory). No test file constructs OpcUaServerService directly so no downstream test breakage — the BackgroundService is only wired via DI in Program.cs. Four new DriverEquipmentContentRegistryTests: Get-null-for-unknown, Set-then-Get, case-insensitive driver-id lookup, Set-overwrites-existing. Server.Tests 190/190 (was 186, +4 new registry tests). Full ADR-001 Option A now lives at every layer: Core.OpcUa walker (#153) → ScopePathIndexBuilder (#154) → OpcUaApplicationHost wire-in (#155) → EquipmentNamespaceContentLoader (#156) → this PR's registry + Program.cs composition. The last pending loose end (full-integration smoke test that boots Program.cs against a seeded Config DB + verifies UNS tree via live OPC UA client) isn't strictly necessary because PR #155's OpcUaEquipmentWalkerIntegrationTests already proves the wire-in at the OPC UA client-browse level — the Program.cs composition added here is purely mechanical + well-covered by the four-file audit trail plus registry unit tests.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 03:50:37 -04:00
f6d98cfa6b Merge pull request 'EquipmentNamespaceContentLoader � Config-DB loader for walker wire-in' (#156) from equipment-content-loader into v2 2026-04-20 03:21:48 -04:00
Joseph Doherty
a29828e41e EquipmentNamespaceContentLoader — Config-DB loader that fills the (driverInstanceId, generationId) shape the walker wire-in from PR #155 consumes. Narrow follow-up to PR #155: the ctor plumbing on OpcUaApplicationHost already takes a Func<string, EquipmentNamespaceContent?>? lookup; this PR lands the loader that will back that lookup against the central Config DB at SealedBootstrap time. DI composition in Program.cs is a separate structural PR because it needs the generation-resolve chain restructured to run before OpcUaApplicationHost construction — this one just lands the loader + unit tests so the wiring PR reduces to one factory lambda. Loader scope is one driver instance at one generation: joins Equipment filtered by (DriverInstanceId == driver, GenerationId == gen, Enabled) first, then UnsLines reachable from those Equipment rows, then UnsAreas reachable from those lines, then Tags filtered by (DriverInstanceId == driver, GenerationId == gen). Returns null when the driver has no Equipment at the supplied generation — the wire-in's null-check treats that as "skip the walker; let DiscoverAsync own the whole address space" which is the correct backward-compat behavior for non-Equipment-kind drivers (Modbus / AB CIP / TwinCAT / FOCAS whose namespace-kind is native per decisions #116-#121). Only loads the UNS branches that actually host this driver's Equipment — skips pulling unrelated UNS folders from other drivers' regions of the cluster by deriving lineIds/areaIds from the filtered Equipment set rather than reloading the full UNS tree. Enabled=false Equipment are skipped at the query level so a decommissioned machine doesn't produce a phantom browse folder — Admin still sees it in the diff view via the regular Config-DB queries but the walker's browse output reflects the operational fleet. AsNoTracking on every query because the bootstrap flow is read-only + the result is handed off to a pure-function walker immediately; change tracking would pin rows in the DbContext for the full server lifetime with no corresponding write path. Five new EquipmentNamespaceContentLoaderTests using InMemoryDatabase: (a) null result when driver has no Equipment; (b) baseline happy-path loads the full shape correctly; (c) other driver's rows at the same generation don't leak into this driver's result (per-driver scope contract); (d) same-driver rows at a different generation are skipped (per-generation scope contract per decision #148); (e) Enabled=false Equipment are skipped. Server project builds 0 errors; Server.Tests 186/186 (was 181, +5 new loader tests). Once the wiring PR lands the factory lambda in Program.cs the loader closes over the SealedBootstrap-resolved generationId + the lookup delegate delegates to LoadAsync via IServiceScopeFactory — a one-line composition, no ctor-signature churn on OpcUaApplicationHost because PR #155 already established the seam.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 03:19:45 -04:00
f5076b4cdd Merge pull request 'ADR-001 wire-in � EquipmentNodeWalker in OpcUaApplicationHost (#212 + #213)' (#155) from equipment-walker-wire-in into v2 2026-04-20 03:11:35 -04:00
Joseph Doherty
2d97f241c0 ADR-001 wire-in — EquipmentNodeWalker runs inside OpcUaApplicationHost before driver DiscoverAsync, closing tasks #212 + #213. Completes the in-server half of the ADR-001 Option A story: Task A (PR #153) shipped the pure-function walker in Core.OpcUa; Task B (PR #154) shipped the NodeScopeResolver + ScopePathIndexBuilder + evaluator-level authz proof. This PR lands the BuildAddressSpaceAsync wire-in the walker was always meant to plug into + a full-stack OPC UA client-browse integration test that proves the UNS folder skeleton is actually visible to real UA clients end-to-end, not just to the RecordingBuilder test double. OpcUaApplicationHost gains an optional ctor parameter equipmentContentLookup of type Func<string, EquipmentNamespaceContent?>? — when supplied + non-null for a driver instance, EquipmentNodeWalker.Walk is invoked against that driver's node manager BEFORE GenericDriverNodeManager.BuildAddressSpaceAsync streams the driver's native DiscoverAsync output on top. Walker-first ordering matters: the UNS Area/Line/Equipment folder skeleton + Identification sub-folders + the five identifier properties (decision #121) are in place so driver-native references (driver-specific tag paths) land ALONGSIDE the UNS tree rather than racing it. Callers that don't supply a lookup (every existing pre-ADR-001 test + the v1 upgrade path) get identical behavior — the null-check is the backward-compat seam per the opt-in design sketched in ADR-001. The lookup delegate is driver-instance-scoped, not server-scoped, so a single server with multiple drivers can serve e.g. one Equipment-kind namespace (Galaxy proxy with a full UNS) alongside several native-kind namespaces (Modbus / AB CIP / TwinCAT / FOCAS that do not have their own UNS because decisions #116-#121 scope UNS to Equipment-kind only). SealedBootstrap.Start will wire this lookup against the Config-DB snapshot loader in a follow-up — the lookup plumbing lands first so that wiring reduces to one-line composition rather than a ctor-signature churn. New OpcUaEquipmentWalkerIntegrationTests spins up a real OtOpcUaServer on a non-default port with an EmptyDriver that registers with zero native content + a lookup that returns a seeded EquipmentNamespaceContent (one area warsaw / one line line-a / one equipment oven-3 / one tag Temperature). An OPC UA client session connects anonymously against the un-secured endpoint, browses the standard hierarchy, + asserts: (a) area folder warsaw contains line-a folder as a child; (b) line folder line-a contains oven-3 folder as a child; (c) equipment folder oven-3 contains EquipmentId + EquipmentUuid + MachineCode identifier properties — ZTag + SAPID correctly absent because the fixture leaves them null per decision #121 skip-when-null behavior; (d) the bound Tag emits a Variable node under the equipment folder with NodeId == Tag.TagConfig (the wire-level driver address) + the client can ReadValue against it end-to-end through the DriverNodeManager dispatch path. Because the EmptyDriver's DiscoverAsync is a no-op the test proves UNS content came from the walker, not the driver — the original ADR-001 question "what actually owns the browse tree" now has a mechanical answer visible at the OPC UA wire level. Test class uses its own port (48500+rand) + per-test PKI root so it runs in parallel with the existing OpcUaServerIntegrationTests fixture (48400+rand) without binding or cert collisions. Server project builds 0 errors; Server.Tests 181/181 (was 179, +2 new full-stack walker tests). Task #212 + #213 closed; the follow-up SealedBootstrap wiring is the natural next pickup because the ctor plumbing lands here + that becomes a narrow downstream PR.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 03:09:37 -04:00
5811ede744 Merge pull request (#154) - ADR-001 Task B + #195 close-out 2026-04-20 02:52:26 -04:00
Joseph Doherty
1bf3938cdf ADR-001 Task B — NodeScopeResolver full-path + ScopePathIndexBuilder + evaluator-level ACL test closing #195. Two production additions + one end-to-end authz regression test proving the Identification ACL contract the IdentificationFolderBuilder docstring promises. Task A (PR #153) shipped the walker as a pure function that materializes the UNS → Equipment → Tag browse tree + IdentificationFolderBuilder.Build per Equipment. This PR lands the authz half of the walker's story — the resolver side that turns a driver-side full reference into a full NodeScope path (NamespaceId + UnsAreaId + UnsLineId + EquipmentId + TagId) so the permission trie can walk the UNS hierarchy + apply Equipment-scope grants correctly at dispatch time. The actual in-server wiring (load snapshot → call walker during BuildAddressSpaceAsync → swap in the full-path resolver) is split into follow-up task #212 because it's a bigger surface (Server bootstrap + DriverNodeManager override + real OPC UA client-browse integration test). NodeScopeResolver extended with a second constructor taking IReadOnlyDictionary<string, NodeScope> pathIndex — when supplied, Resolve looks up the full reference in the index + returns the indexed scope with every UNS level populated; when absent or on miss, falls back to the pre-ADR-001 cluster-only scope so driver-discovered tags that haven't been indexed yet (between a DiscoverAsync result + the next generation publish) stay addressable without crashing the resolver. Index is frozen into a FrozenDictionary<string, NodeScope> under Ordinal comparer for O(1) hot-path lookups. Thread-safety by immutability — callers swap atomically on generation change via the server's publish pipeline. New ScopePathIndexBuilder.Build in Server.Security takes (clusterId, namespaceId, EquipmentNamespaceContent) + produces the fullReference → NodeScope dictionary by joining Tag → Equipment → UnsLine → UnsArea through up-front dictionaries keyed Ordinal-ignoring-case. Tag rows with null EquipmentId (SystemPlatform-namespace Galaxy tags per decision #120) are excluded from the index; cluster-only fallback path covers them. Broken FKs (Tag references missing Equipment row, or Equipment references missing UnsLine) are skipped rather than crashing — sp_ValidateDraft should have caught these at publish, any drift here is unexpected but non-fatal. Duplicate keys throw InvalidOperationException at bootstrap so corrupt-data drift surfaces up-front instead of producing silently-last-wins scopes at dispatch. End-to-end authz regression test in EquipmentIdentificationAuthzTests walks the full dispatch flow against a Config-DB-style fixture: ScopePathIndexBuilder.Build from the same EquipmentNamespaceContent the EquipmentNodeWalker consumes → NodeScopeResolver with that index → AuthorizationGate + TriePermissionEvaluator → PermissionTrieBuilder with one Equipment-scope NodeAcl grant + a NodeAclPath resolving Equipment ScopeId to (namespace, area, line, equipment). Four tests prove the contract: (a) authorized group Read granted on Identification property; (b) unauthorized group Read denied on Identification property — the #195 contract the IdentificationFolderBuilder docstring promises (the BadUserAccessDenied surfacing happens at the DriverNodeManager dispatch layer which is already wired to AuthorizationGate.IsAllowed → StatusCodes.BadUserAccessDenied in PR #94); (c) Equipment-scope grant cascades to both the Equipment's tag + its Identification properties because they share the Equipment ScopeId — no new scope level for Identification per the builder's Remarks section; (d) grant on oven-3 does NOT leak to press-7 (different equipment under the same UnsLine) proving per-Equipment isolation at dispatch when the resolver populates the full path. NodeScopeResolverTests extended with two new tests covering the indexed-lookup path + fallback-on-miss path; renamed the existing "_For_Phase1" test to "_When_NoIndexSupplied" to match the current framing. Server project builds 0 errors; Server.Tests 179/179 (was 173, +6 new across the two test files). Task #212 captures the remaining in-server wiring work — Server.SealedBootstrap load of EquipmentNamespaceContent, DriverNodeManager override that calls EquipmentNodeWalker during BuildAddressSpaceAsync for Equipment-kind namespaces, and a real OPC UA client-browse integration test. With that wiring + this PR's authz-layer proof, #195's "ACL integration test" line is satisfied at two layers (evaluator + live endpoint) which is stronger than the task originally asked for.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 02:50:27 -04:00
7a42f6d84c Merge pull request (#153) - EquipmentNodeWalker (ADR-001 Task A) 2026-04-20 02:40:58 -04:00
Joseph Doherty
2b2991c593 EquipmentNodeWalker — pure-function UNS tree materialization (ADR-001 Task A, task #210). The walker traverses the Config-DB snapshot for a single Equipment-kind namespace (Areas / Lines / Equipment / Tags) and streams IAddressSpaceBuilder.Folder + Variable + AddProperty calls to materialize the canonical 5-level Unified Namespace browse tree that decisions #116-#121 promise external consumers. Pure function: no OPC UA SDK dependency, no DB access, no state — consumes pre-loaded EF Core row collections + streams into the supplied builder. Server-side wiring (load snapshot → call walker → per-tag capability probe) is Task B's scope, alongside NodeScopeResolver's Config-DB join + the ACL integration test that closes task #195. This PR is the Core.OpcUa primitive the server will consume. Walk algorithm — content is grouped up-front (lines by area, equipment by line, tags by equipment) into OrdinalIgnoreCase dictionaries so the per-level nested foreach stays O(N+M) rather than O(N·M) at each UNS level; orderings are deterministic on Name with StringComparer.Ordinal so diffs across runs (e.g. integration-test assertions) are stable. Areas → Lines → Equipment emitted as Folder nodes with browse-name = Name per decision #120. Under each Equipment folder: five identifier properties per decision #121 (EquipmentId + EquipmentUuid always; MachineCode always — it's a required column on the entity; ZTag + SAPID skipped when null to avoid empty-string property noise); IdentificationFolderBuilder.Build materializes the OPC 40010 sub-folder when HasAnyFields(equipment) returns true, skipped otherwise to avoid a pointless empty folder; then one Variable node per Tag row bound to this Equipment (Tag.EquipmentId non-null matches Equipment.EquipmentId) emitted in Name order. Tags with null EquipmentId are walker-skipped — those are SystemPlatform-kind (Galaxy) tags that take the driver-native DiscoverAsync path per decision #120. DriverAttributeInfo construction: FullName = Tag.TagConfig (driver-specific wire-level address); DriverDataType parsed from Tag.DataType which stores the enum name string per decision #138; unparseable values fall back to DriverDataType.String so a one-off driver-specific type doesn't abort the whole walk (driver still sees the original address at runtime + can surface its own typed value via the variant). Address validation is deliberately NOT done at build time per ADR-001 Option A: unreachable addresses surface as OPC UA Bad status via the natural driver-read failure path at runtime, legible to operators through their Admin UI + OPC UA client inspection. Eight new EquipmentNodeWalkerTests: empty content emits nothing; Area/Line/Equipment folder emission order matches Name-sorted deterministic traversal; five identifier properties appear on Equipment nodes with correct values, ZTag + SAPID skipped when null + emitted when non-null; Identification sub-folder materialized when at least one OPC 40010 field is non-null + omitted when all are null; tags with matching EquipmentId emit as Variable nodes under the Equipment folder in Name order, tags with null EquipmentId walker-skipped; unparseable DataType falls back to String. RecordingBuilder test double captures Folder/Variable/Property calls into a tree structure tests can navigate. Core project builds 0 errors; Core.Tests 190/190 (was 182, +8 new walker tests). No Server/Admin changes — Task B lands the server-side wiring + consumes this walker from DriverNodeManager.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 02:39:00 -04:00
9711d0c097 Merge pull request (#152) - ADR-001 Accepted (Option A) 2026-04-20 02:32:41 -04:00
Joseph Doherty
1ddc13b7fc ADR-001 accepted — Option A (Config-primary walker); Option D (discovery-assist) deferred to v2.1. Spawning Task A + Task B. 2026-04-20 02:30:57 -04:00
Joseph Doherty
97e1f55bbb Draft ADR-001 — Equipment node walker: how driver tags bind to the UNS address space. Frames the decision blocking task #195 (IdentificationFolderBuilder wire-in): the Equipment-namespace browse tree requires a Config-DB-driven walker that traverses UNS → Equipment → Tag + hangs Identification sub-folders + identifier properties, and the open question is how driver-discovered tags bind to the UNS Equipment nodes the walker materializes. Context section documents what already exists (IdentificationFolderBuilder unused; NodeScopeResolver at Phase-1 cluster-only stub; Equipment + UnsArea + UnsLine + Tag tables with decisions #110 #116 #117 #120 #121 already landed as the data-model contract) vs what's missing (the walker itself + the ITagDiscovery/Config-DB composition strategy). Four options laid out with trade-offs: Option A Config-primary (Tag rows are the sole source of truth; ITagDiscovery becomes enrichment; BadNotFound placeholder when driver can't address a declared tag); Option B Discovery-primary (driver output is authoritative; Config-DB Equipment rows select subsets); Option C Parallel namespaces (driver-native ns + UNS overlay ns cross-referencing via OPC UA Organizes); Option D Config-primary-with-discovery-assist (same as A at runtime, plus an Admin UI offline discovery panel that lets operators one-click-import discovered tags into the draft). Recommendation: Option A now, defer Option D to v2.1. Reasons: matches decision #110's framing straight-through, identical composition across every Equipment-kind driver, Phase 6.4 Admin UI already authors Tag rows, BadNotFound is a legible failure mode, and nothing in A blocks adding D later without changing the walker contract. If the ADR is accepted, spawns two tasks: Task A builds EquipmentNodeWalker in Core.OpcUa (cluster → namespace → area → line → equipment → tag traversal, IdentificationFolderBuilder per Equipment, 5 identifier properties, BadNotFound placeholders, integration tests); Task B extends NodeScopeResolver to join against Config DB + populate full NodeScope path (unblocks per-Equipment/per-UnsLine ACL granularity + closes task #195 with the ACL integration test from the builder's docstring cross-reference). Consequences-if-we-don't-decide section captures the status quo: Identification metadata ships in DB + Admin UI but never reaches the OPC UA endpoint, external consumers can't resolve equipment via OPC UA properties as decision #121 promises, and NodeScopeResolver stays cluster-level so finer ACL grants are effectively cluster-wide at dispatch (Phase 6.2 rollout limitation, not correctness bug). Draft status — seeking decision before spawning the two implementation tasks. If accepted I'll add the tasks + start on Task A.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 02:28:10 -04:00
cb2a375548 Merge pull request (#151) - Phase 2 close-out 2026-04-20 02:02:33 -04:00
Joseph Doherty
33b87a3aa4 Phase 2 official close-out. Closes task #209. The 2026-04-18 exit-gate-phase-2-final.md captured Phase 2 state at PR 2 merge — four High/Medium adversarial findings still OPEN, Historian port + alarm subsystem + v1 archive deletion all deferred. Since then: PR 4 closed all four findings end-to-end (High 1 Read subscription-leak, High 2 no reconnect loop, Medium 3 SubscribeAsync doesn't push frames, Medium 4 WriteValuesAsync doesn't await OnWriteComplete — mapped + resolved inline in the new doc), PR 12 landed the richer historian quality mapper, PR 13 shipped GalaxyRuntimeProbeManager with per-Platform/AppEngine ScanState subscriptions + StateChanged events forwarded through the existing OnHostStatusChanged IPC frame, PR 14 wired the alarm subsystem (GalaxyAlarmTracker advising the four alarm-state attributes per IsAlarm=true attribute, raising AlarmTransition events forwarded through OnAlarmEvent IPC frames), Phase 3 PR 18 deleted the v1 source trees, and PR 61 closed V1_ARCHIVE_STATUS.md. Phase 2 is functionally done; this commit is the bookkeeping pass. New exit-gate-phase-2-closed.md at docs/v2/implementation/ — five-stream status table (A/B/C/D/E all complete with the specific close commits named), full resolution table for every 2026-04-18 adversarial finding mapped to the PR 4 resolution, cross-cutting deferrals table marking every one resolved (Historian SDK plugin port → done, subscription push frames → done under Medium 3, Historian-backed HistoryRead → done, alarm subsystem wire-up → done, reconnect-without-recycle → done under High 2, v1 archive deletion → done). Fresh 2026-04-20 test baseline captured from the current v2 tip: 1844 passing + 29 infra-gated skips across 21 test projects, including the net48 x86 Galaxy.Host.Tests suite (107 pass) that exercises the MXAccess COM path on the dev box. Flake observed — Configuration.Tests 70/71 on first full-solution run, 71/71 on retry; logged as a known non-stable flake rather than chased because it did not reproduce. The prior exit-gate-phase-2-final.md is kept in place (historical record of the 2026-04-18 snapshot) but gets a superseded-by banner at the top pointing at the new close-out doc so future readers land on current status first. docs/v2/plan.md Phase 2 section header gains the CLOSED 2026-04-20 marker + a link to the close-out doc so the top-level plan index reflects reality. "What Phase 2 closed means for Phase 3 and later" section in the new doc captures the downstream contract: Galaxy now runs as a first-class v2 driver with the same capability-interface shape as Modbus / S7 / AbCip / AbLegacy / TwinCAT / FOCAS / OpcUaClient; no v1 code path remains; the 2026-04-13 stability findings persist as named regression tests under tests/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.E2E/StabilityFindingsRegressionTests.cs so any future refactor reintroducing them trips the test. "Outstanding — not Phase 2 blockers" section lists the four pending non-Phase-2 tasks (#177, #194, #195, #199) so nobody mistakes them for Phase 2 tail work.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 02:00:35 -04:00
2391de7f79 Merge pull request (#150) - Client rename residuals (#207 + #208) 2026-04-20 01:52:40 -04:00
Joseph Doherty
f9bc301c33 Client rename residuals: lmxopcua-cli → otopcua-cli + LmxOpcUaClient → OtOpcUaClient with migration shim. Closes task #208 (the executable-name + LocalAppData-folder slice that was called out in Client.CLI.md / Client.UI.md as a deliberately-deferred residual of the Phase 0 rename). Six source references flipped to the canonical OtOpcUaClient spelling: Program.cs CliFx executable name + description (lmxopcua-cli → otopcua-cli), DefaultApplicationConfigurationFactory.cs ApplicationName + ApplicationUri (LmxOpcUaClient + urn:localhost:LmxOpcUaClient → OtOpcUaClient + urn:localhost:OtOpcUaClient), OpcUaClientService.CreateSessionAsync session-name arg, ConnectionSettings.CertificateStorePath default, MainWindowViewModel.CertificateStorePath default, JsonSettingsService.SettingsDir. Two consuming tests (ConnectionSettingsTests + MainWindowViewModelTests) updated to assert the new canonical name. New ClientStoragePaths static helper at src/ZB.MOM.WW.OtOpcUa.Client.Shared/ClientStoragePaths.cs is the migration shim — single entry point for the PKI root + pki subpath, runs a one-shot legacy-folder probe on first resolution: if {LocalAppData}/LmxOpcUaClient/ exists + {LocalAppData}/OtOpcUaClient/ does not, Directory.Move renames it in place (atomic on NTFS within the same volume) so trusted server certs + saved connection settings persist across the rename without operator action. Idempotent per-process via a Lock-guarded _migrationChecked flag so repeated CertificateStorePath getter calls on the hot path pay no IO cost beyond the first. Fresh-install path (neither folder exists) + already-migrated path (only canonical exists) + manual-override path (both exist — developer has set up something explicit) are all no-ops that leave state alone. IOException on the Directory.Move is swallowed + logged as a false return so a concurrent peer process losing the race doesn't crash the consumer; the losing process falls back to whatever state exists. Five new ClientStoragePathsTests assert: GetRoot ends with canonical name under LocalAppData, GetPkiPath nests pki under root, CanonicalFolderName is OtOpcUaClient, LegacyFolderName is LmxOpcUaClient (the migration contract — a typo here would leak the legacy folder past the shim), repeat invocation returns false after first-touch arms the in-process guard. Doc-side residual-explanation notes in docs/Client.CLI.md + docs/Client.UI.md are dropped now that the rename is real; replaced with a short "pre-#208 dev boxes migrate automatically on first launch" note that points at ClientStoragePaths. Sample CLI invocations in Client.CLI.md updated via sed from lmxopcua-cli to otopcua-cli across every command block (14 replacements). Pre-existing staleness in SubscribeCommandTests.Execute_PrintsSubscriptionMessage surfaced during the test run — the CLI's subscribe command has long since switched to an aggregate "Subscribed to {count}/{total} nodes (interval: ...)" output format but the test still asserted the original single-node form. Updated the assertion to match current output + added a comment explaining the change; this is unrelated to the rename but was blocking a green Client.CLI.Tests run. Full solution build 0 errors; Client.Shared.Tests 136/136 + 5 new shim tests passing; Client.UI.Tests 98/98; Client.CLI.Tests 52/52 (was 51/52 before the subscribe-test fix). No Admin/Core/Server changes — this touches only the client layer.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 01:50:40 -04:00
Joseph Doherty
12d748c4f3 CLAUDE.md — TopShelf + LdapAuthenticationProvider stale references. Closes task #207. The docs-refresh agent sweep (PR #149) flagged two stale library/class references in the root CLAUDE.md that the v2 refactors landed but the project-level instructions missed. Service hosting line replaced with the two-process reality: Server + Admin use .NET generic-host AddWindowsService (decision #30 explicitly replaced TopShelf in v2 — OpcUaServerService.cs carries the decision-#30 comment inline); Galaxy.Host is a plain console app wrapped by NSSM because its .NET-Framework-4.8-x86 target can't use the generic-host Windows-service integration + MXAccess COM bitness requirement pins it there anyway. The LDAP-auth mention gains the actual class name LdapUserAuthenticator (src/ZB.MOM.WW.OtOpcUa.Server/Security/LdapUserAuthenticator.cs) implementing IUserAuthenticator — previously claimed LdapAuthenticationProvider + IUserAuthenticationProvider + IRoleProvider, none of which exist in the source tree (the docs-refresh agent grepped for it; it's truly gone). No functional impact — CLAUDE.md is operator-facing + informs future agent runs about the stack, not compile-time.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 01:41:16 -04:00
e9b1d107ab Merge pull request (#149) - Doc refresh for multi-driver OtOpcUa 2026-04-20 01:37:00 -04:00
Joseph Doherty
5506b43ddc Doc refresh (task #204) — operational docs for multi-process multi-driver OtOpcUa
Five operational docs rewritten for v2 (multi-process, multi-driver, Config-DB authoritative):

- docs/Configuration.md — replaced appsettings-only story with the two-layer model.
  appsettings.json is bootstrap only (Node identity, Config DB connection string,
  transport security, LDAP bind, logging). Authoritative config (clusters, namespaces,
  UNS, equipment, tags, driver instances, ACLs, role grants, poll groups) lives in
  the Config DB accessed via OtOpcUaConfigDbContext and edited through the Admin UI
  draft/publish workflow. Added v1-to-v2 migration index so operators can locate where
  each old section moved. Cross-links to docs/v2/config-db-schema.md + docs/v2/admin-ui.md.

- docs/Redundancy.md — Phase 6.3 rewrite. Named every class under
  src/ZB.MOM.WW.OtOpcUa.Server/Redundancy/: RedundancyCoordinator, RedundancyTopology,
  ApplyLeaseRegistry (publish fencing), PeerReachabilityTracker, RecoveryStateManager,
  ServiceLevelCalculator (pure function), RedundancyStatePublisher. Documented the
  full 11-band ServiceLevel matrix (Maintenance=0 through AuthoritativePrimary=255)
  from ServiceLevelCalculator.cs and the per-ClusterNode fields (RedundancyRole,
  ServiceLevelBase, ApplicationUri). Covered metrics
  (otopcua.redundancy.role_transition counter + primary/secondary/stale_count gauges
  on meter ZB.MOM.WW.OtOpcUa.Redundancy) and SignalR RoleChanged push from
  FleetStatusPoller to RedundancyTab.razor.

- docs/security.md — preserved the transport-security section (still accurate) and
  added Phase 6.2 authorization. Four concerns now documented in one place:
  (1) transport security profiles, (2) OPC UA auth via LdapUserAuthenticator
  (note: task spec called this LdapAuthenticationProvider — actual class name is
  LdapUserAuthenticator in Server/Security/), (3) data-plane authorization via
  NodeAcl + PermissionTrie + AuthorizationGate — additive-only model per decision
  #129, ClusterId → Namespace → UnsArea → UnsLine → Equipment → Tag hierarchy,
  NodePermissions bundle, PermissionProbeService in Admin for "probe this permission",
  (4) control-plane authorization via LdapGroupRoleMapping + AdminRole
  (ConfigViewer / ConfigEditor / FleetAdmin, CanEdit / CanPublish policies) —
  deliberately independent of data-plane ACLs per decision #150. Documented the
  OTOPCUA0001 Roslyn analyzer (UnwrappedCapabilityCallAnalyzer) as the compile-time
  guard ensuring every driver-capability async call is wrapped by CapabilityInvoker.

- docs/ServiceHosting.md — three-process rewrite: OtOpcUa Server (net10 x64,
  BackgroundService + AddWindowsService, hosts OPC UA endpoint + all non-Galaxy
  drivers), OtOpcUa Admin (net10 x64, Blazor Server + SignalR + /metrics via
  OpenTelemetry Prometheus exporter), OtOpcUa Galaxy.Host (.NET Framework 4.8 x86,
  NSSM-wrapped, env-variable driven, STA thread + MXAccess COM). Pipe ACL
  denies-Admins detail + non-elevated shell requirement captured from feedback memory.
  Divergence from CLAUDE.md: task spec said "TopShelf is still the service-installer
  wrapper per CLAUDE.md note" but no csproj in the repo references TopShelf — decision
  #30 replaced it with the generic host's AddWindowsService wrapper (per the doc
  comment on OpcUaServerService). Reflected the actual state + flagged this divergence
  here so someone can update CLAUDE.md separately.

- docs/StatusDashboard.md — replaced the full v1 reference (dashboard endpoints,
  health check rules, StatusData DTO, etc.) with a short "superseded by Admin UI"
  pointer that preserves git-blame continuity + avoids broken links from other docs
  that reference it.

Class references verified by reading:
  src/ZB.MOM.WW.OtOpcUa.Server/Redundancy/{RedundancyCoordinator, ServiceLevelCalculator,
      ApplyLeaseRegistry, RedundancyStatePublisher}.cs
  src/ZB.MOM.WW.OtOpcUa.Core/Authorization/{PermissionTrie, PermissionTrieBuilder,
      PermissionTrieCache, TriePermissionEvaluator, AuthorizationGate}.cs
  src/ZB.MOM.WW.OtOpcUa.Server/Security/{AuthorizationGate, LdapUserAuthenticator}.cs
  src/ZB.MOM.WW.OtOpcUa.Admin/{Program.cs, Services/AdminRoles.cs,
      Services/RedundancyMetrics.cs, Hubs/FleetStatusPoller.cs}
  src/ZB.MOM.WW.OtOpcUa.Server/Program.cs + appsettings.json
  src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/{Program.cs, Ipc/PipeServer.cs}
  src/ZB.MOM.WW.OtOpcUa.Configuration/Entities/{ClusterNode, NodeAcl,
      LdapGroupRoleMapping}.cs
  src/ZB.MOM.WW.OtOpcUa.Analyzers/UnwrappedCapabilityCallAnalyzer.cs

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 01:34:25 -04:00
Joseph Doherty
71339307fa Doc refresh (task #203) — driver docs split + drivers index + IHistoryProvider-aware HistoricalDataAccess
Restructure the driver-facing docs to match the OtOpcUa v2 multi-driver
reality (Galaxy, Modbus, S7, AB CIP, AB Legacy, TwinCAT, FOCAS, OPC UA Client
— 8 drivers total; Galaxy ships as three projects) and the capability-interface
architecture where every driver opts into IDriver + whichever of IReadable /
IWritable / ITagDiscovery / ISubscribable / IHostConnectivityProbe /
IPerCallHostResolver / IAlarmSource / IHistoryProvider / IRediscoverable it
supports. Doc scope follows the code: one-driver-specific docs scoped to that
driver, cross-driver concerns live once at the top level, per-driver specs
cross-link to docs/v2/driver-specs.md rather than duplicate.

What changed per file:

- docs/MxAccessBridge.md -> docs/drivers/Galaxy.md (git mv + rewrite): retitled
  "Galaxy Driver", reframed as one of seven drivers. Added Project Split table
  (Shared .NET Standard 2.0 / Host .NET 4.8 x86 / Proxy .NET 10) and Why
  Out-of-Process section citing both the MXAccess bitness constraint and Tier C
  stability isolation per docs/v2/plan.md section 4. Added IPC Transport
  section covering pipe naming, MessagePack framing, DACL that denies Admins,
  shared-secret handshake, heartbeat, and CallAsync<TReq,TResp> dispatch.
  Moved file paths from src/ZB.MOM.WW.LmxOpcUa.Host/MxAccess/* to
  src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/Backend/MxAccess/* and added the
  Shared + Proxy key-file tables. Added CapabilityInvoker + OTOPCUA0001
  analyzer callout. Cross-linked to drivers/README.md, Galaxy-Repository.md,
  HistoricalDataAccess.md.

- docs/GalaxyRepository.md -> docs/drivers/Galaxy-Repository.md (git mv +
  rewrite): retitled "Galaxy Repository — Tag Discovery for the Galaxy
  Driver", opened with a comparison table showing how every driver's
  ITagDiscovery source is different (AB CIP @tags walker, TwinCAT
  SymbolLoaderFactory, FOCAS CNC queries, OPC UA Client Session.Browse, etc).
  Repositioned GalaxyRepositoryService as the Galaxy driver's
  ITagDiscovery.DiscoverAsync implementation. Updated paths to
  Driver.Galaxy.Host/Backend/GalaxyRepository/*. Added IRediscoverable section
  covering the on-change-redeploy IPC path.

- docs/drivers/README.md (new): index with ground-truth driver table —
  project path, stability tier, wire library, capability-interface list, and
  one notable quirk per driver. Verified against the driver csproj files and
  class declarations on focas-pr3-remaining-capabilities (the most recent
  branch containing every driver). Galaxy gets its own dedicated docs; the
  other seven drivers cross-link to docs/v2/driver-specs.md. Lists the full
  Core.Abstractions capability surface, DriverTypeRegistry, CapabilityInvoker,
  and OTOPCUA0001 analyzer.

- docs/HistoricalDataAccess.md (rewrite): reframed around IHistoryProvider as
  a per-driver optional capability interface. Replaced v1 HistorianPluginLoader
  / AvevaHistorianPluginEntry plugin architecture with the v2 story —
  Historian.Aveva was merged into Driver.Galaxy.Host/Backend/Historian/ and
  IPC-forwarded through GalaxyProxyDriver. Documented all four IHistoryProvider
  methods (ReadRawAsync / ReadProcessedAsync / ReadAtTimeAsync /
  ReadEventsAsync), CapabilityInvoker wrapping with DriverCapability.HistoryRead,
  and the per-driver coverage matrix (Galaxy + OPC UA Client implement; the
  six protocol drivers don't and return BadHistoryOperationUnsupported). Kept
  the cluster-failover + health-counter + quality-mapping detail for the
  Galaxy Historian implementation. Flagged one gap: Proxy forwards all four
  history message kinds but the Host-side HistoryAggregateType -> AnalogSummary
  column mapping may surface GalaxyIpcException{Code="not-implemented"} on a
  given branch until the Phase 2 Galaxy out-of-process gate lands.

Driver list built against ground truth (src on focas-pr3-remaining-capabilities):
  Driver.Galaxy.{Shared,Host,Proxy}, Driver.Modbus, Driver.S7, Driver.AbCip,
  Driver.AbLegacy, Driver.TwinCAT, Driver.FOCAS, Driver.OpcUaClient.
Capability interface lists verified against each *Driver.cs class declaration.
Aveva Historian ported to Driver.Galaxy.Host/Backend/Historian/; no separate
Historian.Aveva assembly on v2 branches.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 01:33:53 -04:00
Joseph Doherty
985b7aba26 Doc refresh (task #202) — core architecture docs for multi-driver OtOpcUa
Rewrite seven core-architecture docs to match the shipped multi-driver platform.
The v1 single-driver LmxNodeManager framing is replaced with the Core +
capability-interface model — Galaxy is now one driver of seven, and each doc
points at the current class names + source paths.

What changed per file:
- OpcUaServer.md — OtOpcUaServer as StandardServer host; per-driver
  DriverNodeManager + CapabilityInvoker wiring; Config-DB-driven configuration
  (sp_PublishGeneration, DraftRevisionToken, Admin UI); Phase 6.2
  AuthorizationGate integration.
- AddressSpace.md — GenericDriverNodeManager.BuildAddressSpaceAsync walks
  ITagDiscovery.DiscoverAsync and streams DriverAttributeInfo through
  IAddressSpaceBuilder; CapturingBuilder registers alarm-condition sinks;
  per-driver NodeId schemes replace the fixed ns=1;s=ZB root.
- ReadWriteOperations.md — OnReadValue / OnWriteValue dispatch to
  IReadable.ReadAsync / IWritable.WriteAsync through CapabilityInvoker,
  honoring WriteIdempotentAttribute (#143); two-layer authorization
  (WriteAuthzPolicy + Phase 6.2 AuthorizationGate).
- Subscriptions.md — ISubscribable.SubscribeAsync/UnsubscribeAsync is the
  capability surface; STA-thread story is now Galaxy-specific (StaPump inside
  Driver.Galaxy.Host), other drivers are free-threaded.
- AlarmTracking.md — IAlarmSource is optional; AlarmSurfaceInvoker wraps
  Subscribe/Ack/Unsubscribe with fan-out by IPerCallHostResolver and the
  no-retry AlarmAcknowledge pipeline (#143); CapturingBuilder registers sinks
  at build time.
- DataTypeMapping.md — DriverDataType + SecurityClassification are the
  driver-agnostic enums; per-driver mappers (GalaxyProxyDriver inline,
  AbCipDataType, ModbusDriver, etc.); SecurityClassification is metadata only,
  ACL enforcement is at the server layer.
- IncrementalSync.md — IRediscoverable covers backend-change signals;
  sp_ComputeGenerationDiff + DiffViewer drive generation-level change
  detection; IDriver.ReinitializeAsync is the in-process recovery path.
2026-04-20 01:33:28 -04:00
Joseph Doherty
48970af416 Doc refresh (task #205) — requirements updated for multi-driver OtOpcUa three-process deploy
Per-file summary:

- docs/reqs/OpcUaServerReqs.md — rewritten driver-agnostic. OPC-001..OPC-013 re-scoped to multi-driver address-space composition + capability dispatch; OPC-014 AuthorizationGate + permission trie; OPC-015 dynamic ServiceLevel via RedundancyCoordinator; OPC-017 surgical generation-apply rebuild; OPC-012 capability dispatch via CapabilityInvoker (decision #143 idempotence-aware retry); OPC-013 per-host Polly isolation (decision #144); OPC-019 OpenTelemetry metrics. Transport-security profile matrix (OPC-010) + UserName/LDAP (OPC-011) preserved.

- docs/reqs/GalaxyRepositoryReqs.md — scope clarified as Galaxy-driver-only (not platform). GR-001..GR-004 tied to ITagDiscovery.DiscoverAsync + IRediscoverable; all SQL runs inside OtOpcUa.Galaxy.Host and streams to Proxy via named pipe. GR-008 capability wrapping via CapabilityInvoker added. Cross-links to docs/v2/driver-specs.md + docs/GalaxyRepository.md.

- docs/reqs/MxAccessClientReqs.md — scope clarified as Galaxy-Host-only. MXA-001..MXA-009 preserved (STA pump, register/unregister, subscription refcount, auto-reconnect, probe, COM cleanup, operation metrics, error translation). MXA-010 Proxy-side capability wrapping + MXA-011 pipe ACL + per-process shared secret (OTOPCUA_ALLOWED_SID / OTOPCUA_GALAXY_SECRET) added.

- docs/reqs/ServiceHostReqs.md — rewritten for three-process deployment. Shared section (SVC-SHARED-001/002) for Serilog + bootstrap-only appsettings. SRV-* for OtOpcUa.Server (net10 x64, Microsoft.Extensions.Hosting + AddWindowsService, in-process driver hosting, redundancy-node bootstrap). ADM-* for OtOpcUa.Admin (Blazor Server, cookie+LDAP auth, CanEdit/CanPublish policies, sole DB writer, Prometheus /metrics, audit logging). GHX-* for OtOpcUa.Galaxy.Host (TopShelf, net48 x86, named-pipe IPC bootstrap, STA backend lifecycle, crash handling tied to supervisor).

- docs/reqs/ClientRequirements.md — restructured as numbered, verifiable requirements. SHR-* for Client.Shared (single IOpcUaClientService, ConnectionSettings, failover, cross-platform certs, type-coercing write, UI-thread neutrality). CLI-001..CLI-011 cover connect/read/write/browse/subscribe/historyread/alarms/redundancy. UI-001..UI-008 cover connection panel, tree browser, each tab, connection-state reflection, cross-platform build. Reference design content (IOpcUaClientService shape, models, view-model map, mock layout) preserved.

- docs/reqs/StatusDashboardReqs.md — retired cleanly. Replaced with a pointer to docs/v2/admin-ui.md + HLR-015 / HLR-016 / HLR-017 / ADM-*. Mapping table shows each retired DASH-001..DASH-009 requirement's replacement (live cluster-node view via SignalR, Prometheus metrics, driver-instance detail views, etc.). Note that a formal AdminUiReqs.md can be written later if needed for cert compliance.

HighLevelReqs.md was already at the target shape (HLR-001..HLR-018 with Revision header noting retired HLR-009) as of commit f217636; verified identical and no additional edit required.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 01:31:58 -04:00
Joseph Doherty
f217636467 Doc refresh (task #206) — Client.CLI + Client.UI brand flip + new top-level docs/README.md index. Client.CLI.md: replaced stale LmxOpcUa-OPC UA-server references with OtOpcUa throughout the overview + sample output + applicationUri examples (opc.tcp://localhost:4840/OtOpcUa, urn:localhost:OtOpcUa:instanceN); confirmed against src/ZB.MOM.WW.OtOpcUa.Server/Program.cs:69-71 which sets the live endpoint url + application uri to those exact values. Added a driver-agnostic note in the overview — the CLI is reachable against every shipped driver surface because the OPC UA endpoint abstracts them all. Kept the lmxopcua-cli executable name + the {LocalAppData}/LmxOpcUaClient/pki/ PKI folder name AS-IS because those are real filesystem-level residuals the code still uses (Program.cs SetExecutableName + OpcUaClientService.cs:428) — flipping them requires migration shims so existing dev boxes don't lose their trusted-cert store; added explicit doc text explaining the residual + why it persists so future readers aren't confused. Fixed the sample connect-output "Server: LmxOpcUa" to "Server: OtOpcUa Server" matching the live ApplicationName in OpcUaServerOptions.cs:39. Client.UI.md: replaced the 4 LmxOpcUa references — overview one-liner, status-bar mock (now reads "OtOpcUa Server" matching the server's reported ApplicationName), endpoint-url example, settings persistence path. Same residual-explanation note added under the LmxOpcUaClient/settings.json path pointing at the Client.Shared session-factory literal at OpcUaClientService.cs:428. docs/README.md is new — a top-level index distinguishing the two documentation tiers (current reference at docs/*.md vs implementation history + design notes at docs/v2/*.md). Every current-reference doc gets a one-line role description in a section table (Architecture + data-path / Drivers / Operational / Client tooling / Requirements) so a new reader picking up the repo finds their way in without having to grep file names. Cross-link calls out that load-bearing references from top-level docs (plan.md decisions, admin-ui.md, acl-design.md, config-db-schema.md, driver-specs.md, dev-environment.md, test-data-sources.md) live under v2/. Notes up front that the project was renamed LmxOpcUa → OtOpcUa and that any remaining LmxOpcUa-string in paths is a deliberate residual with a migration follow-up, so readers don't chase phantom bugs. Four parallel doc-refresh agents currently working on the rest of docs/*.md (task #202 core architecture, #203 driver docs split, #204 operational, #205 requirements) — those commits will land on separate worktree branches + get folded in together once complete; this index already lists the docs they'll produce (drivers/README.md, drivers/Galaxy.md, drivers/Galaxy-Repository.md) so the final merge just has the content showing up where the index already points.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 01:25:18 -04:00
5f26fff4f1 Merge pull request (#148) - Roslyn analyzer OTOPCUA0001 2026-04-20 00:54:39 -04:00
Joseph Doherty
5c0d3154c1 Roslyn analyzer — detect unwrapped driver-capability calls (OTOPCUA0001). Closes task #200. New netstandard2.0 analyzer project src/ZB.MOM.WW.OtOpcUa.Analyzers registered as an <Analyzer>-item ProjectReference from the Server csproj so the warning fires at every Server compile. First (and only so far) rule OTOPCUA0001 — "Driver capability call must be wrapped in CapabilityInvoker" — walks every InvocationOperation in the AST + trips when (a) the target method implements one of the seven guarded capability interfaces (IReadable / IWritable / ITagDiscovery / ISubscribable / IHostConnectivityProbe / IAlarmSource / IHistoryProvider) AND (b) the method's return type is Task, Task<T>, ValueTask, or ValueTask<T> — the async-wire-call constraint narrows the rule to the surfaces the Phase 6.1 pipeline actually wraps + sidesteps pure in-memory accessors like IHostConnectivityProbe.GetHostStatuses() which would trigger false positives AND (c) the call does NOT sit inside a lambda argument passed to CapabilityInvoker.ExecuteAsync / ExecuteWriteAsync / AlarmSurfaceInvoker.*. The wrapper detection walks up the syntax tree from the call site, finds any enclosing InvocationExpressionSyntax whose method's containing type is one of the wrapper classes, + verifies the call lives transitively inside that invocation's AnonymousFunctionExpressionSyntax argument — a sibling "result = await driver.ReadAsync(...)" followed by a separate invoker.ExecuteAsync(...) call does NOT satisfy the wrapping rule + the analyzer flags it (regression guard in the 5th test). Five xunit-v3 + Shouldly tests at tests/ZB.MOM.WW.OtOpcUa.Analyzers.Tests: direct ReadAsync in server namespace trips; wrapped ReadAsync inside CapabilityInvoker.ExecuteAsync lambda passes; direct WriteAsync trips; direct DiscoverAsync trips; sneaky pattern — read outside the lambda + ExecuteAsync with unrelated lambda nearby — still trips. Hand-rolled test harness compiles a stub-plus-user snippet via CSharpCompilation.WithAnalyzers + runs GetAnalyzerDiagnosticsAsync directly, deliberately avoiding Microsoft.CodeAnalysis.CSharp.Analyzer.Testing.XUnit because that package pins to xunit v2 + this repo is on xunit.v3 everywhere else. RS2008 release-tracking noise suppressed by adding AnalyzerReleases.Shipped.md + AnalyzerReleases.Unshipped.md as AdditionalFiles, which is the canonical Roslyn-analyzer hygiene path. Analyzer DLL referenced from Server.csproj via ProjectReference with OutputItemType=Analyzer + ReferenceOutputAssembly=false — the DLL ships as a compiler plugin, not a runtime dependency. Server build validates clean: the analyzer activates on every Server file but finds zero violations, which confirms the Phase 6.1 wrapping work done in prior PRs is complete + the analyzer is now the regression guard preventing the next new capability surface from being added raw. slnx updated with both the src + tests project entries. Full solution build clean, analyzer suite 5/5 passing.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 00:52:40 -04:00
74067e7d7e Merge pull request (#147) - OTel Prometheus exporter 2026-04-20 00:43:15 -04:00
Joseph Doherty
ef53553e9d OTel Prometheus exporter wiring — RedundancyMetrics meter now scraped at /metrics. Closes task #201. Picked Prometheus over OTLP per the earlier recommendation (pull-based means no OTel Collector deployment required for the common K8s/containers case; the endpoint ASP.NET-hosts inside the Admin app already, so one less moving part). Adds two NuGet refs to the Admin csproj: OpenTelemetry.Extensions.Hosting 1.15.2 (stable) + OpenTelemetry.Exporter.Prometheus.AspNetCore 1.15.2-beta.1 (the exporter has historically been beta-only; rest of the OTel ecosystem treats it as production-acceptable + it's what the upstream OTel docs themselves recommend for AspNetCore hosts). Program.cs gains a Metrics:Prometheus:Enabled toggle (defaults true; setting to false disables both the MeterProvider registration + the scrape endpoint entirely for locked-down deployments). When enabled, AddOpenTelemetry().WithMetrics() registers a MeterProvider that subscribes to the "ZB.MOM.WW.OtOpcUa.Redundancy" meter (the exact MeterName constant on RedundancyMetrics) + wires AddPrometheusExporter. MapPrometheusScrapingEndpoint() appends a /metrics handler producing the Prometheus text-format output; deliberately NOT authenticated because scrape jobs typically run on a trusted network + operators who need auth wrap the endpoint behind a reverse-proxy basic-auth gate per fleet-ops convention. appsettings.json declares the toggle with Enabled: true so the default deploy gets metrics automatically — turning off is the explicit action. Future meters (resilience tracker + host status + auth probe) just AddMeter("Name") alongside the existing call to start flowing through the same endpoint without more infrastructure. Admin project builds 0 errors; Admin.Tests 92/92 passing (unchanged — the OTel pipeline runs at request time, not test time). Still-pending work that was NOT part of #201's scope: an equivalent setup for the Server project (different MeterNames — the Polly pipeline builder's tracker + host-status publisher) + a metrics cheat-sheet in docs/observability.md documenting each meter's tag set + expected alerting thresholds. Those are natural follow-ups when fleet-ops starts building dashboards.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 00:41:16 -04:00
d1e50db304 Merge pull request (#146) - DiffViewer ACL section 2026-04-20 00:39:11 -04:00
Joseph Doherty
df0d7c2d84 DiffViewer ACL section — extend sp_ComputeGenerationDiff with NodeAcl rows. Closes the final slice of task #196 (draft-diff ACL section). The DiffViewer already rendered a placeholder "NodeAcl" card from the task #156 refactor; it stayed empty because the stored proc didn't emit NodeAcl rows. This PR lights the card up by adding a fifth UNION to the proc. Logical id for NodeAcl is the composite LdapGroup + ScopeKind + ScopeId triple — format "cn=group|Cluster|scope-id" or "cn=group|Cluster|(cluster)" when ScopeId is null (Cluster-wide rows). That shape means a permission-only change (same group + same scope, PermissionFlags shifted) appears as a single Modified row with the full triple as its identifier, whereas a scope move (same group, new ScopeId) correctly surfaces as Added + Removed of two different logical ids. CHECKSUM signature covers ClusterId + PermissionFlags + Notes so both operator-visible changes (permission bitmask) and audit-tier changes (notes) round-trip through the diff. New migration 20260420000001_ExtendComputeGenerationDiffWithNodeAcl.cs ships both Up (install V2 proc) + Down (restore the exact V1 proc text shipped in 20260417215224_StoredProcedures so the migration is reversible). Row-id column widens from nvarchar(64) to nvarchar(128) in V2 since the composite key (group DN + scope + scope-id) exceeds 64 chars comfortably — narrow column would silently truncate in prod. Designer .cs cloned from the prior migration since the EF model is unchanged; DiffViewer.razor section description updated to drop the "(proc-extension pending)" note it carried since task #156 — the card will now populate live. Admin + Core full-solution build clean. No unit-test changes needed — the existing StoredProceduresTests cover the proc-exec path + would immediately catch any SQL syntax regression on next SQL Server integration run. Task #196 fully closed now — Probe-this-permission (slice 1, PR 144), SignalR invalidation (slice 2, PR 145), draft-diff ACL section (this PR).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 00:37:05 -04:00
16f4b4acad Merge pull request (#145) - ACL + role-grant SignalR invalidation 2026-04-20 00:34:24 -04:00
Joseph Doherty
ac63c2cfb2 ACL + role-grant SignalR invalidation — #196 slice 2. Adds the live-push layer so an operator editing permissions in one Admin session sees the change in peer sessions without a manual reload. Covers both axes of task #196's invalidation requirement: cluster-scoped NodeAcl mutations push NodeAclChanged to that cluster's subscribers; fleet-wide LdapGroupRoleMapping CRUD pushes RoleGrantsChanged to every Admin session on the fleet group. New AclChangeNotifier service wraps IHubContext<FleetStatusHub> with two methods: NotifyNodeAclChangedAsync(clusterId, generationId) + NotifyRoleGrantsChangedAsync(). Both are fire-and-forget — a failed hub send logs a warning + returns; the authoritative DB write already committed, so worst-case peers see stale data until their next poll (AclsTab has no polling today; on-parameter-set reload + this signal covers the practical refresh cases). Catching OperationCanceledException separately so request-teardown doesn't log a false-positive hub-failure. NodeAclService constructor gains an optional AclChangeNotifier param (defaults to null so the existing unit tests that pass only a DbContext keep compiling). GrantAsync + RevokeAsync both emit NodeAclChanged after the SaveChanges completes — the Revoke path uses the loaded row's ClusterId + GenerationId for accurate routing since the caller passes only the surrogate rowId. RoleGrants.razor consumes the notifier after every Create + Delete + opens a fleet-scoped HubConnection on first render that reloads the grant list on RoleGrantsChanged. AclsTab.razor opens a cluster-scoped connection on first render and reloads only when the incoming NodeAclChanged message matches both the current ClusterId + GenerationId (so a peer editing a different draft doesn't trigger spurious reloads). Both pages IAsyncDisposable the connection on navigation away. AclChangeNotifier is DI-registered alongside PermissionProbeService. Two new message records in AclChangeNotifier.cs: NodeAclChangedMessage(ClusterId, GenerationId, ObservedAtUtc) + RoleGrantsChangedMessage(ObservedAtUtc). Admin.Tests 92/92 passing (unchanged — the notifier is fire-and-forget + tested at hub level in existing FleetStatusPoller suite). Admin builds 0 errors. One slice of #196 remains: the draft-diff ACL section (extend sp_ComputeGenerationDiff to emit NodeAcl rows + wire the DiffViewer NodeAcl card from the empty placeholder it currently shows). Next PR.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 00:32:28 -04:00
d93dc73978 Merge pull request (#144) - AclsTab Probe-this-permission 2026-04-20 00:30:15 -04:00
Joseph Doherty
ecc2389ca8 AclsTab Probe-this-permission — first of three #196 slices. New /clusters/{ClusterId}/draft/{GenerationId} ACLs-tab gains a probe card above the grant table so operators can ask the trie "if cn=X asks for permission Y on node Z, would it be granted, and which rows contributed?" without shell-ing into the DB. Service thinly wraps the same PermissionTrieBuilder + PermissionTrie.CollectMatches call path the Server's dispatch layer uses at request time, so a probe answer is by construction identical to what the live server would decide. New PermissionProbeService.ProbeAsync(generationId, ldapGroup, NodeScope, requiredFlags) — loads the target generation's NodeAcl rows filtered to the cluster (critical: without the cluster filter, cross-cluster grants leak into the probe which tested false-positive in the unit suite), builds a trie, CollectMatches against the supplied scope + [ldapGroup], ORs the matched-grant flags into Effective, compares to Required. Returns PermissionProbeResult(Granted, Required, Effective, Matches) — Matches carries LdapGroup + Scope + PermissionFlags per matched row so the UI can render the contribution chain. Zero side effects + no audit rows — a failing probe is a question, not a denial. AclsTab.razor gains the probe card at the top (before the New-grant form + grant table): six inputs for ldap group + every NodeScope level (NamespaceId → UnsAreaId → UnsLineId → EquipmentId → TagId — blank fields become null so the trie walks only as deep as the operator specified), a NodePermissions dropdown filtered to skip None, Probe button, green Granted / red Denied badge + Required/Effective bitmask display, and (when matches exist) a small table showing which LdapGroup matched at which level with which flags. Admin csproj adds ProjectReference to Core — the trie + NodeScope live there + were previously Server-only. Five new PermissionProbeServiceTests covering: cluster-level row grants a namespace-level read; no-group-match denies with empty Effective; matching group but insufficient flags (Browse+Read vs WriteOperate required) denies with correct Effective bitmask; cross-cluster grants stay isolated (c2's WriteOperate does NOT leak into c1's probe); generation isolation (gen1's Read-only does NOT let gen2's WriteOperate-requiring probe pass). Admin.Tests 92/92 passing (was 87, +5). Admin builds 0 errors. Remaining #196 slices — SignalR invalidation + draft-diff ACL section — ship in follow-up PRs so the review surface per PR stays tight.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 00:28:17 -04:00
852c710013 Merge pull request (#143) - Pin ab_server to libplctag v2.6.16 2026-04-20 00:06:29 -04:00
Joseph Doherty
8ce5791f49 Pin libplctag ab_server to v2.6.16 — real release tag + SHA256 hashes for all three Windows arches. Closes the "pick a current version + pin" deferral left by the #180 PR docs stub. Verified the release lands ab_server.exe inside libplctag_2.6.16_windows_<arch>_tools.zip alongside plctag.dll + list_tags_* helpers by downloading each tools zip + unzip -l'ing to confirm ab_server.exe is present at 331264 bytes. New ci/ab-server.lock.json is the single source of truth — one file the CI YAML reads via ConvertFrom-Json instead of duplicating the hash across the workflow + the docs. Structure: repo (libplctag/libplctag) + tag (v2.6.16) + published date (2026-03-29) + assets keyed by platform (windows-x64 / windows-x86 / windows-arm64) each carrying filename + sha256. docs/v2/test-data-sources.md §2.CI updated — replaces the prior placeholder (ver = '<pinned libplctag release tag>', expected = '<pinned sha256>') with the real v2.6.16 + 9b78a3de... hashes pinned table, and replaces the hardcoded URL with a lockfile-driven pwsh step that picks windows-x64 by default but swaps to x86/arm64 by changing one line for non-x64 CI runners. Hash-mismatch path throws with both the expected + actual values so on the first drift the CI log tells the maintainer exactly what to update in the lockfile. Two verification notes from the release fetch: (1) libplctag v2.6.16 tools zips ship ab_server.exe + plctag.dll together — tests don't need a separate libplctag NuGet download for the integration path, the extracted tools dir covers both the simulator + the driver's native dependency; (2) the three Windows arches all carry ab_server.exe, so ARM64 Windows GitHub runners (when they arrive) can run the integration suite without changes beyond swapping the asset key. No code changes in this PR — purely docs + the new lockfile. Admin tests + Core tests unchanged + passing per the prior commit.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 00:04:35 -04:00
05ddea307b Merge pull request (#142) - ab_server per-family profiles 2026-04-19 23:59:20 -04:00
Joseph Doherty
32dff7f1d6 ab_server integration fixture — per-family profiles + documented CI-fetch contract. Closes task #180 (AB CIP follow-up — ab_server CI fixture). Replaces the prior hardcoded single-family fixture with a parametric AbServerProfile abstraction covering ControlLogix / CompactLogix / Micro800 / GuardLogix. Prebuilt-Windows-binary fetch is documented as a CI YAML step rather than fabricated C#-side, because SHA-pinned binary distribution is a CI workflow concern (libplctag owns releases, we pin a version + verify hash) not a test-framework concern. New AbServerProfile record + KnownProfiles static class at tests/.../AbServerProfile.cs. Four profiles: ControlLogix (widest coverage — DINT/REAL/BOOL/SINT/STRING atomic + DINT[16] array so the driver's @tags Symbol-Object decoder + array-bound path both get end-to-end coverage), CompactLogix (atomic subset — driver-side ConnectionSize quirk from PR 10 still applies since ab_server doesn't enforce the narrower limit), Micro800 (ab_server has no dedicated --plc micro800 mode — falls back to controllogix while driver-side path enforces empty routing + unconnected-only per PR 11; real Micro800 coverage requires a 2080 lab rig), GuardLogix (ab_server has no safety subsystem — profile emulates the _S-suffixed naming contract the driver's safety-ViewOnly classification reads in PR 12; real safety-lock behavior requires a 1756-L8xS physical rig). Each profile composes --plc + --tag args via BuildCliArgs(port) — pure string formatter so the composition logic is unit-testable without launching the simulator. AbServerFixture gains a ctor overload taking AbServerProfile + port (defaults back to ControlLogix on parameterless ctor so existing test suites keep compiling). Fixture's InitializeAsync hands the profile's CLI args to ProcessStartInfo.Arguments. New AbServerTheoryAttribute mirrors AbServerFactAttribute but extends TheoryAttribute so a single test can MemberData over KnownProfiles.All + cover all four families. AbCipReadSmokeTests converted from single-fact to theory parametrized over KnownProfiles.All — one row per family reads TestDINT + asserts Good status + Healthy driver state. Fixture lifecycle is explicit try/finally rather than await using because IAsyncLifetime.DisposeAsync returns ValueTask + xUnit's concrete IAsyncDisposable shim depends on xunit version; explicit beats implicit here. Eight new unit tests in AbServerProfileTests.cs (runs without the simulator so CI green even when the binary is absent): BuildCliArgs composes port + plc + tag flags in the documented order; empty seed-tag list still emits port + plc; SeedTag.ToCliSpec handles both 2-segment scalar + 3-segment array; KnownProfiles.ForFamily returns expected --plc arg for every family (verifies Micro800 + GuardLogix both fall back to controllogix); KnownProfiles.All covers every AbCipPlcFamily enum value (regression guard — adding a new family without a profile fails this test); ControlLogix seeds every atomic type the driver supports; GuardLogix seeds at least one _S-suffixed safety tag. Integration tests still skip cleanly when ab_server isn't on PATH. 11/11 unit tests passing in this project (8 new + 3 prior). Full Admin solution builds 0 errors. docs/v2/test-data-sources.md gets a new "CI fixture" subsection under §2.Gotchas with the exact GitHub Actions YAML step — fetch the pinned libplctag release, SHA256-verify against a pinned hash recorded in the repo's CI lockfile (drift = fail closed), extract, append to PATH. The C# harness stays PATH-driven so dev-box installs (cmake + make from source) work identically to CI.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 23:57:24 -04:00
42649ca7b0 Merge pull request (#141) - Redundancy OTel + SignalR 2026-04-19 23:18:04 -04:00
Joseph Doherty
1f3343e61f OpenTelemetry redundancy metrics + RoleChanged SignalR push. Closes instrumentation + live-push slices of task #198; the exporter wiring (OTLP vs Prometheus package decision) is split to new task #201 because the collector/scrape-endpoint choice is a fleet-ops decision that deserves its own PR rather than hardcoded here. New RedundancyMetrics class (Singleton-registered in DI) owning a System.Diagnostics.Metrics.Meter("ZB.MOM.WW.OtOpcUa.Redundancy", "1.0.0"). Three ObservableGauge instruments — otopcua.redundancy.primary_count / secondary_count / stale_count — all tagged by cluster.id, populated by SetClusterCounts(clusterId, primary, secondary, stale) which the poller calls at the tail of every tick; ObservableGauge callbacks snapshot the last value set under a lock so the reader (OTel collector, dotnet-counters) sees consistent tuples. One Counter — otopcua.redundancy.role_transition — tagged cluster.id, node.id, from_role, to_role; ideal for tracking "how often does Cluster-X failover" + "which node transitions most" aggregate queries. In-box Metrics API means zero NuGet dep here — the exporter PR adds OpenTelemetry.Extensions.Hosting + OpenTelemetry.Exporter.OpenTelemetryProtocol or OpenTelemetry.Exporter.Prometheus.AspNetCore to actually ship the data somewhere. FleetStatusPoller extended with role-change detection. Its PollOnceAsync now pulls ClusterNode rows alongside the existing ClusterNodeGenerationState scan, and a new PollRolesAsync walks every node comparing RedundancyRole to the _lastRole cache. On change: records the transition to RedundancyMetrics + emits a RoleChanged SignalR message to both FleetStatusHub.GroupName(cluster) + FleetStatusHub.FleetGroup so cluster-scoped + fleet-wide subscribers both see it. First observation per node is a bootstrap (cache fill) + NOT a transition — avoids spurious churn on service startup or pod restart. UpdateClusterGauges groups nodes by cluster + sets the three gauge values, using ClusterNodeService.StaleThreshold (shared 30s convention) for staleness so the /hosts page + the gauge agree. RoleChangedMessage record lives alongside NodeStateChangedMessage in FleetStatusPoller.cs. RedundancyTab.razor subscribes to the fleet-status hub on first parameters-set, filters RoleChanged events to the current cluster, reloads the node list + paints a blue info banner ("Role changed on node-a: Primary → Secondary at HH:mm:ss UTC") so operators see the transition without needing to poll-refresh the page. IAsyncDisposable closes the connection on tab swap-away. Two new RedundancyMetricsTests covering RecordRoleTransition tag emission (cluster.id + node.id + from_role + to_role all flow through the MeterListener callback) + ObservableGauge snapshot for two clusters (assert primary_count=1 for c1, stale_count=1 for c2). Existing FleetStatusPollerTests ctor-line updated to pass a RedundancyMetrics instance; all tests still pass. Full Admin.Tests suite 87/87 passing (was 85, +2). Admin project builds 0 errors. Task #201 captures the exporter-wiring follow-up — OpenTelemetry.Extensions.Hosting + OTLP vs Prometheus + /metrics endpoint decision, driven by fleet-ops infra direction.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 23:16:09 -04:00
251f567b98 Merge pull request (#140) - AlarmSurfaceInvoker 2026-04-19 23:09:35 -04:00
Joseph Doherty
404bfbe7e4 AlarmSurfaceInvoker — wraps IAlarmSource.Subscribe/Unsubscribe/Acknowledge through CapabilityInvoker with multi-host fan-out. Closes alarm-surface slice of task #161 (Phase 6.1 Stream A); the Roslyn invoker-coverage analyzer is split into new task #200 because a DiagnosticAnalyzer project is genuinely its own scaffolding PR (Microsoft.CodeAnalysis.CSharp.Workspaces dep, netstandard2.0 target, Microsoft.CodeAnalysis.Testing harness, ProjectReference OutputItemType=Analyzer wiring, and four corner-case rules I want tests for before shipping). Ship this PR as the runtime guardrail + callable API; the analyzer lands next as the compile-time guardrail. New AlarmSurfaceInvoker class in Core.Resilience. Three methods mirror IAlarmSource's three mutating surfaces: SubscribeAsync (fan-out: group sourceNodeIds by IPerCallHostResolver.ResolveHost, one CapabilityInvoker.ExecuteAsync per host with DriverCapability.AlarmSubscribe so AlarmSubscribe's retry policy kicks in + returns one IAlarmSubscriptionHandle per host); UnsubscribeAsync (single-host, defaultHost); AcknowledgeAsync (fan-out: group AlarmAcknowledgeRequests by resolver-mapped host, run each host's batch through DriverCapability.AlarmAcknowledge which does NOT retry per decision #143 — alarm-ack is a write-shaped op that's not idempotent at the plant-floor level). Drivers without IPerCallHostResolver (Galaxy single MXAccess endpoint, OpcUaClient against one remote, etc.) fall back to defaultHost = DriverInstanceId so breaker + bulkhead keying still happens; drivers with it get one-dead-PLC-doesn't-poison-siblings isolation per decision #144. Single-host single-subscribe returns [handle] with length 1; empty sourceNodeIds fast-paths to [] without a driver call. Five new AlarmSurfaceInvokerTests covering: (a) empty list short-circuits — driver method never called; (b) single-host sub routes via default host — one driver call with full id list; (c) multi-host sub fans out to 2 distinct hosts for 3 src ids mapping to 2 plcs — one driver call per host; (d) Acknowledge does not retry on failure — call count stays at 1 even with exception; (e) Subscribe retries transient failures — call count reaches 3 with a 2-failures-then-success fake. Core.Tests resilience-builder suite 19/19 passing (was 14, +5); Core.Tests whole suite still green. Core project builds 0 errors. Task #200 captures the compile-time guardrail: Roslyn DiagnosticAnalyzer at src/ZB.MOM.WW.OtOpcUa.Analyzers that flags direct invocations of the eleven capability-interface methods inside the Server namespace when the call is NOT inside a CapabilityInvoker.ExecuteAsync/ExecuteWriteAsync/AlarmSurfaceInvoker.*Async lambda. That analyzer is the reason we keep paying the wrapping-class overhead for every new capability.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 23:07:37 -04:00
006af636a0 Merge pull request (#139) - ExternalIdReservation merge in FinaliseBatch 2026-04-19 23:04:25 -04:00
Joseph Doherty
c0751fdda5 ExternalIdReservation merge inside FinaliseBatchAsync. Closes task #197. The FinaliseBatch docstring called this out as a narrower follow-up pending a concurrent-insert test matrix, and the CSV import UI PR (#163) noted that operators would see raw DbUpdate UNIQUE-constraint messages on ZTag/SAPID collision until this landed. Now every finalised-batch row reserves ZTag + SAPID in the same EF transaction as the Equipment inserts, so either both commit atomically or neither does. New MergeReservation helper handles the four outcomes per (Kind, Value) pair: (1) value empty/whitespace → skip the reservation entirely (operator left the optional identifier blank); (2) active reservation exists for same EquipmentUuid → bump LastPublishedAt + reuse (re-finalising a batch against the same equipment must be idempotent, e.g. a retry after a transient DB blip); (3) active reservation exists for a DIFFERENT EquipmentUuid → throw ExternalIdReservationConflictException with the conflicting UUID + originating cluster + first-published timestamp so operator sees exactly who owns the value + where to resolve it (release via sp_ReleaseExternalIdReservation or pick a new ZTag); (4) no active reservation → create a fresh row with FirstPublishedBy = batch.CreatedBy + FirstPublishedAt = transaction time. Pre-commit overlap scan uses one round-trip (WHERE Kind+Value IN the batch's distinct sets, filtered to ReleasedAt IS NULL so explicitly-released values can be re-issued per decision #124) + caches the results in a Dictionary keyed on (Kind, value.ToLowerInvariant()) for O(1) lookup during the row loop. Race-safety catch: if another finalise commits between our cache-load + our SaveChanges, SQL Server surfaces a 2601/2627 unique-index violation against UX_ExternalIdReservation_KindValue_Active — IsReservationUniquenessViolation walks the inner-exception chain for that specific signature + rethrows as ExternalIdReservationConflictException so the UI shows a clean message instead of a raw DbUpdateException. The index-name match means unrelated filtered-unique violations (future indices) don't get mis-classified. Test-fixture Row() helper updated to generate unique SAPID per row (sap-{ZTag}) — the prior shared SAPID="sap" worked only because reservations didn't exist; two rows sharing a SAPID under different EquipmentUuids now collide as intended by decision #124's fleet-wide uniqueness rule. Four new tests: (a) finalise creates both ZTag + SAPID reservations with expected Kind + Value; (b) re-finalising same EquipmentUuid's ZTag from a different batch does not create a duplicate (LastPublishedAt refresh only); (c) different EquipmentUuid claiming the same ZTag throws ExternalIdReservationConflictException with the ZTag value in the message + Equipment row for the second batch is NOT inserted (transaction rolled back cleanly); (d) row with empty ZTag + empty SAPID skips reservation entirely. Full Admin.Tests suite 85/85 passing (was 81 before this PR, +4). Admin project builds 0 errors. Note: the InMemory EF provider doesn't enforce filtered-unique indices, so the IsReservationUniquenessViolation catch is exercised only in the SQL Server integration path — the in-memory tests cover the cache-level conflict detection in MergeReservation instead, which is the first line of defence + catches the same-batch + published-vs-staged cases. The DbUpdate catch protects only the last-second race where two concurrent transactions both passed the cache check.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 23:02:31 -04:00
80e080ecec Merge pull request (#138) - UnsTab drag/drop + 409 conflict modal 2026-04-19 22:32:45 -04:00
Joseph Doherty
5ee510dc1a UnsTab native HTML5 drag/drop + 409 concurrent-edit modal + optimistic-concurrency commit path. Closes UI slice of task #153 (Phase 6.4 Stream A UI follow-up). Playwright E2E smoke is split into new task #199 — Playwright install + WebApplicationFactory + seeded-DB harness is genuinely its own infra-setup PR. Native HTML5 attributes (draggable, @ondragstart, @ondragover, @ondragleave, @ondrop) deliberately over MudBlazor per the task title — no MudBlazor ever joins this project. Two new service methods on UnsService land the data layer the existing UnsImpactAnalyzer assumed but which didn't actually exist: (1) LoadSnapshotAsync(generationId) — walks UnsAreas + UnsLines + per-line equipment counts + builds a UnsTreeSnapshot including a 16-char SHA-256 revision token computed deterministically over the sorted (kind, id, parent, name, notes) tuple-set so it's stable across processes + changes whenever any row is added / modified / deleted; (2) MoveLineAsync(generationId, expectedToken, lineId, targetAreaId) — re-parents one line inside the same draft under an EF transaction, recomputes the current revision token from freshly-loaded rows, and throws DraftRevisionConflictException when the caller-supplied token no longer matches. Token mismatch means another operator mutated the draft between preview + commit + the move rolls back rather than clobbering their work. No-op same-area drop is a silent return. Cross-generation move is prevented by the generationId filter on the transaction reads. UnsTab.razor gains draggable="true" on every line row with @ondragstart capturing the LineId into _dragLineId, and every area row is a drop target (@ondragover with :preventDefault so the browser accepts drops, @ondrop kicking off OnLineDroppedAsync). Drop path loads a fresh snapshot, builds a UnsMoveOperation(Kind=LineMove, source/target cluster matching because cross-cluster is decision-#82 rejected), runs UnsImpactAnalyzer.Analyze + shows a Bootstrap modal rendered inline in the component — modal shows HumanReadableSummary + equipment/tag counts + any CascadeWarnings list. Confirm button calls MoveLineAsync with the snapshot's RevisionToken; DraftRevisionConflictException surfaces a separate red-header "Draft changed — refresh required" modal with a Reload button that re-fetches areas + lines from the DB. New DraftRevisionConflictException in UnsService.cs, co-located with the service that throws it. Five new UnsServiceMoveTests covering LoadSnapshotAsync (areas + lines + equipment counts), RevisionToken stability between two reads, RevisionToken changes on AddLineAsync, MoveLineAsync happy path reparents the line in the DB, MoveLineAsync with stale token throws DraftRevisionConflictException + leaves the DB unchanged. Admin suite 81/81 passing (was 76, +5). Admin project builds 0 errors. Task #199 captures the deferred Playwright E2E smoke — drag a line onto a different area in a real browser, assert preview modal contents, click Confirm, assert the line row shows the new area. That PR stands up a new tests/ZB.MOM.WW.OtOpcUa.Admin.E2ETests project with Playwright + WebApplicationFactory + seeded InMemory DbContext.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 22:30:48 -04:00
543665dedd Merge pull request (#137) - DiffViewer refactor 2026-04-19 22:25:24 -04:00
Joseph Doherty
c8a38bc57b DiffViewer refactor — 6-section plugin pattern + 1000-row cap. Closes task #156 (Phase 6.4 Stream C). Replaces the flat single-table rendering that mixed Namespace/DriverInstance/Equipment/Tag rows into one untyped list with a per-section-card layout that makes draft review actually scannable on non-trivial diffs. New DiffSection.razor reusable component encapsulates the per-section rendering — card header shows Title + Description + a three-badge summary (+added / −removed / ~modified plus a "no changes" grey badge when the section is empty) so operators can glance at a six-card page and see what areas of the draft actually shifted before drilling into any one table. Hard row-cap at DefaultRowCap=1000 per section lives inside the component so a pathological draft (e.g. 20k tags churned by a block rebuild) can't freeze the browser on render — excess rows are silently dropped with a yellow warning banner that surfaces "Showing the first 1000 of N rows" + a pointer to run sp_ComputeGenerationDiff directly for the full set. Body max-height: 400px + overflow-y: auto gives each section its own scroll region so one big section doesn't push the others off screen. DiffViewer.razor refactored to a static Sections table driving a single foreach that instantiates one DiffSection per known TableName. Sections listed in author-order (Namespace → DriverInstance → Equipment → Tag → UnsLine → NodeAcl) — six entries matching the task acceptance criterion. The first four correspond to what sp_ComputeGenerationDiff currently emits; the last two (UnsLine + NodeAcl) render as empty "no changes" cards today + will light up when the proc is extended (tracked in task #196 for NodeAcl; UnsLine proc extension is a natural follow-up since UnsImpactAnalyzer already tracks UNS moves). RowsFor(tableName) replaces the prior flat table — each section filters the overall DiffRow list by its TableName so the proc output format stays stable. Header-bar summary at the top of the page now reads "N rows across M of 6 sections" so operators see overall change weight at a glance before scanning. Two Razor-specific fixes landed along the way: loop variable renamed from section to sec because @section collides with the Razor section directive + trips RZ2005; helper method renamed from Group to RowsFor because the Razor generator gets confused by a parameter-flowing method whose name clashes with LINQ's Group extension (the source-gen output referenced TypeCheck<T> with no argument). Admin project builds 0 errors; Admin.Tests suite 76/76 (unchanged — the refactor is structural + no service-layer logic changed, so the existing DraftValidator + EquipmentService + AdminServicesIntegrationTests cover the consuming paths). No bUnit in this project so the cap behavior isn't unit-tested at the component level; DiffSection.OnParametersSet is small + deterministic (int counts + Take(RowCap)) + reviewed before ship.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 22:23:22 -04:00
cecb84fa5d Merge pull request (#136) - Admin RedundancyTab 2026-04-19 22:16:20 -04:00
Joseph Doherty
13d5a7968b Admin RedundancyTab — per-cluster read-only topology view. Closes the UI slice of task #149 (Phase 6.3 Stream E — Admin UI RedundancyTab + OpenTelemetry metrics + SignalR); the OpenTelemetry metrics + RoleChanged SignalR push are split into new follow-up task #198 because each is a structural add that deserves its own test matrix + NuGet-dep decision rather than riding this UI PR. New /clusters/{ClusterId} Redundancy tab slotted between ACLs and Audit in the existing ClusterDetail tab bar. Shows each ClusterNode row in the cluster with columns Node / Role (Primary green, Secondary blue, Standalone primary-blue badge) / Host / OPC UA port / ServiceLevel base / ApplicationUri (text-break so the long urn: doesn't blow out the table) / Enabled badge / Last seen (relative age via the same FormatAge helper as Hosts.razor, with a yellow "Stale" chip once LastSeenAt crosses the 30s threshold shared with HostStatusService.StaleThreshold — a missed heartbeat plus clock-skew buffer). Four summary cards above the table — total Nodes, Primary count, Secondary count, Stale count. Two guard-rail alerts: (a) red "No Primary or Standalone" when the cluster has no authoritative write target (all rows are Secondaries — read-only until one is promoted by the server-side RedundancyCoordinator apply-lease flow); (b) red "Split-brain" when >1 Primary exists — apply-lease enforcement at the coordinator level should have made this impossible, so the alert implies a hand-edited DB row + an investigation. New ClusterNodeService with ListByClusterAsync (ordered by ServiceLevelBase descending so Primary rows with higher base float to the top) + a static IsStale predicate matching HostStatusService's 30s convention. DI-registered alongside the existing scoped services in Program.cs. Writes (role swap, enable/disable) are deliberately absent from the service — they go through the RedundancyCoordinator apply-lease flow on the server side + direct DB mutation from Admin would race with it. New ClusterNodeServiceTests covering IsStale across null/recent/old LastSeenAt + ListByClusterAsync ordering + cluster filter. 4/4 new tests passing; full Admin.Tests suite 76/76 (was 72 before this PR, +4). Admin project builds 0 errors. Task #198 captures the deferred work: (1) OpenTelemetry Meter for primary/secondary/stale counts + role_transition counter with from/to/node tags + OTLP exporter config; (2) RoleChanged SignalR push — extend FleetStatusPoller to detect RedundancyRole changes on ClusterNode rows + emit a RoleChanged hub message so the RedundancyTab refreshes instantly instead of on-page-load polling.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 22:14:25 -04:00
d1686ed82d Merge pull request (#135) - Equipment CSV import UI 2026-04-19 22:02:36 -04:00
Joseph Doherty
ac69a1c39d Equipment CSV import UI — Stream B.3/B.5 operator page + EquipmentTab "Import CSV" button. Closes the UI slice of task #163 (Phase 6.4 Stream B.3/B.5); the ExternalIdReservation merge follow-up inside FinaliseBatchAsync is split into new task #197 so it gets a proper concurrent-insert test matrix rather than riding this UI PR. New /clusters/{ClusterId}/draft/{GenerationId}/import-equipment page driving the full staged-import flow end-to-end. Operator selects a driver instance + UNS line (both scoped to the draft generation via DriverInstanceService.ListAsync + UnsService.ListLinesAsync dropdowns), pastes or uploads a CSV (InputFile with 5 MiB cap so pathological files can't OOM the server), clicks Parse — EquipmentCsvImporter.Parse runs + shows two side-by-side cards (accepted rows in green with ZTag/Machine/Name/Line columns, rejected rows in red with line-number + reason). Click Stage + Finalise and the page calls CreateBatchAsync → StageRowsAsync → FinaliseBatchAsync in sequence using the authenticated user's identity as CreatedBy; on success, 600ms banner then NavigateTo back to the draft editor so operator sees the newly-imported rows in EquipmentTab without a manual refresh. Parse errors (missing version marker, bad header, malformed CSV) surface InvalidCsvFormatException.Message inline alongside the Parse button — no page reload needed to retry. Finalise errors surface the service-layer exception message (ImportBatchNotFoundException / ImportBatchAlreadyFinalisedException / any DbUpdate* exception from the atomic transaction) so operator sees exactly why the finalise rejected before the tx rolled back. EquipmentTab gains an "Import CSV…" button next to "Add equipment" that NavigateTo's the new page; it needs a ClusterId parameter to build the URL so the @code block adds [Parameter] string ClusterId, and DraftEditor now passes ClusterId="@ClusterId" alongside the existing GenerationId. EquipmentImportBatchService was already implemented in Phase 6.4 Stream B.4 but missing from the Admin DI container — this PR adds AddScoped so the @inject resolves. The FinaliseBatch docstring explicitly defers ExternalIdReservation merge as a narrower follow-up with a concurrent-insert test matrix — task #197 captures that work. For now the finalise may surface a DB-level UNIQUE-constraint violation if a ZTag conflict exists at commit time; the UI shows the raw message + the batch + staged rows are still in the DB for re-use once the conflict is resolved. Admin project builds 0 errors; Admin.Tests 72/72 passing.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 22:00:40 -04:00
30714831fa Merge pull request (#134) - Admin RoleGrants page 2026-04-19 21:48:14 -04:00
Joseph Doherty
44d4448b37 Admin RoleGrants page — LDAP-group → Admin-role mapping CRUD. Closes the RoleGrantsTab slice of task #144 (Phase 6.2 Stream D follow-up); the remaining three sub-items (Probe-this-permission on AclsTab, SignalR invalidation on role/ACL changes, draft-diff ACL section) are split into new follow-up task #196 so each can ship independently. The permission-trie evaluator + ILdapGroupRoleMappingService already exist from Phase 6.2 Streams A + B — this PR adds the consuming UI + the DI registration that was missing. New /role-grants page at Components/Pages/RoleGrants.razor registered in MainLayout's sidebar next to Certificates. Lists every LdapGroupRoleMapping row with columns LDAP group / Role / Scope (Fleet-wide or Cluster:X) / Created / Notes / Revoke. Add-grant form takes LDAP group DN + AdminRole dropdown (ConfigViewer, ConfigEditor, FleetAdmin) + Fleet-wide checkbox + Cluster dropdown (disabled when Fleet-wide checked) + optional Notes. Service-layer invariants — IsSystemWide=true + ClusterId=null, or IsSystemWide=false + ClusterId populated — enforced in ValidateInvariants; UI catches InvalidLdapGroupRoleMappingException and displays the message in a red alert. ILdapGroupRoleMappingService was present in the Configuration project from Stream A but never registered in the Admin DI container — this PR adds the AddScoped registration so the injection can resolve. Control-plane/data-plane separation note rendered in an info banner at the top of the page per decision #150 (these grants do NOT govern OPC UA data-path authorization; NodeAcl rows are read directly by the permission-trie evaluator without consulting role mappings). Admin project builds 0 errors; Admin.Tests 72/72 passing. Task #196 created to track: (1) AclsTab Probe-this-permission form that takes (ldap group, node path, permission flag) and runs it through the permission trie, showing which row granted it + the actual resolved grant; (2) SignalR invalidation — push a RoleGrantsChanged event when rows are created/deleted so connected Admin sessions reload without polling, ditto NodeAclChanged on ACL writes; (3) DiffViewer ACL section — show NodeAcl + LdapGroupRoleMapping deltas between draft + published alongside equipment/uns diffs.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 21:46:21 -04:00
572f8887e4 Merge pull request (#133) - IdentificationFields editor + edit mode 2026-04-19 21:43:18 -04:00
247 changed files with 21049 additions and 3499 deletions

View File

@@ -87,13 +87,14 @@ The server supports non-transparent warm/hot redundancy via the `Redundancy` sec
## LDAP Authentication
The server uses LDAP-based user authentication via the `Authentication.Ldap` section in `appsettings.json`. When enabled, credentials are validated by LDAP bind against a GLAuth server (installed at `C:\publish\glauth\`), and LDAP group membership maps to OPC UA permissions: `ReadOnly` (browse/read), `WriteOperate` (write FreeAccess/Operate attributes), `WriteTune` (write Tune attributes), `WriteConfigure` (write Configure attributes), `AlarmAck` (alarm acknowledgment). `LdapAuthenticationProvider` implements both `IUserAuthenticationProvider` and `IRoleProvider`. See `docs/Security.md` for the full guide and `C:\publish\glauth\auth.md` for LDAP user/group reference.
The server uses LDAP-based user authentication via the `Authentication.Ldap` section in `appsettings.json`. When enabled, credentials are validated by LDAP bind against a GLAuth server (installed at `C:\publish\glauth\`), and LDAP group membership maps to OPC UA permissions: `ReadOnly` (browse/read), `WriteOperate` (write FreeAccess/Operate attributes), `WriteTune` (write Tune attributes), `WriteConfigure` (write Configure attributes), `AlarmAck` (alarm acknowledgment). `LdapUserAuthenticator` (`src/ZB.MOM.WW.OtOpcUa.Server/Security/LdapUserAuthenticator.cs`) implements `IUserAuthenticator`. See `docs/Security.md` for the full guide and `C:\publish\glauth\auth.md` for LDAP user/group reference.
## Library Preferences
- **Logging**: Serilog with rolling daily file sink
- **Unit tests**: xUnit + Shouldly for assertions
- **Service hosting**: TopShelf (Windows service install/uninstall/run as console)
- **Service hosting (Server, Admin)**: .NET generic host with `AddWindowsService` (decision #30 — replaced TopShelf in v2; see `src/ZB.MOM.WW.OtOpcUa.Server/OpcUaServerService.cs`)
- **Service hosting (Galaxy.Host)**: plain console app wrapped by NSSM (`.NET Framework 4.8 x86` — required by MXAccess COM bitness)
- **OPC UA**: OPC Foundation UA .NET Standard stack (https://github.com/opcfoundation/ua-.netstandard) — NuGet: `OPCFoundation.NetStandard.Opc.Ua.Server`
## OPC UA .NET Standard Documentation

View File

@@ -3,6 +3,7 @@
<Project Path="src/ZB.MOM.WW.OtOpcUa.Core.Abstractions/ZB.MOM.WW.OtOpcUa.Core.Abstractions.csproj"/>
<Project Path="src/ZB.MOM.WW.OtOpcUa.Configuration/ZB.MOM.WW.OtOpcUa.Configuration.csproj"/>
<Project Path="src/ZB.MOM.WW.OtOpcUa.Core/ZB.MOM.WW.OtOpcUa.Core.csproj"/>
<Project Path="src/ZB.MOM.WW.OtOpcUa.Core.Scripting/ZB.MOM.WW.OtOpcUa.Core.Scripting.csproj"/>
<Project Path="src/ZB.MOM.WW.OtOpcUa.Server/ZB.MOM.WW.OtOpcUa.Server.csproj"/>
<Project Path="src/ZB.MOM.WW.OtOpcUa.Admin/ZB.MOM.WW.OtOpcUa.Admin.csproj"/>
<Project Path="src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Shared/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Shared.csproj"/>
@@ -14,15 +15,19 @@
<Project Path="src/ZB.MOM.WW.OtOpcUa.Driver.AbLegacy/ZB.MOM.WW.OtOpcUa.Driver.AbLegacy.csproj"/>
<Project Path="src/ZB.MOM.WW.OtOpcUa.Driver.TwinCAT/ZB.MOM.WW.OtOpcUa.Driver.TwinCAT.csproj"/>
<Project Path="src/ZB.MOM.WW.OtOpcUa.Driver.FOCAS/ZB.MOM.WW.OtOpcUa.Driver.FOCAS.csproj"/>
<Project Path="src/ZB.MOM.WW.OtOpcUa.Driver.FOCAS.Shared/ZB.MOM.WW.OtOpcUa.Driver.FOCAS.Shared.csproj"/>
<Project Path="src/ZB.MOM.WW.OtOpcUa.Driver.FOCAS.Host/ZB.MOM.WW.OtOpcUa.Driver.FOCAS.Host.csproj"/>
<Project Path="src/ZB.MOM.WW.OtOpcUa.Driver.OpcUaClient/ZB.MOM.WW.OtOpcUa.Driver.OpcUaClient.csproj"/>
<Project Path="src/ZB.MOM.WW.OtOpcUa.Client.Shared/ZB.MOM.WW.OtOpcUa.Client.Shared.csproj"/>
<Project Path="src/ZB.MOM.WW.OtOpcUa.Client.CLI/ZB.MOM.WW.OtOpcUa.Client.CLI.csproj"/>
<Project Path="src/ZB.MOM.WW.OtOpcUa.Client.UI/ZB.MOM.WW.OtOpcUa.Client.UI.csproj"/>
<Project Path="src/ZB.MOM.WW.OtOpcUa.Analyzers/ZB.MOM.WW.OtOpcUa.Analyzers.csproj"/>
</Folder>
<Folder Name="/tests/">
<Project Path="tests/ZB.MOM.WW.OtOpcUa.Core.Abstractions.Tests/ZB.MOM.WW.OtOpcUa.Core.Abstractions.Tests.csproj"/>
<Project Path="tests/ZB.MOM.WW.OtOpcUa.Configuration.Tests/ZB.MOM.WW.OtOpcUa.Configuration.Tests.csproj"/>
<Project Path="tests/ZB.MOM.WW.OtOpcUa.Core.Tests/ZB.MOM.WW.OtOpcUa.Core.Tests.csproj"/>
<Project Path="tests/ZB.MOM.WW.OtOpcUa.Core.Scripting.Tests/ZB.MOM.WW.OtOpcUa.Core.Scripting.Tests.csproj"/>
<Project Path="tests/ZB.MOM.WW.OtOpcUa.Server.Tests/ZB.MOM.WW.OtOpcUa.Server.Tests.csproj"/>
<Project Path="tests/ZB.MOM.WW.OtOpcUa.Admin.Tests/ZB.MOM.WW.OtOpcUa.Admin.Tests.csproj"/>
<Project Path="tests/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Shared.Tests/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Shared.Tests.csproj"/>
@@ -33,14 +38,21 @@
<Project Path="tests/ZB.MOM.WW.OtOpcUa.Driver.Modbus.Tests/ZB.MOM.WW.OtOpcUa.Driver.Modbus.Tests.csproj"/>
<Project Path="tests/ZB.MOM.WW.OtOpcUa.Driver.Modbus.IntegrationTests/ZB.MOM.WW.OtOpcUa.Driver.Modbus.IntegrationTests.csproj"/>
<Project Path="tests/ZB.MOM.WW.OtOpcUa.Driver.S7.Tests/ZB.MOM.WW.OtOpcUa.Driver.S7.Tests.csproj"/>
<Project Path="tests/ZB.MOM.WW.OtOpcUa.Driver.S7.IntegrationTests/ZB.MOM.WW.OtOpcUa.Driver.S7.IntegrationTests.csproj"/>
<Project Path="tests/ZB.MOM.WW.OtOpcUa.Driver.AbCip.Tests/ZB.MOM.WW.OtOpcUa.Driver.AbCip.Tests.csproj"/>
<Project Path="tests/ZB.MOM.WW.OtOpcUa.Driver.AbLegacy.Tests/ZB.MOM.WW.OtOpcUa.Driver.AbLegacy.Tests.csproj"/>
<Project Path="tests/ZB.MOM.WW.OtOpcUa.Driver.AbLegacy.IntegrationTests/ZB.MOM.WW.OtOpcUa.Driver.AbLegacy.IntegrationTests.csproj"/>
<Project Path="tests/ZB.MOM.WW.OtOpcUa.Driver.TwinCAT.Tests/ZB.MOM.WW.OtOpcUa.Driver.TwinCAT.Tests.csproj"/>
<Project Path="tests/ZB.MOM.WW.OtOpcUa.Driver.TwinCAT.IntegrationTests/ZB.MOM.WW.OtOpcUa.Driver.TwinCAT.IntegrationTests.csproj"/>
<Project Path="tests/ZB.MOM.WW.OtOpcUa.Driver.FOCAS.Tests/ZB.MOM.WW.OtOpcUa.Driver.FOCAS.Tests.csproj"/>
<Project Path="tests/ZB.MOM.WW.OtOpcUa.Driver.FOCAS.Shared.Tests/ZB.MOM.WW.OtOpcUa.Driver.FOCAS.Shared.Tests.csproj"/>
<Project Path="tests/ZB.MOM.WW.OtOpcUa.Driver.FOCAS.Host.Tests/ZB.MOM.WW.OtOpcUa.Driver.FOCAS.Host.Tests.csproj"/>
<Project Path="tests/ZB.MOM.WW.OtOpcUa.Driver.AbCip.IntegrationTests/ZB.MOM.WW.OtOpcUa.Driver.AbCip.IntegrationTests.csproj"/>
<Project Path="tests/ZB.MOM.WW.OtOpcUa.Driver.OpcUaClient.Tests/ZB.MOM.WW.OtOpcUa.Driver.OpcUaClient.Tests.csproj"/>
<Project Path="tests/ZB.MOM.WW.OtOpcUa.Driver.OpcUaClient.IntegrationTests/ZB.MOM.WW.OtOpcUa.Driver.OpcUaClient.IntegrationTests.csproj"/>
<Project Path="tests/ZB.MOM.WW.OtOpcUa.Client.Shared.Tests/ZB.MOM.WW.OtOpcUa.Client.Shared.Tests.csproj"/>
<Project Path="tests/ZB.MOM.WW.OtOpcUa.Client.CLI.Tests/ZB.MOM.WW.OtOpcUa.Client.CLI.Tests.csproj"/>
<Project Path="tests/ZB.MOM.WW.OtOpcUa.Client.UI.Tests/ZB.MOM.WW.OtOpcUa.Client.UI.Tests.csproj"/>
<Project Path="tests/ZB.MOM.WW.OtOpcUa.Analyzers.Tests/ZB.MOM.WW.OtOpcUa.Analyzers.Tests.csproj"/>
</Folder>
</Solution>

20
ci/ab-server.lock.json Normal file
View File

@@ -0,0 +1,20 @@
{
"_comment": "Pinned libplctag release used by tests/ZB.MOM.WW.OtOpcUa.Driver.AbCip.IntegrationTests/AbServerFixture. ab_server.exe ships inside the *_tools.zip asset on every GitHub release. See docs/v2/test-data-sources.md §2.CI for the GitHub Actions step that consumes this file.",
"repo": "libplctag/libplctag",
"tag": "v2.6.16",
"published": "2026-03-29",
"assets": {
"windows-x64": {
"file": "libplctag_2.6.16_windows_x64_tools.zip",
"sha256": "9b78a3dee73d9cd28ca348c090f453dbe3ad9d07ad6bf42865a9dc3a79bc2232"
},
"windows-x86": {
"file": "libplctag_2.6.16_windows_x86_tools.zip",
"sha256": "fdfefd58b266c5da9a1ded1a430985e609289c9e67be2544da7513b668761edf"
},
"windows-arm64": {
"file": "libplctag_2.6.16_windows_arm64_tools.zip",
"sha256": "d747728e4c4958bb63b4ac23e1c820c4452e4778dfd7d58f8a0aecd5402d4944"
}
}
}

View File

@@ -1,82 +1,72 @@
# Address Space
The address space maps the Galaxy object hierarchy and attribute definitions into an OPC UA browse tree. `LmxNodeManager` builds the tree from data queried by `GalaxyRepositoryService`, while `AddressSpaceBuilder` provides a testable in-memory model of the same structure.
Each driver's browsable subtree is built by streaming nodes from the driver's `ITagDiscovery.DiscoverAsync` implementation into an `IAddressSpaceBuilder`. `GenericDriverNodeManager` (`src/ZB.MOM.WW.OtOpcUa.Core/OpcUa/GenericDriverNodeManager.cs`) owns the shared orchestration; `DriverNodeManager` (`src/ZB.MOM.WW.OtOpcUa.Server/OpcUa/DriverNodeManager.cs`) implements `IAddressSpaceBuilder` against the OPC Foundation stack's `CustomNodeManager2`. The same code path serves Galaxy object hierarchies, Modbus PLC registers, AB CIP tags, TwinCAT symbols, FOCAS CNC parameters, and OPC UA Client aggregations — Galaxy is one driver of seven, not the driver.
## Root ZB Folder
## Driver root folder
Every address space starts with a single root folder node named `ZB` (NodeId `ns=1;s=ZB`). This folder is added under the standard OPC UA `Objects` folder via an `Organizes` reference. The reverse reference is registered through `MasterNodeManager.AddReferences` because `BuildAddressSpace` runs after `CreateAddressSpace` has already consumed the external references dictionary.
Every driver's subtree starts with a root `FolderState` under the standard OPC UA `Objects` folder, wired with an `Organizes` reference. `DriverNodeManager.CreateAddressSpace` creates this folder with `NodeId = ns;s={DriverInstanceId}`, `BrowseName = {DriverInstanceId}`, and `EventNotifier = SubscribeToEvents | HistoryRead` so alarm and history-event subscriptions can target the root. The namespace URI is `urn:OtOpcUa:{DriverInstanceId}`.
The root folder has `EventNotifier = SubscribeToEvents` enabled so alarm events propagate up to clients subscribed at the root level.
## IAddressSpaceBuilder surface
## Area Folders vs Object Nodes
`IAddressSpaceBuilder` (`src/ZB.MOM.WW.OtOpcUa.Core.Abstractions/IAddressSpaceBuilder.cs`) offers three calls:
Galaxy objects fall into two categories based on `template_definition.category_id`:
- `Folder(browseName, displayName)` — creates a child `FolderState` and returns a child builder scoped to it.
- `Variable(browseName, displayName, DriverAttributeInfo attributeInfo)` — creates a `BaseDataVariableState` and returns an `IVariableHandle` the driver keeps for alarm wiring.
- `AddProperty(browseName, DriverDataType, value)` — attaches a `PropertyState` for static metadata (e.g. equipment identification fields).
- **Areas** (`category_id = 13`) become `FolderState` nodes with `FolderType` type definition and `Organizes` references. They represent logical groupings in the Galaxy hierarchy (e.g., production lines, cells).
- **Non-area objects** (AppEngine, Platform, UserDefined, etc.) become `BaseObjectState` nodes with `BaseObjectType` type definition and `HasComponent` references. These represent runtime automation objects that carry attributes.
Drivers drive ordering. Typical pattern: root → folder per equipment → variables per tag. `GenericDriverNodeManager` calls `DiscoverAsync` once on startup and once per rediscovery cycle.
Both node types use `contained_name` as the browse name. When `contained_name` is null or empty, `tag_name` is used as a fallback.
## DriverAttributeInfo → OPC UA variable
## Variable Nodes for Attributes
Each variable carries a `DriverAttributeInfo` (`src/ZB.MOM.WW.OtOpcUa.Core.Abstractions/DriverAttributeInfo.cs`):
Each Galaxy attribute becomes a `BaseDataVariableState` node under its parent object. The variable is configured with:
| Field | OPC UA target |
|---|---|
| `FullName` | `NodeId.Identifier` — used as the driver-side lookup key for Read/Write/Subscribe |
| `DriverDataType` | mapped to a built-in `DataTypeIds.*` NodeId via `DriverNodeManager.MapDataType` |
| `IsArray` | `ValueRank = OneDimension` when true, `Scalar` otherwise |
| `ArrayDim` | declared array length, carried through as metadata |
| `SecurityClass` | stored in `_securityByFullRef` for `WriteAuthzPolicy` gating on write |
| `IsHistorized` | flips `AccessLevel.HistoryRead` + `Historizing = true` |
| `IsAlarm` | drives the `MarkAsAlarmCondition` pass (see below) |
| `WriteIdempotent` | stored in `_writeIdempotentByFullRef`; fed to `CapabilityInvoker.ExecuteWriteAsync` |
- **DataType** -- Mapped from `mx_data_type` via `MxDataTypeMapper` (see [DataTypeMapping.md](DataTypeMapping.md))
- **ValueRank** -- `OneDimension` (1) for arrays, `Scalar` (-1) for scalars
- **ArrayDimensions** -- Set to `[array_dimension]` when the attribute is an array
- **AccessLevel** -- `CurrentReadOrWrite` or `CurrentRead` based on security classification, with `HistoryRead` added for historized attributes
- **Historizing** -- Set to `true` for attributes with a `HistoryExtension` primitive
- **Initial value** -- `null` with `StatusCode = BadWaitingForInitialData` until the first MXAccess callback delivers a live value
The initial value stays `null` with `StatusCode = BadWaitingForInitialData` until the first Read or `ISubscribable.OnDataChange` push lands.
## Primitive Grouping
## CapturingBuilder + alarm sink registration
Galaxy objects can have primitive components (e.g., alarm extensions, history extensions) that attach sub-attributes to a parent attribute. The address space handles this with a two-pass approach:
`GenericDriverNodeManager.BuildAddressSpaceAsync` wraps the supplied builder in a `CapturingBuilder` before calling `DiscoverAsync`. The wrapper observes every `Variable()` call: when a returned `IVariableHandle.MarkAsAlarmCondition(AlarmConditionInfo)` fires, the sink is registered in the manager's `_alarmSinks` dictionary keyed by the variable's `FullReference`. Subsequent `IAlarmSource.OnAlarmEvent` pushes are routed to the matching sink by `SourceNodeId`. This keeps the alarm-wiring protocol declarative — drivers just flag `DriverAttributeInfo.IsAlarm = true` and the materialization of the OPC UA `AlarmConditionState` node is handled by the server layer. See `docs/AlarmTracking.md`.
### First pass: direct attributes
## NodeId scheme
Attributes with an empty `PrimitiveName` are created as direct variable children of the object node. If a direct attribute shares its name with a primitive group, the variable node reference is saved for the second pass.
### Second pass: primitive child attributes
Attributes with a non-empty `PrimitiveName` are grouped by that name. For each group:
1. If a direct attribute variable with the same name already exists, the primitive's child attributes are added as `HasComponent` children of that variable node. This merges alarm/history sub-attributes (e.g., `InAlarm`, `Priority`) under the parent variable they describe.
2. If no matching direct attribute exists, a new `BaseObjectState` node is created with NodeId `ns=1;s={TagName}.{PrimitiveName}`, and the primitive's attributes are added under it.
This structure means that browsing `TestMachine_001/SomeAlarmAttr` reveals both the process value and its alarm sub-attributes (`InAlarm`, `Priority`, `DescAttrName`) as children.
## NodeId Scheme
All node identifiers use string-based NodeIds in namespace index 1 (`ns=1`):
All nodes live in the driver's namespace (not a shared `ns=1`). Browse paths are driver-defined:
| Node type | NodeId format | Example |
|-----------|---------------|---------|
| Root folder | `ns=1;s=ZB` | `ns=1;s=ZB` |
| Area folder | `ns=1;s={tag_name}` | `ns=1;s=Area_001` |
| Object node | `ns=1;s={tag_name}` | `ns=1;s=TestMachine_001` |
| Scalar variable | `ns=1;s={tag_name}.{attr}` | `ns=1;s=TestMachine_001.MachineID` |
| Array variable | `ns=1;s={tag_name}.{attr}` | `ns=1;s=MESReceiver_001.MoveInPartNumbers` |
| Primitive sub-object | `ns=1;s={tag_name}.{prim}` | `ns=1;s=TestMachine_001.AlarmPrim` |
|---|---|---|
| Driver root | `ns;s={DriverInstanceId}` | `urn:OtOpcUa:galaxy-01;s=galaxy-01` |
| Folder | `ns;s={parent}/{browseName}` | `ns;s=galaxy-01/Area_001` |
| Variable | `ns;s={DriverAttributeInfo.FullName}` | `ns;s=DelmiaReceiver_001.DownloadPath` |
| Alarm condition | `ns;s={FullReference}.Condition` | `ns;s=DelmiaReceiver_001.Temperature.Condition` |
For array attributes, the `[]` suffix present in `full_tag_reference` is stripped from the NodeId. The `full_tag_reference` (with `[]`) is kept internally for MXAccess subscription addressing. This means `MESReceiver_001.MoveInPartNumbers[]` in the Galaxy maps to NodeId `ns=1;s=MESReceiver_001.MoveInPartNumbers`.
For Galaxy the `FullName` stays in the legacy `tag_name.AttributeName` format; Modbus uses `unit:register:type`; AB CIP uses the native `program:tag.member` path; etc. — the shape is the driver's choice.
## Topological Sort
## Per-driver hierarchy examples
The hierarchy query returns objects ordered by `parent_gobject_id, tag_name`, but this does not guarantee that a parent appears before all of its children in all cases. `LmxNodeManager.TopologicalSort` performs a depth-first traversal to produce a list where every parent is guaranteed to precede its children. This allows the build loop to look up parent nodes from `_nodeMap` without forward references.
- **Galaxy Proxy**: walks the DB-snapshot hierarchy (`GalaxyProxyDriver.DiscoverAsync`), streams Area objects as folders and non-area objects as variable-bearing folders, marks `IsAlarm = true` on attributes that have an `AlarmExtension` primitive. The v1 two-pass primitive-grouping logic is retained inside the Galaxy driver.
- **Modbus**: streams one folder per device, one variable per register range from `ModbusDriverOptions`. No alarm surface.
- **AB CIP**: uses `AbCipTemplateCache` to enumerate user-defined types, streams a folder per program with variables keyed on the native tag path.
- **OPC UA Client**: re-exposes a remote server's address space — browses the upstream and relays nodes through the builder.
## Platform Scope Filtering
See `docs/v2/driver-specs.md` for the per-driver discovery contracts.
When `GalaxyRepository.Scope` is set to `LocalPlatform`, the hierarchy and attributes passed to `BuildAddressSpace` are pre-filtered by `PlatformScopeFilter` inside `GalaxyRepositoryService`. The node manager receives only the local platform's objects and their ancestor areas, so the resulting browse tree is a subset of the full Galaxy. The filtering is transparent to `LmxNodeManager` — it builds nodes from whatever data it receives.
## Rediscovery
Clients browsing a `LocalPlatform`-scoped server will see only the areas and objects hosted by that platform. Areas that exist in the Galaxy but contain no local descendants are excluded. See [Galaxy Repository — Platform Scope Filter](GalaxyRepository.md#platform-scope-filter) for the filtering algorithm and configuration.
## Incremental Sync
On address space rebuild (triggered by a Galaxy deploy change), `SyncAddressSpace` uses `AddressSpaceDiff` to identify which `gobject_id` values have changed between the old and new snapshots. Only the affected subtrees are torn down and rebuilt, preserving unchanged nodes and their active subscriptions. Affected subscriptions are snapshot before teardown and replayed after rebuild.
If no previous state is cached (first build), the full `BuildAddressSpace` path runs instead.
Drivers that implement `IRediscoverable` fire `OnRediscoveryNeeded` when their backend signals a change (Galaxy: `time_of_last_deploy` advance; TwinCAT: symbol-version-changed; OPC UA Client: server namespace change). Core re-runs `DiscoverAsync` and diffs — see `docs/IncrementalSync.md`. Static drivers (Modbus, S7) don't implement `IRediscoverable`; their address space only changes when a new generation is published from the Config DB.
## Key source files
- `src/ZB.MOM.WW.OtOpcUa.Host/OpcUa/LmxNodeManager.cs` -- Node manager with `BuildAddressSpace`, `SyncAddressSpace`, and `TopologicalSort`
- `src/ZB.MOM.WW.OtOpcUa.Host/OpcUa/AddressSpaceBuilder.cs` -- Testable in-memory model builder
- `src/ZB.MOM.WW.OtOpcUa.Core/OpcUa/GenericDriverNodeManager.cs` — orchestration + `CapturingBuilder`
- `src/ZB.MOM.WW.OtOpcUa.Server/OpcUa/DriverNodeManager.cs` — OPC UA materialization (`IAddressSpaceBuilder` impl + `NestedBuilder`)
- `src/ZB.MOM.WW.OtOpcUa.Core.Abstractions/IAddressSpaceBuilder.cs` — builder contract
- `src/ZB.MOM.WW.OtOpcUa.Core.Abstractions/ITagDiscovery.cs` — driver discovery capability
- `src/ZB.MOM.WW.OtOpcUa.Core.Abstractions/DriverAttributeInfo.cs` — per-attribute descriptor

View File

@@ -1,234 +1,76 @@
# Alarm Tracking
`LmxNodeManager` generates OPC UA alarm conditions from Galaxy attributes marked as alarms. The system detects alarm-capable attributes during address space construction, creates `AlarmConditionState` nodes, auto-subscribes to the runtime alarm tags via MXAccess, and reports state transitions as OPC UA events.
Alarm surfacing is an optional driver capability exposed via `IAlarmSource` (`src/ZB.MOM.WW.OtOpcUa.Core.Abstractions/IAlarmSource.cs`). Drivers whose backends have an alarm concept implement it — today: Galaxy (MXAccess alarms), FOCAS (CNC alarms), OPC UA Client (A&C events from the upstream server). Modbus / S7 / AB CIP / AB Legacy / TwinCAT do not implement the interface and the feature is simply absent from their subtrees.
## AlarmInfo Structure
Each tracked alarm is represented by an `AlarmInfo` instance stored in the `_alarmInAlarmTags` dictionary, keyed by the `InAlarm` tag reference:
## IAlarmSource surface
```csharp
private sealed class AlarmInfo
{
public string SourceTagReference { get; set; } // e.g., "Tag_001.Temperature"
public NodeId SourceNodeId { get; set; }
public string SourceName { get; set; } // attribute name for event messages
public bool LastInAlarm { get; set; } // tracks previous state for edge detection
public AlarmConditionState? ConditionNode { get; set; }
public string PriorityTagReference { get; set; } // e.g., "Tag_001.Temperature.Priority"
public string DescAttrNameTagReference { get; set; } // e.g., "Tag_001.Temperature.DescAttrName"
public ushort CachedSeverity { get; set; }
public string CachedMessage { get; set; }
}
Task<IAlarmSubscriptionHandle> SubscribeAlarmsAsync(
IReadOnlyList<string> sourceNodeIds, CancellationToken cancellationToken);
Task UnsubscribeAlarmsAsync(IAlarmSubscriptionHandle handle, CancellationToken cancellationToken);
Task AcknowledgeAsync(IReadOnlyList<AlarmAcknowledgeRequest> acknowledgements,
CancellationToken cancellationToken);
event EventHandler<AlarmEventArgs>? OnAlarmEvent;
```
`LastInAlarm` enables edge detection so only actual transitions (inactive-to-active or active-to-inactive) generate events, not repeated identical values.
The driver fires `OnAlarmEvent` for every transition (`Active`, `Acknowledged`, `Inactive`) with an `AlarmEventArgs` carrying the source node id, condition id, alarm type, message, severity (`AlarmSeverity` enum), and source timestamp.
## Alarm Detection via is_alarm Flag
## AlarmSurfaceInvoker
During `BuildAddressSpace` (and `BuildSubtree` for incremental sync), the node manager scans each non-area Galaxy object for attributes where `IsAlarm == true` and `PrimitiveName` is empty (direct attributes only, not primitive children):
`AlarmSurfaceInvoker` (`src/ZB.MOM.WW.OtOpcUa.Core/Resilience/AlarmSurfaceInvoker.cs`) wraps the three mutating surfaces through `CapabilityInvoker`:
```csharp
var alarmAttrs = objAttrs.Where(a => a.IsAlarm && string.IsNullOrEmpty(a.PrimitiveName)).ToList();
```
- `SubscribeAlarmsAsync` / `UnsubscribeAlarmsAsync` run through the `DriverCapability.AlarmSubscribe` pipeline — retries apply under the tier configuration.
- `AcknowledgeAsync` runs through `DriverCapability.AlarmAcknowledge` which does NOT retry per decision #143. A timed-out ack may have already registered at the plant floor; replay would silently double-acknowledge.
The `IsAlarm` flag originates from the `AlarmExtension` primitive in the Galaxy repository database. When a Galaxy attribute has an associated `AlarmExtension` primitive, the SQL query sets `is_alarm = 1` on the corresponding `GalaxyAttributeInfo`.
Multi-host fan-out: when the driver implements `IPerCallHostResolver`, each source node id is resolved individually and batches are grouped by host so a dead PLC inside a multi-device driver doesn't poison sibling breakers. Single-host drivers fall back to `IDriver.DriverInstanceId` as the pipeline-key host.
For each alarm attribute, the code verifies that a corresponding `InAlarm` sub-attribute variable node exists in `_tagToVariableNode` (constructed from `FullTagReference + ".InAlarm"`). If the variable node is missing, the alarm is skipped -- this prevents creating orphaned alarm conditions for attributes whose extension primitives were not published.
## Condition-node creation via CapturingBuilder
## Template-Based Alarm Object Filter
Alarm-condition nodes are materialized at address-space build time. During `GenericDriverNodeManager.BuildAddressSpaceAsync` the builder is wrapped in a `CapturingBuilder` that observes every `Variable()` call. When a driver calls `IVariableHandle.MarkAsAlarmCondition(AlarmConditionInfo)` on a returned handle, the server-side `DriverNodeManager.VariableHandle` creates a sibling `AlarmConditionState` node and returns an `IAlarmConditionSink`. The wrapper stores the sink in `_alarmSinks` keyed by the variable's full reference, then `GenericDriverNodeManager` registers a forwarder on `IAlarmSource.OnAlarmEvent` that routes each push to the matching sink by `SourceNodeId`. Unknown source ids are dropped silently — they may belong to another driver.
When large galaxies contain more alarm-bearing objects than clients need, `OpcUa.AlarmFilter.ObjectFilters` restricts alarm condition creation to a subset of objects selected by **template name pattern**. The filter is applied at both alarm creation sites -- the full build in `BuildAddressSpace` and the subtree rebuild path triggered by Galaxy redeployment -- so the included set is recomputed on every rebuild against the fresh hierarchy.
The `AlarmConditionState` layout matches OPC UA Part 9:
### Matching rules
- `SourceNode` → the originating variable
- `SourceName` / `ConditionName` → from `AlarmConditionInfo.SourceName`
- Initial state: enabled, inactive, acknowledged, severity per `InitialSeverity`, retain false
- `HasCondition` references wire the source variable ↔ the condition node bidirectionally
- `*` is the only wildcard (glob-style, zero or more characters). All other regex metacharacters are escaped and matched literally.
- Matching is case-insensitive.
- The leading `$` used by Galaxy template `tag_name` values is normalized away on both the stored chain entry and the operator pattern, so `TestMachine*` matches the stored `$TestMachine`.
- Each configured entry may itself be comma-separated for operator convenience (`"TestMachine*, Pump_*"`).
- An empty list disables the filter and restores the prior behavior: every alarm-bearing object is tracked when `AlarmTrackingEnabled=true`.
Drivers flag alarm-bearing variables at discovery time via `DriverAttributeInfo.IsAlarm = true`. The Galaxy driver, for example, sets this on attributes that have an `AlarmExtension` primitive in the Galaxy repository DB; FOCAS sets it on the CNC alarm register.
### What gets included
## State transitions
Every Galaxy object whose **template derivation chain** contains any template matching any pattern is included. The chain walks `gobject.derived_from_gobject_id` from the instance through its immediate template and each ancestor template, up to `$Object`. An instance of `TestCoolMachine` whose chain is `$TestCoolMachine -> $TestMachine -> $UserDefined` matches the pattern `TestMachine` via the ancestor hit.
`ConditionSink.OnTransition` runs under the node manager's `Lock` and maps the `AlarmEventArgs.AlarmType` string to Part 9 state:
Inclusion propagates down the **containment hierarchy**: if an object matches, all of its descendants are included as well, regardless of their own template chains. This lets operators target a parent and pick up all its alarm-bearing children with one pattern.
| AlarmType | Action |
|---|---|
| `Active` | `SetActiveState(true)`, `SetAcknowledgedState(false)`, `Retain = true` |
| `Acknowledged` | `SetAcknowledgedState(true)` |
| `Inactive` | `SetActiveState(false)`; `Retain = false` once both inactive and acknowledged |
Each object is evaluated exactly once. Overlapping matches (multiple patterns hit, or both an ancestor and descendant match independently) never produce duplicate alarm condition subscriptions -- the filter operates on object identity via a `HashSet<int>` of included `GobjectId` values.
Severity is remapped: `AlarmSeverity.Low/Medium/High/Critical` → OPC UA numeric 250 / 500 / 700 / 900. `Message.Value` is set from `AlarmEventArgs.Message` on every transition. `ClearChangeMasks(true)` and `ReportEvent(condition)` fire the OPC UA event notification for clients subscribed to any ancestor notifier.
### Resolution algorithm
## Acknowledge dispatch
`AlarmObjectFilter.ResolveIncludedObjects(hierarchy)` runs once per build:
Alarm acknowledgement initiated by an OPC UA client flows:
1. Compile each pattern into a regex with `IgnoreCase | CultureInvariant | Compiled`.
2. Build a `parent -> children` map from the hierarchy. Orphans (parent id not in the hierarchy) are treated as roots.
3. BFS from each root with a `(nodeId, parentIncluded)` queue and a `visited` set for cycle defense.
4. At each node: if the parent was included OR any chain entry matches any pattern, add the node and mark its subtree as included.
5. Return the `HashSet<int>` of included object IDs. When no patterns are configured the filter is disabled and the method returns `null`, which the alarm loop treats as "no filtering".
1. The SDK invokes the `AlarmConditionState.OnAcknowledge` method delegate.
2. The handler checks the session's roles for `AlarmAck` — drivers never see a request the session wasn't entitled to make.
3. `AlarmSurfaceInvoker.AcknowledgeAsync` is called with the source / condition / comment tuple. The invoker groups by host and runs each batch through the no-retry `AlarmAcknowledge` pipeline.
After each resolution, `UnmatchedPatterns` exposes any raw pattern that matched zero objects so the startup log can warn about operator typos without failing startup.
Drivers return normally for success or throw to signal the ack failed at the backend.
### How the alarm loop applies the filter
## EventNotifier propagation
```csharp
// LmxNodeManager.BuildAddressSpace (and the subtree rebuild path)
if (_alarmTrackingEnabled)
{
var includedIds = ResolveAlarmFilterIncludedIds(sorted); // null if no filter
foreach (var obj in sorted)
{
if (obj.IsArea) continue;
if (includedIds != null && !includedIds.Contains(obj.GobjectId)) continue;
// ... existing alarm-attribute collection + AlarmConditionState creation
}
}
```
Drivers that want hierarchical alarm subscriptions propagate `EventNotifier.SubscribeToEvents` up the containment chain during discovery — the Galaxy driver flips the flag on every ancestor of an alarm-bearing object up to the driver root, mirroring v1 behavior. Clients subscribed at the driver root, a mid-level folder, or the `Objects/` root see alarm events from every descendant with an `AlarmConditionState` sibling. The driver-root `FolderState` is created in `DriverNodeManager.CreateAddressSpace` with `EventNotifier = SubscribeToEvents | HistoryRead` so alarm event subscriptions and alarm history both have a single natural target.
`ResolveAlarmFilterIncludedIds` also emits a one-line summary (`Alarm filter: X of Y objects included (Z pattern(s))`) and per-pattern warnings for patterns that matched nothing. The included count is published to the dashboard via `AlarmFilterIncludedObjectCount`.
## ConditionRefresh
### Runtime telemetry
The OPC UA `ConditionRefresh` service queues the current state of every retained condition back to the requesting monitored items. `DriverNodeManager` iterates the node manager's `AlarmConditionState` collection and queues each condition whose `Retain.Value == true` — matching the Part 9 requirement.
`LmxNodeManager` exposes three read-only properties populated by the filter:
## Key source files
- `AlarmFilterEnabled` -- true when patterns are configured.
- `AlarmFilterPatternCount` -- number of compiled patterns.
- `AlarmFilterIncludedObjectCount` -- number of objects in the most recent included set.
`StatusReportService` reads these into `AlarmStatusInfo.FilterEnabled`, `FilterPatternCount`, and `FilterIncludedObjectCount`. The Alarms panel on the dashboard renders `Filter: N pattern(s), M object(s) included` only when the filter is enabled. See [Status Dashboard](StatusDashboard.md#alarms).
### Validator warning
`ConfigurationValidator.ValidateAndLog()` logs the effective filter at startup and emits a `Warning` if `AlarmFilter.ObjectFilters` is non-empty while `AlarmTrackingEnabled` is `false`, because the filter would have no effect.
## AlarmConditionState Creation
Each detected alarm attribute produces an `AlarmConditionState` node:
```csharp
var condition = new AlarmConditionState(sourceVariable);
condition.Create(SystemContext, conditionNodeId,
new QualifiedName(alarmAttr.AttributeName + "Alarm", NamespaceIndex),
new LocalizedText("en", alarmAttr.AttributeName + " Alarm"),
true);
```
Key configuration on the condition node:
- **SourceNode** -- Set to the OPC UA NodeId of the source variable, linking the condition to the attribute that triggered it.
- **SourceName / ConditionName** -- Set to the Galaxy attribute name for identification in event notifications.
- **AutoReportStateChanges** -- Set to `true` so the OPC UA framework automatically generates event notifications when condition properties change.
- **Initial state** -- Enabled, inactive, acknowledged, severity Medium, retain false.
- **HasCondition references** -- Bidirectional references are added between the source variable and the condition node.
The condition's `OnReportEvent` callback forwards events to `Server.ReportEvent` so they reach clients subscribed at the server level.
### Condition Methods
Each alarm condition supports the following OPC UA Part 9 methods:
- **Acknowledge** (`OnAcknowledge`) -- Writes the acknowledgment message to the Galaxy `AckMsg` tag. Requires the `AlarmAck` role.
- **Confirm** (`OnConfirm`) -- Confirms a previously acknowledged alarm. The SDK manages the `ConfirmedState` transition.
- **AddComment** (`OnAddComment`) -- Attaches an operator comment to the condition for audit trail purposes.
- **Enable / Disable** (`OnEnableDisable`) -- Activates or deactivates alarm monitoring for the specific condition. The SDK manages the `EnabledState` transition.
- **Shelve** (`OnShelve`) -- Supports `TimedShelve`, `OneShotShelve`, and `Unshelve` operations. The SDK manages the `ShelvedStateMachineType` state transitions including automatic timed unshelve.
- **TimedUnshelve** (`OnTimedUnshelve`) -- Automatically called by the SDK when a timed shelve period expires.
### Event Fields
Alarm events include the following fields:
- `EventId` -- Unique GUID for each event, used as reference for Acknowledge/Confirm
- `ActiveState`, `AckedState`, `ConfirmedState` -- State transitions
- `Message` -- Alarm message from Galaxy `DescAttrName` or default text
- `Severity` -- Galaxy Priority clamped to OPC UA range 1-1000
- `Retain` -- True while alarm is active or unacknowledged
- `LocalTime` -- Server timezone offset with daylight saving flag
- `Quality` -- Set to Good for alarm events
## Auto-subscription to Alarm Tags
After alarm condition nodes are created, `SubscribeAlarmTags` opens MXAccess subscriptions for three tags per alarm:
1. **InAlarm** (`Tag_001.Temperature.InAlarm`) -- The boolean trigger for alarm activation/deactivation.
2. **Priority** (`Tag_001.Temperature.Priority`) -- Numeric priority that maps to OPC UA severity.
3. **DescAttrName** (`Tag_001.Temperature.DescAttrName`) -- String description used as the alarm event message.
These subscriptions are opened unconditionally (not ref-counted) because they serve the server's own alarm tracking, not client-initiated monitoring. Tags that do not have corresponding variable nodes in `_tagToVariableNode` are skipped.
## EventNotifier Propagation
When a Galaxy object contains at least one alarm attribute, `EventNotifiers.SubscribeToEvents` is set on the object node **and all its ancestors** up to the root. This allows OPC UA clients to subscribe to events at any level in the hierarchy and receive alarm notifications from all descendants:
```csharp
if (hasAlarms && _nodeMap.TryGetValue(obj.GobjectId, out var objNode))
EnableEventNotifierUpChain(objNode);
```
For example, an alarm on `TestMachine_001.SubObject.Temperature` will be visible to clients subscribed on `SubObject`, `TestMachine_001`, or the root `ZB` folder. The root `ZB` folder also has `EventNotifiers.SubscribeToEvents` enabled during initial construction.
## InAlarm Transition Detection in DispatchLoop
Alarm state changes are detected in the dispatch loop's Phase 1 (outside `Lock`), which runs on the background dispatch thread rather than the STA thread. This placement is intentional because the detection logic reads Priority and DescAttrName values from MXAccess, which would block the STA thread if done inside the `OnMxAccessDataChange` callback.
For each pending data change, the loop checks whether the address matches a key in `_alarmInAlarmTags`:
```csharp
if (_alarmInAlarmTags.TryGetValue(address, out var alarmInfo))
{
var newInAlarm = vtq.Value is true || vtq.Value is 1
|| (vtq.Value is int intVal && intVal != 0);
if (newInAlarm != alarmInfo.LastInAlarm)
{
alarmInfo.LastInAlarm = newInAlarm;
// Read Priority and DescAttrName via MXAccess (outside Lock)
...
pendingAlarmEvents.Add((alarmInfo, newInAlarm));
}
}
```
The boolean coercion handles multiple value representations: `true`, integer `1`, or any non-zero integer. When the value changes state, Priority and DescAttrName are read synchronously from MXAccess to populate `CachedSeverity` and `CachedMessage`. These reads happen outside `Lock` because they call into the STA thread.
Priority values are clamped to the OPC UA severity range (1-1000). Both `int` and `short` types are handled.
## ReportAlarmEvent
`ReportAlarmEvent` runs inside `Lock` during Phase 2 of the dispatch loop. It updates the `AlarmConditionState` and generates an OPC UA event:
```csharp
condition.SetActiveState(SystemContext, active);
condition.Message.Value = new LocalizedText("en", message);
condition.SetSeverity(SystemContext, (EventSeverity)severity);
condition.Retain.Value = active || (condition.AckedState?.Id?.Value == false);
```
Key behaviors:
- **Active state** -- Set to `true` on activation, `false` on clearing.
- **Message** -- Uses `CachedMessage` (from DescAttrName) when available on activation. Falls back to a generated `"Alarm active: {SourceName}"` string. Cleared alarms always use `"Alarm cleared: {SourceName}"`.
- **Severity** -- Set from `CachedSeverity`, which was read from the Priority tag.
- **Retain** -- `true` while the alarm is active or unacknowledged. This keeps the condition visible in condition refresh responses.
- **Acknowledged state** -- Reset to `false` when the alarm activates, requiring explicit client acknowledgment. When role-based auth is active, alarm acknowledgment requires the `AlarmAck` role on the session (checked via `GrantedRoleIds`). Users without this role receive `BadUserAccessDenied`.
The event is reported by walking up the notifier chain from the source variable's parent through all ancestor nodes. Each ancestor with `EventNotifier` set receives the event via `ReportEvent`, so clients subscribed at any level in the Galaxy hierarchy see alarm transitions from descendant objects.
## Condition Refresh Override
The `ConditionRefresh` override iterates all tracked alarms and queues retained conditions to the requesting monitored items:
```csharp
public override ServiceResult ConditionRefresh(OperationContext context,
IList<IEventMonitoredItem> monitoredItems)
{
foreach (var kvp in _alarmInAlarmTags)
{
var info = kvp.Value;
if (info.ConditionNode == null || info.ConditionNode.Retain?.Value != true)
continue;
foreach (var item in monitoredItems)
item.QueueEvent(info.ConditionNode);
}
return ServiceResult.Good;
}
```
Only conditions where `Retain.Value == true` are included. This means only active or unacknowledged alarms appear in condition refresh responses, matching the OPC UA specification requirement that condition refresh returns the current state of all retained conditions.
- `src/ZB.MOM.WW.OtOpcUa.Core.Abstractions/IAlarmSource.cs` — capability contract + `AlarmEventArgs`
- `src/ZB.MOM.WW.OtOpcUa.Core/Resilience/AlarmSurfaceInvoker.cs` — per-host fan-out + no-retry ack
- `src/ZB.MOM.WW.OtOpcUa.Core/OpcUa/GenericDriverNodeManager.cs``CapturingBuilder` + alarm forwarder
- `src/ZB.MOM.WW.OtOpcUa.Server/OpcUa/DriverNodeManager.cs``VariableHandle.MarkAsAlarmCondition` + `ConditionSink`
- `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/Backend/Alarms/GalaxyAlarmTracker.cs` — Galaxy-specific alarm-event production

View File

@@ -2,9 +2,9 @@
## Overview
`ZB.MOM.WW.OtOpcUa.Client.CLI` is a cross-platform command-line client for the LmxOpcUa OPC UA server. It targets .NET 10 and uses the shared `IOpcUaClientService` from `Client.Shared` for all OPC UA operations. Commands are routed and parsed by [CliFx](https://github.com/Tyrrrz/CliFx).
`ZB.MOM.WW.OtOpcUa.Client.CLI` is a cross-platform command-line client for the OtOpcUa OPC UA server. It targets .NET 10 and uses the shared `IOpcUaClientService` from `Client.Shared` for all OPC UA operations. Commands are routed and parsed by [CliFx](https://github.com/Tyrrrz/CliFx).
The CLI is the primary tool for operators and developers to test and interact with the server from a terminal. It supports all core operations: connectivity testing, browsing, reading, writing, subscriptions, alarm monitoring, history reads, and redundancy queries.
The CLI is the primary tool for operators and developers to test and interact with the server from a terminal. It supports all core operations: connectivity testing, browsing, reading, writing, subscriptions, alarm monitoring, history reads, and redundancy queries. Any driver surface exposed by the server (Galaxy, Modbus, S7, AB CIP, AB Legacy, TwinCAT, FOCAS, OPC UA Client) is reachable through these commands — the CLI is driver-agnostic because everything below the OPC UA endpoint is.
## Build and Run
@@ -14,7 +14,7 @@ dotnet build
dotnet run -- <command> [options]
```
The executable name is `lmxopcua-cli`.
The executable name is `otopcua-cli`. Dev boxes carrying a pre-task-#208 install may still have the legacy `{LocalAppData}/LmxOpcUaClient/` folder on disk; on first launch of any post-#208 CLI or UI build, `ClientStoragePaths` (`src/ZB.MOM.WW.OtOpcUa.Client.Shared/ClientStoragePaths.cs`) migrates it to `{LocalAppData}/OtOpcUaClient/` automatically so trusted certificates + saved settings survive the rename.
## Architecture
@@ -46,7 +46,7 @@ All commands accept these options:
When `-U` and `-P` are provided, the shared service passes a `UserIdentity(username, password)` to the OPC UA session. Without credentials, anonymous identity is used.
```bash
lmxopcua-cli write -u opc.tcp://localhost:4840 -n "ns=2;s=MyNode" -v 42 -U operator -P op123
otopcua-cli write -u opc.tcp://localhost:4840 -n "ns=2;s=MyNode" -v 42 -U operator -P op123
```
### Failover
@@ -54,20 +54,20 @@ lmxopcua-cli write -u opc.tcp://localhost:4840 -n "ns=2;s=MyNode" -v 42 -U opera
When `-F` is provided, the shared service tries the primary URL first, then each failover URL in order. For long-running commands (`subscribe`, `alarms`), the service monitors the session via keep-alive and automatically reconnects to the next available server on failure.
```bash
lmxopcua-cli connect -u opc.tcp://localhost:4840/LmxOpcUa -F opc.tcp://localhost:4841/LmxOpcUa
otopcua-cli connect -u opc.tcp://localhost:4840/OtOpcUa -F opc.tcp://localhost:4841/OtOpcUa
```
### Transport Security
When `sign` or `encrypt` is specified, the shared service:
1. Ensures a client application certificate exists under `{LocalAppData}/LmxOpcUaClient/pki/` (auto-created if missing)
1. Ensures a client application certificate exists under `{LocalAppData}/OtOpcUaClient/pki/` (auto-created if missing; pre-rename `LmxOpcUaClient/` is migrated in place on first launch)
2. Discovers server endpoints and selects one matching the requested security mode
3. Prefers `Basic256Sha256` when multiple matching endpoints exist
4. Fails with a clear error if no matching endpoint is found
```bash
lmxopcua-cli browse -u opc.tcp://localhost:4840/LmxOpcUa -S encrypt -U admin -P secret -r -d 2
otopcua-cli browse -u opc.tcp://localhost:4840/OtOpcUa -S encrypt -U admin -P secret -r -d 2
```
### Verbose Logging
@@ -81,14 +81,14 @@ The `--verbose` flag switches Serilog output from `Warning` to `Debug` level, sh
Tests connectivity to an OPC UA server. Creates a session, prints connection metadata, and disconnects.
```bash
lmxopcua-cli connect -u opc.tcp://localhost:4840/LmxOpcUa -U admin -P admin123
otopcua-cli connect -u opc.tcp://localhost:4840/OtOpcUa -U admin -P admin123
```
Output:
```text
Connected to: opc.tcp://localhost:4840/LmxOpcUa
Server: LmxOpcUa
Connected to: opc.tcp://localhost:4840/OtOpcUa
Server: OtOpcUa Server
Security Mode: None
Security Policy: http://opcfoundation.org/UA/SecurityPolicy#None
Connection successful.
@@ -99,7 +99,7 @@ Connection successful.
Reads the current value of a single node and prints the value, status code, and timestamps.
```bash
lmxopcua-cli read -u opc.tcp://localhost:4840/LmxOpcUa -n "ns=3;s=DEV.ScanState" -U admin -P admin123
otopcua-cli read -u opc.tcp://localhost:4840/OtOpcUa -n "ns=3;s=DEV.ScanState" -U admin -P admin123
```
| Flag | Description |
@@ -121,7 +121,7 @@ Server Time: 2026-03-30T19:58:38.0971257Z
Writes a value to a node. The shared service reads the current value first to determine the target data type, then converts the supplied string value using `ValueConverter.ConvertValue()`.
```bash
lmxopcua-cli write -u opc.tcp://localhost:4840 -n "ns=2;s=MyNode" -v 42
otopcua-cli write -u opc.tcp://localhost:4840 -n "ns=2;s=MyNode" -v 42
```
| Flag | Description |
@@ -135,10 +135,10 @@ Browses the OPC UA address space starting from the Objects folder or a specified
```bash
# Browse top-level Objects folder
lmxopcua-cli browse -u opc.tcp://localhost:4840/LmxOpcUa -U admin -P admin123
otopcua-cli browse -u opc.tcp://localhost:4840/OtOpcUa -U admin -P admin123
# Browse a specific node recursively to depth 3
lmxopcua-cli browse -u opc.tcp://localhost:4840/LmxOpcUa -U admin -P admin123 -r -d 3 -n "ns=3;s=ZB"
otopcua-cli browse -u opc.tcp://localhost:4840/OtOpcUa -U admin -P admin123 -r -d 3 -n "ns=3;s=ZB"
```
| Flag | Description |
@@ -152,7 +152,7 @@ lmxopcua-cli browse -u opc.tcp://localhost:4840/LmxOpcUa -U admin -P admin123 -r
Monitors a node for value changes using OPC UA subscriptions. Prints each data change notification with timestamp, value, and status code. Runs until Ctrl+C, then unsubscribes and disconnects cleanly.
```bash
lmxopcua-cli subscribe -u opc.tcp://localhost:4840 -n "ns=2;s=MyNode" -i 500
otopcua-cli subscribe -u opc.tcp://localhost:4840 -n "ns=2;s=MyNode" -i 500
```
| Flag | Description |
@@ -166,12 +166,12 @@ Reads historical data from a node. Supports raw history reads and aggregate (pro
```bash
# Raw history
lmxopcua-cli historyread -u opc.tcp://localhost:4840/LmxOpcUa \
otopcua-cli historyread -u opc.tcp://localhost:4840/OtOpcUa \
-n "ns=1;s=TestMachine_001.TestHistoryValue" \
--start "2026-03-25" --end "2026-03-30"
# Aggregate: 1-hour average
lmxopcua-cli historyread -u opc.tcp://localhost:4840/LmxOpcUa \
otopcua-cli historyread -u opc.tcp://localhost:4840/OtOpcUa \
-n "ns=1;s=TestMachine_001.TestHistoryValue" \
--start "2026-03-25" --end "2026-03-30" \
--aggregate Average --interval 3600000
@@ -203,10 +203,10 @@ Subscribes to alarm events on a node. Prints structured alarm output including s
```bash
# Subscribe to alarm events on the Server node
lmxopcua-cli alarms -u opc.tcp://localhost:4840/LmxOpcUa
otopcua-cli alarms -u opc.tcp://localhost:4840/OtOpcUa
# Subscribe to a specific source node with condition refresh
lmxopcua-cli alarms -u opc.tcp://localhost:4840/LmxOpcUa \
otopcua-cli alarms -u opc.tcp://localhost:4840/OtOpcUa \
-n "ns=1;s=TestMachine_001" --refresh
```
@@ -221,7 +221,7 @@ lmxopcua-cli alarms -u opc.tcp://localhost:4840/LmxOpcUa \
Reads the OPC UA redundancy state from a server: redundancy mode, service level, server URIs, and application URI.
```bash
lmxopcua-cli redundancy -u opc.tcp://localhost:4840/LmxOpcUa -U admin -P admin123
otopcua-cli redundancy -u opc.tcp://localhost:4840/OtOpcUa -U admin -P admin123
```
Example output:
@@ -230,9 +230,9 @@ Example output:
Redundancy Mode: Warm
Service Level: 200
Server URIs:
- urn:localhost:LmxOpcUa:instance1
- urn:localhost:LmxOpcUa:instance2
Application URI: urn:localhost:LmxOpcUa:instance1
- urn:localhost:OtOpcUa:instance1
- urn:localhost:OtOpcUa:instance2
Application URI: urn:localhost:OtOpcUa:instance1
```
## Testing

View File

@@ -2,7 +2,7 @@
## Overview
`ZB.MOM.WW.OtOpcUa.Client.UI` is a cross-platform Avalonia desktop application for connecting to and interacting with the LmxOpcUa OPC UA server. It targets .NET 10 and uses the shared `IOpcUaClientService` from `Client.Shared` for all OPC UA operations.
`ZB.MOM.WW.OtOpcUa.Client.UI` is a cross-platform Avalonia desktop application for connecting to and interacting with the OtOpcUa OPC UA server. It targets .NET 10 and uses the shared `IOpcUaClientService` from `Client.Shared` for all OPC UA operations.
The UI provides a single-window interface for browsing the address space, reading and writing values, monitoring live subscriptions, managing alarms, and querying historical data.
@@ -43,7 +43,7 @@ The application uses a single-window layout with five main areas:
│ │ │ ││
│ (lazy-load) │ └──────────────────────────────────────────────┘│
├──────────────┴──────────────────────────────────────────────┤
│ Connected to opc.tcp://... | LmxOpcUa | Session: ... | 3 subs│
│ Connected to opc.tcp://... | OtOpcUa Server | Session: ... | 3 subs│
└─────────────────────────────────────────────────────────────┘
```
@@ -55,7 +55,7 @@ The top bar provides the endpoint URL, Connect, and Disconnect buttons. The **Co
| Setting | Description |
|---------|-------------|
| Endpoint URL | OPC UA server endpoint (e.g., `opc.tcp://localhost:4840/LmxOpcUa`) |
| Endpoint URL | OPC UA server endpoint (e.g., `opc.tcp://localhost:4840/OtOpcUa`) |
| Username / Password | Credentials for `UserName` token authentication |
| Security Mode | Transport security: None, Sign, SignAndEncrypt |
| Failover URLs | Comma-separated backup endpoints for redundancy failover |
@@ -65,7 +65,7 @@ The top bar provides the endpoint URL, Connect, and Disconnect buttons. The **Co
### Settings Persistence
Connection settings are saved to `{LocalAppData}/LmxOpcUaClient/settings.json` after each successful connection and on window close. The settings are reloaded on next launch, including:
Connection settings are saved to `{LocalAppData}/OtOpcUaClient/settings.json` after each successful connection and on window close. Dev boxes upgrading from a pre-task-#208 build still have the legacy `LmxOpcUaClient/` folder on disk; `ClientStoragePaths` in `Client.Shared` moves it to the canonical path on first launch so existing trusted certs + saved settings persist without operator action. The settings are reloaded on next launch, including:
- All connection parameters
- Active subscription node IDs (restored after reconnection)

View File

@@ -1,370 +1,141 @@
# Configuration
## Overview
## Two-layer model
The service loads configuration from `appsettings.json` at startup using the Microsoft.Extensions.Configuration stack. `AppConfiguration` is the root holder class that aggregates typed sections: `OpcUa`, `MxAccess`, `GalaxyRepository`, `Dashboard`, `Historian`, `Authentication`, and `Security`. Each section binds to a dedicated POCO class with sensible defaults, so the service runs with zero configuration on a standard deployment.
OtOpcUa configuration is split into two layers:
## Config Binding Pattern
| Layer | Where | Scope | Edited by |
|---|---|---|---|
| **Bootstrap** | `appsettings.json` per process | Enough to start the process and reach the Config DB | Local file edit + process restart |
| **Authoritative config** | Config DB (SQL Server) via `OtOpcUaConfigDbContext` | Clusters, namespaces, UNS hierarchy, equipment, tags, driver instances, ACLs, role grants, poll groups | Admin UI draft/publish workflow |
The production constructor in `OpcUaService` builds the configuration pipeline and binds each JSON section to its typed class:
The rule: if the setting describes *how the process connects to the rest of the world* (Config DB connection string, LDAP bind, transport security profile, node identity, logging), it lives in `appsettings.json`. If it describes *what the fleet does* (clusters, drivers, tags, UNS, ACLs), it lives in the Config DB and is edited through the Admin UI.
```csharp
var configuration = new ConfigurationBuilder()
.AddJsonFile("appsettings.json", optional: false)
.AddJsonFile($"appsettings.{Environment.GetEnvironmentVariable("DOTNET_ENVIRONMENT") ?? "Production"}.json", optional: true)
.AddEnvironmentVariables()
.Build();
---
_config = new AppConfiguration();
configuration.GetSection("OpcUa").Bind(_config.OpcUa);
configuration.GetSection("MxAccess").Bind(_config.MxAccess);
configuration.GetSection("GalaxyRepository").Bind(_config.GalaxyRepository);
configuration.GetSection("Dashboard").Bind(_config.Dashboard);
configuration.GetSection("Historian").Bind(_config.Historian);
configuration.GetSection("Authentication").Bind(_config.Authentication);
configuration.GetSection("Security").Bind(_config.Security);
```
## Bootstrap configuration (`appsettings.json`)
This pattern uses `IConfiguration.GetSection().Bind()` rather than `IOptions<T>` because the service targets .NET Framework 4.8, where the full dependency injection container is not used.
Each of the three processes (Server, Admin, Galaxy.Host) reads its own `appsettings.json` plus environment overrides.
## Environment-Specific Overrides
### OtOpcUa Server — `src/ZB.MOM.WW.OtOpcUa.Server/appsettings.json`
The configuration pipeline supports three layers of override, applied in order:
Bootstrap-only. `Program.cs` reads four top-level sections:
1. `appsettings.json` -- base configuration (required)
2. `appsettings.{DOTNET_ENVIRONMENT}.json` -- environment-specific overlay (optional)
3. Environment variables -- highest priority, useful for deployment automation
| Section | Keys | Purpose |
|---|---|---|
| `Node` | `NodeId`, `ClusterId`, `ConfigDbConnectionString`, `LocalCachePath` | Identity + path to the Config DB + LiteDB offline cache path. |
| `OpcUaServer` | `EndpointUrl`, `ApplicationName`, `ApplicationUri`, `PkiStoreRoot`, `AutoAcceptUntrustedClientCertificates`, `SecurityProfile` | OPC UA endpoint + transport security. See [`security.md`](security.md). |
| `OpcUaServer:Ldap` | `Enabled`, `Server`, `Port`, `UseTls`, `AllowInsecureLdap`, `SearchBase`, `ServiceAccountDn`, `ServiceAccountPassword`, `GroupToRole`, `UserNameAttribute`, `GroupAttribute` | LDAP auth for OPC UA UserName tokens. See [`security.md`](security.md). |
| `Serilog` | Standard Serilog keys + `WriteJson` bool | Logging verbosity + optional JSON file sink for SIEM ingest. |
| `Authorization` | `StrictMode` (bool) | Flip `true` to fail-closed on sessions lacking LDAP group metadata. Default false during ACL rollouts. |
| `Metrics:Prometheus:Enabled` | bool | Toggles the `/metrics` endpoint. |
Set the `DOTNET_ENVIRONMENT` variable to load a named overlay file. For example, setting `DOTNET_ENVIRONMENT=Staging` loads `appsettings.Staging.json` if it exists.
Environment variables follow the standard `Section__Property` naming convention. For example, `OpcUa__Port=5840` overrides the OPC UA port.
## Configuration Sections
### OpcUa
Controls the OPC UA server endpoint and session limits. Defined in `OpcUaConfiguration`.
| Property | Type | Default | Description |
|----------|------|---------|-------------|
| `BindAddress` | `string` | `"0.0.0.0"` | IP address or hostname the server binds to. Use `0.0.0.0` for all interfaces, `localhost` for local-only, or a specific IP |
| `Port` | `int` | `4840` | TCP port the OPC UA server listens on |
| `EndpointPath` | `string` | `"/LmxOpcUa"` | Path appended to the host URI |
| `ServerName` | `string` | `"LmxOpcUa"` | Server name presented to OPC UA clients |
| `GalaxyName` | `string` | `"ZB"` | Galaxy name used as the OPC UA namespace |
| `MaxSessions` | `int` | `100` | Maximum simultaneous OPC UA sessions |
| `SessionTimeoutMinutes` | `int` | `30` | Idle session timeout in minutes |
| `AlarmTrackingEnabled` | `bool` | `false` | Enables `AlarmConditionState` nodes for alarm attributes |
| `AlarmFilter.ObjectFilters` | `List<string>` | `[]` | Wildcard template-name patterns (with `*`) that scope alarm tracking to matching objects and their descendants. Empty list disables filtering. See [Alarm Tracking](AlarmTracking.md#template-based-alarm-object-filter) |
| `ApplicationUri` | `string?` | `null` | Explicit application URI for this server instance. Required when redundancy is enabled. Defaults to `urn:{GalaxyName}:LmxOpcUa` when null |
### MxAccess
Controls the MXAccess runtime connection used for live tag reads and writes. Defined in `MxAccessConfiguration`.
| Property | Type | Default | Description |
|----------|------|---------|-------------|
| `ClientName` | `string` | `"LmxOpcUa"` | Client name registered with MXAccess |
| `NodeName` | `string?` | `null` | Optional Galaxy node name to target |
| `GalaxyName` | `string?` | `null` | Optional Galaxy name for MXAccess reference resolution |
| `ReadTimeoutSeconds` | `int` | `5` | Maximum wait for a live tag read |
| `WriteTimeoutSeconds` | `int` | `5` | Maximum wait for a write acknowledgment |
| `MaxConcurrentOperations` | `int` | `10` | Cap on concurrent MXAccess operations |
| `MonitorIntervalSeconds` | `int` | `5` | Connectivity monitor probe interval |
| `AutoReconnect` | `bool` | `true` | Automatically re-establish dropped MXAccess sessions |
| `ProbeTag` | `string?` | `null` | Optional tag used to verify the runtime returns fresh data |
| `ProbeStaleThresholdSeconds` | `int` | `60` | Seconds a probe value may remain unchanged before the connection is considered stale |
| `RuntimeStatusProbesEnabled` | `bool` | `true` | Advises `<Host>.ScanState` on every deployed `$WinPlatform` and `$AppEngine` to track per-host runtime state. Drives the Galaxy Runtime dashboard panel, HealthCheck Rule 2e, and the Read-path short-circuit that invalidates OPC UA variable quality when a host is Stopped. Set `false` to return to legacy behavior where host state is invisible and the bridge serves whatever quality MxAccess reports for individual tags. See [MXAccess Bridge](MxAccessBridge.md#per-host-runtime-status-probes-hostscanstate) |
| `RuntimeStatusUnknownTimeoutSeconds` | `int` | `15` | Maximum seconds to wait for the initial probe callback before marking a host as Stopped. Only applies to the Unknown → Stopped transition; Running hosts never time out because `ScanState` is delivered on-change only. A value below 5s triggers a validator warning |
| `RequestTimeoutSeconds` | `int` | `30` | Outer safety timeout applied to sync-over-async MxAccess operations invoked from the OPC UA stack thread (Read, Write, address-space rebuild probe sync). Backstop for the inner `ReadTimeoutSeconds` / `WriteTimeoutSeconds`. A timed-out operation returns `BadTimeout`. Validator rejects values < 1 and warns if set below the inner Read/Write timeouts. See [MXAccess Bridge](MxAccessBridge.md#request-timeout-safety-backstop). Stability review 2026-04-13 Finding 3 |
### GalaxyRepository
Controls the Galaxy repository database connection used to build the OPC UA address space. Defined in `GalaxyRepositoryConfiguration`.
| Property | Type | Default | Description |
|----------|------|---------|-------------|
| `ConnectionString` | `string` | `"Server=localhost;Database=ZB;Integrated Security=true;"` | SQL Server connection string for the Galaxy database |
| `ChangeDetectionIntervalSeconds` | `int` | `30` | How often the service polls for Galaxy deploy changes |
| `CommandTimeoutSeconds` | `int` | `30` | SQL command timeout for repository queries |
| `ExtendedAttributes` | `bool` | `false` | Load extended Galaxy attribute metadata into the OPC UA model |
| `Scope` | `GalaxyScope` | `"Galaxy"` | Controls how much of the Galaxy hierarchy is loaded. `Galaxy` loads all deployed objects (default). `LocalPlatform` loads only objects hosted by the platform deployed on this machine. See [Galaxy Repository — Platform Scope Filter](GalaxyRepository.md#platform-scope-filter) |
| `PlatformName` | `string?` | `null` | Explicit platform hostname for `LocalPlatform` filtering. When null, uses `Environment.MachineName`. Only used when `Scope` is `LocalPlatform` |
### Dashboard
Controls the embedded HTTP status dashboard. Defined in `DashboardConfiguration`.
| Property | Type | Default | Description |
|----------|------|---------|-------------|
| `Enabled` | `bool` | `true` | Whether the status dashboard is hosted |
| `Port` | `int` | `8081` | HTTP port for the dashboard endpoint |
| `RefreshIntervalSeconds` | `int` | `10` | HTML auto-refresh interval in seconds |
### Historian
Controls the Wonderware Historian SDK connection for OPC UA historical data access. Defined in `HistorianConfiguration`.
| Property | Type | Default | Description |
|----------|------|---------|-------------|
| `Enabled` | `bool` | `false` | Enables OPC UA historical data access |
| `ServerName` | `string` | `"localhost"` | Single Historian server hostname used when `ServerNames` is empty. Preserved for backward compatibility with pre-cluster deployments |
| `ServerNames` | `List<string>` | `[]` | Ordered list of Historian cluster nodes. When non-empty, supersedes `ServerName` and enables read-only cluster failover. See [Historical Data Access](HistoricalDataAccess.md#read-only-cluster-failover) |
| `FailureCooldownSeconds` | `int` | `60` | How long a failed cluster node is skipped before being re-tried. Zero disables the cooldown |
| `IntegratedSecurity` | `bool` | `true` | Use Windows authentication |
| `UserName` | `string?` | `null` | Username when `IntegratedSecurity` is false |
| `Password` | `string?` | `null` | Password when `IntegratedSecurity` is false |
| `Port` | `int` | `32568` | Historian TCP port |
| `CommandTimeoutSeconds` | `int` | `30` | SDK packet timeout in seconds (inner async bound) |
| `RequestTimeoutSeconds` | `int` | `60` | Outer safety timeout applied to sync-over-async Historian operations invoked from the OPC UA stack thread (`HistoryReadRaw`, `HistoryReadProcessed`, `HistoryReadAtTime`, `HistoryReadEvents`). Backstop for `CommandTimeoutSeconds`; a timed-out read returns `BadTimeout`. Validator rejects values < 1 and warns if set below `CommandTimeoutSeconds`. Stability review 2026-04-13 Finding 3 |
| `MaxValuesPerRead` | `int` | `10000` | Maximum values returned per `HistoryRead` request |
### Authentication
Controls user authentication and write authorization for the OPC UA server. Defined in `AuthenticationConfiguration`.
| Property | Type | Default | Description |
|----------|------|---------|-------------|
| `AllowAnonymous` | `bool` | `true` | Accepts anonymous client connections when `true` |
| `AnonymousCanWrite` | `bool` | `true` | Permits anonymous users to write when `true` |
#### LDAP Authentication
When `Ldap.Enabled` is `true`, credentials are validated against the configured LDAP server and group membership determines OPC UA permissions.
| Property | Type | Default | Description |
|----------|------|---------|-------------|
| `Ldap.Enabled` | `bool` | `false` | Enables LDAP authentication |
| `Ldap.Host` | `string` | `localhost` | LDAP server hostname |
| `Ldap.Port` | `int` | `3893` | LDAP server port |
| `Ldap.BaseDN` | `string` | `dc=lmxopcua,dc=local` | Base DN for LDAP operations |
| `Ldap.BindDnTemplate` | `string` | `cn={username},dc=lmxopcua,dc=local` | Bind DN template (`{username}` is replaced) |
| `Ldap.ServiceAccountDn` | `string` | `""` | Service account DN for group lookups |
| `Ldap.ServiceAccountPassword` | `string` | `""` | Service account password |
| `Ldap.TimeoutSeconds` | `int` | `5` | Connection timeout |
| `Ldap.ReadOnlyGroup` | `string` | `ReadOnly` | LDAP group granting read-only access |
| `Ldap.WriteOperateGroup` | `string` | `WriteOperate` | LDAP group granting write access for FreeAccess/Operate attributes |
| `Ldap.WriteTuneGroup` | `string` | `WriteTune` | LDAP group granting write access for Tune attributes |
| `Ldap.WriteConfigureGroup` | `string` | `WriteConfigure` | LDAP group granting write access for Configure attributes |
| `Ldap.AlarmAckGroup` | `string` | `AlarmAck` | LDAP group granting alarm acknowledgment |
#### Permission Model
When LDAP is enabled, LDAP group membership is mapped to OPC UA session role NodeIds during authentication. All authenticated LDAP users can browse and read nodes regardless of group membership. Groups grant additional permissions:
| LDAP Group | Permission |
|---|---|
| ReadOnly | No additional permissions (read-only access) |
| WriteOperate | Write FreeAccess and Operate attributes |
| WriteTune | Write Tune attributes |
| WriteConfigure | Write Configure attributes |
| AlarmAck | Acknowledge alarms |
Users can belong to multiple groups. The `admin` user in the default GLAuth configuration belongs to all three groups.
Write access depends on both the user's role and the Galaxy attribute's security classification. See the [Effective Permission Matrix](Security.md#effective-permission-matrix) in the Security Guide for the full breakdown.
Example configuration:
```json
"Authentication": {
"AllowAnonymous": true,
"AnonymousCanWrite": false,
"Ldap": {
"Enabled": true,
"Host": "localhost",
"Port": 3893,
"BaseDN": "dc=lmxopcua,dc=local",
"BindDnTemplate": "cn={username},dc=lmxopcua,dc=local",
"ServiceAccountDn": "cn=serviceaccount,dc=lmxopcua,dc=local",
"ServiceAccountPassword": "serviceaccount123",
"TimeoutSeconds": 5,
"ReadOnlyGroup": "ReadOnly",
"WriteOperateGroup": "WriteOperate",
"WriteTuneGroup": "WriteTune",
"WriteConfigureGroup": "WriteConfigure",
"AlarmAckGroup": "AlarmAck"
}
}
```
### Security
Controls OPC UA transport security profiles and certificate handling. Defined in `SecurityProfileConfiguration`. See [Security Guide](security.md) for detailed usage.
| Property | Type | Default | Description |
|----------|------|---------|-------------|
| `Profiles` | `List<string>` | `["None"]` | Security profiles to expose. Valid: `None`, `Basic256Sha256-Sign`, `Basic256Sha256-SignAndEncrypt`, `Aes128_Sha256_RsaOaep-Sign`, `Aes128_Sha256_RsaOaep-SignAndEncrypt`, `Aes256_Sha256_RsaPss-Sign`, `Aes256_Sha256_RsaPss-SignAndEncrypt` |
| `AutoAcceptClientCertificates` | `bool` | `true` | Auto-accept untrusted client certificates. Set to `false` in production |
| `RejectSHA1Certificates` | `bool` | `true` | Reject client certificates signed with SHA-1 |
| `MinimumCertificateKeySize` | `int` | `2048` | Minimum RSA key size for client certificates |
| `PkiRootPath` | `string?` | `null` | Override for PKI root directory. Defaults to `%LOCALAPPDATA%\OPC Foundation\pki` |
| `CertificateSubject` | `string?` | `null` | Override for server certificate subject. Defaults to `CN={ServerName}, O=ZB MOM, DC=localhost` |
Example — production deployment with encrypted transport:
```json
"Security": {
"Profiles": ["Basic256Sha256-SignAndEncrypt"],
"AutoAcceptClientCertificates": false,
"RejectSHA1Certificates": true,
"MinimumCertificateKeySize": 2048
}
```
### Redundancy
Controls non-transparent OPC UA redundancy. Defined in `RedundancyConfiguration`. See [Redundancy Guide](Redundancy.md) for detailed usage.
| Property | Type | Default | Description |
|----------|------|---------|-------------|
| `Enabled` | `bool` | `false` | Enables redundancy mode and ServiceLevel computation |
| `Mode` | `string` | `"Warm"` | Redundancy mode: `Warm` or `Hot` |
| `Role` | `string` | `"Primary"` | Instance role: `Primary` (higher ServiceLevel) or `Secondary` |
| `ServerUris` | `List<string>` | `[]` | ApplicationUri values for all servers in the redundant set |
| `ServiceLevelBase` | `int` | `200` | Base ServiceLevel when healthy (1-255). Secondary receives base - 50 |
Example — two-instance redundant pair (Primary):
```json
"Redundancy": {
"Enabled": true,
"Mode": "Warm",
"Role": "Primary",
"ServerUris": ["urn:localhost:LmxOpcUa:instance1", "urn:localhost:LmxOpcUa:instance2"],
"ServiceLevelBase": 200
}
```
## Feature Flags
Three boolean properties act as feature flags that control optional subsystems:
- **`OpcUa.AlarmTrackingEnabled`** -- When `true`, the node manager creates `AlarmConditionState` nodes for alarm attributes and monitors `InAlarm` transitions. Disabled by default because alarm tracking adds per-attribute overhead.
- **`OpcUa.AlarmFilter.ObjectFilters`** -- List of wildcard template-name patterns that scope alarm tracking to matching objects and their descendants. An empty list preserves the current unfiltered behavior; a non-empty list includes an object only when any name in its template derivation chain matches any pattern, then propagates the inclusion to every descendant in the containment hierarchy. `*` is the only wildcard, matching is case-insensitive, and the Galaxy `$` prefix on template names is normalized so operators can write `TestMachine*` instead of `$TestMachine*`. Each list entry may itself contain comma-separated patterns (`"TestMachine*, Pump_*"`) for convenience. When the list is non-empty but `AlarmTrackingEnabled` is `false`, the validator emits a warning because the filter has no effect. See [Alarm Tracking](AlarmTracking.md#template-based-alarm-object-filter) for the full matching algorithm and telemetry.
- **`Historian.Enabled`** -- When `true`, the service calls `HistorianPluginLoader.TryLoad(config)` to load the `ZB.MOM.WW.OtOpcUa.Historian.Aveva` plugin from the `Historian/` subfolder next to the host exe and registers the resulting `IHistorianDataSource` with the OPC UA server host. Disabled by default because not all deployments have a Historian instance -- when disabled the plugin is not probed and the Wonderware SDK DLLs are not required on the host. If the flag is `true` but the plugin or its SDK dependencies cannot be loaded, the server still starts and every history read returns `BadHistoryOperationUnsupported` with a warning in the log.
- **`GalaxyRepository.ExtendedAttributes`** -- When `true`, the repository loads additional Galaxy attribute metadata beyond the core set needed for the address space. Disabled by default to minimize startup query time.
- **`GalaxyRepository.Scope`** -- When set to `LocalPlatform`, the repository filters the hierarchy and attributes to only include objects hosted by the platform whose `node_name` matches this machine (or the explicit `PlatformName` override). Ancestor areas are retained to keep the browse tree connected. Default is `Galaxy` (load everything). See [Galaxy Repository — Platform Scope Filter](GalaxyRepository.md#platform-scope-filter).
## Configuration Validation
`ConfigurationValidator.ValidateAndLog()` runs at the start of `OpcUaService.Start()`. It logs every resolved configuration value at `Information` level and validates required constraints:
- `OpcUa.Port` must be between 1 and 65535
- `OpcUa.GalaxyName` must not be empty
- `MxAccess.ClientName` must not be empty
- `GalaxyRepository.ConnectionString` must not be empty
- `Security.MinimumCertificateKeySize` must be at least 2048
- Unknown security profile names are logged as warnings
- `AutoAcceptClientCertificates = true` emits a warning
- Only-`None` profile configuration emits a warning
- `OpcUa.AlarmFilter.ObjectFilters` is non-empty while `OpcUa.AlarmTrackingEnabled = false` emits a warning (filter has no effect)
- `Historian.ServerName` (or `Historian.ServerNames`) must not be empty when `Historian.Enabled = true`
- `Historian.FailureCooldownSeconds` must be zero or positive
- `Historian.ServerName` is set alongside a non-empty `Historian.ServerNames` emits a warning (single ServerName is ignored)
- `MxAccess.RuntimeStatusUnknownTimeoutSeconds` below 5s emits a warning (below the reasonable floor for MxAccess initial-resolution latency)
- `OpcUa.ApplicationUri` must be set when `Redundancy.Enabled = true`
- `Redundancy.ServiceLevelBase` must be between 1 and 255
- `Redundancy.ServerUris` should contain at least 2 entries when enabled
- Local `ApplicationUri` should appear in `Redundancy.ServerUris`
If validation fails, the service throws `InvalidOperationException` and does not start.
## Test Constructor Pattern
`OpcUaService` provides an `internal` constructor that accepts pre-built dependencies instead of loading `appsettings.json`:
```csharp
internal OpcUaService(
AppConfiguration config,
IMxProxy? mxProxy,
IGalaxyRepository? galaxyRepository,
IMxAccessClient? mxAccessClientOverride = null,
bool hasMxAccessClientOverride = false)
```
Integration tests use this constructor to inject substitute implementations of `IMxProxy`, `IGalaxyRepository`, and `IMxAccessClient`, bypassing the STA thread, COM interop, and SQL Server dependencies. The `hasMxAccessClientOverride` flag tells the service to use the injected `IMxAccessClient` directly instead of creating one from the `IMxProxy` on the STA thread.
## Example appsettings.json
Minimal example:
```json
{
"OpcUa": {
"BindAddress": "0.0.0.0",
"Port": 4840,
"EndpointPath": "/LmxOpcUa",
"ServerName": "LmxOpcUa",
"GalaxyName": "ZB",
"MaxSessions": 100,
"SessionTimeoutMinutes": 30,
"AlarmTrackingEnabled": false,
"AlarmFilter": {
"ObjectFilters": []
},
"ApplicationUri": null
"Serilog": { "MinimumLevel": "Information" },
"Node": {
"NodeId": "node-dev-a",
"ClusterId": "cluster-dev",
"ConfigDbConnectionString": "Server=localhost,14330;Database=OtOpcUaConfig;Integrated Security=True;TrustServerCertificate=True;Encrypt=False;",
"LocalCachePath": "config_cache.db"
},
"MxAccess": {
"ClientName": "LmxOpcUa",
"NodeName": null,
"GalaxyName": null,
"ReadTimeoutSeconds": 5,
"WriteTimeoutSeconds": 5,
"MaxConcurrentOperations": 10,
"MonitorIntervalSeconds": 5,
"AutoReconnect": true,
"ProbeTag": null,
"ProbeStaleThresholdSeconds": 60,
"RuntimeStatusProbesEnabled": true,
"RuntimeStatusUnknownTimeoutSeconds": 15,
"RequestTimeoutSeconds": 30
},
"GalaxyRepository": {
"ConnectionString": "Server=localhost;Database=ZB;Integrated Security=true;",
"ChangeDetectionIntervalSeconds": 30,
"CommandTimeoutSeconds": 30,
"ExtendedAttributes": false,
"Scope": "Galaxy",
"PlatformName": null
},
"Dashboard": {
"Enabled": true,
"Port": 8081,
"RefreshIntervalSeconds": 10
},
"Historian": {
"Enabled": false,
"ServerName": "localhost",
"ServerNames": [],
"FailureCooldownSeconds": 60,
"IntegratedSecurity": true,
"UserName": null,
"Password": null,
"Port": 32568,
"CommandTimeoutSeconds": 30,
"RequestTimeoutSeconds": 60,
"MaxValuesPerRead": 10000
},
"Authentication": {
"AllowAnonymous": true,
"AnonymousCanWrite": true,
"Ldap": {
"Enabled": false
}
},
"Security": {
"Profiles": ["None"],
"AutoAcceptClientCertificates": true,
"RejectSHA1Certificates": true,
"MinimumCertificateKeySize": 2048,
"PkiRootPath": null,
"CertificateSubject": null
},
"Redundancy": {
"Enabled": false,
"Mode": "Warm",
"Role": "Primary",
"ServerUris": [],
"ServiceLevelBase": 200
"OpcUaServer": {
"EndpointUrl": "opc.tcp://0.0.0.0:4840/OtOpcUa",
"ApplicationUri": "urn:node-dev-a:OtOpcUa",
"SecurityProfile": "None",
"AutoAcceptUntrustedClientCertificates": true,
"Ldap": { "Enabled": false }
}
}
```
### OtOpcUa Admin — `src/ZB.MOM.WW.OtOpcUa.Admin/appsettings.json`
| Section | Purpose |
|---|---|
| `ConnectionStrings:ConfigDb` | SQL connection string — must point at the same Config DB every Server reaches. |
| `Authentication:Ldap` | LDAP bind for the Admin login form (same options shape as the Server's `OpcUaServer:Ldap`). |
| `CertTrust` | `CertTrustOptions` — file-system path under the Server's `PkiStoreRoot` so the Admin Certificates page can promote rejected client certs. |
| `Metrics:Prometheus:Enabled` | Toggles the `/metrics` scrape endpoint (default true). |
| `Serilog` | Logging. |
### Galaxy.Host
Environment-variable driven (`OTOPCUA_GALAXY_PIPE`, `OTOPCUA_ALLOWED_SID`, `OTOPCUA_GALAXY_SECRET`, `OTOPCUA_GALAXY_BACKEND`, `OTOPCUA_GALAXY_ZB_CONN`, `OTOPCUA_HISTORIAN_*`). No `appsettings.json` — the supervisor owns the launch environment. See [`ServiceHosting.md`](ServiceHosting.md#galaxyhost-process).
### Environment overrides
Standard .NET config layering applies: `appsettings.{Environment}.json`, then environment variables with `Section__Property` naming. `DOTNET_ENVIRONMENT` (or `ASPNETCORE_ENVIRONMENT` for Admin) selects the overlay.
---
## Authoritative configuration (Config DB)
The Config DB is the single source of truth for every setting that a v1 deployment used to carry in `appsettings.json` as driver-specific state. `OtOpcUaConfigDbContext` (`src/ZB.MOM.WW.OtOpcUa.Configuration/OtOpcUaConfigDbContext.cs`) is the EF Core context used by both the Admin writer and every Server reader.
### Top-level sections operators touch
| Concept | Entity | Admin UI surface | Purpose |
|---|---|---|---|
| Cluster | `ServerCluster` | Clusters pages | Fleet unit; owns nodes, generations, UNS, ACLs. |
| Cluster node | `ClusterNode` + `ClusterNodeCredential` | RedundancyTab, Hosts page | Per-node identity, `RedundancyRole`, `ServiceLevelBase`, ApplicationUri, service-account credentials. |
| Generation | `ConfigGeneration` + `ClusterNodeGenerationState` | Generations / DiffViewer | Append-only; draft → publish workflow (`sp_PublishGeneration`). |
| Namespace | `Namespace` | Namespaces tab | Per-cluster OPC UA namespace; `Kind` = Equipment / SystemPlatform / Simulated. |
| Driver instance | `DriverInstance` | Drivers tab | Configured driver (Modbus, S7, OpcUaClient, Galaxy, …) + `DriverConfig` JSON + resilience profile. |
| Device | `Device` | Under each driver instance | Per-host settings inside a driver instance (IP, port, unit-id…). |
| UNS hierarchy | `UnsArea` + `UnsLine` | UnsTab (drag/drop) | L3 / L4 of the unified namespace. |
| Equipment | `Equipment` | Equipment pages, CSV import | L5; carries `MachineCode`, `ZTag`, `SAPID`, `EquipmentUuid`, reservation-backed external ids. |
| Tag | `Tag` | Under each equipment | Driver-specific tag address + `SecurityClassification` + poll-group assignment. |
| Poll group | `PollGroup` | Driver-scoped | Poll cadence buckets; `PollGroupEngine` in Core.Abstractions uses this at runtime. |
| ACL | `NodeAcl` | AclsTab + Probe dialog | Per-level permission grants, additive only. See [`security.md`](security.md#data-plane-authorization). |
| Role grant | `LdapGroupRoleMapping` | RoleGrants page | Maps LDAP groups → Admin roles (`ConfigViewer` / `ConfigEditor` / `FleetAdmin`). |
| External id reservation | `ExternalIdReservation` | Reservations page | Reservation-backed `ZTag` and `SAPID` uniqueness. |
| Equipment import batch | `EquipmentImportBatch` | CSV import flow | Staged bulk-add with validation preview. |
| Audit log | `ConfigAuditLog` | Audit page | Append-only record of every publish, rollback, credential rotation, role-grant change. |
### Draft → publish generation model
All edits go into a **draft** generation scoped to one cluster. `DraftValidationService` checks invariants (same-cluster FKs, reservation collisions, UNS path consistency, ACL scope validity). When the operator clicks Publish, `sp_PublishGeneration` atomically promotes the draft, records the audit event, and causes every `RedundancyCoordinator.RefreshAsync` in the affected cluster to pick up the new topology + ACL set. The Admin UI `DiffViewer` shows exactly what's changing before publish.
Old generations are retained; rollback is "publish older generation as new". `ConfigAuditLog` makes every change auditable by principal + timestamp.
### Offline cache
Each Server process caches the last-seen published generation in `Node:LocalCachePath` via LiteDB (`LiteDbConfigCache` in `src/ZB.MOM.WW.OtOpcUa.Configuration/LocalCache/`). The cache lets a node start without the central DB reachable; once the DB comes back, `NodeBootstrap` syncs to the current generation.
### Full schema reference
For table columns, indexes, stored procedures, the publish-transaction semantics, and the SQL authorization model (per-node SQL principals + `SESSION_CONTEXT` cluster binding), see [`docs/v2/config-db-schema.md`](v2/config-db-schema.md).
### Admin UI flow
For the draft editor, DiffViewer, CSV import, IdentificationFields, RedundancyTab, AclsTab + Probe-this-permission, RoleGrants, and the SignalR real-time surface, see [`docs/v2/admin-ui.md`](v2/admin-ui.md).
---
## Where did v1 appsettings sections go?
Quick index for operators coming from v1 LmxOpcUa:
| v1 appsettings section | v2 home |
|---|---|
| `OpcUa.Port` / `BindAddress` / `EndpointPath` / `ServerName` | Bootstrap `OpcUaServer:EndpointUrl` + `ApplicationName`. |
| `OpcUa.ApplicationUri` | Config DB `ClusterNode.ApplicationUri`. |
| `OpcUa.MaxSessions` / `SessionTimeoutMinutes` | Bootstrap `OpcUaServer:*` (if exposed) or stack defaults. |
| `OpcUa.AlarmTrackingEnabled` / `AlarmFilter` | Per driver instance in Config DB (alarm surface is capability-driven per `IAlarmSource`). |
| `MxAccess.*` | Galaxy driver instance `DriverConfig` JSON + Galaxy.Host env vars (see [`ServiceHosting.md`](ServiceHosting.md#galaxyhost-process)). |
| `GalaxyRepository.*` | Galaxy driver instance `DriverConfig` JSON + `OTOPCUA_GALAXY_ZB_CONN` env var. |
| `Dashboard.*` | Retired — Admin UI replaces the dashboard. See [`StatusDashboard.md`](StatusDashboard.md). |
| `Historian.*` | Galaxy driver instance `DriverConfig` JSON + `OTOPCUA_HISTORIAN_*` env vars. |
| `Authentication.Ldap.*` | Bootstrap `OpcUaServer:Ldap` (same shape) + Admin `Authentication:Ldap` for the UI login. |
| `Security.*` | Bootstrap `OpcUaServer:SecurityProfile` + `PkiStoreRoot` + `AutoAcceptUntrustedClientCertificates`. |
| `Redundancy.*` | Config DB `ClusterNode.RedundancyRole` + `ServiceLevelBase`. |
---
## Validation
- **Bootstrap**: the process fails fast on missing required keys in `Program.cs` (e.g. `Node:NodeId`, `Node:ClusterId`, `Node:ConfigDbConnectionString` all throw `InvalidOperationException` if unset).
- **Authoritative**: `DraftValidationService` runs on every save; `sp_ValidateDraft` runs as part of `sp_PublishGeneration` so an invalid draft cannot reach any node.

View File

@@ -1,84 +1,65 @@
# Data Type Mapping
`MxDataTypeMapper` and `SecurityClassificationMapper` translate Galaxy attribute metadata into OPC UA variable node properties. These mappings determine how Galaxy runtime values are represented to OPC UA clients and whether clients can write to them.
Data-type mapping is driver-defined. Each driver translates its native attribute metadata into two driver-agnostic enums from `Core.Abstractions``DriverDataType` (`src/ZB.MOM.WW.OtOpcUa.Core.Abstractions/DriverDataType.cs`) and `SecurityClassification` (`src/ZB.MOM.WW.OtOpcUa.Core.Abstractions/SecurityClassification.cs`) — and populates the `DriverAttributeInfo` record it hands to `IAddressSpaceBuilder.Variable(...)`. Core doesn't interpret the native types; it trusts the driver's translation.
## mx_data_type to OPC UA Type Mapping
## DriverDataType OPC UA built-in type
Each Galaxy attribute carries an `mx_data_type` integer that identifies its data type. `MxDataTypeMapper.MapToOpcUaDataType` maps these to OPC UA built-in type NodeIds:
`DriverNodeManager.MapDataType` (`src/ZB.MOM.WW.OtOpcUa.Server/OpcUa/DriverNodeManager.cs`) is the single translation table for every driver:
| mx_data_type | Galaxy type | OPC UA type | NodeId | CLR type |
|:---:|-------------|-------------|:------:|----------|
| 1 | Boolean | Boolean | i=1 | `bool` |
| 2 | Integer | Int32 | i=6 | `int` |
| 3 | Float | Float | i=10 | `float` |
| 4 | Double | Double | i=11 | `double` |
| 5 | String | String | i=12 | `string` |
| 6 | Time | DateTime | i=13 | `DateTime` |
| 7 | ElapsedTime | Double | i=11 | `double` |
| 8 | Reference | String | i=12 | `string` |
| 13 | Enumeration | Int32 | i=6 | `int` |
| 14 | Custom | String | i=12 | `string` |
| 15 | InternationalizedString | LocalizedText | i=21 | `string` |
| 16 | Custom | String | i=12 | `string` |
| other | Unknown | String | i=12 | `string` |
| DriverDataType | OPC UA NodeId |
|---|---|
| `Boolean` | `DataTypeIds.Boolean` (i=1) |
| `Int32` | `DataTypeIds.Int32` (i=6) |
| `Float32` | `DataTypeIds.Float` (i=10) |
| `Float64` | `DataTypeIds.Double` (i=11) |
| `String` | `DataTypeIds.String` (i=12) |
| `DateTime` | `DataTypeIds.DateTime` (i=13) |
| anything else | `DataTypeIds.BaseDataType` |
Unknown types default to String. This is a safe fallback because MXAccess delivers values as COM `VARIANT` objects, and string serialization preserves any value that does not have a direct OPC UA counterpart.
The enum also carries `Int16 / Int64 / UInt16 / UInt32 / UInt64 / Reference` members for drivers that need them; the mapping table is extended as those types surface in actual drivers. `Reference` is the Galaxy-style attribute reference — it's encoded as an OPC UA `String` on the wire.
### Why ElapsedTime maps to Double
## Per-driver mappers
Galaxy `ElapsedTime` (mx_data_type 7) represents a duration/timespan. OPC UA has no native `TimeSpan` type. The OPC UA specification defines a `Duration` type alias (NodeId i=290) that is semantically a `Double` representing milliseconds, but the simpler approach is to map directly to `Double` (i=11) representing seconds. This avoids ambiguity about whether the value is in seconds or milliseconds and matches how the Galaxy runtime exposes elapsed time values through MXAccess.
Each driver owns its native → `DriverDataType` translation:
## Array Handling
- **Galaxy Proxy** — `GalaxyProxyDriver.MapDataType(int mxDataType)` and `MapSecurity(int mxSec)` (inline in `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Proxy/GalaxyProxyDriver.cs`). The Galaxy `mx_data_type` integer is sent across the Host↔Proxy pipe and mapped on the Proxy side. Galaxy's full classic 16-entry table (Boolean / Integer / Float / Double / String / Time / ElapsedTime / Reference / Enumeration / Custom / InternationalizedString) is preserved but compressed into the seven-entry `DriverDataType` enum — `ElapsedTime``Float64`, `InternationalizedString``String`, `Reference``Reference`, enumerations → `Int32`.
- **AB CIP** — `src/ZB.MOM.WW.OtOpcUa.Driver.AbCip/AbCipDataType.cs` maps CIP tag type codes.
- **Modbus** — `src/ZB.MOM.WW.OtOpcUa.Driver.Modbus/ModbusDriver.cs` maps register shapes (16-bit signed, 16-bit unsigned, 32-bit float, etc.) including the DirectLogic quirk table in `DirectLogicAddress.cs`.
- **S7 / AB Legacy / TwinCAT / FOCAS / OPC UA Client** — each has its own inline mapper or `*DataType.cs` file per the same pattern.
Galaxy attributes with `is_array = 1` in the repository are exposed as one-dimensional OPC UA array variables.
The driver's mapping is authoritative — when a field type is ambiguous (a `LREAL` that could be bit-reinterpreted, a BCD counter, a string of a particular encoding), the driver decides the exposed OPC UA shape.
### ValueRank
## Array handling
The `ValueRank` property on the OPC UA variable node indicates the array dimensionality:
`DriverAttributeInfo.IsArray = true` flips `ValueRank = OneDimension` on the generated `BaseDataVariableState`; scalars stay at `ValueRank.Scalar`. `DriverAttributeInfo.ArrayDim` carries the declared length. Writing element-by-element (OPC UA `IndexRange`) is a driver-level decision — see `docs/ReadWriteOperations.md`.
| `is_array` | ValueRank | Constant |
|:---:|:---------:|----------|
| 0 | -1 | `ValueRanks.Scalar` |
| 1 | 1 | `ValueRanks.OneDimension` |
## SecurityClassification — metadata, not ACL
### ArrayDimensions
`SecurityClassification` is driver-reported metadata only. Drivers never enforce write permissions themselves — the classification flows into the Server project where `WriteAuthzPolicy.IsAllowed(classification, userRoles)` (`src/ZB.MOM.WW.OtOpcUa.Server/Security/WriteAuthzPolicy.cs`) gates the write against the session's LDAP-derived roles, and (Phase 6.2) the `AuthorizationGate` + permission trie apply on top. This is the "ACL at server layer" invariant recorded in `feedback_acl_at_server_layer.md`.
When `ValueRank = 1`, the `ArrayDimensions` property is set to a single-element `ReadOnlyList<uint>` containing the declared array length from `array_dimension`:
The classification values mirror the v1 Galaxy model so existing Galaxy galaxies keep their published semantics:
```csharp
if (attr.IsArray && attr.ArrayDimension.HasValue)
{
variable.ArrayDimensions = new ReadOnlyList<uint>(
new List<uint> { (uint)attr.ArrayDimension.Value });
}
```
| SecurityClassification | Required role | Write-from-OPC-UA |
|---|---|---|
| `FreeAccess` | — | yes (even anonymous) |
| `Operate` | `WriteOperate` | yes |
| `Tune` | `WriteTune` | yes |
| `Configure` | `WriteConfigure` | yes |
| `SecuredWrite` | `WriteOperate` | yes |
| `VerifiedWrite` | `WriteConfigure` | yes |
| `ViewOnly` | — | no |
The `array_dimension` value is extracted from the `mx_value` binary column in the Galaxy database (bytes 13-16, little-endian int32).
Drivers whose backend has no notion of classification (Modbus, most PLCs) default every tag to `FreeAccess` or `Operate`; drivers whose backend does carry the notion (Galaxy, OPC UA Client relaying `UserAccessLevel`) translate it directly.
### NodeId for array variables
## Historization
Array variables use a NodeId without the `[]` suffix. The `full_tag_reference` stored internally for MXAccess addressing retains the `[]` (e.g., `MESReceiver_001.MoveInPartNumbers[]`), but the OPC UA NodeId strips it to `ns=1;s=MESReceiver_001.MoveInPartNumbers`.
## Security Classification to AccessLevel Mapping
Galaxy attributes carry a `security_classification` value that controls write permissions. `SecurityClassificationMapper.IsWritable` determines the OPC UA `AccessLevel`:
| security_classification | Galaxy level | OPC UA AccessLevel | Writable |
|:---:|--------------|-------------------|:--------:|
| 0 | FreeAccess | CurrentReadOrWrite | Yes |
| 1 | Operate | CurrentReadOrWrite | Yes |
| 2 | SecuredWrite | CurrentRead | No |
| 3 | VerifiedWrite | CurrentRead | No |
| 4 | Tune | CurrentReadOrWrite | Yes |
| 5 | Configure | CurrentReadOrWrite | Yes |
| 6 | ViewOnly | CurrentRead | No |
Most attributes default to Operate (1). The mapper treats SecuredWrite, VerifiedWrite, and ViewOnly as read-only because the OPC UA server does not implement the Galaxy's multi-level authentication model. Allowing writes to SecuredWrite or VerifiedWrite attributes without proper verification would bypass Galaxy security.
For historized attributes, `AccessLevels.HistoryRead` is added to the access level via bitwise OR, enabling OPC UA history read requests when an `IHistorianDataSource` is configured via the runtime-loaded historian plugin.
`DriverAttributeInfo.IsHistorized = true` flips `AccessLevel.HistoryRead` and `Historizing = true` on the variable. The driver must then implement `IHistoryProvider` for HistoryRead service calls to succeed; otherwise the node manager surfaces `BadHistoryOperationUnsupported` per request.
## Key source files
- `src/ZB.MOM.WW.OtOpcUa.Host/Domain/MxDataTypeMapper.cs` -- Type and CLR mapping
- `src/ZB.MOM.WW.OtOpcUa.Host/Domain/SecurityClassificationMapper.cs` -- Write access mapping
- `gr/data_type_mapping.md` -- Reference documentation for the full mapping table
- `src/ZB.MOM.WW.OtOpcUa.Core.Abstractions/DriverDataType.cs` — driver-agnostic type enum
- `src/ZB.MOM.WW.OtOpcUa.Core.Abstractions/SecurityClassification.cs` — write-authz tier metadata
- `src/ZB.MOM.WW.OtOpcUa.Core.Abstractions/DriverAttributeInfo.cs` — per-attribute descriptor
- `src/ZB.MOM.WW.OtOpcUa.Server/OpcUa/DriverNodeManager.cs``MapDataType` translation
- `src/ZB.MOM.WW.OtOpcUa.Server/Security/WriteAuthzPolicy.cs` — classification-to-role policy
- Per-driver mappers in each `Driver.*` project

View File

@@ -1,228 +1,109 @@
# Historical Data Access
`LmxNodeManager` exposes OPC UA historical data access (HDA) through an abstract `IHistorianDataSource` interface (`Historian/IHistorianDataSource.cs`). The Wonderware Historian implementation lives in a separate assembly, `ZB.MOM.WW.OtOpcUa.Historian.Aveva`, which is loaded at runtime only when `Historian.Enabled=true`. This keeps the `aahClientManaged` SDK out of the core Host so deployments that do not need history do not need the SDK installed.
OPC UA HistoryRead is a **per-driver optional capability** in OtOpcUa. The Core dispatches HistoryRead service calls to the owning driver through the `IHistoryProvider` capability interface (`src/ZB.MOM.WW.OtOpcUa.Core.Abstractions/IHistoryProvider.cs`). Drivers that don't implement the interface return `BadHistoryOperationUnsupported` for every history call on their nodes; that is the expected behavior for protocol drivers (Modbus, S7, AB CIP, AB Legacy, TwinCAT, FOCAS) whose wire protocols carry no time-series data.
## Plugin Architecture
Historian integration is no longer a separate bolt-on assembly, as it was in v1 (`ZB.MOM.WW.LmxOpcUa.Historian.Aveva` plugin). It is now one optional capability any driver can implement. The first implementation is the Galaxy driver's Wonderware Historian integration; OPC UA Client forwards HistoryRead to the upstream server. Every other driver leaves the capability unimplemented and the Core short-circuits history calls on nodes that belong to those drivers.
The historian surface is split across two assemblies:
## `IHistoryProvider`
- **`ZB.MOM.WW.OtOpcUa.Host`** (core) owns only OPC UA / BCL types:
- `IHistorianDataSource` -- the interface `LmxNodeManager` depends on
- `HistorianEventDto` -- SDK-free representation of a historian event record
- `HistorianAggregateMap` -- maps OPC UA aggregate NodeIds to AnalogSummary column names
- `HistorianPluginLoader` -- loads the plugin via `Assembly.LoadFrom` at startup
- `HistoryContinuationPointManager` -- paginates HistoryRead results
- **`ZB.MOM.WW.OtOpcUa.Historian.Aveva`** (plugin) owns everything SDK-bound:
- `HistorianDataSource` -- implements `IHistorianDataSource`, wraps `aahClientManaged`
- `IHistorianConnectionFactory` / `SdkHistorianConnectionFactory` -- opens and polls `ArchestrA.HistorianAccess` connections
- `AvevaHistorianPluginEntry.Create(HistorianConfiguration)` -- the static factory invoked by the loader
Four methods, mapping onto the four OPC UA HistoryRead service variants:
The plugin assembly and its SDK dependencies (`aahClientManaged.dll`, `aahClient.dll`, `aahClientCommon.dll`, `Historian.CBE.dll`, `Historian.DPAPI.dll`, `ArchestrA.CloudHistorian.Contract.dll`) deploy to a `Historian/` subfolder next to `ZB.MOM.WW.OtOpcUa.Host.exe`. See [Service Hosting](ServiceHosting.md#required-runtime-assemblies) for the full layout and deployment matrix.
| Method | OPC UA service | Notes |
|--------|----------------|-------|
| `ReadRawAsync` | HistoryReadRawModified (raw subset) | Returns `HistoryReadResult { Samples, ContinuationPoint? }`. The Core handles `ContinuationPoint` pagination. |
| `ReadProcessedAsync` | HistoryReadProcessed | Takes a `HistoryAggregateType` (Average / Minimum / Maximum / Total / Count) and a bucket `interval`. Drivers that can't express an aggregate throw `NotSupportedException`; the Core translates that into `BadAggregateNotSupported`. |
| `ReadAtTimeAsync` | HistoryReadAtTime | Default implementation throws `NotSupportedException` — drivers without interpolation / prior-boundary support leave the default. |
| `ReadEventsAsync` | HistoryReadEvents | Historical alarm/event rows, distinct from the live `IAlarmSource` stream. Default throws; only drivers with an event historian (Galaxy's A&E log) override. |
## Plugin Loading
Supporting DTOs live alongside the interface in `Core.Abstractions`:
When the service starts with `Historian.Enabled=true`, `OpcUaService` calls `HistorianPluginLoader.TryLoad(config)`. The loader:
- `HistoryReadResult(IReadOnlyList<DataValueSnapshot> Samples, byte[]? ContinuationPoint)`
- `HistoryAggregateType` — enum `{ Average, Minimum, Maximum, Total, Count }`
- `HistoricalEvent(EventId, SourceName?, EventTimeUtc, ReceivedTimeUtc, Message?, Severity)`
- `HistoricalEventsResult(IReadOnlyList<HistoricalEvent> Events, byte[]? ContinuationPoint)`
1. Probes `AppDomain.CurrentDomain.BaseDirectory\Historian\ZB.MOM.WW.OtOpcUa.Historian.Aveva.dll`.
2. Installs a one-shot `AppDomain.AssemblyResolve` handler that redirects any `aahClientManaged`/`aahClientCommon`/`Historian.*` lookups to the same subfolder, so the CLR can resolve SDK dependencies when the plugin first JITs.
3. Calls the plugin's `AvevaHistorianPluginEntry.Create(HistorianConfiguration)` via reflection and returns the resulting `IHistorianDataSource`.
4. On any failure (plugin missing, entry type not found, SDK assembly unresolvable, bad image), logs a warning with the expected plugin path and returns `null`. The server starts normally and `LmxNodeManager` returns `BadHistoryOperationUnsupported` for every history call.
## Dispatch through `CapabilityInvoker`
## Wonderware Historian SDK
All four HistoryRead surfaces are wrapped by `CapabilityInvoker` (`Core/Resilience/CapabilityInvoker.cs`) with `DriverCapability.HistoryRead`. The Polly pipeline keyed on `(DriverInstanceId, HostName, DriverCapability.HistoryRead)` provides timeout, circuit-breaker, and bulkhead defaults per the driver's stability tier (see [docs/v2/driver-stability.md](v2/driver-stability.md)).
The plugin uses the AVEVA Historian managed SDK (`aahClientManaged.dll`) to query historical data. The SDK provides a cursor-based query API through `ArchestrA.HistorianAccess`, replacing direct SQL queries against the Historian Runtime database. Two query types are used:
The dispatch point is `DriverNodeManager` in `ZB.MOM.WW.OtOpcUa.Server`. When the OPC UA stack calls `HistoryRead`, the node manager:
- **`HistoryQuery`** -- Raw historical samples with timestamp, value (numeric or string), and OPC quality.
- **`AnalogSummaryQuery`** -- Pre-computed aggregates with properties for Average, Minimum, Maximum, ValueCount, First, Last, StdDev, and more.
1. Resolves the target `NodeHandle` to a `(DriverInstanceId, fullReference)` pair.
2. Checks the owning driver's `DriverTypeMetadata` to see if the type may advertise history at all (fast reject for types that never implement `IHistoryProvider`).
3. If the driver instance implements `IHistoryProvider`, wraps the `ReadRawAsync` / `ReadProcessedAsync` / `ReadAtTimeAsync` / `ReadEventsAsync` call in `CapabilityInvoker.InvokeAsync(... DriverCapability.HistoryRead ...)`.
4. Translates the `HistoryReadResult` into an OPC UA `HistoryData` + `ExtensionObject`.
5. Manages the continuation point via `HistoryContinuationPointManager` so clients can page through large result sets.
The SDK DLLs are located in `lib/` and originate from `C:\Program Files (x86)\Wonderware\Historian\`. Only the plugin project (`src/ZB.MOM.WW.OtOpcUa.Historian.Aveva/`) references them at build time; the core Host project does not.
Driver-level history code never sees the continuation-point protocol or the OPC UA stack types — those stay in the Core.
## Configuration
## Driver coverage
`HistorianConfiguration` controls the SDK connection:
| Driver | Implements `IHistoryProvider`? | Source |
|--------|:------------------------------:|--------|
| Galaxy | Yes — raw, processed, at-time, events | `aahClientManaged` SDK (Wonderware Historian) on the Host side, forwarded through the Proxy's IPC |
| OPC UA Client | Yes — raw, processed, at-time, events (forwarded to upstream) | `Opc.Ua.Client.Session.HistoryRead` against the remote server |
| Modbus | No | Wire protocol has no time-series concept |
| Siemens S7 | No | S7comm has no time-series concept |
| AB CIP | No | CIP has no time-series concept |
| AB Legacy | No | PCCC has no time-series concept |
| TwinCAT | No | ADS symbol reads are point-in-time; archiving is an external concern |
| FOCAS | No | Default — FOCAS has no general-purpose historian API |
```csharp
public class HistorianConfiguration
{
public bool Enabled { get; set; } = false;
public string ServerName { get; set; } = "localhost";
public List<string> ServerNames { get; set; } = new();
public int FailureCooldownSeconds { get; set; } = 60;
public bool IntegratedSecurity { get; set; } = true;
public string? UserName { get; set; }
public string? Password { get; set; }
public int Port { get; set; } = 32568;
public int CommandTimeoutSeconds { get; set; } = 30;
public int MaxValuesPerRead { get; set; } = 10000;
public int RequestTimeoutSeconds { get; set; } = 60;
}
```
## Galaxy — Wonderware Historian (`aahClientManaged`)
When `Enabled` is `false`, `HistorianPluginLoader.TryLoad` is not called, no plugin is loaded, and the node manager returns `BadHistoryOperationUnsupported` for history read requests. When `Enabled` is `true` but the plugin cannot be loaded (missing `Historian/` subfolder, SDK assembly resolve failure, etc.), the server still starts and returns the same `BadHistoryOperationUnsupported` status with a warning in the log.
The Galaxy driver's `IHistoryProvider` implementation lives on the Host side (`.NET 4.8 x86`) in `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/Backend/Historian/`. The Proxy's `GalaxyProxyDriver.ReadRawAsync` / `ReadProcessedAsync` / `ReadAtTimeAsync` / `ReadEventsAsync` each serializes a `HistoryRead*Request` and awaits the matching `HistoryRead*Response` over the named pipe (see [drivers/Galaxy.md](drivers/Galaxy.md#ipc-transport)).
### Connection Properties
Host-side, `HistorianDataSource` uses the AVEVA Historian managed SDK (`aahClientManaged.dll`) to query historical data via a cursor-based API through `ArchestrA.HistorianAccess`:
| Property | Default | Description |
|---|---|---|
| `ServerName` | `localhost` | Single Historian server hostname used when `ServerNames` is empty. Preserved for backward compatibility with pre-cluster deployments |
| `ServerNames` | `[]` | Ordered list of Historian cluster nodes. When non-empty, supersedes `ServerName` and enables read-only cluster failover (see [Cluster Failover](#read-only-cluster-failover)) |
| `FailureCooldownSeconds` | `60` | How long a failed cluster node is skipped before being re-tried. Zero means no cooldown (retry on every request) |
| `IntegratedSecurity` | `true` | Use Windows authentication |
| `UserName` | `null` | Username when `IntegratedSecurity` is false |
| `Password` | `null` | Password when `IntegratedSecurity` is false |
| `Port` | `32568` | Historian TCP port |
| `CommandTimeoutSeconds` | `30` | SDK packet timeout in seconds (inner async bound) |
| `RequestTimeoutSeconds` | `60` | Outer safety timeout applied to sync-over-async history reads on the OPC UA stack thread. Backstop for `CommandTimeoutSeconds`; a timed-out read returns `BadTimeout`. Should be greater than `CommandTimeoutSeconds`. Stability review 2026-04-13 Finding 3 |
| `MaxValuesPerRead` | `10000` | Maximum values per history read request |
- **`HistoryQuery`** — raw historical samples (timestamp, value, OPC quality)
- **`AnalogSummaryQuery`** — pre-computed aggregates (Average, Minimum, Maximum, ValueCount, First, Last, StdDev)
## Connection Lifecycle
The SDK DLLs are pulled into the Galaxy.Host project at build time; the Server and every other driver project remain SDK-free.
`HistorianDataSource` (in the plugin assembly) maintains a persistent connection to the Historian server via `ArchestrA.HistorianAccess`:
> **Gap / status note.** The raw SDK wrapper (`HistorianDataSource`, `HistorianClusterEndpointPicker`, `HistorianHealthSnapshot`, etc.) has been ported from the v1 `ZB.MOM.WW.LmxOpcUa.Historian.Aveva` plugin into `Driver.Galaxy.Host/Backend/Historian/`. The **IPC wire-up** — `HistoryReadRequest` / `HistoryReadResponse` message kinds, Proxy-side `ReadRawAsync` / `ReadProcessedAsync` / `ReadAtTimeAsync` / `ReadEventsAsync` forwarding — is in place on `GalaxyProxyDriver`. What remains to close on a given branch is Host-side **mapping of `HistoryAggregateType` onto the `AnalogSummaryQuery` column names** (done in `GalaxyProxyDriver.MapAggregateToColumn`; the Host side must mirror it) and the **end-to-end integration test** that was held by the v1 plugin suite. Until those land on a given driver branch, history calls against Galaxy may surface `GalaxyIpcException { Code = "not-implemented" }` or backend-specific errors rather than populated `HistoryReadResult`s. Track the remaining work against the Phase 2 Galaxy out-of-process gate in `docs/v2/plan.md`.
1. **Lazy connect** -- The connection is established on the first query via `EnsureConnected()`. When a cluster is configured, the data source iterates `HistorianClusterEndpointPicker.GetHealthyNodes()` in order and returns the first node that successfully connects.
2. **Connection reuse** -- Subsequent queries reuse the same connection. The active node is tracked in `_activeProcessNode` / `_activeEventNode` and surfaced on the dashboard.
3. **Auto-reconnect** -- On connection failure, the connection is disposed, the active node is marked failed in the picker, and the next query re-enters the picker loop to try the next eligible candidate.
4. **Clean shutdown** -- `Dispose()` closes the connection when the service stops.
### Aggregate function mapping
The connection is opened with `ReadOnly = true` and `ConnectionType = Process`. The event (alarm history) path uses a separate connection with `ConnectionType = Event`, but both silos share the same cluster picker so a node that fails on one silo is immediately skipped on the other.
`GalaxyProxyDriver.MapAggregateToColumn` (Proxy-side) translates the OPC UA Part 13 standard aggregate enum onto `AnalogSummaryQuery` column names consumed by `HistorianDataSource.ReadAggregateAsync`:
## Read-Only Cluster Failover
| `HistoryAggregateType` | Result Property |
|------------------------|-----------------|
| `Average` | `Average` |
| `Minimum` | `Minimum` |
| `Maximum` | `Maximum` |
| `Count` | `ValueCount` |
When `HistorianConfiguration.ServerNames` is non-empty, the plugin picks from an ordered list of cluster nodes instead of a single `ServerName`. Each connection attempt tries candidates in configuration order until one succeeds. Failed nodes are placed into a timed cooldown and re-admitted when the cooldown elapses.
`HistoryAggregateType.Total` is **not supported** by Wonderware `AnalogSummary` and raises `NotSupportedException`, which the Core translates to `BadAggregateNotSupported`. Additional OPC UA aggregates (`Start`, `End`, `StandardDeviationPopulation`) sit on the Historian columns `First`, `Last`, `StdDev` and can be exposed by extending the enum + mapping together.
### HistorianClusterEndpointPicker
### Read-only cluster failover
The picker (in the plugin assembly, internal) is pure logic with no SDK dependency — all cluster behavior is unit-testable with a fake clock and scripted factory. Key characteristics:
`HistorianConfiguration.ServerNames` accepts an ordered list of cluster nodes. `HistorianClusterEndpointPicker` iterates the list in configuration order, marks failed nodes with a `FailureCooldownSeconds` window, and re-admits them when the cooldown elapses. One picker instance is shared by the process-values connection and the event-history connection (two SDK silos), so a node failure on one silo immediately benches it for the other. `FailureCooldownSeconds = 0` disables the cooldown — the SDK's own retry semantics are the sole gate.
- **Ordered iteration**: nodes are tried in the exact order they appear in `ServerNames`. Operators can express a preference ("primary first, fallback second") by ordering the list.
- **Per-node cooldown**: `MarkFailed(node, error)` starts a `FailureCooldownSeconds` window during which the node is skipped from `GetHealthyNodes()`. `MarkHealthy(node)` clears the window immediately (used on successful connect).
- **Automatic re-admission**: when a node's cooldown elapses, the next call to `GetHealthyNodes()` includes it automatically — no background probe, no manual reset. The cumulative `FailureCount` and `LastError` are retained for operator diagnostics.
- **Thread-safe**: a single lock guards the per-node state. Operations are microsecond-scale so contention is a non-issue.
- **Shared across silos**: one picker instance is shared by the process-values connection and the event-history connection, so a node failure on one path immediately benches it for the other.
- **Zero cooldown mode**: `FailureCooldownSeconds = 0` disables the cooldown entirely — the node is never benched. Useful for tests or for operators who want the SDK's own retry semantics to be the sole gate.
Host-side cluster health is surfaced via `HistorianHealthSnapshot { NodeCount, HealthyNodeCount, ActiveProcessNode, ActiveEventNode, Nodes }` and forwarded to the Proxy so the Admin UI Historian panel can render a per-node table. `HealthCheckService` flips overall service health to `Degraded` when `HealthyNodeCount < NodeCount`.
### Connection attempt flow
### Runtime health counters
`HistorianDataSource.ConnectToAnyHealthyNode(HistorianConnectionType)` performs the actual iteration:
`HistorianDataSource` maintains per-read counters — `TotalQueries`, `TotalSuccesses`, `TotalFailures`, `ConsecutiveFailures`, `LastSuccessTime`, `LastFailureTime`, `LastError`, `ProcessConnectionOpen`, `EventConnectionOpen` — so the dashboard can distinguish "backend loaded but never queried" from "backend loaded and queries are failing". `LastError` is prefixed with the read path (`raw:`, `aggregate:`, `at-time:`, `events:`) so operators can tell which silo is broken. `HealthCheckService` degrades at `ConsecutiveFailures >= 3`.
1. Snapshot healthy nodes from the picker. If empty, throw `InvalidOperationException` with either "No historian nodes configured" or "All N historian nodes are in cooldown".
2. For each candidate, clone `HistorianConfiguration` with the candidate as `ServerName` and pass it to the factory. On success: `MarkHealthy(node)` and return the `(Connection, Node)` tuple. On exception: `MarkFailed(node, ex.Message)`, log a warning, continue.
3. If all candidates fail, wrap the last inner exception in an `InvalidOperationException` with the cumulative failure count so the existing read-method catch blocks surface a meaningful error through the health counters.
### Quality mapping
The wrapping exception intentionally includes the last inner error message in the outer `Message` so the health snapshot's `LastError` field is still human-readable when the cluster exhausts every candidate.
### Single-node backward compatibility
When `ServerNames` is empty, the picker is seeded with a single entry from `ServerName` and the iteration loop still runs — it just has one candidate. Legacy deployments see no behavior change: the picker marks the single node healthy on success, runs the same cooldown logic on failure, and the dashboard renders a compact `Node: <hostname>` line instead of the cluster table.
### Cluster health surface
Runtime cluster state is exposed on `HistorianHealthSnapshot`:
- `NodeCount` / `HealthyNodeCount` -- size of the configured cluster and how many are currently eligible.
- `ActiveProcessNode` / `ActiveEventNode` -- which nodes are currently serving the two connection silos, or `null` when a silo has no open connection.
- `Nodes: List<HistorianClusterNodeState>` -- per-node state with `Name`, `IsHealthy`, `CooldownUntil`, `FailureCount`, `LastError`, `LastFailureTime`.
The dashboard renders this as a cluster table when `NodeCount > 1`. See [Status Dashboard](StatusDashboard.md#historian). `HealthCheckService` flips the overall service health to `Degraded` when `HealthyNodeCount < NodeCount` so operators can alert on a partially-failed cluster even while queries are still succeeding via the remaining nodes.
## Runtime Health Counters
`HistorianDataSource` maintains runtime query counters updated on every read method exit — success or failure — so the dashboard can distinguish "plugin loaded but never queried" from "plugin loaded and queries are failing". The load-time `HistorianPluginLoader.LastOutcome` only reports whether the assembly resolved at startup; it cannot catch a connection that succeeds at boot and degrades later.
### Counters
- `TotalQueries` / `TotalSuccesses` / `TotalFailures` — cumulative since startup. Every call to `RecordSuccess` or `RecordFailure` in the read methods updates these under `_healthLock`. Empty result sets count as successes — the counter reflects "the SDK call returned" rather than "the SDK call returned data".
- `ConsecutiveFailures` — latches while queries are failing; reset to zero by the first success. Drives `HealthCheckService` degradation at threshold 3.
- `LastSuccessTime` / `LastFailureTime` — UTC timestamps of the most recent success or failure, or `null` when no query of that outcome has occurred yet.
- `LastError` — exception message from the most recent failure, prefixed with the read-path name (`raw:`, `aggregate:`, `at-time:`, `events:`) so operators can tell which SDK call is broken. Cleared on the next success.
- `ProcessConnectionOpen` / `EventConnectionOpen` — whether the plugin currently holds an open SDK connection on each silo. Read from the data source's `_connection` / `_eventConnection` fields via a `Volatile.Read`.
These fields are read once per dashboard refresh via `IHistorianDataSource.GetHealthSnapshot()` and serialized into `HistorianStatusInfo`. See [Status Dashboard](StatusDashboard.md#historian) for the HTML/JSON surface.
### Two SDK connection silos
The plugin maintains two independent `ArchestrA.HistorianAccess` connections, one per `HistorianConnectionType`:
- **Process connection** (`ConnectionType = Process`) — serves historical *value* queries: `ReadRawAsync`, `ReadAggregateAsync`, `ReadAtTimeAsync`. This is the SDK's query channel for tags stored in the Historian runtime.
- **Event connection** (`ConnectionType = Event`) — serves historical *event/alarm* queries: `ReadEventsAsync`. The SDK requires a separately opened connection for its event store because the query API and wire schema are distinct from value queries.
Both connections are lazy: they open on the first query that needs them. Either can be open, closed, or open against a different cluster node than the other. The dashboard renders both independently in the Historian panel (`Process Conn: open (host-a) | Event Conn: closed`) so operators can tell which silos are active and which node is serving each. When cluster support is configured, both silos share the same `HistorianClusterEndpointPicker`, so a failure on one silo marks the node unhealthy for the other as well.
## Raw Reads
`IHistorianDataSource.ReadRawAsync` (plugin implementation) uses a `HistoryQuery` to retrieve individual samples within a time range:
1. Create a `HistoryQuery` via `_connection.CreateHistoryQuery()`
2. Configure `HistoryQueryArgs` with `TagNames`, `StartDateTime`, `EndDateTime`, and `RetrievalMode = Full`
3. Iterate: `StartQuery` -> `MoveNext` loop -> `EndQuery`
Each result row is converted to an OPC UA `DataValue`:
- `QueryResult.Value` (double) takes priority; `QueryResult.StringValue` is used as fallback for string-typed tags.
- `SourceTimestamp` and `ServerTimestamp` are both set to `QueryResult.StartDateTime`.
- `StatusCode` is mapped from the `QueryResult.OpcQuality` (UInt16) via `QualityMapper` (the same OPC DA quality byte mapping used for live MXAccess data).
## Aggregate Reads
`IHistorianDataSource.ReadAggregateAsync` (plugin implementation) uses an `AnalogSummaryQuery` to retrieve pre-computed aggregates:
1. Create an `AnalogSummaryQuery` via `_connection.CreateAnalogSummaryQuery()`
2. Configure `AnalogSummaryQueryArgs` with `TagNames`, `StartDateTime`, `EndDateTime`, and `Resolution` (milliseconds)
3. Iterate the same `StartQuery` -> `MoveNext` -> `EndQuery` pattern
4. Extract the requested aggregate from named properties on `AnalogSummaryQueryResult`
Null aggregate values return `BadNoData` status rather than `Good` with a null variant.
## Quality Mapping
The Historian SDK returns standard OPC DA quality values in `QueryResult.OpcQuality` (UInt16). The low byte is passed through the shared `QualityMapper` pipeline (`MapFromMxAccessQuality` -> `MapToOpcUaStatusCode`), which maps the OPC DA quality families to OPC UA status codes:
The Historian SDK returns standard OPC DA quality values in `QueryResult.OpcQuality` (UInt16). The low byte flows through the shared `QualityMapper` pipeline (`MapFromMxAccessQuality``MapToOpcUaStatusCode`):
| OPC Quality Byte | OPC DA Family | OPC UA StatusCode |
|---|---|---|
|------------------|---------------|-------------------|
| 0-63 | Bad | `Bad` (with sub-code when an exact enum match exists) |
| 64-191 | Uncertain | `Uncertain` (with sub-code when an exact enum match exists) |
| 192+ | Good | `Good` (with sub-code when an exact enum match exists) |
See `Domain/QualityMapper.cs` and `Domain/Quality.cs` for the full mapping table and sub-code definitions.
See `Domain/QualityMapper.cs` and `Domain/Quality.cs` in `Driver.Galaxy.Host` for the full table.
## Aggregate Function Mapping
## OPC UA Client — upstream forwarding
`HistorianAggregateMap.MapAggregateToColumn` (in the core Host assembly, so the node manager can validate aggregate support without requiring the plugin to be loaded) translates OPC UA aggregate NodeIds to `AnalogSummaryQueryResult` property names:
The OPC UA Client driver (`Driver.OpcUaClient`) implements `IHistoryProvider` by forwarding each call to the upstream server via `Session.HistoryRead`. Raw / processed / at-time / events map onto the stack's native HistoryRead details types. Continuation points are passed through — the Core's `HistoryContinuationPointManager` treats the driver as an opaque pager.
| OPC UA Aggregate | Result Property |
|---|---|
| `AggregateFunction_Average` | `Average` |
| `AggregateFunction_Minimum` | `Minimum` |
| `AggregateFunction_Maximum` | `Maximum` |
| `AggregateFunction_Count` | `ValueCount` |
| `AggregateFunction_Start` | `First` |
| `AggregateFunction_End` | `Last` |
| `AggregateFunction_StandardDeviationPopulation` | `StdDev` |
## Historizing flag and AccessLevel
Unsupported aggregates return `null`, which causes the node manager to return `BadAggregateNotSupported`.
## HistoryReadRawModified Override
`LmxNodeManager` overrides `HistoryReadRawModified` to handle raw history read requests:
1. Resolve the `NodeHandle` to a tag reference via `_nodeIdToTagReference`. Return `BadNodeIdUnknown` if not found.
2. Check that `_historianDataSource` is not null. Return `BadHistoryOperationUnsupported` if historian is disabled.
3. Call `ReadRawAsync` with the time range and `NumValuesPerNode` from the `ReadRawModifiedDetails`.
4. Pack the resulting `DataValue` list into a `HistoryData` object and wrap it in an `ExtensionObject` for the `HistoryReadResult`.
## HistoryReadProcessed Override
`HistoryReadProcessed` handles aggregate history requests with additional validation:
1. Resolve the node and check historian availability (same as raw).
2. Validate that `AggregateType` is present in the `ReadProcessedDetails`. Return `BadAggregateListMismatch` if empty.
3. Map the requested aggregate to a result property via `MapAggregateToColumn`. Return `BadAggregateNotSupported` if unmapped.
4. Call `ReadAggregateAsync` with the time range, `ProcessingInterval`, and property name.
5. Return results in the same `HistoryData` / `ExtensionObject` format.
## Historizing Flag and AccessLevel
During variable node creation in `CreateAttributeVariable`, attributes with `IsHistorized == true` receive two additional settings:
During variable node creation, drivers that advertise history set:
```csharp
if (attr.IsHistorized)
@@ -230,7 +111,13 @@ if (attr.IsHistorized)
variable.Historizing = attr.IsHistorized;
```
- **`Historizing = true`** -- Tells OPC UA clients that this node has historical data available.
- **`AccessLevels.HistoryRead`** -- Enables the `HistoryRead` access bit on the node, which the OPC UA stack checks before routing history requests to the node manager override. Nodes without this bit set will be rejected by the framework before reaching `HistoryReadRawModified` or `HistoryReadProcessed`.
- **`Historizing = true`** — tells OPC UA clients that the node has historical data available.
- **`AccessLevels.HistoryRead`** — enables the `HistoryRead` access bit. The OPC UA stack checks this bit before routing history requests to the Core dispatcher; nodes without it are rejected before reaching `IHistoryProvider`.
The `IsHistorized` flag originates from the Galaxy repository database query, which checks whether the attribute has Historian logging configured.
The `IsHistorized` flag originates in the driver's discovery output. For Galaxy it comes from the repository query detecting a `HistoryExtension` primitive (see [drivers/Galaxy-Repository.md](drivers/Galaxy-Repository.md)). For OPC UA Client it is copied from the upstream server's `Historizing` property.
## Configuration
Driver-specific historian config lives in each driver's `DriverConfig` JSON blob, validated against the driver type's `DriverConfigJsonSchema` in `DriverTypeRegistry`. The Galaxy driver's historian section carries the fields exercised by `HistorianConfiguration``ServerName` / `ServerNames`, `FailureCooldownSeconds`, `IntegratedSecurity` / `UserName` / `Password`, `Port` (default `32568`), `CommandTimeoutSeconds`, `RequestTimeoutSeconds`, `MaxValuesPerRead`. The OPC UA Client driver inherits its timeouts from the upstream session.
See [Configuration.md](Configuration.md) for the schema shape and validation path.

View File

@@ -1,121 +1,65 @@
# Incremental Sync
When a Galaxy redeployment is detected, the OPC UA address space must be updated to reflect the new hierarchy and attributes. Rather than tearing down the entire address space and rebuilding from scratch (which disconnects all clients and drops all subscriptions), `LmxNodeManager` performs an incremental sync that identifies changed objects and rebuilds only the affected subtrees.
Two distinct change-detection paths feed the running server: driver-backend rediscovery (Galaxy's `time_of_last_deploy`, TwinCAT's symbol-version-changed, OPC UA Client's upstream namespace change) and generation-level config publishes from the Admin UI. Both flow into re-runs of `ITagDiscovery.DiscoverAsync`, but they originate differently.
## Cached State
## Driver-backend rediscovery — IRediscoverable
`LmxNodeManager` retains shallow copies of the last-published hierarchy and attributes:
Drivers whose backend has a native change signal implement `IRediscoverable` (`src/ZB.MOM.WW.OtOpcUa.Core.Abstractions/IRediscoverable.cs`):
```csharp
private List<GalaxyObjectInfo>? _lastHierarchy;
private List<GalaxyAttributeInfo>? _lastAttributes;
```
These are updated at the end of every `BuildAddressSpace` or `SyncAddressSpace` call via `new List<T>(source)` to create independent copies. The copies serve as the baseline for the next diff comparison.
On the first call (when `_lastHierarchy` is null), `SyncAddressSpace` falls through to a full `BuildAddressSpace` since there is no baseline to diff against.
## AddressSpaceDiff
`AddressSpaceDiff` is a static helper class that computes the set of changed Galaxy object IDs between two snapshots.
### FindChangedGobjectIds
This method compares old and new hierarchy+attributes and returns a `HashSet<int>` of gobject IDs that have any difference. It detects three categories of changes:
**Added objects** -- Present in new hierarchy but not in old:
```csharp
foreach (var id in newObjects.Keys)
if (!oldObjects.ContainsKey(id))
changed.Add(id);
```
**Removed objects** -- Present in old hierarchy but not in new:
```csharp
foreach (var id in oldObjects.Keys)
if (!newObjects.ContainsKey(id))
changed.Add(id);
```
**Modified objects** -- Present in both but with different properties. `ObjectsEqual` compares `TagName`, `BrowseName`, `ContainedName`, `ParentGobjectId`, and `IsArea`.
**Attribute set changes** -- For objects that exist in both snapshots, attributes are grouped by `GobjectId` and compared pairwise. `AttributeSetsEqual` sorts both lists by `FullTagReference` and `PrimitiveName`, then checks each pair via `AttributesEqual`, which compares `AttributeName`, `FullTagReference`, `MxDataType`, `IsArray`, `ArrayDimension`, `PrimitiveName`, `SecurityClassification`, `IsHistorized`, and `IsAlarm`. A difference in count or any field mismatch marks the owning gobject as changed.
Objects already marked as changed by hierarchy comparison are skipped during attribute comparison to avoid redundant work.
### ExpandToSubtrees
When a Galaxy object changes, its children must also be rebuilt because they may reference the parent's node or have inherited attribute changes. `ExpandToSubtrees` performs a BFS traversal from each changed ID, adding all descendants:
```csharp
public static HashSet<int> ExpandToSubtrees(HashSet<int> changed,
List<GalaxyObjectInfo> hierarchy)
public interface IRediscoverable
{
var childrenByParent = hierarchy.GroupBy(h => h.ParentGobjectId)
.ToDictionary(g => g.Key, g => g.Select(h => h.GobjectId).ToList());
var expanded = new HashSet<int>(changed);
var queue = new Queue<int>(changed);
while (queue.Count > 0)
{
var id = queue.Dequeue();
if (childrenByParent.TryGetValue(id, out var children))
foreach (var childId in children)
if (expanded.Add(childId))
queue.Enqueue(childId);
}
return expanded;
event EventHandler<RediscoveryEventArgs>? OnRediscoveryNeeded;
}
public sealed record RediscoveryEventArgs(string Reason, string? ScopeHint);
```
The expansion runs against both the old and new hierarchy. This is necessary because a removed parent's children appear in the old hierarchy (for teardown) while an added parent's children appear in the new hierarchy (for construction).
The driver fires the event with a reason string (for the diagnostic log) and an optional scope hint — a non-null hint lets Core scope the rebuild surgically to that subtree; null means "the whole address space may have changed".
## SyncAddressSpace Flow
Drivers that implement the capability today:
`SyncAddressSpace` orchestrates the incremental update inside the OPC UA framework `Lock`:
- **Galaxy** — polls `galaxy.time_of_last_deploy` in the Galaxy repository DB and fires on change. This is Galaxy-internal change detection, not the platform-wide mechanism.
- **TwinCAT** — observes ADS symbol-version-changed notifications (`0x0702`).
- **OPC UA Client** — subscribes to the upstream server's `Server/NamespaceArray` change notifications.
1. **Diff** -- Call `FindChangedGobjectIds` with the cached and new snapshots. If no changes are detected, update the cached snapshots and return early.
Static drivers (Modbus, S7, AB CIP, AB Legacy, FOCAS) do not implement `IRediscoverable` — their tags only change when a new generation is published from the Config DB. Core sees absence of the interface and skips change-detection wiring for those drivers (decision #54).
2. **Expand** -- Call `ExpandToSubtrees` on both old and new hierarchies to include descendant objects.
## Config-DB generation publishes
3. **Snapshot subscriptions** -- Before teardown, iterate `_gobjectToTagRefs` for each changed gobject ID and record the current MXAccess subscription ref-counts. These are needed to restore subscriptions after rebuild.
Tag-set changes authored in the Admin UI (UNS edits, CSV imports, driver-config edits) accumulate in a draft generation and commit via `sp_PublishGeneration`. The delta between the currently-published generation and the proposed next one is computed by `sp_ComputeGenerationDiff`, which drives:
4. **Teardown** -- Call `TearDownGobjects` to remove the old nodes and clean up tracking state.
- The **DiffViewer** in Admin (`src/ZB.MOM.WW.OtOpcUa.Admin/Components/Pages/Clusters/DiffViewer.razor`) so operators can preview what will change before clicking Publish.
- The 409-on-stale-draft flow (decision #161) — a UNS drag-reorder preview carries a `DraftRevisionToken` so Confirm returns `409 Conflict / refresh-required` if the draft advanced between preview and commit.
5. **Rebuild** -- Filter the new hierarchy and attributes to only the changed gobject IDs, then call `BuildSubtree` to create the replacement nodes.
After publish, the server's generation applier invokes `IDriver.ReinitializeAsync(driverConfigJson, ct)` on every driver whose `DriverInstance.DriverConfig` row changed in the new generation. Reinitialize is the in-process recovery path for Tier A/B drivers; if it fails the driver is marked `DriverState.Faulted` and its nodes go Bad quality — but the server process stays running. See `docs/v2/driver-stability.md`.
6. **Restore subscriptions** -- For each previously subscribed tag reference that still exists in `_tagToVariableNode` after rebuild, re-open the MXAccess subscription and restore the original ref-count.
Drivers whose discovery depends on Config DB state (Modbus register maps, S7 DBs, AB CIP tag lists) re-run their discovery inside `ReinitializeAsync`; Core then diffs the new node set against the current address space.
7. **Update cache** -- Replace `_lastHierarchy` and `_lastAttributes` with shallow copies of the new data.
## Rebuild flow
## TearDownGobjects
When a rediscovery is triggered (by either source), `GenericDriverNodeManager` re-runs `ITagDiscovery.DiscoverAsync` into the same `CapturingBuilder` it used at first build. The new node set is diffed against the current:
`TearDownGobjects` removes all OPC UA nodes and tracking state for a set of gobject IDs:
1. **Diff** — full-name comparison of the new `DriverAttributeInfo` set against the existing `_variablesByFullRef` map. Added / removed / modified references are partitioned.
2. **Snapshot subscriptions** — before teardown, Core captures the current monitored-item ref-counts for every affected reference so subscriptions can be replayed after rebuild.
3. **Teardown** — removed / modified variable nodes are deleted via `CustomNodeManager2.DeleteNode`. Driver-side subscriptions for those references are unwound via `ISubscribable.UnsubscribeAsync`.
4. **Rebuild** — added / modified references get fresh `BaseDataVariableState` nodes via the standard `IAddressSpaceBuilder.Variable(...)` path. Alarm-flagged references re-register their `IAlarmConditionSink` through `CapturingBuilder`.
5. **Restore subscriptions** — for every captured reference that still exists after rebuild, Core re-opens the driver subscription and restores the original ref-count.
For each gobject ID, it processes the associated tag references from `_gobjectToTagRefs`:
Exceptions during teardown are swallowed per decision #12 — a driver throw must not leave the node tree half-deleted.
1. **Unsubscribe** -- If the tag has an active MXAccess subscription (entry in `_subscriptionRefCounts`), call `UnsubscribeAsync` and remove the ref-count entry.
## Scope hint
2. **Remove alarm tracking** -- Find any `_alarmInAlarmTags` entries whose `SourceTagReference` matches the tag. For each, unsubscribe the InAlarm, Priority, and DescAttrName tags, then remove the alarm entry.
When `RediscoveryEventArgs.ScopeHint` is non-null (e.g. a folder path), Core restricts the diff to that subtree. This matters for Galaxy Platform-scoped deployments where a `time_of_last_deploy` advance may only affect one platform's subtree, and for OPC UA Client where an upstream change may be localized. Null scope falls back to a full-tree diff.
3. **Delete variable node** -- Call `DeleteNode` on the variable's `NodeId`, remove from `_tagToVariableNode`, clean up `_nodeIdToTagReference` and `_tagMetadata`, and decrement `VariableNodeCount`.
## Active subscriptions survive rebuild
4. **Delete object/folder node** -- Remove the gobject's entry from `_nodeMap` and call `DeleteNode`. Non-folder nodes decrement `ObjectNodeCount`.
Subscriptions for unchanged references stay live across rebuilds — their ref-count map is not disturbed. Clients monitoring a stable tag never see a data-change gap during a deploy, only clients monitoring a tag that was genuinely removed see the subscription drop.
All MXAccess calls and `DeleteNode` calls are wrapped in try/catch with ignored exceptions, since teardown must complete even if individual cleanup steps fail.
## Key source files
## BuildSubtree
`BuildSubtree` creates OPC UA nodes for a subset of the Galaxy hierarchy, reusing existing parent nodes from `_nodeMap`.
The method first topologically sorts the input hierarchy (same `TopologicalSort` used by `BuildAddressSpace`) to ensure parents are created before children. For each object:
1. **Find parent** -- Look up `ParentGobjectId` in `_nodeMap`. If the parent was not part of the changed set, it already exists from the previous build. If no parent is found, fall back to the root `ZB` folder. This is the key difference from `BuildAddressSpace` -- subtree builds reuse the existing node tree rather than starting from the root.
2. **Create node** -- Areas become `FolderState` with `Organizes` reference; non-areas become `BaseObjectState` with `HasComponent` reference. The node is added to `_nodeMap`.
3. **Create variable nodes** -- Attributes are processed with the same primitive-grouping logic as `BuildAddressSpace`, creating `BaseDataVariableState` nodes via `CreateAttributeVariable`.
4. **Alarm tracking** -- If `_alarmTrackingEnabled` is set, alarm attributes are detected and `AlarmConditionState` nodes are created using the same logic as the full build. EventNotifier flags are set on parent nodes, and alarm tags are auto-subscribed.
- `src/ZB.MOM.WW.OtOpcUa.Core.Abstractions/IRediscoverable.cs` — backend-change capability
- `src/ZB.MOM.WW.OtOpcUa.Core/OpcUa/GenericDriverNodeManager.cs` — discovery orchestration
- `src/ZB.MOM.WW.OtOpcUa.Core.Abstractions/IDriver.cs``ReinitializeAsync` contract
- `src/ZB.MOM.WW.OtOpcUa.Admin/Services/GenerationService.cs` — publish-flow driver
- `docs/v2/config-db-schema.md``sp_PublishGeneration` + `sp_ComputeGenerationDiff`
- `docs/v2/admin-ui.md` — DiffViewer + draft-revision-token flow

View File

@@ -1,166 +0,0 @@
# MXAccess Bridge
The MXAccess bridge connects the OPC UA server to the AVEVA System Platform runtime through the `ArchestrA.MxAccess` COM API. It handles all COM threading requirements, translates between OPC UA read/write requests and MXAccess operations, and manages connection health.
## STA Thread Requirement
MXAccess is a COM-based API that requires a Single-Threaded Apartment (STA). All COM objects -- `LMXProxyServer` instantiation, `Register`, `AddItem`, `AdviseSupervisory`, `Write`, and cleanup calls -- must execute on the same STA thread. Calling COM objects from the wrong thread causes marshalling failures or silent data corruption.
`StaComThread` provides a dedicated STA thread with the apartment state set before the thread starts:
```csharp
_thread = new Thread(ThreadEntry) { Name = "MxAccess-STA", IsBackground = true };
_thread.SetApartmentState(ApartmentState.STA);
```
Work items are queued via `RunAsync(Action)` or `RunAsync<T>(Func<T>)`, which enqueue the work to a `ConcurrentQueue<Action>` and post a `WM_APP` message to wake the pump. Each work item is wrapped in a `TaskCompletionSource` so callers can `await` the result from any thread.
## Win32 Message Pump
COM callbacks (like `OnDataChange`) are delivered through the Windows message loop. `StaComThread` runs a standard Win32 message pump using P/Invoke:
1. `PeekMessage` primes the message queue (required before `PostThreadMessage` works)
2. `GetMessage` blocks until a message arrives
3. `WM_APP` messages drain the work queue
4. `WM_APP + 1` drains the queue and posts `WM_QUIT` to exit the loop
5. All other messages are passed through `TranslateMessage`/`DispatchMessage` for COM callback delivery
Without this message pump, MXAccess COM callbacks would never fire and the server would receive no live data.
## LMXProxyServer COM Object
`MxProxyAdapter` wraps the real `ArchestrA.MxAccess.LMXProxyServer` COM object behind the `IMxProxy` interface. This abstraction allows unit tests to substitute a fake proxy without requiring the ArchestrA runtime.
The COM object lifecycle:
1. **`Register(clientName)`** -- Creates a new `LMXProxyServer` instance, wires up `OnDataChange` and `OnWriteComplete` event handlers, and calls `Register` to obtain a connection handle
2. **`Unregister(handle)`** -- Unwires event handlers, calls `Unregister`, and releases the COM object via `Marshal.ReleaseComObject`
## Register/AddItem/AdviseSupervisory Pattern
Every MXAccess data operation follows a three-step pattern, all executed on the STA thread:
1. **`AddItem(handle, address)`** -- Resolves a Galaxy tag reference (e.g., `TestMachine_001.MachineID`) to an integer item handle
2. **`AdviseSupervisory(handle, itemHandle)`** -- Subscribes the item for supervisory data change callbacks
3. The runtime begins delivering `OnDataChange` events for the item
For writes, after `AddItem` + `AdviseSupervisory`, `Write(handle, itemHandle, value, securityClassification)` sends the value to the runtime. The `OnWriteComplete` callback confirms or rejects the write.
Cleanup reverses the pattern: `UnAdviseSupervisory` then `RemoveItem`.
## OnDataChange and OnWriteComplete Callbacks
### OnDataChange
Fired by the COM runtime on the STA thread when a subscribed tag value changes. The handler in `MxAccessClient.EventHandlers.cs`:
1. Maps the integer `phItemHandle` back to a tag address via `_handleToAddress`
2. Maps the MXAccess quality code to the internal `Quality` enum
3. Checks `MXSTATUS_PROXY` for error details and adjusts quality accordingly
4. Converts the timestamp to UTC
5. Constructs a `Vtq` (Value/Timestamp/Quality) and delivers it to:
- The stored per-tag subscription callback
- Any pending one-shot read completions
- The global `OnTagValueChanged` event (consumed by `LmxNodeManager`)
### OnWriteComplete
Fired when the runtime acknowledges or rejects a write. The handler resolves the pending `TaskCompletionSource<bool>` for the item handle. If `MXSTATUS_PROXY.success == 0`, the write is considered failed and the error detail is logged.
## Reconnection Logic
`MxAccessClient` implements automatic reconnection through two mechanisms:
### Monitor loop
`StartMonitor` launches a background task that polls at `MonitorIntervalSeconds`. On each cycle:
- If the state is `Disconnected` or `Error` and `AutoReconnect` is enabled, it calls `ReconnectAsync`
- If connected and a probe tag is configured, it checks the probe staleness threshold
### Reconnect sequence
`ReconnectAsync` performs a full disconnect-then-connect cycle:
1. Increment the reconnect counter
2. `DisconnectAsync` -- Tears down all active subscriptions (`UnAdviseSupervisory` + `RemoveItem` for each), detaches COM event handlers, calls `Unregister`, and clears all handle mappings
3. `ConnectAsync` -- Creates a fresh `LMXProxyServer`, registers, replays all stored subscriptions, and re-subscribes the probe tag
Stored subscriptions (`_storedSubscriptions`) persist across reconnects. When `ConnectAsync` succeeds, `ReplayStoredSubscriptionsAsync` iterates all stored entries and calls `AddItem` + `AdviseSupervisory` for each.
## Probe Tag Health Monitoring
A configurable probe tag (e.g., a frequently updating Galaxy attribute) serves as a connection health indicator. After connecting, the client subscribes to the probe tag and records `_lastProbeValueTime` on every `OnDataChange` callback.
The monitor loop compares `DateTime.UtcNow - _lastProbeValueTime` against `ProbeStaleThresholdSeconds`. If the probe value has not updated within the threshold, the connection is assumed stale and a reconnect is forced. This catches scenarios where the COM connection is technically alive but the runtime has stopped delivering data.
## Per-Host Runtime Status Probes (`<Host>.ScanState`)
Separate from the connection-level probe above, the bridge advises `<HostName>.ScanState` on every deployed `$WinPlatform` and `$AppEngine` in the Galaxy. These probes track per-host runtime state so the dashboard can report "this specific Platform / AppEngine is off scan" and the bridge can proactively invalidate every OPC UA variable hosted by the stopped object — preventing MxAccess from serving stale Good-quality cached values to clients who read those tags while the host is down.
Enabled by default via `MxAccess.RuntimeStatusProbesEnabled`; see [Configuration](Configuration.md#mxaccess) for the two config fields.
### How it works
`GalaxyRuntimeProbeManager` is owned by `LmxNodeManager` and operates on a simple three-state machine per host (Unknown / Running / Stopped):
1. **Discovery** — After `BuildAddressSpace` completes, the manager filters the hierarchy to rows where `CategoryId == 1` (`$WinPlatform`) or `CategoryId == 3` (`$AppEngine`) and issues `AdviseSupervisory` for `<TagName>.ScanState` on each one. Probes are bridge-owned, not ref-counted against client subscriptions, and persist across address-space rebuilds via a `Sync` diff.
2. **Transition predicate** — A probe callback is interpreted as `isRunning = vtq.Quality.IsGood() && vtq.Value is bool b && b`. Everything else (explicit `ScanState = false`, bad quality, communication errors from the broker) means **Stopped**.
3. **On-change-only delivery**`ScanState` is delivered **only when the value actually changes**. A stably Running host may go hours without a callback. The probe manager's `Tick()` explicitly does NOT run a starvation check on Running entries — the only time-based transition is **Unknown → Stopped** when the initial callback hasn't arrived within `RuntimeStatusUnknownTimeoutSeconds` (default 15s). This protects against a probe that fails to resolve at all without incorrectly flipping healthy long-running hosts.
4. **Transport gating** — When `IMxAccessClient.State != Connected`, `GetSnapshot()` forces every entry to `Unknown` regardless of underlying state. The dashboard shows the Connection panel as the primary signal in that case rather than misleading operators with "every host stopped."
5. **Subscribe failure rollback** — If `SubscribeAsync` throws for a new probe (SDK failure, broker rejection, transport error), the manager rolls back both `_byProbe` and `_probeByGobjectId` so the probe never appears in `GetSnapshot()`. Without this rollback, a failed subscribe would leave the entry in `Unknown` forever, and `Tick()` would later transition it to `Stopped` after the unknown-resolution timeout, fanning out a **false-negative** host-down signal that invalidates the subtree of a host that was never actually advised. Stability review 2026-04-13 Finding 1.
### Subtree quality invalidation on transition
When a host transitions **Running → Stopped**, the probe manager invokes a callback that walks `_hostedVariables[gobjectId]` — the set of every OPC UA variable transitively hosted by that Galaxy object — and sets each variable's `StatusCode` to `BadOutOfService`. The reverse happens on **Stopped → Running**: `ClearHostVariablesBadQuality` resets each to `Good` and lets subsequent on-change MxAccess updates repopulate the values.
The hosted-variables map is built once per `BuildAddressSpace` by walking each object's `HostedByGobjectId` chain up to the nearest Platform or Engine ancestor. A variable hosted by an Engine inside a Platform ends up in **both** the Engine's list and the Platform's list, so stopping the Platform transitively invalidates every descendant Engine's variables.
### Read-path short-circuit (`IsTagUnderStoppedHost`)
`LmxNodeManager.Read` override is called by the OPC UA SDK for both direct Read requests and monitored-item sampling. It previously called `_mxAccessClient.ReadAsync(tagRef)` unconditionally and returned whatever VTQ the runtime reported. That created a gap: MxAccess happily serves the last cached value as Good on a tag whose hosting Engine has gone off scan.
The Read override now checks `IsTagUnderStoppedHost(tagRef)` (a reverse-index lookup `_hostIdsByTagRef[tagRef]``GalaxyRuntimeProbeManager.IsHostStopped(hostId)`) before the MxAccess round-trip. When the owning host is Stopped, the handler returns a synthesized `DataValue { Value = cachedVar.Value, StatusCode = BadOutOfService }` directly without touching MxAccess. This guarantees clients see a uniform `BadOutOfService` on every descendant tag of a stopped host, regardless of whether they're reading or subscribing.
### Deferred dispatch: the STA deadlock
**Critical**: probe transition callbacks must **not** run synchronously on the STA thread that delivered the `OnDataChange`. `MarkHostVariablesBadQuality` takes the `LmxNodeManager.Lock`, which may be held by a worker thread currently inside `Read` waiting on an `_mxAccessClient.ReadAsync()` round-trip that is itself waiting for the STA thread. Classic circular wait — the first real deploy of this feature hung inside 30 seconds from exactly this pattern.
The fix is a deferred-dispatch queue: probe callbacks enqueue the transition onto `ConcurrentQueue<(int GobjectId, bool Stopped)>` and set the existing dispatch signal. The dispatch thread drains the queue inside its existing 100ms `WaitOne` loop — **outside** any locks held by the STA path — and then calls `MarkHostVariablesBadQuality` / `ClearHostVariablesBadQuality` under its own natural `Lock` acquisition. No circular wait, no STA dispatch involvement.
See the `runtimestatus.md` plan file and the `service_info.md` entry for the in-flight debugging that led to this pattern.
### Dashboard + health surface
- Dashboard **Galaxy Runtime** panel between Galaxy Info and Historian shows per-host state with Name / Kind / State / Since / Last Error columns. Panel color is green (all Running), yellow (any Unknown, none Stopped), red (any Stopped), gray (MxAccess transport disconnected).
- Subscriptions panel gains a `Probes: N (bridge-owned runtime status)` line when at least one probe is active, so operators can distinguish bridge-owned probe count from client-driven subscriptions.
- `HealthCheckService.CheckHealth` Rule 2e rolls overall health to `Degraded` when any host is Stopped, ordered after the MxAccess-transport check (Rule 1) so a transport outage stays `Unhealthy` without double-messaging.
See [Status Dashboard](StatusDashboard.md#galaxy-runtime) for the field table and [Configuration](Configuration.md#mxaccess) for the two new config fields.
## Request Timeout Safety Backstop
Every sync-over-async site on the OPC UA stack thread that calls into MxAccess (`Read`, `Write`, address-space rebuild probe sync) is wrapped in a bounded `SyncOverAsync.WaitSync(...)` helper with timeout `MxAccess.RequestTimeoutSeconds` (default 30s). This is a backstop: `MxAccessClient.Read/Write` already enforce inner `ReadTimeoutSeconds` / `WriteTimeoutSeconds` bounds on the async path. The outer wrapper exists so a scheduler stall, slow reconnect, or any other non-returning async path cannot park the stack thread indefinitely.
On timeout, the underlying task is **not** cancelled — it runs to completion on the thread pool and is abandoned. This is acceptable because MxAccess clients are shared singletons and the abandoned continuation does not capture request-scoped state. The OPC UA stack receives `StatusCodes.BadTimeout` on the affected operation.
`ConfigurationValidator` enforces `RequestTimeoutSeconds >= 1` and warns when it is set below the inner Read/Write timeouts (operator misconfiguration). Stability review 2026-04-13 Finding 3.
## Why Marshal.ReleaseComObject Is Needed
The .NET runtime's garbage collector releases COM references non-deterministically. For MXAccess, delayed release can leave stale COM connections open, preventing clean re-registration. `MxProxyAdapter.Unregister` calls `Marshal.ReleaseComObject(_lmxProxy)` in a `finally` block to immediately release the COM reference count to zero. This ensures the underlying COM server is freed before a reconnect attempt creates a new instance.
## Key source files
- `src/ZB.MOM.WW.OtOpcUa.Host/MxAccess/StaComThread.cs` -- STA thread and Win32 message pump
- `src/ZB.MOM.WW.OtOpcUa.Host/MxAccess/MxAccessClient.cs` -- Core client class (partial)
- `src/ZB.MOM.WW.OtOpcUa.Host/MxAccess/MxAccessClient.Connection.cs` -- Connect, disconnect, reconnect
- `src/ZB.MOM.WW.OtOpcUa.Host/MxAccess/MxAccessClient.Subscription.cs` -- Subscribe, unsubscribe, replay
- `src/ZB.MOM.WW.OtOpcUa.Host/MxAccess/MxAccessClient.ReadWrite.cs` -- Read and write operations
- `src/ZB.MOM.WW.OtOpcUa.Host/MxAccess/MxAccessClient.EventHandlers.cs` -- OnDataChange and OnWriteComplete handlers
- `src/ZB.MOM.WW.OtOpcUa.Host/MxAccess/MxAccessClient.Monitor.cs` -- Background health monitor
- `src/ZB.MOM.WW.OtOpcUa.Host/MxAccess/MxProxyAdapter.cs` -- COM object wrapper
- `src/ZB.MOM.WW.OtOpcUa.Host/MxAccess/GalaxyRuntimeProbeManager.cs` -- Per-host `ScanState` probes, state machine, `IsHostStopped` lookup
- `src/ZB.MOM.WW.OtOpcUa.Host/Domain/GalaxyRuntimeStatus.cs` -- Per-host DTO
- `src/ZB.MOM.WW.OtOpcUa.Host/Domain/GalaxyRuntimeState.cs` -- `Unknown` / `Running` / `Stopped` enum
- `src/ZB.MOM.WW.OtOpcUa.Host/Domain/IMxAccessClient.cs` -- Client interface

View File

@@ -1,137 +1,88 @@
# OPC UA Server
The OPC UA server component hosts the Galaxy-backed namespace on a configurable TCP endpoint and exposes deployed System Platform objects and attributes to OPC UA clients.
The OPC UA server component (`src/ZB.MOM.WW.OtOpcUa.Server/OpcUa/OtOpcUaServer.cs`) hosts the OPC UA stack and exposes one browsable subtree per registered driver. The server itself is driver-agnostic — Galaxy/MXAccess, Modbus, S7, AB CIP, AB Legacy, TwinCAT, FOCAS, and OPC UA Client are all plugged in as `IDriver` implementations via the capability interfaces in `src/ZB.MOM.WW.OtOpcUa.Core.Abstractions/`.
## Composition
`OtOpcUaServer` subclasses the OPC Foundation `StandardServer` and wires:
- A `DriverHost` (`src/ZB.MOM.WW.OtOpcUa.Core/Hosting/DriverHost.cs`) which registers drivers and holds the per-instance `IDriver` references.
- One `DriverNodeManager` per registered driver (`src/ZB.MOM.WW.OtOpcUa.Server/OpcUa/DriverNodeManager.cs`), constructed in `CreateMasterNodeManager`. Each manager owns its own namespace URI (`urn:OtOpcUa:{DriverInstanceId}`) and exposes the driver as a subtree under the standard `Objects` folder.
- A `CapabilityInvoker` (`src/ZB.MOM.WW.OtOpcUa.Core/Resilience/CapabilityInvoker.cs`) per driver instance, keyed on `(DriverInstanceId, HostName, DriverCapability)` against the shared `DriverResiliencePipelineBuilder`. Every Read/Write/Discovery/Subscribe/HistoryRead/AlarmSubscribe call on the driver flows through this invoker so the Polly pipeline (retry / timeout / breaker / bulkhead) applies. The OTOPCUA0001 Roslyn analyzer enforces the wrapping at compile time.
- An `IUserAuthenticator` (LDAP in production, injected stub in tests) for `UserName` token validation in the `ImpersonateUser` hook.
- Optional `AuthorizationGate` + `NodeScopeResolver` (Phase 6.2) that sit in front of every dispatch call. In lax mode the gate passes through when the identity lacks LDAP groups so existing integration tests keep working; strict mode (`Authorization:StrictMode = true`) denies those cases.
`OtOpcUaServer.DriverNodeManagers` exposes the materialized list so the hosting layer can walk each one post-start and call `GenericDriverNodeManager.BuildAddressSpaceAsync(manager)` — the manager is passed as its own `IAddressSpaceBuilder`.
## Configuration
`OpcUaConfiguration` defines the server endpoint and session settings. All properties have sensible defaults:
Server wiring used to live in `appsettings.json`. It now flows from the SQL Server **Config DB**: `ServerInstance` + `DriverInstance` + `Tag` + `NodeAcl` rows are published as a *generation* via `sp_PublishGeneration` and loaded into the running process by the generation applier. The Admin UI (Blazor Server, `docs/v2/admin-ui.md`) is the operator surface — drafts accumulate edits; `sp_ComputeGenerationDiff` drives the DiffViewer preview; a UNS drag-reorder carries a `DraftRevisionToken` so Confirm re-checks against the current draft and returns 409 if it advanced (decision #161). See `docs/v2/config-db-schema.md` for the schema.
| Property | Default | Description |
|----------|---------|-------------|
| `BindAddress` | `0.0.0.0` | IP address or hostname the server binds to |
| `Port` | `4840` | TCP port the server listens on |
| `EndpointPath` | `/LmxOpcUa` | URI path appended to the base address |
| `ServerName` | `LmxOpcUa` | Application name presented to clients |
| `GalaxyName` | `ZB` | Galaxy name used in the namespace URI |
| `MaxSessions` | `100` | Maximum concurrent client sessions |
| `SessionTimeoutMinutes` | `30` | Idle session timeout |
| `AlarmTrackingEnabled` | `false` | Enables `AlarmConditionState` nodes for alarm attributes |
| `AlarmFilter.ObjectFilters` | `[]` | Wildcard template-name patterns that scope alarm tracking to matching objects and their descendants (see [Alarm Tracking](AlarmTracking.md#template-based-alarm-object-filter)) |
Environmental knobs that aren't per-tenant (bind address, port, PKI path) still live in `appsettings.json` on the Server project; everything tenant-scoped moved to the Config DB.
The resulting endpoint URL is `opc.tcp://{BindAddress}:{Port}{EndpointPath}`, e.g., `opc.tcp://0.0.0.0:4840/LmxOpcUa`.
## Transport
The namespace URI follows the pattern `urn:{GalaxyName}:LmxOpcUa` and is used as the `ProductUri`. The `ApplicationUri` can be set independently via `OpcUa.ApplicationUri` to support redundant deployments where each instance needs a unique identity. When `ApplicationUri` is null, it defaults to the namespace URI.
The server binds one TCP endpoint per `ServerInstance` (default `opc.tcp://0.0.0.0:4840`). The `ApplicationConfiguration` is built programmatically in the `OpcUaApplicationHost` — there are no UA XML files. Security profiles (`None`, `Basic256Sha256-Sign`, `Basic256Sha256-SignAndEncrypt`) are resolved from the `ServerInstance.Security` JSON at startup; the default profile is still `None` for backward compatibility. User token policies (`Anonymous`, `UserName`) are attached based on whether LDAP is configured. See `docs/security.md` for hardening.
## Programmatic ApplicationConfiguration
## Session impersonation
`OpcUaServerHost` builds the entire `ApplicationConfiguration` in code. There are no XML configuration files. This keeps deployment simple on factory floor machines where editing XML is error-prone.
`OtOpcUaServer.OnImpersonateUser` handles the three token types:
The configuration covers:
- `AnonymousIdentityToken` → default anonymous `UserIdentity`.
- `UserNameIdentityToken``IUserAuthenticator.AuthenticateAsync` validates the credential (`LdapUserAuthenticator` in production). On success, the resolved display name + LDAP-derived roles are wrapped in a `RoleBasedIdentity` that implements `IRoleBearer`. `DriverNodeManager.OnWriteValue` reads these roles via `context.UserIdentity is IRoleBearer` and applies `WriteAuthzPolicy` per write.
- Anything else → `BadIdentityTokenInvalid`.
- **ServerConfiguration** -- base address, session limits, security policies, and user token policies
- **SecurityConfiguration** -- certificate store paths under `%LOCALAPPDATA%\OPC Foundation\pki\`, auto-accept enabled
- **TransportQuotas** -- 4 MB max message/string/byte-string size, 120-second operation timeout, 1-hour security token lifetime
- **TraceConfiguration** -- OPC Foundation SDK tracing is disabled (output path `null`, trace masks `0`); all logging goes through Serilog instead
The Phase 6.2 `AuthorizationGate` runs on top of this baseline: when configured it consults the cluster's permission trie (loaded from `NodeAcl` rows) using the session's `UserAuthorizationState` and can deny Read / HistoryRead / Write / Browse independently per tag. See `docs/v2/acl-design.md`.
## Security Profiles
## Dispatch
The server supports configurable transport security profiles controlled by the `Security` section in `appsettings.json`. The default configuration exposes only `MessageSecurityMode.None` for backward compatibility.
Every service call the stack hands to `DriverNodeManager` is translated to the driver's capability interface and routed through `CapabilityInvoker`:
Supported Phase 1 profiles:
| Profile Name | SecurityPolicy URI | MessageSecurityMode |
| Service | Capability | Invoker method |
|---|---|---|
| `None` | `SecurityPolicy#None` | `None` |
| `Basic256Sha256-Sign` | `SecurityPolicy#Basic256Sha256` | `Sign` |
| `Basic256Sha256-SignAndEncrypt` | `SecurityPolicy#Basic256Sha256` | `SignAndEncrypt` |
| Read | `IReadable.ReadAsync` | `ExecuteAsync(DriverCapability.Read, host, …)` |
| Write | `IWritable.WriteAsync` | `ExecuteWriteAsync(host, isIdempotent, …)` — honors `WriteIdempotentAttribute` (#143) |
| CreateMonitoredItems / DeleteMonitoredItems | `ISubscribable.SubscribeAsync/UnsubscribeAsync` | `ExecuteAsync(DriverCapability.Subscribe, host, …)` |
| HistoryRead (raw / processed / at-time / events) | `IHistoryProvider.*Async` | `ExecuteAsync(DriverCapability.HistoryRead, host, …)` |
| ConditionRefresh / Acknowledge | `IAlarmSource.*Async` | via `AlarmSurfaceInvoker` (fans out per host) |
`SecurityProfileResolver` maps configured profile names to `ServerSecurityPolicy` instances at startup. Unknown names are skipped with a warning, and an empty or invalid list falls back to `None`.
For production deployments, configure `["Basic256Sha256-SignAndEncrypt"]` or `["None", "Basic256Sha256-SignAndEncrypt"]` and set `AutoAcceptClientCertificates` to `false`. See the [Security Guide](security.md) for hardening details.
The host name fed to the invoker comes from `IPerCallHostResolver.ResolveHost(fullReference)` when the driver implements it (multi-host drivers: AB CIP, Modbus with per-device options). Single-host drivers fall back to `DriverInstanceId`, preserving pre-Phase-6.1 pipeline-key semantics (decision #144).
## Redundancy
When `Redundancy.Enabled = true`, `LmxOpcUaServer` exposes the standard OPC UA redundancy nodes on startup:
`Redundancy.Enabled = true` on the `ServerInstance` activates the `RedundancyCoordinator` + `ServiceLevelCalculator` (`src/ZB.MOM.WW.OtOpcUa.Server/Redundancy/`). Standard OPC UA redundancy nodes (`Server/ServerRedundancy/RedundancySupport`, `ServerUriArray`, `Server/ServiceLevel`) are populated on startup; `ServiceLevel` recomputes whenever any driver's `DriverHealth` changes. The apply-lease mechanism prevents two instances from concurrently applying a generation. See `docs/Redundancy.md`.
- `Server/ServerRedundancy/RedundancySupport` — set to `Warm` or `Hot` based on configuration
- `Server/ServerRedundancy/ServerUriArray` — populated with the configured `ServerUris`
- `Server/ServiceLevel` — computed dynamically from role and runtime health
## Server class hierarchy
The `ServiceLevel` is updated whenever MXAccess connection state changes or Galaxy DB health changes. See [Redundancy Guide](Redundancy.md) for full details.
### OtOpcUaServer extends StandardServer
### User token policies
- **`CreateMasterNodeManager`** — Iterates `_driverHost.RegisteredDriverIds`, builds one `DriverNodeManager` per driver with its own `CapabilityInvoker` + resilience options (tier from `DriverTypeRegistry`, per-instance JSON overrides from `DriverInstance.ResilienceConfig` via `DriverResilienceOptionsParser`). The managers are wrapped in a `MasterNodeManager` with no additional core managers.
- **`OnServerStarted`** — Hooks `SessionManager.ImpersonateUser` for LDAP auth. Redundancy + server-capability population happens via `OpcUaApplicationHost`.
- **`LoadServerProperties`** — Manufacturer `OtOpcUa`, Product `OtOpcUa.Server`, ProductUri `urn:OtOpcUa:Server`.
`UserTokenPolicies` are dynamically configured based on the `Authentication` settings in `appsettings.json`:
### ServerCapabilities
- An `Anonymous` user token policy is added when `AllowAnonymous` is `true` (the default).
- A `UserName` user token policy is added when an authentication provider is configured (LDAP or injected).
Both policies can be active simultaneously, allowing clients to connect with or without credentials.
### Session impersonation
When a client presents `UserName` credentials, the server validates them through `IUserAuthenticationProvider`. If the provider also implements `IRoleProvider` (as `LdapAuthenticationProvider` does), LDAP group membership is resolved once during authentication and mapped to custom OPC UA role `NodeId`s in a dedicated `urn:zbmom:lmxopcua:roles` namespace. These role NodeIds are added to the session's `RoleBasedIdentity.GrantedRoleIds`.
Anonymous sessions receive `WellKnownRole_Anonymous`. Authenticated sessions receive `WellKnownRole_AuthenticatedUser` plus any LDAP-derived role NodeIds. Permission checks in `LmxNodeManager` inspect `GrantedRoleIds` directly — no username extraction or side-channel cache is needed.
`AnonymousCanWrite` controls whether anonymous sessions can write, regardless of whether LDAP is enabled.
`OpcUaApplicationHost` populates `Server/ServerCapabilities` with `StandardUA2017`, `en` locale, 100 ms `MinSupportedSampleRate`, 4 MB message caps, and per-operation limits (1000 per Read/Write/Browse/TranslateBrowsePaths/MonitoredItems/HistoryRead; 0 for MethodCall/NodeManagement/HistoryUpdate).
## Certificate handling
On startup, `OpcUaServerHost.StartAsync` calls `CheckApplicationInstanceCertificate(false, minKeySize)` to locate or create a self-signed certificate meeting the configured minimum key size (default 2048). The certificate subject defaults to `CN={ServerName}, O=ZB MOM, DC=localhost` but can be overridden via `Security.CertificateSubject`. Certificate stores use the directory-based store type under the configured `Security.PkiRootPath` (default `%LOCALAPPDATA%\OPC Foundation\pki\`):
Certificate stores default to `%LOCALAPPDATA%\OPC Foundation\pki\` (directory-based):
| Store | Path suffix |
|-------|-------------|
|---|---|
| Own | `pki/own` |
| Trusted issuers | `pki/issuer` |
| Trusted peers | `pki/trusted` |
| Rejected | `pki/rejected` |
`AutoAcceptUntrustedCertificates` is controlled by `Security.AutoAcceptClientCertificates` (default `true`). Set to `false` in production to enforce client certificate trust. When `RejectSHA1Certificates` is `true` (default), client certificates signed with SHA-1 are rejected. Certificate validation events are logged for visibility into accepted and rejected client connections.
## Server class hierarchy
### LmxOpcUaServer extends StandardServer
`LmxOpcUaServer` inherits from the OPC Foundation `StandardServer` base class and overrides two methods:
- **`CreateMasterNodeManager`** -- Instantiates `LmxNodeManager` with the Galaxy namespace URI, the `IMxAccessClient` for runtime I/O, performance metrics, and an optional `IHistorianDataSource` (supplied by the runtime-loaded historian plugin, see [Historical Data Access](HistoricalDataAccess.md)). The node manager is wrapped in a `MasterNodeManager` with no additional core node managers.
- **`OnServerStarted`** -- Configures redundancy, history capabilities, and server capabilities at startup. Called after the server is fully initialized.
- **`LoadServerProperties`** -- Returns server metadata: manufacturer `ZB MOM`, product `LmxOpcUa Server`, and the assembly version as the software version.
### ServerCapabilities
`ConfigureServerCapabilities` populates the `ServerCapabilities` node at startup:
- **ServerProfileArray** -- `StandardUA2017`
- **LocaleIdArray** -- `en`
- **MinSupportedSampleRate** -- 100ms
- **MaxBrowseContinuationPoints** -- 100
- **MaxHistoryContinuationPoints** -- 100
- **MaxArrayLength** -- 65535
- **MaxStringLength / MaxByteStringLength** -- 4MB
- **OperationLimits** -- 1000 nodes per Read/Write/Browse/RegisterNodes/TranslateBrowsePaths/MonitoredItems/HistoryRead; 0 for MethodCall/NodeManagement/HistoryUpdate (not supported)
- **ServerDiagnostics.EnabledFlag** -- `true` (SDK tracks session/subscription counts automatically)
### Session tracking
`LmxOpcUaServer` exposes `ActiveSessionCount` by querying `ServerInternal.SessionManager.GetSessions().Count`. `OpcUaServerHost` surfaces this for status reporting.
## Startup and Shutdown
`OpcUaServerHost.StartAsync` performs the following sequence:
1. Build `ApplicationConfiguration` programmatically
2. Validate the configuration via `appConfig.Validate(ApplicationType.Server)`
3. Create `ApplicationInstance` and check/create the application certificate
4. Instantiate `LmxOpcUaServer` and start it via `ApplicationInstance.Start`
`OpcUaServerHost.Stop` calls `_server.Stop()` and nulls both the server and application instance references. The class implements `IDisposable`, delegating to `Stop`.
`Security.AutoAcceptClientCertificates` (default `true`) and `RejectSHA1Certificates` (default `true`) are honored. The server certificate is always created — even for `None`-only deployments — because `UserName` token encryption needs it.
## Key source files
- `src/ZB.MOM.WW.OtOpcUa.Host/OpcUa/OpcUaServerHost.cs` -- Application lifecycle and programmatic configuration
- `src/ZB.MOM.WW.OtOpcUa.Host/OpcUa/LmxOpcUaServer.cs` -- StandardServer subclass and node manager creation
- `src/ZB.MOM.WW.OtOpcUa.Host/OpcUa/SecurityProfileResolver.cs` -- Profile-name to ServerSecurityPolicy mapping
- `src/ZB.MOM.WW.OtOpcUa.Host/Configuration/OpcUaConfiguration.cs` -- Configuration POCO
- `src/ZB.MOM.WW.OtOpcUa.Host/Configuration/SecurityProfileConfiguration.cs` -- Security configuration POCO
- `src/ZB.MOM.WW.OtOpcUa.Server/OpcUa/OtOpcUaServer.cs` `StandardServer` subclass + `ImpersonateUser` hook
- `src/ZB.MOM.WW.OtOpcUa.Server/OpcUa/DriverNodeManager.cs` — per-driver `CustomNodeManager2` + dispatch surface
- `src/ZB.MOM.WW.OtOpcUa.Server/OpcUa/OpcUaApplicationHost.cs` — programmatic `ApplicationConfiguration` + lifecycle
- `src/ZB.MOM.WW.OtOpcUa.Core/Hosting/DriverHost.cs` — driver registration
- `src/ZB.MOM.WW.OtOpcUa.Core/Resilience/CapabilityInvoker.cs` — Polly pipeline entry point
- `src/ZB.MOM.WW.OtOpcUa.Core/Authorization/` — Phase 6.2 permission trie + evaluator
- `src/ZB.MOM.WW.OtOpcUa.Server/Security/AuthorizationGate.cs` — stack-to-evaluator bridge

83
docs/README.md Normal file
View File

@@ -0,0 +1,83 @@
# OtOpcUa documentation
Two tiers of documentation live here:
- **Current reference** at the top level (`docs/*.md`) — describes what's shipped today. Start here for operator + integrator reference.
- **Implementation history + design notes** at `docs/v2/*.md` — the authoritative plan + decision log the current reference is built from. Start here when you need the *why* behind an architectural choice, or when a top-level doc says "see plan.md § X".
The project was originally called **LmxOpcUa** (a single-driver Galaxy/MXAccess OPC UA server) and has since become **OtOpcUa**, a multi-driver OPC UA server platform. Any lingering `LmxOpcUa`-string in a path you see in docs is a deliberate residual (executable name `lmxopcua-cli`, client PKI folder `{LocalAppData}/LmxOpcUaClient/`) — fixing those requires migration shims + is tracked as follow-ups.
## Platform overview
- **Core** owns the OPC UA stack, address space, session/security/subscription machinery.
- **Drivers** plug in via capability interfaces in `ZB.MOM.WW.OtOpcUa.Core.Abstractions`: `IDriver`, `IReadable`, `IWritable`, `ITagDiscovery`, `ISubscribable`, `IHostConnectivityProbe`, `IAlarmSource`, `IHistoryProvider`, `IPerCallHostResolver`. Each driver opts into whichever it supports.
- **Server** is the OPC UA endpoint process (net10, x64). Hosts every driver except Galaxy in-process; talks to Galaxy via a named pipe because MXAccess COM is 32-bit-only.
- **Admin** is the Blazor Server operator UI (net10, x64). Owns the Config DB draft/publish flow, ACL + role-grant authoring, fleet status + `/metrics` scrape endpoint.
- **Galaxy.Host** is a .NET Framework 4.8 x86 Windows service that wraps MXAccess COM on an STA thread for the Galaxy driver.
## Where to find what
### Architecture + data-path reference
| Doc | Covers |
|-----|--------|
| [OpcUaServer.md](OpcUaServer.md) | Top-level server architecture — Core, driver dispatch, Config DB, generations |
| [AddressSpace.md](AddressSpace.md) | `GenericDriverNodeManager` + `ITagDiscovery` + `IAddressSpaceBuilder` |
| [ReadWriteOperations.md](ReadWriteOperations.md) | OPC UA Read/Write → `CapabilityInvoker``IReadable`/`IWritable` |
| [Subscriptions.md](Subscriptions.md) | Monitored items → `ISubscribable` + per-driver subscription refcount |
| [AlarmTracking.md](AlarmTracking.md) | `IAlarmSource` + `AlarmSurfaceInvoker` + OPC UA alarm conditions |
| [DataTypeMapping.md](DataTypeMapping.md) | Per-driver `DriverAttributeInfo` → OPC UA variable types |
| [IncrementalSync.md](IncrementalSync.md) | Address-space rebuild on redeploy + `sp_ComputeGenerationDiff` |
| [HistoricalDataAccess.md](HistoricalDataAccess.md) | `IHistoryProvider` as a per-driver optional capability |
### Drivers
| Doc | Covers |
|-----|--------|
| [drivers/README.md](drivers/README.md) | Index of the seven shipped drivers + capability matrix |
| [drivers/Galaxy.md](drivers/Galaxy.md) | Galaxy driver — MXAccess bridge, Host/Proxy split, named-pipe IPC |
| [drivers/Galaxy-Repository.md](drivers/Galaxy-Repository.md) | Galaxy-specific discovery via the ZB SQL database |
For Modbus / S7 / AB CIP / AB Legacy / TwinCAT / FOCAS / OPC UA Client specifics, see [v2/driver-specs.md](v2/driver-specs.md).
### Operational
| Doc | Covers |
|-----|--------|
| [Configuration.md](Configuration.md) | appsettings bootstrap + Config DB + Admin UI draft/publish |
| [security.md](security.md) | Transport security profiles, LDAP auth, ACL trie, role grants, OTOPCUA0001 analyzer |
| [Redundancy.md](Redundancy.md) | `RedundancyCoordinator`, `ServiceLevelCalculator`, apply-lease, Prometheus metrics |
| [ServiceHosting.md](ServiceHosting.md) | Three-process deploy (Server + Admin + Galaxy.Host) install/uninstall |
| [StatusDashboard.md](StatusDashboard.md) | Pointer — superseded by [v2/admin-ui.md](v2/admin-ui.md) |
### Client tooling
| Doc | Covers |
|-----|--------|
| [Client.CLI.md](Client.CLI.md) | `lmxopcua-cli` — command-line client |
| [Client.UI.md](Client.UI.md) | Avalonia desktop client |
### Requirements
| Doc | Covers |
|-----|--------|
| [reqs/HighLevelReqs.md](reqs/HighLevelReqs.md) | HLRs — numbered system-level requirements |
| [reqs/OpcUaServerReqs.md](reqs/OpcUaServerReqs.md) | OPC UA server-layer reqs |
| [reqs/ServiceHostReqs.md](reqs/ServiceHostReqs.md) | Per-process hosting reqs |
| [reqs/ClientRequirements.md](reqs/ClientRequirements.md) | Client CLI + UI reqs |
| [reqs/GalaxyRepositoryReqs.md](reqs/GalaxyRepositoryReqs.md) | Galaxy-scoped repository reqs |
| [reqs/MxAccessClientReqs.md](reqs/MxAccessClientReqs.md) | Galaxy-scoped MXAccess reqs |
| [reqs/StatusDashboardReqs.md](reqs/StatusDashboardReqs.md) | Pointer — superseded by Admin UI |
## Implementation history (`docs/v2/`)
Design decisions + phase plans + execution notes. Load-bearing cross-references from the top-level docs:
- [v2/plan.md](v2/plan.md) — authoritative v2 vision doc + numbered decision log (referenced as "decision #N" elsewhere)
- [v2/admin-ui.md](v2/admin-ui.md) — Admin UI spec
- [v2/acl-design.md](v2/acl-design.md) — data-plane ACL + permission-trie design (Phase 6.2)
- [v2/config-db-schema.md](v2/config-db-schema.md) — Config DB schema reference
- [v2/driver-specs.md](v2/driver-specs.md) — per-driver addressing + quirks for every shipped protocol
- [v2/dev-environment.md](v2/dev-environment.md) — dev-box bootstrap
- [v2/test-data-sources.md](v2/test-data-sources.md) — integration-test simulator matrix (includes the pinned libplctag `ab_server` version for AB CIP tests)
- [v2/implementation/phase-*-*.md](v2/implementation/) — per-phase execution plans with exit-gate evidence

View File

@@ -1,99 +1,57 @@
# Read/Write Operations
`LmxNodeManager` overrides the OPC UA `Read` and `Write` methods to translate client requests into MXAccess runtime calls. Each override resolves the OPC UA `NodeId` to a Galaxy tag reference, performs the I/O through `IMxAccessClient`, and returns the result with appropriate status codes.
`DriverNodeManager` (`src/ZB.MOM.WW.OtOpcUa.Server/OpcUa/DriverNodeManager.cs`) wires the OPC UA stack's per-variable `OnReadValue` and `OnWriteValue` hooks to each driver's `IReadable` and `IWritable` capabilities. Every dispatch flows through `CapabilityInvoker` so the Polly pipeline (retry / timeout / breaker / bulkhead) applies uniformly across Galaxy, Modbus, S7, AB CIP, AB Legacy, TwinCAT, FOCAS, and OPC UA Client drivers.
## Read Override
## OnReadValue
The `Read` override in `LmxNodeManager` intercepts value attribute reads for nodes in the Galaxy namespace.
The hook is registered on every `BaseDataVariableState` created by the `IAddressSpaceBuilder.Variable(...)` call during discovery. When the stack dispatches a Read for a node in this namespace:
### Resolution flow
1. If the driver does not implement `IReadable`, the hook returns `BadNotReadable`.
2. The node's `NodeId.Identifier` is used directly as the driver-side full reference — it matches `DriverAttributeInfo.FullName` registered at discovery time.
3. (Phase 6.2) If an `AuthorizationGate` + `NodeScopeResolver` are wired, the gate is consulted first via `IsAllowed(identity, OpcUaOperation.Read, scope)`. A denied read never hits the driver.
4. The call is wrapped by `_invoker.ExecuteAsync(DriverCapability.Read, ResolveHostFor(fullRef), …)`. The resolved host is `IPerCallHostResolver.ResolveHost(fullRef)` for multi-host drivers; single-host drivers fall back to `DriverInstanceId` (decision #144).
5. The first `DataValueSnapshot` from the batch populates the outgoing `value` / `statusCode` / `timestamp`. An empty batch surfaces `BadNoData`; any exception surfaces `BadInternalError`.
1. The base class `Read` runs first, handling non-value attributes (DisplayName, DataType, etc.) through the standard node manager.
2. For each `ReadValueId` where `AttributeId == Attributes.Value`, the override checks whether the node belongs to this namespace (`NamespaceIndex` match).
3. The string-typed `NodeId.Identifier` is looked up in `_nodeIdToTagReference` to find the corresponding `FullTagReference` (e.g., `DelmiaReceiver_001.DownloadPath`).
4. `_mxAccessClient.ReadAsync(tagRef)` retrieves the current value, timestamp, and quality from MXAccess. The async call is synchronously awaited because the OPC UA SDK `Read` override is synchronous.
5. The returned `Vtq` is converted to a `DataValue` via `CreatePublishedDataValue`, which normalizes array values through `NormalizePublishedValue` (substituting a default typed array when the value is null for array nodes).
6. On success, `errors[i]` is set to `ServiceResult.Good`. On exception, the error is set to `BadInternalError`.
The hook is synchronous — the async invoker call is bridged with `AsTask().GetAwaiter().GetResult()` because the OPC UA SDK's value-hook signature is sync. Idempotent-by-construction reads mean this bridge is safe to retry inside the Polly pipeline.
```csharp
if (_nodeIdToTagReference.TryGetValue(nodeIdStr, out var tagRef))
{
var vtq = _mxAccessClient.ReadAsync(tagRef).GetAwaiter().GetResult();
results[i] = CreatePublishedDataValue(tagRef, vtq);
errors[i] = ServiceResult.Good;
}
```
## OnWriteValue
## Write Override
`OnWriteValue` follows the same shape with two additional concerns: authorization and idempotence.
The `Write` override follows a similar pattern but includes access-level enforcement and array element write support.
### Authorization (two layers)
### Access level check
1. **SecurityClassification gate.** Every variable stores its `SecurityClassification` in `_securityByFullRef` at registration time (populated from `DriverAttributeInfo.SecurityClass`). `WriteAuthzPolicy.IsAllowed(classification, userRoles)` runs first, consulting the session's roles via `context.UserIdentity is IRoleBearer`. `FreeAccess` passes anonymously, `ViewOnly` denies everyone, and `Operate / Tune / Configure / SecuredWrite / VerifiedWrite` require `WriteOperate / WriteTune / WriteConfigure` roles respectively. Denial returns `BadUserAccessDenied` without consulting the driver — drivers never enforce ACLs themselves; they only report classification as discovery metadata (feedback `feedback_acl_at_server_layer.md`).
2. **Phase 6.2 permission-trie gate.** When `AuthorizationGate` is wired, it re-runs with the operation derived from `WriteAuthzPolicy.ToOpcUaOperation(classification)`. The gate consults the per-cluster permission trie loaded from `NodeAcl` rows, enforcing fine-grained per-tag ACLs on top of the role-based classification policy. See `docs/v2/acl-design.md`.
The base class `Write` runs first and sets `BadNotWritable` for nodes whose `AccessLevel` does not include `CurrentWrite`. The override skips these nodes:
### Dispatch
```csharp
if (errors[i] != null && errors[i].StatusCode == StatusCodes.BadNotWritable)
continue;
```
`_invoker.ExecuteWriteAsync(host, isIdempotent, callSite, …)` honors the `WriteIdempotentAttribute` semantics per decisions #44-45 and #143:
The `AccessLevel` is set during node creation based on `SecurityClassificationMapper.IsWritable(attr.SecurityClassification)`. Read-only Galaxy attributes (e.g., security classification `FreeRead`) get `AccessLevels.CurrentRead` only.
- `isIdempotent = true` (tag flagged `WriteIdempotent` in the Config DB) → runs through the standard `DriverCapability.Write` pipeline; retry may apply per the tier configuration.
- `isIdempotent = false` (default) → the invoker builds a one-off pipeline with `RetryCount = 0`. A timeout may fire after the device already accepted the pulse / alarm-ack / counter-increment; replay is the caller's decision, not the server's.
### Write flow
The `_writeIdempotentByFullRef` lookup is populated at discovery time from the `DriverAttributeInfo.WriteIdempotent` field.
1. The `NodeId` is resolved to a tag reference via `_nodeIdToTagReference`.
2. The raw value is extracted from `writeValue.Value.WrappedValue.Value`.
3. If the write includes an `IndexRange` (array element write), `TryApplyArrayElementWrite` handles the merge before sending the full array to MXAccess.
4. `_mxAccessClient.WriteAsync(tagRef, value)` sends the value to the Galaxy runtime.
5. On success, `PublishLocalWrite` updates the in-memory node immediately so subscribed clients see the change without waiting for the next MXAccess data change callback.
### Per-write status
### Array element writes via IndexRange
`IWritable.WriteAsync` returns `IReadOnlyList<WriteResult>` — one numeric `StatusCode` per requested write. A non-zero code is surfaced directly to the client; exceptions become `BadInternalError`. The OPC UA stack's pattern of batching per-service is preserved through the full chain.
`TryApplyArrayElementWrite` supports writing individual elements of an array attribute. MXAccess does not support element-level writes, so the method performs a read-modify-write:
## Array element writes
1. Parse the `IndexRange` string as a zero-based integer index. Return `BadIndexRangeInvalid` if parsing fails or the index is negative.
2. Read the current array value from MXAccess via `ReadAsync`.
3. Clone the array and set the element at the target index.
4. `NormalizeIndexedWriteValue` unwraps single-element arrays (OPC UA clients sometimes wrap a scalar in a one-element array).
5. `ConvertArrayElementValue` coerces the value to the array's element type using `Convert.ChangeType`, handling null values by substituting the type's default.
6. The full modified array is written back to MXAccess as a single `WriteAsync` call.
Array-element writes via OPC UA `IndexRange` are driver-specific. The OPC UA stack hands the dispatch an unwrapped `NumericRange` on the `indexRange` parameter of `OnWriteValue`; `DriverNodeManager` passes the full `value` object to `IWritable.WriteAsync` and the driver decides whether to support partial writes. Galaxy performs a read-modify-write inside the Galaxy driver (MXAccess has no element-level writes); other drivers generally accept only full-array writes today.
```csharp
var nextArray = (Array)currentArray.Clone();
nextArray.SetValue(ConvertArrayElementValue(normalizedValue, elementType), index);
updatedArray = nextArray;
```
## HistoryRead
### Role-based write enforcement
`DriverNodeManager.HistoryReadRawModified`, `HistoryReadProcessed`, `HistoryReadAtTime`, and `HistoryReadEvents` route through the driver's `IHistoryProvider` capability with `DriverCapability.HistoryRead`. Drivers without `IHistoryProvider` surface `BadHistoryOperationUnsupported` per node. See `docs/HistoricalDataAccess.md`.
When `AnonymousCanWrite` is `false` in the `Authentication` configuration, the write override enforces role-based access control before dispatching to MXAccess. The check order is:
## Failure isolation
1. The base class `Write` runs first, enforcing `AccessLevel`. Nodes without `CurrentWrite` get `BadNotWritable` and the override skips them.
2. The override checks whether the node is in the Galaxy namespace. Non-namespace nodes are skipped.
3. If `AnonymousCanWrite` is `false`, the override inspects `context.OperationContext.Session` for `GrantedRoleIds`. If the session does not hold `WellKnownRole_AuthenticatedUser`, the error is set to `BadUserAccessDenied` and the write is rejected.
4. If the role check passes (or `AnonymousCanWrite` is `true`), the write proceeds to MXAccess.
Per decision #12, exceptions in the driver's capability call are logged and converted to a per-node `BadInternalError` — they never unwind into the master node manager. This keeps one driver's outage from disrupting sibling drivers in the same server process.
The existing security classification enforcement (ReadOnly nodes getting `BadNotWritable` via `AccessLevel`) still applies first and takes precedence over the role check.
## Key source files
## Value Type Conversion
`CreatePublishedDataValue` wraps the conversion pipeline. `NormalizePublishedValue` checks whether the tag is an array type with a declared `ArrayDimension` and substitutes a default typed array (via `CreateDefaultArrayValue`) when the raw value is null. This prevents OPC UA clients from receiving a null variant for array nodes, which violates the specification for nodes declared with `ValueRank.OneDimension`.
`CreateDefaultArrayValue` uses `MxDataTypeMapper.MapToClrType` to determine the CLR element type, then creates an `Array.CreateInstance` of the declared length. String arrays are initialized with `string.Empty` elements rather than null.
## PublishLocalWrite
After a successful write, `PublishLocalWrite` updates the variable node in memory without waiting for the MXAccess `OnDataChange` callback to arrive:
```csharp
private void PublishLocalWrite(string tagRef, object? value)
{
var dataValue = CreatePublishedDataValue(tagRef, Vtq.Good(value));
variable.Value = dataValue.Value;
variable.StatusCode = dataValue.StatusCode;
variable.Timestamp = dataValue.SourceTimestamp;
variable.ClearChangeMasks(SystemContext, false);
}
```
`ClearChangeMasks` notifies the OPC UA framework that the node value has changed, which triggers data change notifications to any active monitored items. Without this call, subscribed clients would only see the update when the next MXAccess data change event arrives, which could be delayed depending on the subscription interval.
- `src/ZB.MOM.WW.OtOpcUa.Server/OpcUa/DriverNodeManager.cs``OnReadValue` / `OnWriteValue` hooks
- `src/ZB.MOM.WW.OtOpcUa.Server/Security/WriteAuthzPolicy.cs` — classification-to-role policy
- `src/ZB.MOM.WW.OtOpcUa.Server/Security/AuthorizationGate.cs` — Phase 6.2 trie gate
- `src/ZB.MOM.WW.OtOpcUa.Core/Resilience/CapabilityInvoker.cs``ExecuteAsync` / `ExecuteWriteAsync`
- `src/ZB.MOM.WW.OtOpcUa.Core.Abstractions/IReadable.cs`, `IWritable.cs`, `WriteIdempotentAttribute.cs`

View File

@@ -2,189 +2,102 @@
## Overview
LmxOpcUa supports OPC UA **non-transparent redundancy** in Warm or Hot mode. In a non-transparent redundancy deployment, two independent server instances run side by side. Both connect to the same Galaxy repository database and the same MXAccess runtime, but each maintains its own OPC UA sessions and subscriptions. Clients discover the redundant set through the `ServerUriArray` exposed in each server's address space and are responsible for managing failover between the two endpoints.
OtOpcUa supports OPC UA **non-transparent** warm/hot redundancy. Two (or more) OtOpcUa Server processes run side-by-side, share the same Config DB, the same driver backends (Galaxy ZB, MXAccess runtime, remote PLCs), and advertise the same OPC UA node tree. Each process owns a distinct `ApplicationUri`; OPC UA clients see both endpoints via the standard `ServerUriArray` and pick one based on the `ServiceLevel` that each server publishes.
When redundancy is disabled (the default), the server reports `RedundancySupport.None` and a fixed `ServiceLevel` of 255.
The redundancy surface lives in `src/ZB.MOM.WW.OtOpcUa.Server/Redundancy/`:
## Namespace vs Application Identity
Both servers in the redundant set share the same **namespace URI** so that clients see identical node IDs regardless of which instance they are connected to. The namespace URI follows the pattern `urn:{GalaxyName}:LmxOpcUa` (e.g., `urn:ZB:LmxOpcUa`).
The **ApplicationUri**, on the other hand, must be unique per instance. This is how the OPC UA stack and clients distinguish one server from the other within the redundant set. Each instance sets its own ApplicationUri via the `OpcUa.ApplicationUri` configuration property (e.g., `urn:localhost:LmxOpcUa:instance1` and `urn:localhost:LmxOpcUa:instance2`).
When redundancy is disabled, `ApplicationUri` defaults to `urn:{GalaxyName}:LmxOpcUa` if left null.
## Configuration
### Redundancy Section
| Property | Type | Default | Description |
|---|---|---|---|
| `Enabled` | bool | `false` | Enables non-transparent redundancy. When false, the server reports `RedundancySupport.None` and `ServiceLevel = 255`. |
| `Mode` | string | `"Warm"` | The redundancy mode advertised to clients. Valid values: `Warm`, `Hot`. |
| `Role` | string | `"Primary"` | This instance's role in the redundant pair. Valid values: `Primary`, `Secondary`. The Primary advertises a higher ServiceLevel than the Secondary when both are healthy. |
| `ServerUris` | string[] | `[]` | The ApplicationUri values of all servers in the redundant set. Must include this instance's own `OpcUa.ApplicationUri`. Should contain at least 2 entries. |
| `ServiceLevelBase` | int | `200` | The base ServiceLevel when the server is fully healthy. Valid range: 1-255. The Secondary automatically receives `ServiceLevelBase - 50`. |
### OpcUa.ApplicationUri
| Property | Type | Default | Description |
|---|---|---|---|
| `ApplicationUri` | string | `null` | Explicit application URI for this server instance. When null, defaults to `urn:{GalaxyName}:LmxOpcUa`. **Required when redundancy is enabled** -- each instance needs a unique identity. |
## ServiceLevel Computation
ServiceLevel is a standard OPC UA diagnostic value (0-255) that indicates server health. Clients in a redundant deployment should prefer the server advertising the highest ServiceLevel.
**Baseline values:**
| Role | Baseline |
| Class | Role |
|---|---|
| Primary | `ServiceLevelBase` (default 200) |
| Secondary | `ServiceLevelBase - 50` (default 150) |
| `RedundancyCoordinator` | Process-singleton; owns the current `RedundancyTopology` loaded from the `ClusterNode` table. `RefreshAsync` re-reads after `sp_PublishGeneration` so operator role swaps take effect without a process restart. CAS-style swap (`Interlocked.Exchange`) means readers always see a coherent snapshot. |
| `RedundancyTopology` | Immutable `(ClusterId, Self, Peers, ServerUriArray, ValidityFlags)` snapshot. |
| `ApplyLeaseRegistry` | Tracks in-progress `sp_PublishGeneration` apply leases keyed on `(ConfigGenerationId, PublishRequestId)`. `await using` the disposable scope guarantees every exit path (success / exception / cancellation) decrements the lease; a stale-lease watchdog force-closes any lease older than `ApplyMaxDuration` (default 10 minutes) so a crashed publisher can't pin the node at `PrimaryMidApply`. |
| `PeerReachabilityTracker` | Maintains last-known reachability for each peer node over two independent probes — OPC UA ping and HTTP `/healthz`. Both must succeed for `peerReachable = true`. |
| `RecoveryStateManager` | Gates transitions out of the `Recovering*` bands; requires dwell + publish-witness satisfaction before allowing a return to nominal. |
| `ServiceLevelCalculator` | Pure function `(role, selfHealthy, peerUa, peerHttp, applyInProgress, recoveryDwellMet, topologyValid, operatorMaintenance) → byte`. |
| `RedundancyStatePublisher` | Orchestrates inputs into the calculator, pushes the resulting byte to the OPC UA `ServiceLevel` variable via an edge-triggered `OnStateChanged` event, and fires `OnServerUriArrayChanged` when the topology's `ServerUriArray` shifts. |
**Penalties applied to the baseline:**
## Data model
| Condition | Penalty |
Per-node redundancy state lives in the Config DB `ClusterNode` table (`src/ZB.MOM.WW.OtOpcUa.Configuration/Entities/ClusterNode.cs`):
| Column | Role |
|---|---|
| MXAccess disconnected | -100 |
| Galaxy DB unreachable | -50 |
| Both MXAccess and DB down | ServiceLevel forced to 0 |
| `NodeId` | Unique node identity; matches `Node:NodeId` in the server's bootstrap `appsettings.json`. |
| `ClusterId` | Foreign key into `ServerCluster`. |
| `RedundancyRole` | `Primary`, `Secondary`, or `Standalone` (`RedundancyRole` enum in `Configuration/Enums`). |
| `ServiceLevelBase` | Per-node base value used to bias nominal ServiceLevel output. |
| `ApplicationUri` | Unique-per-node OPC UA ApplicationUri advertised in endpoint descriptions. |
The final value is clamped to the range 0-255.
`ServerUriArray` is derived from the set of peer `ApplicationUri` values at topology-load time and republished when the topology changes.
**Examples (with default ServiceLevelBase = 200):**
## ServiceLevel matrix
| Scenario | Primary | Secondary |
`ServiceLevelCalculator` produces one of the following bands (see `ServiceLevelBand` enum in the same file):
| Band | Byte | Meaning |
|---|---|---|
| Both healthy | 200 | 150 |
| MXAccess down | 100 | 50 |
| DB down | 150 | 100 |
| Both down | 0 | 0 |
| `Maintenance` | 0 | Operator-declared maintenance. |
| `NoData` | 1 | Self-reported unhealthy (`/healthz` fails). |
| `InvalidTopology` | 2 | More than one Primary detected; both nodes self-demote. |
| `RecoveringBackup` | 30 | Backup post-fault, dwell not met. |
| `BackupMidApply` | 50 | Backup inside a publish-apply window. |
| `IsolatedBackup` | 80 | Primary unreachable; Backup says "take over if asked" — does **not** auto-promote (non-transparent model). |
| `AuthoritativeBackup` | 100 | Backup nominal. |
| `RecoveringPrimary` | 180 | Primary post-fault, dwell not met. |
| `PrimaryMidApply` | 200 | Primary inside a publish-apply window. |
| `IsolatedPrimary` | 230 | Primary with unreachable peer, retains authority. |
| `AuthoritativePrimary` | 255 | Primary nominal. |
## Two-Instance Deployment
The reserved bands (0 Maintenance, 1 NoData, 2 InvalidTopology) take precedence over operational states per OPC UA Part 5 §6.3.34. Operational values occupy 2..255 so spec-compliant clients that treat "<3 = unhealthy" keep working.
When deploying a redundant pair, the following configuration properties must differ between the two instances. All other settings (GalaxyName, ConnectionString, etc.) are shared.
Standalone nodes (single-instance deployments) report `AuthoritativePrimary` when healthy and `PrimaryMidApply` during publish.
| Property | Instance 1 (Primary) | Instance 2 (Secondary) |
|---|---|---|
| `OpcUa.Port` | 4840 | 4841 |
| `OpcUa.ServerName` | `LmxOpcUa-1` | `LmxOpcUa-2` |
| `OpcUa.ApplicationUri` | `urn:localhost:LmxOpcUa:instance1` | `urn:localhost:LmxOpcUa:instance2` |
| `Dashboard.Port` | 8081 | 8082 |
| `MxAccess.ClientName` | `LmxOpcUa-1` | `LmxOpcUa-2` |
| `Redundancy.Role` | `Primary` | `Secondary` |
## Publish fencing and split-brain prevention
### Instance 1 -- Primary (appsettings.json)
Any Admin-triggered `sp_PublishGeneration` acquires an apply lease through `ApplyLeaseRegistry.BeginApplyLease`. While the lease is held:
```json
{
"OpcUa": {
"Port": 4840,
"ServerName": "LmxOpcUa-1",
"GalaxyName": "ZB",
"ApplicationUri": "urn:localhost:LmxOpcUa:instance1"
},
"MxAccess": {
"ClientName": "LmxOpcUa-1"
},
"Dashboard": {
"Port": 8081
},
"Redundancy": {
"Enabled": true,
"Mode": "Warm",
"Role": "Primary",
"ServerUris": [
"urn:localhost:LmxOpcUa:instance1",
"urn:localhost:LmxOpcUa:instance2"
],
"ServiceLevelBase": 200
}
}
```
- The calculator reports `PrimaryMidApply` / `BackupMidApply` — clients see the band shift and cut over to the unaffected peer rather than racing against a half-applied generation.
- `RedundancyCoordinator.RefreshAsync` is called at the end of the apply window so the post-publish topology becomes visible exactly once, atomically.
- The watchdog force-closes any lease older than `ApplyMaxDuration`; a stuck publisher therefore cannot strand a node at `PrimaryMidApply`.
### Instance 2 -- Secondary (appsettings.json)
Because role transitions are **operator-driven** (write `RedundancyRole` in the Config DB + publish), the Backup never auto-promotes. An `IsolatedBackup` at 80 is the signal that the operator should intervene; auto-failover is intentionally out of scope for the non-transparent model (decision #154).
```json
{
"OpcUa": {
"Port": 4841,
"ServerName": "LmxOpcUa-2",
"GalaxyName": "ZB",
"ApplicationUri": "urn:localhost:LmxOpcUa:instance2"
},
"MxAccess": {
"ClientName": "LmxOpcUa-2"
},
"Dashboard": {
"Port": 8082
},
"Redundancy": {
"Enabled": true,
"Mode": "Warm",
"Role": "Secondary",
"ServerUris": [
"urn:localhost:LmxOpcUa:instance1",
"urn:localhost:LmxOpcUa:instance2"
],
"ServiceLevelBase": 200
}
}
```
## Metrics
## CLI `redundancy` Command
`RedundancyMetrics` in `src/ZB.MOM.WW.OtOpcUa.Admin/Services/RedundancyMetrics.cs` registers the `ZB.MOM.WW.OtOpcUa.Redundancy` meter on the Admin process. Instruments:
The Client CLI includes a `redundancy` command that reads the redundancy state from a running server.
| Name | Kind | Tags | Description |
|---|---|---|---|
| `otopcua.redundancy.role_transition` | Counter<long> | `cluster.id`, `node.id`, `from_role`, `to_role` | Incremented every time `FleetStatusPoller` observes a `RedundancyRole` change on a `ClusterNode` row. |
| `otopcua.redundancy.primary_count` | ObservableGauge<long> | `cluster.id` | Primary-role nodes per cluster — should be exactly 1 in nominal state. |
| `otopcua.redundancy.secondary_count` | ObservableGauge<long> | `cluster.id` | Secondary-role nodes per cluster. |
| `otopcua.redundancy.stale_count` | ObservableGauge<long> | `cluster.id` | Nodes whose `LastSeenAt` exceeded the stale threshold. |
```bash
dotnet run --project src/ZB.MOM.WW.OtOpcUa.Client.CLI -- redundancy -u opc.tcp://localhost:4840/LmxOpcUa
dotnet run --project src/ZB.MOM.WW.OtOpcUa.Client.CLI -- redundancy -u opc.tcp://localhost:4841/LmxOpcUa
```
Admin `Program.cs` wires OpenTelemetry to the Prometheus exporter when `Metrics:Prometheus:Enabled=true` (default), exposing the meter under `/metrics`. The endpoint is intentionally unauthenticated — fleet conventions put it behind a reverse-proxy basic-auth gate if needed.
The command reads the following standard OPC UA nodes and displays their values:
## Real-time notifications (Admin UI)
- **Redundancy Mode** -- from `Server_ServerRedundancy_RedundancySupport` (None, Warm, or Hot)
- **Service Level** -- from `Server_ServiceLevel` (0-255)
- **Server URIs** -- from `Server_ServerRedundancy_ServerUriArray` (list of ApplicationUri values in the redundant set)
- **Application URI** -- from `Server_ServerArray` (this instance's ApplicationUri)
`FleetStatusPoller` in `src/ZB.MOM.WW.OtOpcUa.Admin/Hubs/` polls the `ClusterNode` table, records role transitions, updates `RedundancyMetrics.SetClusterCounts`, and pushes a `RoleChanged` SignalR event onto `FleetStatusHub` when a transition is observed. `RedundancyTab.razor` subscribes with `_hub.On<RoleChangedMessage>("RoleChanged", …)` so connected Admin sessions see role swaps the moment they happen.
Example output for a healthy Primary:
## Configuring a redundant pair
```
Redundancy Mode: Warm
Service Level: 200
Server URIs:
- urn:localhost:LmxOpcUa:instance1
- urn:localhost:LmxOpcUa:instance2
Application URI: urn:localhost:LmxOpcUa:instance1
```
Redundancy is configured **in the Config DB, not appsettings.json**. The fields that must differ between the two instances:
The command also supports `--username`/`--password` and `--security` options for authenticated or encrypted connections.
| Field | Location | Instance 1 | Instance 2 |
|---|---|---|---|
| `NodeId` | `appsettings.json` `Node:NodeId` (bootstrap) | `node-a` | `node-b` |
| `ClusterNode.ApplicationUri` | Config DB | `urn:node-a:OtOpcUa` | `urn:node-b:OtOpcUa` |
| `ClusterNode.RedundancyRole` | Config DB | `Primary` | `Secondary` |
| `ClusterNode.ServiceLevelBase` | Config DB | typically 255 | typically 100 |
### Client Failover with `-F`
Shared between instances: `ClusterId`, Config DB connection string, published generation, cluster-level ACLs, UNS hierarchy, driver instances.
All CLI commands support the `-F` / `--failover-urls` flag for automatic client-side failover. When provided, the CLI tries the primary endpoint first and falls back to the listed URLs if the primary is unreachable.
Role swaps, stand-alone promotions, and base-level adjustments all happen through the Admin UI `RedundancyTab` — the operator edits the `ClusterNode` row in a draft generation and publishes. `RedundancyCoordinator.RefreshAsync` picks up the new topology without a process restart.
```bash
# Connect with failover — uses secondary if primary is down
dotnet run -- connect -u opc.tcp://localhost:4840/LmxOpcUa -F opc.tcp://localhost:4841/LmxOpcUa
## Client-side failover
# Subscribe with live failover — reconnects to secondary if primary drops mid-stream
dotnet run -- subscribe -u opc.tcp://localhost:4840/LmxOpcUa -F opc.tcp://localhost:4841/LmxOpcUa \
-n "ns=1;s=TestMachine_001.MachineID"
```
The OtOpcUa Client CLI at `src/ZB.MOM.WW.OtOpcUa.Client.CLI` supports `-F` / `--failover-urls` for automatic client-side failover; for long-running subscriptions the CLI monitors session KeepAlive and reconnects to the next available server, recreating the subscription on the new endpoint. See [`Client.CLI.md`](Client.CLI.md) for the command reference.
For long-running commands (`subscribe`), the CLI monitors the session KeepAlive and automatically reconnects to the next available server when the current session drops. The subscription is re-created on the new server.
## Depth reference
## Troubleshooting
**Mismatched ServerUris between instances** -- Both instances must list the exact same set of ApplicationUri values in `Redundancy.ServerUris`. If they differ, clients may not discover the full redundant set. Check the startup log for the `Redundancy.ServerUris` line on each instance.
**ServiceLevel stuck at 255** -- This indicates redundancy is not enabled. When `Redundancy.Enabled` is false (the default), the server always reports `ServiceLevel = 255` and `RedundancySupport.None`. Verify that `Redundancy.Enabled` is set to `true` in the configuration and that the configuration section is correctly bound.
**ApplicationUri not set** -- The configuration validator rejects startup when redundancy is enabled but `OpcUa.ApplicationUri` is null or empty. Each instance must have a unique ApplicationUri. Check the error log for: `OpcUa.ApplicationUri must be set when redundancy is enabled`.
**Both servers report the same ServiceLevel** -- Verify that one instance has `Redundancy.Role` set to `Primary` and the other to `Secondary`. Both set to `Primary` (or both to `Secondary`) will produce identical baseline values, preventing clients from distinguishing the preferred server.
**ServerUriArray not readable** -- When `RedundancySupport` is `None` (redundancy disabled), the OPC UA SDK may not expose the `ServerUriArray` node or it may return an empty value. The CLI `redundancy` command handles this gracefully by catching the read error. Enable redundancy to populate this array.
For the full decision trail and implementation plan — topology invariants, peer-probe cadence, recovery-dwell policy, compliance-script guard against enum-value drift — see `docs/v2/plan.md` §Phase 6.3.

View File

@@ -2,189 +2,132 @@
## Overview
The service runs as a Windows service or console application using TopShelf for lifecycle management. It targets .NET Framework 4.8 with an x86 (32-bit) platform target, which is required for MXAccess COM interop with the ArchestrA runtime DLLs.
A production OtOpcUa deployment runs **three processes**, each with a distinct runtime, platform target, and install surface:
## TopShelf Configuration
| Process | Project | Runtime | Platform | Responsibility |
|---|---|---|---|---|
| **OtOpcUa Server** | `src/ZB.MOM.WW.OtOpcUa.Server` | .NET 10 | x64 | Hosts the OPC UA endpoint; loads every non-Galaxy driver in-process; exposes `/healthz`. |
| **OtOpcUa Admin** | `src/ZB.MOM.WW.OtOpcUa.Admin` | .NET 10 (ASP.NET Core / Blazor Server) | x64 | Operator UI for Config DB editing + fleet status, SignalR hubs (`FleetStatusHub`, `AlertHub`), Prometheus `/metrics`. |
| **OtOpcUa Galaxy.Host** | `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host` | .NET Framework 4.8 | x86 (32-bit) | Hosts MXAccess COM on a dedicated STA thread with a Win32 message pump; exposes a named-pipe IPC surface consumed by `Driver.Galaxy.Proxy` inside the Server process. |
`Program.Main()` configures TopShelf to manage the `OpcUaService` lifecycle:
The x86 / .NET Framework 4.8 constraint applies **only** to Galaxy.Host because the MXAccess toolkit DLLs (`Program Files (x86)\ArchestrA\Framework\bin`) are 32-bit-only COM. Every other driver (Modbus, S7, OpcUaClient, AbCip, AbLegacy, TwinCAT, FOCAS) runs in-process in the 64-bit Server.
## Server process
`src/ZB.MOM.WW.OtOpcUa.Server/Program.cs` uses the generic host:
```csharp
var exitCode = HostFactory.Run(host =>
{
host.UseSerilog();
host.Service<OpcUaService>(svc =>
{
svc.ConstructUsing(() => new OpcUaService());
svc.WhenStarted(s => s.Start());
svc.WhenStopped(s => s.Stop());
});
host.SetServiceName("LmxOpcUa");
host.SetDisplayName("LMX OPC UA Server");
host.SetDescription("OPC UA server exposing System Platform Galaxy tags via MXAccess.");
host.RunAsLocalSystem();
host.StartAutomatically();
});
var builder = Host.CreateApplicationBuilder(args);
builder.Services.AddSerilog();
builder.Services.AddWindowsService(o => o.ServiceName = "OtOpcUa");
builder.Services.AddHostedService<OpcUaServerService>();
builder.Services.AddHostedService<HostStatusPublisher>();
```
TopShelf provides these deployment modes from the same executable:
`OpcUaServerService` is a `BackgroundService` (decision #30 — TopShelf from v1 was replaced by the generic-host `AddWindowsService` wrapper; no TopShelf dependency remains in any csproj). It owns:
| Command | Description |
|---------|-------------|
| `OtOpcUa.Host.exe` | Run as a console application (foreground) |
| `OtOpcUa.Host.exe install` | Install as a Windows service |
| `OtOpcUa.Host.exe uninstall` | Remove the Windows service |
| `OtOpcUa.Host.exe start` | Start the installed service |
| `OtOpcUa.Host.exe stop` | Stop the installed service |
1. Config bootstrap — reads `Node:NodeId`, `Node:ClusterId`, `Node:ConfigDbConnectionString`, `Node:LocalCachePath` from `appsettings.json`.
2. `NodeBootstrap` — pulls the latest published generation from the Config DB into the LiteDB local cache (`LiteDbConfigCache`) so the node starts even if the central DB is briefly unreachable.
3. `DriverHost` — instantiates configured driver instances from the generation, wires each through `CapabilityInvoker` resilience pipelines.
4. `OpcUaApplicationHost` — builds the OPC UA endpoint, applies `OpcUaServerOptions` + `LdapOptions`, registers `AuthorizationGate` at dispatch.
5. `HostStatusPublisher` — a second hosted service that heartbeats `DriverHostStatus` rows so the Admin UI Fleet view sees the node.
The service is configured to run as `LocalSystem` and start automatically on boot.
### Installation
## Working Directory
Same executable, different modes driven by the .NET generic-host `AddWindowsService` wrapper:
Before configuring Serilog, `Program.Main()` sets the working directory to the executable's location:
| Mode | Invocation |
|---|---|
| Console | `ZB.MOM.WW.OtOpcUa.Server.exe` |
| Install as Windows service | `sc create OtOpcUa binPath="C:\Program Files\OtOpcUa\Server\ZB.MOM.WW.OtOpcUa.Server.exe" start=auto` |
| Start | `sc start OtOpcUa` |
| Stop | `sc stop OtOpcUa` |
| Uninstall | `sc delete OtOpcUa` |
```csharp
Environment.CurrentDirectory = AppDomain.CurrentDomain.BaseDirectory;
```
### Health endpoints
This is necessary because Windows services default their working directory to `System32`, which would cause relative log paths and `appsettings.json` to resolve incorrectly.
The Server exposes `/healthz` + `/readyz` used by (a) the Admin `FleetStatusPoller` as input to Fleet status and (b) `PeerReachabilityTracker` in a peer Server process as the HTTP side of the peer-reachability probe.
## Startup Sequence
## Admin process
`OpcUaService.Start()` executes the following steps in order. If any required step fails, the service logs the error and throws, preventing a partially initialized state.
`src/ZB.MOM.WW.OtOpcUa.Admin/Program.cs` is a stock `WebApplication`. Highlights:
1. **Load configuration** -- The production constructor reads `appsettings.json`, optional environment overlay, and environment variables, then binds each section to its typed configuration class.
2. **Validate configuration** -- `ConfigurationValidator.ValidateAndLog()` logs all resolved values and checks required constraints (port range, non-empty names and connection strings). If validation fails, the service throws `InvalidOperationException`.
3. **Register exception handler** -- Registers `AppDomain.CurrentDomain.UnhandledException` to log fatal unhandled exceptions with `IsTerminating` context.
4. **Create performance metrics** -- Creates the `PerformanceMetrics` instance and a `CancellationTokenSource` for coordinating shutdown.
5. **Create and connect MXAccess client** -- Starts the STA COM thread, creates the `MxAccessClient`, and attempts an initial connection. If the connection fails, the service logs a warning and continues -- the monitor loop will retry in the background.
6. **Start MXAccess monitor** -- Starts the connectivity monitor loop that probes the runtime connection at the configured interval and handles auto-reconnect.
7. **Test Galaxy repository connection** -- Calls `TestConnectionAsync()` on the Galaxy repository to verify the SQL Server database is reachable. If it fails, the service continues without initial address-space data.
8. **Create OPC UA server host** -- Creates `OpcUaServerHost` with the effective MXAccess client (real, override, or null fallback), performance metrics, and an optional `IHistorianDataSource` obtained from `HistorianPluginLoader.TryLoad` when `Historian.Enabled=true` (returns `null` if the plugin is absent or fails to load).
9. **Query Galaxy hierarchy** -- Fetches the object hierarchy and attribute definitions from the Galaxy repository database, recording object and attribute counts.
10. **Start server and build address space** -- Starts the OPC UA server, retrieves the `LmxNodeManager`, and calls `BuildAddressSpace()` with the queried hierarchy and attributes. If the query or build fails, the server still starts with an empty address space.
11. **Start change detection** -- Creates and starts `ChangeDetectionService`, which polls `galaxy.time_of_last_deploy` at the configured interval. When a change is detected, it triggers an address-space rebuild via the `OnGalaxyChanged` event.
12. **Start status dashboard** -- Creates the `HealthCheckService` and `StatusReportService`, wires in all live components, and starts the `StatusWebServer` HTTP listener if the dashboard is enabled. If `StatusWebServer.Start()` returns `false` (port already bound, insufficient permissions, etc.), the service logs a warning, disposes the unstarted instance, sets `OpcUaService.DashboardStartFailed = true`, and continues in degraded mode. Matches the warning-continue policy applied to MxAccess connect, Galaxy DB connect, and initial address space build. Stability review 2026-04-13 Finding 2.
13. **Log startup complete** -- Logs "LmxOpcUa service started successfully" at `Information` level.
- Cookie auth (`CookieAuthenticationDefaults`, scheme name `OtOpcUa.Admin`) + Blazor Server (`AddInteractiveServerComponents`) + SignalR.
- Authorization policies gated by `AdminRoles`: `ConfigViewer`, `ConfigEditor`, `FleetAdmin` (see `Services/AdminRoles.cs`). `CanEdit` policy requires `ConfigEditor` or `FleetAdmin`; `CanPublish` requires `FleetAdmin`.
- `OtOpcUaConfigDbContext` registered against `ConnectionStrings:ConfigDb`.
- Scoped services: `ClusterService`, `GenerationService`, `EquipmentService`, `UnsService`, `NamespaceService`, `DriverInstanceService`, `NodeAclService`, `PermissionProbeService`, `AclChangeNotifier`, `ReservationService`, `DraftValidationService`, `AuditLogService`, `HostStatusService`, `ClusterNodeService`, `EquipmentImportBatchService`, `ILdapGroupRoleMappingService`.
- Singleton `RedundancyMetrics` (meter name `ZB.MOM.WW.OtOpcUa.Redundancy`) + `CertTrustService` (promotes rejected client certs in the Server's PKI store to trusted via the Admin Certificates page).
- `LdapAuthService` bound to `Authentication:Ldap` — same LDAP flow as ScadaLink CentralUI for visual parity.
- SignalR hubs mapped at `/hubs/fleet` and `/hubs/alerts`; `FleetStatusPoller` runs as a hosted service and pushes `RoleChanged`, host status, and alert events.
- OpenTelemetry → Prometheus exporter at `/metrics` when `Metrics:Prometheus:Enabled=true` (default). Pull-based means no Collector required in the common K8s deploy.
## Shutdown Sequence
### Installation
`OpcUaService.Stop()` tears down components in reverse dependency order:
Deployed as an ASP.NET Core service; the generic-host `AddWindowsService` wrapper (or IIS reverse-proxy for multi-node fleets) provides install/uninstall. Listens on whatever `ASPNETCORE_URLS` specifies.
1. **Cancel operations** -- Signals the `CancellationTokenSource` to stop all background loops.
2. **Stop change detection** -- Stops the Galaxy deploy polling loop.
3. **Stop OPC UA server** -- Shuts down the OPC UA server host, disconnecting all client sessions.
4. **Stop MXAccess monitor** -- Stops the connectivity monitor loop.
5. **Disconnect MXAccess** -- Disconnects the MXAccess client and releases COM resources.
6. **Dispose STA thread** -- Shuts down the dedicated STA COM thread and its message pump.
7. **Stop dashboard** -- Disposes the `StatusWebServer` HTTP listener.
8. **Dispose metrics** -- Releases the performance metrics collector.
9. **Dispose change detection** -- Releases the change detection service.
10. **Unregister exception handler** -- Removes the `AppDomain.UnhandledException` handler.
## Galaxy.Host process
The entire shutdown is wrapped in a `try/catch` that logs warnings for errors during cleanup, ensuring the service exits even if a component fails to dispose cleanly.
`src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/Program.cs` is a .NET Framework 4.8 x86 console executable. Configuration comes from environment variables supplied by the supervisor (`Driver.Galaxy.Proxy.Supervisor`):
## Error Handling
| Env var | Purpose |
|---|---|
| `OTOPCUA_GALAXY_PIPE` | Pipe name the host listens on (default `OtOpcUaGalaxy`). |
| `OTOPCUA_ALLOWED_SID` | SID of the Server process's principal; anyone else is refused during the handshake. |
| `OTOPCUA_GALAXY_SECRET` | Per-spawn shared secret the client must present in the Hello frame. |
| `OTOPCUA_GALAXY_BACKEND` | `mxaccess` (default), `db` (ZB-only, no COM), `stub` (in-memory; for tests). |
| `OTOPCUA_GALAXY_ZB_CONN` | SQL connection string to the ZB Galaxy repository. |
| `OTOPCUA_HISTORIAN_*` | Optional Wonderware Historian SDK config if Historian is enabled for this node. |
### Unhandled exceptions
The host spins up `StaPump` (the STA thread with message pump), creates the MXAccess `LMXProxyServer` COM object on that thread, and handles all COM calls there; the IPC layer marshals work items via `PostThreadMessage`.
`AppDomain.CurrentDomain.UnhandledException` is registered at startup and removed at shutdown. The handler logs the exception at `Fatal` level with the `IsTerminating` flag:
### Pipe security
```csharp
Log.Fatal(e.ExceptionObject as Exception,
"Unhandled exception (IsTerminating={IsTerminating})", e.IsTerminating);
```
`PipeServer` builds a `PipeAcl` from the provided `SecurityIdentifier` + uses `NamedPipeServerStream` with `maxNumberOfServerInstances: 1`. The handshake requires a matching shared secret in the first Hello frame; callers whose SID doesn't match `OTOPCUA_ALLOWED_SID` are rejected before any frame is processed. **By design the pipe ACL denies BUILTIN\Administrators** — live smoke tests must therefore run from a non-elevated shell that matches the allowed principal. The installed dev host (`OtOpcUaGalaxyHost`) runs as `dohertj2` with the secret at `.local/galaxy-host-secret.txt`.
### Startup resilience
### Installation
The startup sequence is designed to degrade gracefully rather than fail entirely:
- If MXAccess connection fails, the service continues with a `NullMxAccessClient` that returns bad-quality values for all reads.
- If the Galaxy repository database is unreachable, the OPC UA server starts with an empty address space.
- If the status dashboard port is in use, the dashboard logs a warning and does not start, but the OPC UA server continues.
### Fatal startup failure
If a critical step (configuration validation, OPC UA server start) throws, `Start()` catches the exception, logs it at `Fatal`, and re-throws to let TopShelf report the failure.
## Logging
The service uses Serilog with two sinks configured in `Program.Main()`:
```csharp
Log.Logger = new LoggerConfiguration()
.MinimumLevel.Information()
.WriteTo.Console()
.WriteTo.File(
path: "logs/lmxopcua-.log",
rollingInterval: RollingInterval.Day,
retainedFileCountLimit: 31)
.CreateLogger();
```
| Sink | Details |
|------|---------|
| Console | Writes to stdout, useful when running as a console application |
| Rolling file | Writes to `logs/lmxopcua-{date}.log`, rolls daily, retains 31 days of history |
Log files are written relative to the executable directory (see Working Directory above). Each component creates its own contextual logger using `Log.ForContext<T>()` or `Log.ForContext(typeof(T))`.
`Log.CloseAndFlush()` is called in the `finally` block of `Program.Main()` to ensure all buffered log entries are written before process exit.
## Multi-Instance Deployment
The service supports running multiple instances for redundancy. Each instance requires:
- A unique Windows service name (e.g., `LmxOpcUa`, `LmxOpcUa2`)
- A unique OPC UA port and dashboard port
- A unique `OpcUa.ApplicationUri` and `OpcUa.ServerName`
- A unique `MxAccess.ClientName`
- Matching `Redundancy.ServerUris` arrays on all instances
Install additional instances using TopShelf's `-servicename` flag:
NSSM-wrapped (the Non-Sucking Service Manager) because the executable itself is a plain console app, not a `ServiceBase` Windows service. The supervisor then adopts the child process over the pipe after install. Install/uninstall commands follow the NSSM pattern:
```bash
cd C:\publish\lmxopcua\instance2
ZB.MOM.WW.OtOpcUa.Host.exe install -servicename "LmxOpcUa2" -displayname "LMX OPC UA Server (Instance 2)"
nssm install OtOpcUaGalaxyHost "C:\Program Files (x86)\OtOpcUa\Galaxy.Host\ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host.exe"
nssm set OtOpcUaGalaxyHost ObjectName .\dohertj2 <password>
nssm set OtOpcUaGalaxyHost AppEnvironmentExtra OTOPCUA_GALAXY_BACKEND=mxaccess OTOPCUA_GALAXY_SECRET=OTOPCUA_ALLOWED_SID=
nssm start OtOpcUaGalaxyHost
```
See [Redundancy Guide](Redundancy.md) for full deployment details.
(Exact values for the environment block are generated by the Admin UI + committed alongside `.local/galaxy-host-secret.txt` on the dev box.)
## Required Runtime Assemblies
The build uses Costura.Fody to embed all NuGet dependencies into the single `ZB.MOM.WW.OtOpcUa.Host.exe`. The only native dependency that must sit alongside the executable in every deployment is the MXAccess COM toolkit:
| Assembly | Purpose |
|----------|---------|
| `ArchestrA.MxAccess.dll` | MXAccess COM interop — runtime data access to Galaxy tags |
The Wonderware Historian SDK is packaged as a **runtime-loaded plugin** so hosts that will not use historical data access do not need the SDK installed. The plugin lives in a `Historian/` subfolder next to `ZB.MOM.WW.OtOpcUa.Host.exe`:
## Inter-process communication
```
ZB.MOM.WW.OtOpcUa.Host.exe
ArchestrA.MxAccess.dll
Historian/
ZB.MOM.WW.OtOpcUa.Historian.Aveva.dll
aahClientManaged.dll
aahClientCommon.dll
aahClient.dll
Historian.CBE.dll
Historian.DPAPI.dll
ArchestrA.CloudHistorian.Contract.dll
┌──────────────────────────┐ LDAP bind (Authentication:Ldap) ┌──────────────────────────┐
│ OtOpcUa Admin (x64) │ ─────────────────────────────────────────────▶│ LDAP / AD │
│ Blazor Server + SignalR │ └──────────────────────────┘
/metrics (Prometheus) │ FleetStatusPoller → ClusterNode poll
│ ─────────────────────────────────────────────▶┌──────────────────────────┐
│ Cluster/Generation/ACL writes │ Config DB (SQL Server) │
└──────────────────────────┘ ─────────────────────────────────────────────▶│ OtOpcUaConfigDbContext │
▲ └──────────────────────────┘
│ SignalR ▲
│ (role change, │ sp_GetCurrentGenerationForCluster
│ host status, │ sp_PublishGeneration
│ alerts) │
┌──────────────────────────┐ │
│ OtOpcUa Server (x64) │ ──────────────────────────────────────────────────────────┘
│ OPC UA endpoint │
│ Non-Galaxy drivers │ Named pipe (OtOpcUaGalaxy) ┌──────────────────────────┐
│ Driver.Galaxy.Proxy │ ─────────────────────────────────────────────▶│ Galaxy.Host (x86 .NFx) │
│ │ SID + shared-secret handshake │ STA + message pump │
│ /healthz /readyz │ │ MXAccess COM │
└──────────────────────────┘ │ Historian SDK (opt) │
└──────────────────────────┘
```
At startup, if `Historian.Enabled=true` in `appsettings.json`, `HistorianPluginLoader` probes `Historian/ZB.MOM.WW.OtOpcUa.Historian.Aveva.dll` via `Assembly.LoadFrom` and instantiates the plugin's entry point. An `AppDomain.AssemblyResolve` handler redirects the SDK assembly lookups (`aahClientManaged`, `aahClientCommon`, …) to the same subfolder so the CLR can resolve them when the plugin first JITs. If the plugin directory is absent or any SDK dependency fails to load, the loader logs a warning and the server continues to run with history support disabled — `LmxNodeManager` returns `BadHistoryOperationUnsupported` for every history call.
## appsettings.json boundary
Deployment matrix:
Each process reads its own `appsettings.json` for **bootstrap only** — connection strings, LDAP bind config, transport security profile, redundancy node id, logging. The authoritative configuration tree (drivers, UNS, tags, ACLs) lives in the Config DB and is edited through the Admin UI. See [`Configuration.md`](Configuration.md) for the split.
| Scenario | Host exe | `ArchestrA.MxAccess.dll` | `Historian/` subfolder |
|----------|----------|--------------------------|------------------------|
| `Historian.Enabled=false` | required | required | **omit** |
| `Historian.Enabled=true` | required | required | required |
## Development bootstrap
`ArchestrA.MxAccess.dll` and the historian SDK DLLs are not redistributable — they are provided by the AVEVA System Platform and Historian installations on the target machine. The copies in `lib/` are taken from `Program Files (x86)\ArchestrA\Framework\bin` on a machine with the platform installed.
## Platform Target
The service must be compiled and run as x86 (32-bit). The MXAccess COM toolkit DLLs in `Program Files (x86)\ArchestrA\Framework\bin` are 32-bit only. Running the service as x64 or AnyCPU (64-bit preferred) causes COM interop failures when creating the `LMXProxyServer` object on the STA thread.
For the Windows install steps (SQL Server in Docker, .NET 10 SDK, .NET Framework 4.8 SDK, Docker Desktop WSL 2 backend, EF Core CLI, first-run migration), see [`docs/v2/dev-environment.md`](v2/dev-environment.md).

View File

@@ -1,274 +1,16 @@
# Status Dashboard
# Status Dashboard — Superseded
## Overview
This document has been superseded.
The service hosts an embedded HTTP status dashboard that surfaces real-time health, connection state, subscription counts, data change throughput, and Galaxy metadata. Operators access it through a browser to verify the bridge is functioning without needing an OPC UA client. The dashboard is enabled by default on port 8081 and can be disabled via configuration.
The single-process, HTTP-listener "Status Dashboard" (`StatusWebServer` bound to port 8081) belonged to v1 LmxOpcUa, where one process owned the OPC UA endpoint, the MXAccess bridge, and the operator surface. In the multi-process OtOpcUa platform the operator surface has moved into the **OtOpcUa Admin** app — a Blazor Server UI that talks to the shared Config DB and to every deployed node over SignalR (`FleetStatusHub`, `AlertHub`). Prometheus scraping lives on the Admin app's `/metrics` endpoint via OpenTelemetry (`Metrics:Prometheus:Enabled`).
## HTTP Server
Operator surfaces now covered by the Admin UI:
`StatusWebServer` wraps a `System.Net.HttpListener` bound to `http://+:{port}/`. It starts a background task that accepts requests in a loop and dispatches them by path. Only `GET` requests are accepted; all other methods return `405 Method Not Allowed`. Responses include `Cache-Control: no-cache` headers to prevent stale data in the browser.
- Fleet health, per-node role/ServiceLevel, crash-loop detection (`Fleet.razor`, `Hosts.razor`, `FleetStatusPoller`)
- Redundancy state + role transitions (`RedundancyMetrics`, `otopcua.redundancy.*`)
- Cluster + node + credential management (`ClusterService`, `ClusterNodeService`)
- Draft/publish generation editor, diff viewer, CSV import, UnsTab, IdentificationFields, RedundancyTab, AclsTab with Probe-this-permission
- Certificate trust management (`CertTrustService` promotes rejected client certs to trusted)
- Audit log viewer (`AuditLogService`)
### Endpoints
| Path | Content-Type | Description |
|------|-------------|-------------|
| `/` | `text/html` | Operator dashboard with auto-refresh |
| `/health` | `text/html` | Focused health page with service-level badge and component cards |
| `/api/status` | `application/json` | Full status snapshot as JSON (`StatusData`) |
| `/api/health` | `application/json` | Health endpoint (`HealthEndpointData`) -- returns `503` when status is `Unhealthy`, `200` otherwise |
Any other path returns `404 Not Found`.
## Health Check Logic
`HealthCheckService.CheckHealth` evaluates bridge health using the following rules applied in order. The first rule that matches wins; rules 2b, 2c, 2d, and 2e only fire when the corresponding integration is enabled and a non-null snapshot is passed:
1. **Rule 1 -- Unhealthy**: MXAccess connection state is not `Connected`. Returns a red banner with the current state.
2. **Rule 2b -- Degraded**: `Historian.Enabled=true` but the plugin load outcome is not `Loaded`. Returns a yellow banner citing the plugin status (`NotFound`, `LoadFailed`) and the error message if one is available.
3. **Rule 2 / 2c -- Degraded**: Any recorded operation has a low success rate. The sample threshold depends on the operation category:
- Regular operations (`Read`, `Write`, `Subscribe`, `AlarmAcknowledge`): >100 invocations and <50% success rate.
- Historian operations (`HistoryReadRaw`, `HistoryReadProcessed`, `HistoryReadAtTime`, `HistoryReadEvents`): >10 invocations and <50% success rate. The lower threshold surfaces a stuck historian quickly, since history reads are rare relative to live reads.
4. **Rule 2d -- Degraded (latched)**: `AlarmTrackingEnabled=true` and any alarm acknowledge MXAccess write has failed since startup. Latched on purpose -- an ack write failure is a durable MXAccess write problem that should stay visible until the operator restarts.
5. **Rule 2e -- Degraded**: `RuntimeStatus.StoppedCount > 0` -- at least one Galaxy runtime host (`$WinPlatform` / `$AppEngine`) is currently reported Stopped by the runtime probe manager. The rule names the stopped hosts in the message. Ordered after Rule 1 so an MxAccess transport outage stays `Unhealthy` via Rule 1 and this rule never double-messages; the probe manager also forces every entry to `Unknown` when the transport is disconnected, so the `StoppedCount` is always 0 in that case.
6. **Rule 3 -- Healthy**: All checks pass. Returns a green banner with "All systems operational."
The `/api/health` endpoint returns `200` for both Healthy and Degraded states, and `503` only for Unhealthy. This allows load balancers or monitoring tools to distinguish between a service that is running but degraded and one that has lost its runtime connection.
## Status Data Model
`StatusReportService` aggregates data from all bridge components into a `StatusData` DTO, which is then rendered as HTML or serialized to JSON. The DTO contains the following sections:
### Connection
| Field | Type | Description |
|-------|------|-------------|
| `State` | `string` | Current MXAccess connection state (Connected, Disconnected, Connecting) |
| `ReconnectCount` | `int` | Number of reconnect attempts since startup |
| `ActiveSessions` | `int` | Number of active OPC UA client sessions |
### Health
| Field | Type | Description |
|-------|------|-------------|
| `Status` | `string` | Healthy, Degraded, or Unhealthy |
| `Message` | `string` | Operator-facing explanation |
| `Color` | `string` | CSS color token (green, yellow, red, gray) |
### Subscriptions
| Field | Type | Description |
|-------|------|-------------|
| `ActiveCount` | `int` | Number of active MXAccess tag subscriptions (includes bridge-owned runtime status probes — see `ProbeCount`) |
| `ProbeCount` | `int` | Subset of `ActiveCount` attributable to bridge-owned runtime status probes (`<Host>.ScanState` per deployed `$WinPlatform` / `$AppEngine`). Rendered as a separate `Probes: N (bridge-owned runtime status)` line on the dashboard so operators can distinguish probe overhead from client-driven subscription load |
### Galaxy
| Field | Type | Description |
|-------|------|-------------|
| `GalaxyName` | `string` | Name of the Galaxy being bridged |
| `DbConnected` | `bool` | Whether the Galaxy repository database is reachable |
| `LastDeployTime` | `DateTime?` | Most recent deploy timestamp from the Galaxy |
| `ObjectCount` | `int` | Number of Galaxy objects in the address space |
| `AttributeCount` | `int` | Number of Galaxy attributes as OPC UA variables |
| `LastRebuildTime` | `DateTime?` | UTC timestamp of the last completed address-space rebuild |
### Data change
| Field | Type | Description |
|-------|------|-------------|
| `EventsPerSecond` | `double` | Rate of MXAccess data change events per second |
| `AvgBatchSize` | `double` | Average items processed per dispatch cycle |
| `PendingItems` | `int` | Items waiting in the dispatch queue |
| `TotalEvents` | `long` | Total MXAccess data change events since startup |
### Galaxy Runtime
Populated from the `GalaxyRuntimeProbeManager` that advises `<Host>.ScanState` on every deployed `$WinPlatform` and `$AppEngine`. See [MXAccess Bridge](MxAccessBridge.md#per-host-runtime-status-probes-hostscanstate) for the probe machinery, state machine, and the subtree quality invalidation that fires on transitions. Disabled when `MxAccess.RuntimeStatusProbesEnabled = false`; the panel is suppressed entirely from the HTML when `Total == 0`.
| Field | Type | Description |
|-------|------|-------------|
| `Total` | `int` | Number of runtime hosts tracked (Platforms + AppEngines) |
| `RunningCount` | `int` | Hosts whose last probe callback reported `ScanState = true` with Good quality |
| `StoppedCount` | `int` | Hosts whose last probe callback reported `ScanState != true` or a failed item status, or whose initial probe timed out in Unknown state |
| `UnknownCount` | `int` | Hosts still awaiting initial probe resolution, or rewritten to Unknown when the MxAccess transport is Disconnected |
| `Hosts` | `List<GalaxyRuntimeStatus>` | Per-host detail rows, sorted alphabetically by `ObjectName` |
Each `GalaxyRuntimeStatus` entry:
| Field | Type | Description |
|-------|------|-------------|
| `ObjectName` | `string` | Galaxy `tag_name` of the host (e.g., `DevPlatform`, `DevAppEngine`) |
| `GobjectId` | `int` | Galaxy `gobject_id` of the host |
| `Kind` | `string` | `$WinPlatform` or `$AppEngine` |
| `State` | `enum` | `Unknown`, `Running`, or `Stopped` |
| `LastStateCallbackTime` | `DateTime?` | UTC time of the most recent probe callback, whether good or bad |
| `LastStateChangeTime` | `DateTime?` | UTC time of the most recent Running↔Stopped transition; backs the dashboard "Since" column |
| `LastScanState` | `bool?` | Last `ScanState` value received; `null` before the first callback |
| `LastError` | `string?` | Detail message from the most recent failure callback (e.g., `"ScanState = false (OffScan)"`); cleared on successful recovery |
| `GoodUpdateCount` | `long` | Cumulative count of `ScanState = true` callbacks |
| `FailureCount` | `long` | Cumulative count of `ScanState != true` callbacks or failed item statuses |
The HTML panel renders a per-host table with Name / Kind / State / Since / Last Error columns. Panel color reflects aggregate state: green when every host is `Running`, yellow when any host is `Unknown` with zero `Stopped`, red when any host is `Stopped`, gray when the MxAccess transport is disconnected (the Connection panel is the primary signal in that case and every row is force-rewritten to `Unknown`).
### Operations
A dictionary of `MetricsStatistics` keyed by operation name. Each entry contains:
- `TotalCount` -- total invocations
- `SuccessRate` -- fraction of successful operations
- `AverageMilliseconds`, `MinMilliseconds`, `MaxMilliseconds`, `Percentile95Milliseconds` -- latency distribution
The instrumented operation names are:
| Name | Source |
|---|---|
| `Read` | MXAccess live tag reads (`MxAccessClient.ReadWrite.cs`) |
| `Write` | MXAccess live tag writes |
| `Subscribe` | MXAccess subscription attach |
| `HistoryReadRaw` | `LmxNodeManager.HistoryReadRawModified` -> historian plugin |
| `HistoryReadProcessed` | `LmxNodeManager.HistoryReadProcessed` -> historian plugin (aggregates) |
| `HistoryReadAtTime` | `LmxNodeManager.HistoryReadAtTime` -> historian plugin (interpolated) |
| `HistoryReadEvents` | `LmxNodeManager.HistoryReadEvents` -> historian plugin (alarm/event history) |
| `AlarmAcknowledge` | `LmxNodeManager.OnAlarmAcknowledge` -> MXAccess AckMsg write |
New operation names are auto-registered on first use, so the `Operations` dictionary only contains entries for features that have actually been exercised since startup.
### Historian
`HistorianStatusInfo` -- reflects the outcome of the runtime-loaded historian plugin and the runtime query-health counters. See [Historical Data Access](HistoricalDataAccess.md) for the plugin architecture and the [Runtime Health Counters](HistoricalDataAccess.md#runtime-health-counters) section for the data source instrumentation.
| Field | Type | Description |
|-------|------|-------------|
| `Enabled` | `bool` | Whether `Historian.Enabled` is set in configuration |
| `PluginStatus` | `string` | `Disabled`, `NotFound`, `LoadFailed`, or `Loaded` — load-time outcome from `HistorianPluginLoader.LastOutcome` |
| `PluginError` | `string?` | Exception message from the last load attempt when `PluginStatus=LoadFailed`; otherwise `null` |
| `PluginPath` | `string` | Absolute path the loader probed for the plugin assembly |
| `ServerName` | `string` | Legacy single-node hostname from `Historian.ServerName`; ignored when `ServerNames` is non-empty |
| `Port` | `int` | Configured historian TCP port |
| `QueryTotal` | `long` | Total historian read queries attempted since startup (raw + aggregate + at-time + events) |
| `QuerySuccesses` | `long` | Queries that completed without an exception |
| `QueryFailures` | `long` | Queries that raised an exception — each failure also triggers the plugin's reconnect path |
| `ConsecutiveFailures` | `int` | Failures since the last success. Resets to zero on any successful query. Drives the `Degraded` health rule at threshold 3 |
| `LastSuccessTime` | `DateTime?` | UTC timestamp of the most recent successful query, or `null` when no query has succeeded since startup |
| `LastFailureTime` | `DateTime?` | UTC timestamp of the most recent failure |
| `LastQueryError` | `string?` | Exception message from the most recent failure. Prefixed with the read-path name (`raw:`, `aggregate:`, `at-time:`, `events:`) so operators can tell which SDK call failed |
| `ProcessConnectionOpen` | `bool` | Whether the plugin currently holds an open SDK connection for the **process** silo (historical value queries — `ReadRaw`, `ReadAggregate`, `ReadAtTime`). See [Two SDK connection silos](HistoricalDataAccess.md#two-sdk-connection-silos) |
| `EventConnectionOpen` | `bool` | Whether the plugin currently holds an open SDK connection for the **event** silo (alarm history queries — `ReadEvents`). Separate from the process connection because the SDK requires distinct query channels |
| `ActiveProcessNode` | `string?` | Cluster node currently serving the process silo, or `null` when no process connection is open |
| `ActiveEventNode` | `string?` | Cluster node currently serving the event silo, or `null` when no event connection is open |
| `NodeCount` | `int` | Total configured historian cluster nodes. 1 for a legacy single-node deployment |
| `HealthyNodeCount` | `int` | Nodes currently eligible for new connections (not in failure cooldown) |
| `Nodes` | `List<HistorianClusterNodeState>` | Per-node cluster state in configuration order. Each entry carries `Name`, `IsHealthy`, `CooldownUntil`, `FailureCount`, `LastError`, `LastFailureTime` |
The operator dashboard renders a cluster table inside the Historian panel when `NodeCount > 1`. Legacy single-node deployments render a compact `Node: <hostname>` line and no table. Panel color reflects combined load-time + runtime health: green when everything is fine, yellow when any cluster node is in cooldown or 1-4 consecutive query failures are accumulated, red when the plugin is unloaded / all cluster nodes are failed / 5+ consecutive failures.
### Alarms
`AlarmStatusInfo` -- surfaces alarm-condition tracking health and dispatch counters.
| Field | Type | Description |
|-------|------|-------------|
| `TrackingEnabled` | `bool` | Whether `OpcUa.AlarmTrackingEnabled` is set in configuration |
| `ConditionCount` | `int` | Number of distinct alarm conditions currently tracked |
| `ActiveAlarmCount` | `int` | Number of alarms currently in the `InAlarm=true` state |
| `TransitionCount` | `long` | Total `InAlarm` transitions observed in the dispatch loop since startup |
| `AckEventCount` | `long` | Total alarm acknowledgement transitions observed since startup |
| `AckWriteFailures` | `long` | Total MXAccess AckMsg writes that have failed while processing alarm acknowledges. Any non-zero value latches the service into Degraded (see Rule 2d). |
| `FilterEnabled` | `bool` | Whether `OpcUa.AlarmFilter.ObjectFilters` has any patterns configured |
| `FilterPatternCount` | `int` | Number of compiled filter patterns (after comma-splitting and trimming) |
| `FilterIncludedObjectCount` | `int` | Number of Galaxy objects included by the filter during the most recent address-space build. Zero when the filter is disabled. |
When the filter is active, the operator dashboard's Alarms panel renders an extra line `Filter: N pattern(s), M object(s) included` so operators can verify scope at a glance. See [Alarm Tracking](AlarmTracking.md#template-based-alarm-object-filter) for the matching rules and resolution algorithm.
### Redundancy
`RedundancyInfo` -- only populated when `Redundancy.Enabled=true` in configuration. Shows mode, role, computed service level, application URI, and the set of peer server URIs. See [Redundancy](Redundancy.md) for the full guide.
### Footer
| Field | Type | Description |
|-------|------|-------------|
| `Timestamp` | `DateTime` | UTC time when the snapshot was generated |
| `Version` | `string` | Service assembly version |
## `/api/health` Payload
The health endpoint returns a `HealthEndpointData` document distinct from the full dashboard snapshot. It is designed for load balancers and external monitoring probes that only need an up/down signal plus component-level detail:
| Field | Type | Description |
|-------|------|-------------|
| `Status` | `string` | `Healthy`, `Degraded`, or `Unhealthy` (drives the HTTP status code) |
| `ServiceLevel` | `byte` | OPC UA-style 0-255 service level. 255 when healthy non-redundant; 0 when MXAccess is down; redundancy-adjusted otherwise |
| `RedundancyEnabled` | `bool` | Whether redundancy is configured |
| `RedundancyRole` | `string?` | `Primary` or `Secondary` when redundancy is enabled; `null` otherwise |
| `RedundancyMode` | `string?` | `Warm` or `Hot` when redundancy is enabled; `null` otherwise |
| `Components.MxAccess` | `string` | `Connected` or `Disconnected` |
| `Components.Database` | `string` | `Connected` or `Disconnected` |
| `Components.OpcUaServer` | `string` | `Running` or `Stopped` |
| `Components.Historian` | `string` | `Disabled`, `NotFound`, `LoadFailed`, or `Loaded` -- matches `HistorianStatusInfo.PluginStatus` |
| `Components.Alarms` | `string` | `Disabled` or `Enabled` -- mirrors `OpcUa.AlarmTrackingEnabled` |
| `Uptime` | `string` | Formatted service uptime (e.g., `3d 5h 20m`) |
| `Timestamp` | `DateTime` | UTC time the snapshot was generated |
Monitoring tools should:
- Alert on `Status=Unhealthy` (HTTP 503) for hard outages.
- Alert on `Status=Degraded` (HTTP 200) for latched or cumulative failures -- a degraded status means the server is still operating but a subsystem needs attention (historian plugin missing, alarm ack writes failing, history read error rate too high, etc.).
## HTML Dashboards
### `/` -- Operator dashboard
Monospace, dark background, color-coded panels. Panels: Connection, Health, Redundancy (when enabled), Subscriptions, Data Change Dispatch, Galaxy Info, **Historian**, **Alarms**, Operations (table), Footer. Each panel border color reflects component state (green, yellow, red, or gray).
The page includes a `<meta http-equiv='refresh'>` tag set to the configured `RefreshIntervalSeconds` (default 10 seconds), so the browser polls automatically without JavaScript.
### `/health` -- Focused health view
Large status badge, computed `ServiceLevel` value, redundancy summary (when enabled), and a row of component cards: MXAccess, Galaxy Database, OPC UA Server, **Historian**, **Alarm Tracking**. Each card turns red when its component is in a failure state and grey when disabled. Best for wallboards and quick at-a-glance monitoring.
## Configuration
The dashboard is configured through the `Dashboard` section in `appsettings.json`:
```json
{
"Dashboard": {
"Enabled": true,
"Port": 8081,
"RefreshIntervalSeconds": 10
}
}
```
Setting `Enabled` to `false` prevents the `StatusWebServer` from starting. The `StatusReportService` is still created so that other components can query health programmatically, but no HTTP listener is opened.
### Dashboard start failures are non-fatal
If the dashboard is enabled but the configured port is already bound (e.g., a previous instance did not clean up, another service is squatting on the port, or the user lacks URL-reservation rights), `StatusWebServer.Start()` logs the listener exception at Error level and returns `false`. `OpcUaService` then logs a Warning, disposes the unstarted instance, sets `DashboardStartFailed = true`, and continues in degraded mode — the OPC UA endpoint still starts. Operators can detect the failure by searching the service log for:
```
[WRN] Status dashboard failed to bind on port {Port}; service continues without dashboard
```
Stability review 2026-04-13 Finding 2.
## Component Wiring
`StatusReportService` is initialized after all other service components are created. `OpcUaService.Start()` calls `SetComponents()` to supply the live references, including the historian configuration so the dashboard can label the plugin target and evaluate Rule 2b:
```csharp
StatusReportInstance.SetComponents(
effectiveMxClient,
Metrics,
GalaxyStatsInstance,
ServerHost,
NodeManagerInstance,
_config.Redundancy,
_config.OpcUa.ApplicationUri,
_config.Historian);
```
This deferred wiring allows the report service to be constructed before the MXAccess client or node manager are fully initialized. If a component is `null`, the report service falls back to default values (e.g., `ConnectionState.Disconnected`, zero counts, `HistorianPluginStatus.Disabled`).
The historian plugin status is sourced from `HistorianPluginLoader.LastOutcome`, which is updated on every load attempt. `OpcUaService` explicitly calls `HistorianPluginLoader.MarkDisabled()` when `Historian.Enabled=false` so the dashboard can distinguish "feature off" from "load failed" without ambiguity.
See [`docs/v2/admin-ui.md`](v2/admin-ui.md) for the current operator surface and [`docs/ServiceHosting.md`](ServiceHosting.md) for the three-process layout.

View File

@@ -1,135 +1,60 @@
# Subscriptions
`LmxNodeManager` bridges OPC UA monitored items to MXAccess runtime subscriptions using reference counting and a decoupled dispatch architecture. This design ensures that MXAccess COM callbacks (which run on the STA thread) never contend with the OPC UA framework lock.
Driver-side data-change subscriptions live behind `ISubscribable` (`src/ZB.MOM.WW.OtOpcUa.Core.Abstractions/ISubscribable.cs`). The interface is deliberately mechanism-agnostic: it covers native subscriptions (Galaxy MXAccess advisory, OPC UA monitored items on an upstream server, TwinCAT ADS notifications) and driver-internal polled subscriptions (Modbus, AB CIP, S7, FOCAS). Core sees the same event shape regardless — drivers fire `OnDataChange` and Core dispatches to the matching OPC UA monitored items.
## Ref-Counted MXAccess Subscriptions
Multiple OPC UA clients can subscribe to the same Galaxy tag simultaneously. Rather than opening duplicate MXAccess subscriptions, `LmxNodeManager` maintains a reference count per tag in `_subscriptionRefCounts`.
### SubscribeTag
`SubscribeTag` increments the reference count for a tag reference. On the first subscription (count goes from 0 to 1), it calls `_mxAccessClient.SubscribeAsync` to open the MXAccess runtime subscription:
## ISubscribable surface
```csharp
internal void SubscribeTag(string fullTagReference)
{
lock (_lock)
{
if (_subscriptionRefCounts.TryGetValue(fullTagReference, out var count))
_subscriptionRefCounts[fullTagReference] = count + 1;
else
{
_subscriptionRefCounts[fullTagReference] = 1;
_ = _mxAccessClient.SubscribeAsync(fullTagReference, (_, _) => { });
}
}
}
Task<ISubscriptionHandle> SubscribeAsync(
IReadOnlyList<string> fullReferences,
TimeSpan publishingInterval,
CancellationToken cancellationToken);
Task UnsubscribeAsync(ISubscriptionHandle handle, CancellationToken cancellationToken);
event EventHandler<DataChangeEventArgs>? OnDataChange;
```
### UnsubscribeTag
A single `SubscribeAsync` call may batch many attributes and returns an opaque handle the caller passes back to `UnsubscribeAsync`. The driver may emit an immediate `OnDataChange` for each subscribed reference (the OPC UA initial-data convention) and then a push per change.
`UnsubscribeTag` decrements the reference count. When the count reaches zero, the MXAccess subscription is closed via `UnsubscribeAsync` and the tag is removed from the dictionary:
Every subscribe / unsubscribe call goes through `CapabilityInvoker.ExecuteAsync(DriverCapability.Subscribe, host, …)` so the per-host pipeline applies.
```csharp
if (count <= 1)
{
_subscriptionRefCounts.Remove(fullTagReference);
_ = _mxAccessClient.UnsubscribeAsync(fullTagReference);
}
else
_subscriptionRefCounts[fullTagReference] = count - 1;
```
## Reference counting at Core
Both methods use `lock (_lock)` (a private object, distinct from the OPC UA framework `Lock`) to serialize ref-count updates without blocking node value dispatches.
Multiple OPC UA clients can monitor the same variable simultaneously. Rather than open duplicate driver subscriptions, Core maintains a ref-count per `(driver, fullReference)` pair: the first OPC UA monitored-item for a reference triggers `ISubscribable.SubscribeAsync` with that single reference; each additional monitored-item just increments the count; decrement-to-zero triggers `UnsubscribeAsync`. Transferred subscriptions (client reconnect → resume session) replay against the same ref-count map so active driver subscriptions are preserved across session migration.
## OnMonitoredItemCreated
## Threading
The OPC UA framework calls `OnMonitoredItemCreated` when a client creates a monitored item. The override resolves the node handle to a tag reference and calls `SubscribeTag`, which opens the MXAccess subscription early so runtime values start arriving before the first publish cycle:
The STA thread story is now driver-specific, not a server-wide concern:
```csharp
protected override void OnMonitoredItemCreated(ServerSystemContext context,
NodeHandle handle, MonitoredItem monitoredItem)
{
base.OnMonitoredItemCreated(context, handle, monitoredItem);
var nodeIdStr = handle?.NodeId?.Identifier as string;
if (nodeIdStr != null && _nodeIdToTagReference.TryGetValue(nodeIdStr, out var tagRef))
SubscribeTag(tagRef);
}
```
- **Galaxy** runs its MXAccess COM objects on a dedicated STA thread with a Win32 message pump (`src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/Sta/StaPump.cs`) inside the standalone `Driver.Galaxy.Host` Windows service. The Proxy driver (`Driver.Galaxy.Proxy`) connects to the Host via named pipe and re-exposes the data on a free-threaded surface to Core. Core never touches COM.
- **Modbus / S7 / AB CIP / AB Legacy / TwinCAT / FOCAS** are free-threaded — they run their polling loops on ordinary `Task`s. Their `OnDataChange` fires on thread-pool threads.
- **OPC UA Client** delegates to the OPC Foundation stack's subscription loop.
`OnDeleteMonitoredItemsComplete` performs the inverse, calling `UnsubscribeTag` for each deleted monitored item.
The common contract: drivers are responsible for marshalling from whatever native thread the backend uses onto thread-pool threads before raising `OnDataChange`. Core's dispatch path acquires the OPC UA framework `Lock` and calls `ClearChangeMasks` on the corresponding `BaseDataVariableState` to notify subscribed clients.
## Data Change Dispatch Queue
## Dispatch
MXAccess delivers data change callbacks on the STA thread via the `OnTagValueChanged` event. These callbacks must not acquire the OPC UA framework `Lock` directly because the lock is also held during `Read`/`Write` operations that call into MXAccess (creating a potential deadlock with the STA thread). The solution is a `ConcurrentDictionary<string, Vtq>` named `_pendingDataChanges` that decouples the two threads.
Core's subscription dispatch path:
### Callback handler
1. `ISubscribable.OnDataChange` fires on a thread-pool thread with a `DataChangeEventArgs(subscriptionHandle, fullReference, DataValueSnapshot)`.
2. Core looks up the variable by `fullReference` in the driver's `DriverNodeManager` variable map.
3. Under the OPC UA framework `Lock`, the variable's `Value` / `StatusCode` / `Timestamp` are updated and `ClearChangeMasks(SystemContext, false)` is called.
4. The OPC Foundation stack then enqueues data-change notifications for every monitored-item attached to that variable, honoring each subscription's sampling + filter configuration.
`OnMxAccessDataChange` runs on the STA thread. It stores the latest value in the concurrent dictionary (coalescing rapid updates for the same tag) and signals the dispatch thread:
Batch coalescing — coalescing multiple pushes for the same reference between publish cycles — is done driver-side when the backend natively supports it (Galaxy keeps the v1 coalescing dictionary); otherwise the SDK's own data-change filter suppresses no-change notifications.
```csharp
private void OnMxAccessDataChange(string address, Vtq vtq)
{
Interlocked.Increment(ref _totalMxChangeEvents);
_pendingDataChanges[address] = vtq;
_dataChangeSignal.Set();
}
```
## Initial values
### Dispatch thread architecture
A freshly-built variable carries `StatusCode = BadWaitingForInitialData` until the driver delivers the first value. Drivers whose backends supply an initial read (Galaxy `AdviseSupervisory`, TwinCAT `AddDeviceNotification`) fire `OnDataChange` immediately after `SubscribeAsync` returns. Polled drivers fire the first push when their first poll cycle completes.
A dedicated background thread (`OpcUaDataChangeDispatch`) runs `DispatchLoop`, which waits on an `AutoResetEvent` with a 100ms timeout. The decoupled design exists for two reasons:
## Transferred subscription restoration
1. **Deadlock avoidance** -- The STA thread must not acquire the OPC UA `Lock`. The dispatch thread is a normal background thread that can safely acquire `Lock`.
2. **Batch coalescing** -- Multiple MXAccess callbacks for the same tag between dispatch cycles are collapsed to the latest value via dictionary key overwrite. Under high load, this reduces the number of `ClearChangeMasks` calls.
When an OPC UA session is resumed (client reconnect with `TransferSubscriptions`), Core walks the transferred monitored-items and ensures every referenced `(driver, fullReference)` has a live driver subscription. References already active (in-process migration) skip re-subscribing; references that lost their driver-side handle during the session gap are re-subscribed via `SubscribeAsync`.
The dispatch loop processes changes in two phases:
## Key source files
**Phase 1 (outside Lock):** Drain keys from `_pendingDataChanges`, convert each `Vtq` to a `DataValue` via `CreatePublishedDataValue`, and collect alarm transition events. MXAccess reads for alarm Priority and DescAttrName values also happen in this phase, since they call back into the STA thread.
**Phase 2 (inside Lock):** Apply all prepared updates to variable nodes and call `ClearChangeMasks` on each to trigger OPC UA data change notifications. Alarm events are reported in this same lock scope.
```csharp
lock (Lock)
{
foreach (var (variable, dataValue) in updates)
{
variable.Value = dataValue.Value;
variable.StatusCode = dataValue.StatusCode;
variable.Timestamp = dataValue.SourceTimestamp;
variable.ClearChangeMasks(SystemContext, false);
}
}
```
### ClearChangeMasks
`ClearChangeMasks(SystemContext, false)` is the mechanism that notifies the OPC UA framework a node's value has changed. The framework uses change masks internally to track which nodes have pending notifications for active monitored items. Calling this method causes the server to enqueue data change notifications for all monitoring clients of that node. The `false` parameter indicates that child nodes should not be recursively cleared.
## Transferred Subscription Restoration
When OPC UA sessions are transferred (e.g., client reconnects and resumes a previous session), the framework calls `OnMonitoredItemsTransferred`. The override collects the tag references for all transferred items and calls `RestoreTransferredSubscriptions`.
`RestoreTransferredSubscriptions` groups the tag references by count and, for each tag that does not already have an active ref-count entry, opens a new MXAccess subscription and sets the initial reference count:
```csharp
internal void RestoreTransferredSubscriptions(IEnumerable<string> fullTagReferences)
{
var transferredCounts = fullTagReferences
.GroupBy(tagRef => tagRef, StringComparer.OrdinalIgnoreCase)
.ToDictionary(g => g.Key, g => g.Count(), StringComparer.OrdinalIgnoreCase);
foreach (var kvp in transferredCounts)
{
lock (_lock)
{
if (_subscriptionRefCounts.ContainsKey(kvp.Key))
continue;
_subscriptionRefCounts[kvp.Key] = kvp.Value;
}
_ = _mxAccessClient.SubscribeAsync(kvp.Key, (_, _) => { });
}
}
```
Tags that already have in-memory bookkeeping are skipped to avoid double-counting when the transfer happens within the same server process (normal in-process session migration).
- `src/ZB.MOM.WW.OtOpcUa.Core.Abstractions/ISubscribable.cs` — capability contract
- `src/ZB.MOM.WW.OtOpcUa.Core/Resilience/CapabilityInvoker.cs` — pipeline wrapping
- `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/Sta/StaPump.cs` — Galaxy STA thread + message pump
- Per-driver subscribe implementations in each `Driver.*` project

View File

@@ -0,0 +1,125 @@
# AB Legacy test fixture
Coverage map + gap inventory for the AB Legacy (PCCC) driver — SLC 500 /
MicroLogix / PLC-5 / LogixPccc-mode.
**TL;DR:** Docker integration-test scaffolding lives at
`tests/ZB.MOM.WW.OtOpcUa.Driver.AbLegacy.IntegrationTests/` (task #224),
reusing the AB CIP `ab_server` image in PCCC mode with per-family
compose profiles (`slc500` / `micrologix` / `plc5`). Scaffold passes
the skip-when-absent contract cleanly. **Wire-level round-trip against
`ab_server` PCCC mode currently fails** with `BadCommunicationError`
on read/write (verified 2026-04-20) — ab_server's PCCC server-side
coverage is narrower than libplctag's PCCC client expects. The smoke
tests target the correct shape for real hardware + should pass when
`AB_LEGACY_ENDPOINT` points at a real SLC 5/05 / MicroLogix. Unit tests
via `FakeAbLegacyTag` still carry the contract coverage.
## What the fixture is
**Integration layer** (task #224, scaffolded with a known ab_server
gap):
`tests/ZB.MOM.WW.OtOpcUa.Driver.AbLegacy.IntegrationTests/` with
`AbLegacyServerFixture` (TCP-probes `localhost:44818`) + three smoke
tests (parametric read across families, SLC500 write-then-read). Reuses
the AB CIP `otopcua-ab-server:libplctag-release` image via a relative
`build:` context in `Docker/docker-compose.yml` — one image, different
`--plc` flags. See `Docker/README.md` §Known limitations for the
ab_server PCCC round-trip gap + resolution paths.
**Unit layer**: `tests/ZB.MOM.WW.OtOpcUa.Driver.AbLegacy.Tests/` is
still the primary coverage. All tests tagged `[Trait("Category", "Unit")]`.
The driver accepts `IAbLegacyTagFactory` via ctor DI; every test
supplies a `FakeAbLegacyTag`.
## What it actually covers (unit only)
- `AbLegacyAddressTests` — PCCC address parsing for SLC / MicroLogix / PLC-5
/ LogixPccc-mode (`N7:0`, `F8:12`, `B3:0/5`, etc.)
- `AbLegacyCapabilityTests` — data type mapping, read-only enforcement
- `AbLegacyReadWriteTests` — read + write happy + error paths against the fake
- `AbLegacyBitRmwTests` — bit-within-DINT read-modify-write serialization via
per-parent `SemaphoreSlim` (mirrors the AB CIP + FOCAS PMC-bit pattern from #181)
- `AbLegacyHostAndStatusTests` — probe + host-status transitions driven by
fake-returned statuses
- `AbLegacyDriverTests``IDriver` lifecycle
Capability surfaces whose contract is verified: `IDriver`, `IReadable`,
`IWritable`, `ITagDiscovery`, `ISubscribable`, `IHostConnectivityProbe`,
`IPerCallHostResolver`.
## What it does NOT cover
### 1. Wire-level PCCC
No PCCC frame is sent by the test suite. libplctag's PCCC subset (DF1,
ControlNet-over-EtherNet, PLC-5 native EtherNet) is untested here;
driver-side correctness depends on libplctag being correct.
### 2. Family-specific behavior
- SLC 500 timeout + retry thresholds (SLC's comm module has known slow-response
edges) — unit fakes don't simulate timing.
- MicroLogix 1100 / 1400 max-connection-count limits — not stressed.
- PLC-5 native EtherNet connection setup (PCCC-encapsulated-in-CIP vs raw
CSPv4) — routing covered at parse level only.
### 3. Multi-device routing
`IPerCallHostResolver` contract is verified; real PCCC wire routing across
multiple gateways is not.
### 4. Alarms / history
PCCC has no alarm object + no history object. Driver doesn't implement
`IAlarmSource` or `IHistoryProvider` — no test coverage is the correct shape.
### 5. File-type coverage
PCCC has many file types (N, F, B, T, C, R, S, ST, A) — the parser tests
cover the common ones but uncommon ones (`R` counters, `S` status files,
`A` ASCII strings) have thin coverage.
## When to trust AB Legacy tests, when to reach for a rig
| Question | Unit tests | Real PLC |
| --- | --- | --- |
| "Does `N7:0/5` parse correctly?" | yes | - |
| "Does bit-in-word RMW serialize concurrent writers?" | yes | yes |
| "Does the driver lifecycle hang / crash?" | yes | yes |
| "Does a real read against an SLC 500 return correct bytes?" | no | yes (required) |
| "Does MicroLogix 1100 respect its connection-count cap?" | no | yes (required) |
| "Do PLC-5 ST-files round-trip correctly?" | no | yes (required) |
## Follow-up candidates
1. **Fix ab_server PCCC coverage upstream** — the scaffold lands the
Docker infrastructure; the wire-level round-trip gap is in ab_server
itself. Filing a patch to `libplctag/libplctag` to expand PCCC
server-side opcode coverage would make the scaffolded smoke tests
pass without a golden-box tier.
2. **Rockwell RSEmulate 500 golden-box tier** — Rockwell's real emulator
for SLC/MicroLogix/PLC-5. Would close UDT-equivalent (integer-file
indirection), timer/counter decomposition, and real ladder execution
gaps. Costs: RSLinx OEM license, Windows-only, Hyper-V conflict
matching TwinCAT XAR + Logix Emulate, no clean PR-diffable project
format (SLC/ML save as binary `.RSS`). Scaffold like the Logix
Emulate tier when operationally worth it.
3. **Lab rig** — used SLC 5/05 or MicroLogix 1100 on a dedicated
network; parts are end-of-life but still available. PLC-5 +
LogixPccc-mode behaviour + DF1 serial need specific controllers.
## Key fixture / config files
- `tests/ZB.MOM.WW.OtOpcUa.Driver.AbLegacy.IntegrationTests/AbLegacyServerFixture.cs`
— TCP probe + skip attributes + env-var parsing
- `tests/ZB.MOM.WW.OtOpcUa.Driver.AbLegacy.IntegrationTests/AbLegacyReadSmokeTests.cs`
— three wire-level smoke tests (currently blocked by ab_server PCCC gap)
- `tests/ZB.MOM.WW.OtOpcUa.Driver.AbLegacy.IntegrationTests/Docker/docker-compose.yml`
— compose profiles reusing AB CIP Dockerfile
- `tests/ZB.MOM.WW.OtOpcUa.Driver.AbLegacy.IntegrationTests/Docker/README.md`
— known-limitations write-up + resolution paths
- `tests/ZB.MOM.WW.OtOpcUa.Driver.AbLegacy.Tests/FakeAbLegacyTag.cs`
in-process fake + factory
- `src/ZB.MOM.WW.OtOpcUa.Driver.AbLegacy/AbLegacyDriver.cs` — scope remarks
at the top of the file

View File

@@ -0,0 +1,216 @@
# ab_server test fixture
Coverage map + gap inventory for the AB CIP integration fixture backed by
libplctag's `ab_server` simulator.
**TL;DR:** `ab_server` is a connectivity + atomic-read smoke harness for the AB
CIP driver. It does **not** benchmark UDTs, alarms, or any family-specific
quirk. UDT / alarm / quirk behavior is verified only by unit tests with
`FakeAbCipTagRuntime`.
## What the fixture is
- **Binary**: `ab_server` — a C program in libplctag's
`src/tools/ab_server/` ([libplctag/libplctag](https://github.com/libplctag/libplctag),
MIT).
- **Launcher**: Docker (only supported path). `Docker/Dockerfile`
multi-stage-builds `ab_server` from source against a pinned libplctag
tag + copies the binary into a slim runtime image.
`Docker/docker-compose.yml` has per-family services (`controllogix`
/ `compactlogix` / `micro800` / `guardlogix`); all bind `:44818`.
- **Lifecycle**: `AbServerFixture` TCP-probes `127.0.0.1:44818` at
collection init + records a skip reason when unreachable. Tests skip
via `[AbServerFact]` / `[AbServerTheory]` which check the same probe.
- **Profiles**: `KnownProfiles.{ControlLogix, CompactLogix, Micro800, GuardLogix}`
in `AbServerProfile.cs` — thin Family + ComposeProfile + Notes records;
the compose file is the canonical source of truth for which tags get
seeded + which `--plc` mode the simulator boots in.
- **Tests**: one smoke, `AbCipReadSmokeTests.Driver_reads_seeded_DInt_from_ab_server`,
parametrized over all four profiles via `[AbServerTheory]` + `[MemberData]`.
- **Endpoint override**: `AB_SERVER_ENDPOINT=host:port` points the
fixture at a real PLC instead of the local container.
## What it actually covers
- Read path: driver → libplctag → CIP-over-EtherNet/IP → simulator → back.
- Atomic Logix types per seed: `DINT`, `REAL`, `BOOL`, `SINT`, `STRING`.
- One `DINT[16]` array tag (ControlLogix profile only).
- `--plc controllogix` and `--plc compactlogix` mode dispatch.
- The skip-on-missing-binary behavior (`AbServerFactAttribute`) so a fresh
clone without the simulator stays green.
## What it does NOT cover
Each gap below is either stated explicitly in the profile's `Notes` field or
inferable from the seed-tag set + smoke-test surface.
### 1. UDTs / CIP Template Object (class 0x6C)
ControlLogix profile `Notes`: *"ab_server lacks full UDT emulation."*
Unverified against `ab_server`:
- PR 6 structured read/write (`AbCipStructureMember` fan-out)
- #179 Template Object shape reader (`CipTemplateObjectDecoder` + `FetchUdtShapeAsync`)
- #194 whole-UDT read optimization (`AbCipUdtReadPlanner` +
`AbCipUdtMemberLayout` + the `ReadGroupAsync` path in `AbCipDriver`)
Unit coverage: `AbCipFetchUdtShapeTests`, `CipTemplateObjectDecoderTests`,
`AbCipUdtMemberLayoutTests`, `AbCipUdtReadPlannerTests`,
`AbCipDriverWholeUdtReadTests` — all with golden Template-Object byte buffers
+ offset-keyed `FakeAbCipTag` values.
### 2. ALMD / ALMA alarm projection (#177)
Depends on the ALMD UDT shape, which `ab_server` cannot emulate. The
`OnAlarmEvent` raise/clear path + ack-write semantics are not exercised
end-to-end.
Unit coverage: `AbCipAlarmProjectionTests` — fakes feed `InFaulted` /
`Severity` via `ValuesByOffset` + assert the emitted `AlarmEventArgs`.
### 3. Micro800 unconnected-only path
Micro800 profile `Notes`: *"ab_server has no --plc micro800 — falls back to
controllogix emulation."*
The empty routing path + unconnected-session requirement (PR 11) is unit-tested
but never challenged at the CIP wire level. Real Micro800 (2080-series) on a
lab rig would be the authoritative benchmark.
### 4. GuardLogix safety subsystem
GuardLogix profile `Notes`: *"ab_server doesn't emulate the safety
subsystem."*
Only the `_S`-suffix naming classifier (PR 12, `SecurityClassification.ViewOnly`
forced on safety tags) runs. Actual safety-partition write rejection — what
happens when a non-safety write lands on a real `1756-L8xS` — is not exercised.
### 5. CompactLogix narrow ConnectionSize cap
CompactLogix profile `Notes`: *"ab_server lacks the narrower limit itself."*
Driver-side `AbCipPlcFamilyProfile` caps `ConnectionSize` at the CompactLogix
value per PR 10, but `ab_server` accepts whatever the client asks for — the
cap's correctness is trusted from its unit test, never stressed against a
simulator that rejects oversized requests.
### 6. BOOL-within-DINT read-modify-write (#181)
The `AbCipDriver.WriteBitInDIntAsync` RMW path + its per-parent `SemaphoreSlim`
serialization is unit-tested only (`AbCipBoolInDIntRmwTests`). `ab_server`
seeds a plain `TestBOOL` tag; the `.N` bit-within-DINT syntax that triggers
the RMW path is not exercised end-to-end.
### 7. Capability surfaces beyond read
No smoke test for:
- `IWritable.WriteAsync`
- `ITagDiscovery.DiscoverAsync` (`@tags` walker)
- `ISubscribable.SubscribeAsync` (poll-group engine)
- `IHostConnectivityProbe` state transitions under wire failure
- `IPerCallHostResolver` multi-device routing
The driver implements all of these + they have unit coverage, but the only
end-to-end path `ab_server` validates today is atomic `ReadAsync`.
## Logix Emulate golden-box tier
Rockwell Studio 5000 Logix Emulate sits **above** ab_server in fidelity +
**below** real hardware. When an operator has Emulate running on a
reachable Windows box + sets two env vars, the suite promotes several
behaviours from unit-only to end-to-end wire-level coverage:
```powershell
$env:AB_SERVER_PROFILE = 'emulate'
$env:AB_SERVER_ENDPOINT = '<emulate-pc-ip>:44818'
dotnet test tests\ZB.MOM.WW.OtOpcUa.Driver.AbCip.IntegrationTests
```
With `AB_SERVER_PROFILE` unset or `abserver`, the Emulate-tier classes
skip cleanly + the ab_server Docker fixture runs as usual.
| Gap this fixture doc calls out | ab_server | Logix Emulate | Real hardware |
|---|---|---|---|
| UDT / CIP Template Object (task #194) | no | **yes** | yes |
| ALMD alarm projection (task #177) | no | **yes** | yes |
| `@tags` Symbol Object walk with `Program:` scope | partial | **yes** | yes |
| Add-On Instructions | no | **yes** | yes |
| GuardLogix safety-partition write rejection | no | **yes** (Emulate 5580) | yes |
| CompactLogix narrow ConnectionSize enforcement | no | **yes** (5370 firmware) | yes |
| EtherNet/IP embedded-switch behaviour | no | no | yes |
| Redundant chassis failover (1756-RM) | no | no | yes |
| Motion control timing | no | no | yes |
**Tests that promote to Emulate** (gated on `AB_SERVER_PROFILE=emulate`
via `AbServerProfileGate.SkipUnless`):
- `AbCipEmulateUdtReadTests.WholeUdt_read_decodes_each_member_at_its_Template_Object_offset`
#194 whole-UDT optimization, verified against real Template Object
bytes
- `AbCipEmulateAlmdTests.Real_ALMD_raise_fires_OnAlarmEvent_through_the_driver_projection`
#177 ALMD projection, verified against the real ALMD instruction
**Required Studio 5000 project state** is documented in
[`tests/…/AbCip.IntegrationTests/LogixProject/README.md`](../../tests/ZB.MOM.WW.OtOpcUa.Driver.AbCip.IntegrationTests/LogixProject/README.md);
the `.L5X` export lands there once the Emulate PC is on-site + the
project is authored.
**Costs to accept**:
- **Rockwell TechConnect or per-seat license** — not redistributable;
not CI-runnable. Each operator licenses their own Emulate install.
- **Windows-only + Hyper-V conflict** — Emulate can't coexist with
Docker Desktop's WSL 2 backend on the same OS, same way TwinCAT XAR
can't (see `docs/v2/dev-environment.md` §Integration host).
- **Manual lifecycle** — no `docker compose up` equivalent; operator
opens Emulate, loads the L5X, clicks Run. The L5X in the repo keeps
project state reproducible, runtime-start is human.
## When to trust ab_server, when to reach for a rig
| Question | ab_server | Unit tests | Logix Emulate | Lab rig |
| --- | --- | --- | --- | --- |
| "Does the driver talk CIP at all?" | yes | - | yes | - |
| "Is my atomic read path wired correctly?" | yes | yes | yes | yes |
| "Does whole-UDT grouping work?" | no | yes | **yes** | yes |
| "Do ALMD alarms raise + clear?" | no | yes | **yes** | yes |
| "Is Micro800 unconnected-only enforced wire-side?" | no (emulated as CLX) | partial | yes | yes (required) |
| "Does GuardLogix reject non-safety writes on safety tags?" | no | no | yes (Emulate 5580) | yes |
| "Does CompactLogix refuse oversized ConnectionSize?" | no | partial | yes (5370 firmware) | yes |
| "Does BOOL-in-DINT RMW race against concurrent writers?" | no | yes | partial | yes (stress) |
| "Does EtherNet/IP embedded-switch behave correctly?" | no | no | no | yes (required) |
| "Does redundant-chassis failover work?" | no | no | no | yes (required) |
## Follow-up candidates
If integration-level UDT / alarm / quirk proof becomes a shipping gate, the
options are roughly:
1. **Logix Emulate golden-box tier** (scaffolded; see the section above) —
highest-fidelity path short of real hardware. Closes UDT / ALMD / AOI /
optimized-DB gaps in one license + one Windows PC.
2. **Extend `ab_server`** upstream — the project accepts PRs + already
carries a CIP framing layer that UDT emulation could plug into.
3. **Stand up a lab rig** — physical `1756-L7x` / `5069-L3x` / `2080-LC30`
/ `1756-L8xS` controllers. The only path that covers safety partitions
across nodes, redundant chassis, embedded-switch behaviour, and motion
timing.
See also:
- `tests/ZB.MOM.WW.OtOpcUa.Driver.AbCip.IntegrationTests/AbServerFixture.cs`
- `tests/ZB.MOM.WW.OtOpcUa.Driver.AbCip.IntegrationTests/AbServerProfile.cs`
- `tests/ZB.MOM.WW.OtOpcUa.Driver.AbCip.IntegrationTests/AbServerProfileGate.cs`
`AB_SERVER_PROFILE` tier gate
- `tests/ZB.MOM.WW.OtOpcUa.Driver.AbCip.IntegrationTests/AbCipReadSmokeTests.cs`
- `tests/ZB.MOM.WW.OtOpcUa.Driver.AbCip.IntegrationTests/Docker/` — ab_server
image + compose
- `tests/ZB.MOM.WW.OtOpcUa.Driver.AbCip.IntegrationTests/Emulate/` — Logix
Emulate tier tests
- `tests/ZB.MOM.WW.OtOpcUa.Driver.AbCip.IntegrationTests/LogixProject/README.md`
— L5X project state the Emulate tier expects
- `docs/v2/test-data-sources.md` §2 — the broader test-data-source picking
rationale this fixture slots into

View File

@@ -0,0 +1,133 @@
# FOCAS test fixture
Coverage map + gap inventory for the FANUC FOCAS2 CNC driver.
**TL;DR: there is no integration fixture.** Every test uses a
`FakeFocasClient` injected via `IFocasClientFactory`. Fanuc's FOCAS library
(`Fwlib32.dll`) is closed-source proprietary with no public simulator;
CNC-side behavior is trusted from field deployments.
## What the fixture is
Nothing at the integration layer.
`tests/ZB.MOM.WW.OtOpcUa.Driver.FOCAS.Tests/` is unit-only. The driver ships
as Tier C (process-isolated) per `docs/v2/driver-stability.md` because the
FANUC DLL has known crash modes; tests can't replicate those in-process.
## What it actually covers (unit only)
- `FocasCapabilityTests` — data-type mapping (PMC bit / word / float,
macro variable types, parameter types)
- `FocasCapabilityMatrixTests` — per-CNC-series range validation (macro
/ parameter / PMC letter + number) across 16i / 0i-D / 0i-F /
30i / PowerMotion. See [`docs/v2/focas-version-matrix.md`](../v2/focas-version-matrix.md)
for the authoritative matrix. 46 theory cases lock every documented
range boundary — widening a range without updating the doc fails a
test.
- `FocasReadWriteTests` — read + write against the fake, FOCAS native status
→ OPC UA StatusCode mapping
- `FocasScaffoldingTests``IDriver` lifecycle + multi-device routing
- `FocasPmcBitRmwTests` — PMC bit read-modify-write synchronization (per-byte
`SemaphoreSlim`, mirrors the AB / Modbus pattern from #181)
- `FwlibNativeHelperTests``Focas32.dll``Fwlib32.dll` bridge validation
+ P/Invoke signature validation
Capability surfaces whose contract is verified: `IDriver`, `IReadable`,
`IWritable`, `ITagDiscovery`, `ISubscribable`, `IHostConnectivityProbe`,
`IPerCallHostResolver`.
Pre-flight validation runs in `FocasDriver.InitializeAsync` — configs
referencing out-of-range addresses fail at load time with a diagnostic
message naming the CNC series + documented limit. This closes the
cheap half of the hardware-free stability gap; Tier-C process
isolation (task #220) closes the expensive half — see
[`docs/v2/implementation/focas-isolation-plan.md`](../v2/implementation/focas-isolation-plan.md).
## What it does NOT cover
### 1. FOCAS wire traffic
No FOCAS TCP frame is sent. `Fwlib32.dll`'s TCP-to-FANUC-gateway exchange is
closed-source; the driver trusts the P/Invoke layer per #193. Real CNC
correctness is trusted from field deployments.
### 2. Alarm / parameter-change callbacks
FOCAS has no push model — the driver polls via the shared `PollGroupEngine`.
There are no CNC-initiated callbacks to test; the absence is by design.
### 3. Macro / ladder variable types
FANUC has CNC-specific extensions (macro variables `#100-#999`, system
variables `#1000-#5000`, PMC timers / counters / keep-relays) whose
per-address semantics differ across 0i-F / 30i / 31i / 32i Series. Driver
covers the common address shapes; per-model quirks are not stressed.
### 4. Model-specific behavior
- Alarm retention across power cycles (model-specific CNC behavior)
- Parameter range enforcement (CNC rejects out-of-range writes)
- MTB (machine tool builder) custom screens that expose non-standard data
### 5. Tier-C process isolation — architecture shipped, Fwlib32 integration hardware-gated
The Tier-C architecture is now in place as of PRs #169#173 (FOCAS
PR AE, task #220):
- `Driver.FOCAS.Shared` carries MessagePack IPC contracts
- `Driver.FOCAS.Host` (.NET 4.8 x86 Windows service via NSSM) accepts
a connection on a strictly-ACL'd named pipe + dispatches frames to
an `IFocasBackend`
- `Driver.FOCAS.Ipc.IpcFocasClient` implements the `IFocasClient` DI
seam by forwarding over IPC — swap the DI registration and the
driver runs Tier-C with zero other changes
- `Driver.FOCAS.Supervisor.FocasHostSupervisor` owns the spawn +
heartbeat + respawn + 3-in-5min crash-loop breaker + sticky alert
- `Driver.FOCAS.Host.Stability.PostMortemMmf`
`Driver.FOCAS.Supervisor.PostMortemReader` — ring-buffer of the
last ~1000 IPC operations survives a Host crash
The one remaining gap is the production `FwlibHostedBackend`: an
`IFocasBackend` implementation that wraps the licensed
`Fwlib32.dll` P/Invoke. That's hardware-gated on task #222 — we
need a CNC on the bench (or the licensed FANUC developer kit DLL
with a test harness) to validate it. Until then, the Host ships
`FakeFocasBackend` + `UnconfiguredFocasBackend`. Setting
`OTOPCUA_FOCAS_BACKEND=fake` lets operators smoke-test the whole
Tier-C pipeline end-to-end without any CNC.
## When to trust FOCAS tests, when to reach for a rig
| Question | Unit tests | Real CNC |
| --- | --- | --- |
| "Does PMC address `R100.3` route to the right bit?" | yes | yes |
| "Does the FANUC status → OPC UA StatusCode map cover every documented code?" | yes (contract) | yes |
| "Does a real read against a 30i Series return correct bytes?" | no | yes (required) |
| "Does `Fwlib32.dll` crash on concurrent reads?" | no | yes (stress) |
| "Do macro variables round-trip across power cycles?" | no | yes (required) |
## Follow-up candidates
1. **Nothing public** — Fanuc's FOCAS Developer Kit ships an emulator DLL
but it's under NDA + tied to licensed dev-kit installations; can't
redistribute for CI.
2. **Lab rig** — used FANUC 0i-F simulator controller (or a retired machine
tool) on a dedicated network; only path that covers real CNC behavior.
3. **Process isolation first** — before trusting FOCAS in production at
scale, shipping the Tier-C out-of-process Host architecture (similar to
Galaxy) is higher value than a CI simulator.
## Key fixture / config files
- `tests/ZB.MOM.WW.OtOpcUa.Driver.FOCAS.Tests/FakeFocasClient.cs`
in-process fake implementing `IFocasClient`
- `tests/ZB.MOM.WW.OtOpcUa.Driver.FOCAS.Tests/FocasCapabilityMatrixTests.cs`
— parameterized theories locking the per-series matrix
- `src/ZB.MOM.WW.OtOpcUa.Driver.FOCAS/FocasDriver.cs` — ctor takes
`IFocasClientFactory`
- `src/ZB.MOM.WW.OtOpcUa.Driver.FOCAS/FocasCapabilityMatrix.cs`
per-CNC-series range validator (the matrix the doc describes)
- `docs/v2/focas-version-matrix.md` — authoritative range reference
- `docs/v2/implementation/focas-isolation-plan.md` — Tier-C isolation
plan (task #220)
- `docs/v2/driver-stability.md` — Tier C scope + process-isolation rationale

View File

@@ -1,6 +1,19 @@
# Galaxy Repository
# Galaxy Repository — Tag Discovery for the Galaxy Driver
`GalaxyRepositoryService` reads the Galaxy object hierarchy and attribute metadata from the System Platform Galaxy Repository SQL Server database. This data drives the construction of the OPC UA address space.
`GalaxyRepositoryService` reads the Galaxy object hierarchy and attribute metadata from the System Platform Galaxy Repository SQL Server database. It is the Galaxy driver's implementation of **`ITagDiscovery.DiscoverAsync`** — every driver has its own discovery source, and the Galaxy driver's is a direct SQL query against the Galaxy Repository (the `ZB` database). Other drivers use completely different mechanisms:
| Driver | `ITagDiscovery` source |
|--------|------------------------|
| Galaxy | ZB SQL hierarchy + attribute queries (this doc) |
| AB CIP | `@tags` walker against the PLC controller |
| AB Legacy | Data-table scan via PCCC `LogicalRead` on the PLC |
| TwinCAT | Beckhoff `SymbolLoaderFactory` — uploads the full symbol tree from the ADS runtime |
| S7 | Config-DB enumeration (no native symbol upload for S7comm) |
| Modbus | Config-DB enumeration (flat register map, user-authored) |
| FOCAS | CNC queries (`cnc_rdaxisname`, `cnc_rdmacroinfo`, …) + optional Config-DB overlays |
| OPC UA Client | `Session.Browse` against the remote server |
`GalaxyRepositoryService` lives in `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/Backend/GalaxyRepository/` — Host-side, .NET Framework 4.8 x86, same process that owns the MXAccess COM objects. The Proxy forwards discovery over IPC the same way it forwards reads and writes.
## Connection Configuration
@@ -19,7 +32,7 @@ The connection uses Windows Authentication because the Galaxy Repository databas
## SQL Queries
All queries are embedded as `const string` fields in `GalaxyRepositoryService`. No dynamic SQL is used.
All queries are embedded as `const string` fields in `GalaxyRepositoryService`. No dynamic SQL is used. Project convention `GR-006` requires `const string` SQL queries; any new query must be added as a named constant rather than built at runtime.
### Hierarchy query
@@ -31,9 +44,9 @@ Returns deployed Galaxy objects with their parent relationships, browse names, a
- Marks objects with `category_id = 13` as areas
- Filters to `is_template = 0` (instances only, not templates)
- Filters to `deployed_package_id <> 0` (deployed objects only)
- Returns a `template_chain` column built by a recursive CTE that walks `gobject.derived_from_gobject_id` from each instance through its immediate template and ancestor templates (depth guard `< 10`). Template names are ordered by depth and joined with `|` via `STUFF(... FOR XML PATH(''))`. Example: `TestMachine_001` returns `$TestMachine|$gMachine|$gUserDefined|$UserDefined`. The C# repository reader splits the column on `|`, trims, and populates `GalaxyObjectInfo.TemplateChain`, which is consumed by `AlarmObjectFilter` for template-based alarm filtering. See [Alarm Tracking](AlarmTracking.md#template-based-alarm-object-filter).
- Returns `template_definition.category_id` as a `category_id` column, populated into `GalaxyObjectInfo.CategoryId`. The runtime status probe manager filters this down to `CategoryId == 1` (`$WinPlatform`) and `CategoryId == 3` (`$AppEngine`) to decide which objects get a `<Host>.ScanState` probe advised. Also used by `LmxNodeManager.BuildHostedVariablesMap` to identify Platform/Engine ancestors during the hosted-variables walk.
- Returns `gobject.hosted_by_gobject_id` as a `hosted_by_gobject_id` column, populated into `GalaxyObjectInfo.HostedByGobjectId`. This is the **runtime host** of the object (e.g., which `$AppEngine` actually runs it), **not** the browse-containment parent (`contained_by_gobject_id`). The two are often different — an object can live in one Area in the browse tree but be hosted by an Engine on a different Platform for runtime execution. The node manager walks this chain during `BuildHostedVariablesMap` to find the nearest `$WinPlatform` or `$AppEngine` ancestor so subtree quality invalidation on a Stopped host reaches exactly the variables that were actually executing there. Note: the Galaxy schema column is named `hosted_by_gobject_id` (not `host_gobject_id` as some documentation sources guess). See [MXAccess Bridge — Per-Host Runtime Status Probes](MxAccessBridge.md#per-host-runtime-status-probes-hostscanstate).
- Returns a `template_chain` column built by a recursive CTE that walks `gobject.derived_from_gobject_id` from each instance through its immediate template and ancestor templates (depth guard `< 10`). Template names are ordered by depth and joined with `|` via `STUFF(... FOR XML PATH(''))`. Example: `TestMachine_001` returns `$TestMachine|$gMachine|$gUserDefined|$UserDefined`. The C# repository reader splits the column on `|`, trims, and populates `GalaxyObjectInfo.TemplateChain`, which is consumed by `AlarmObjectFilter` for template-based alarm filtering. See [Alarm Tracking](../AlarmTracking.md#template-based-alarm-object-filter).
- Returns `template_definition.category_id` as a `category_id` column, populated into `GalaxyObjectInfo.CategoryId`. The runtime status probe manager filters this down to `CategoryId == 1` (`$WinPlatform`) and `CategoryId == 3` (`$AppEngine`) to decide which objects get a `<Host>.ScanState` probe advised. Also used during the hosted-variables walk to identify Platform/Engine ancestors.
- Returns `gobject.hosted_by_gobject_id` as a `hosted_by_gobject_id` column, populated into `GalaxyObjectInfo.HostedByGobjectId`. This is the **runtime host** of the object (e.g., which `$AppEngine` actually runs it), **not** the browse-containment parent (`contained_by_gobject_id`). The two are often different — an object can live in one Area in the browse tree but be hosted by an Engine on a different Platform for runtime execution. The driver walks this chain during `BuildHostedVariablesMap` to find the nearest `$WinPlatform` or `$AppEngine` ancestor so subtree quality invalidation on a Stopped host reaches exactly the variables that were actually executing there. Note: the Galaxy schema column is named `hosted_by_gobject_id` (not `host_gobject_id` as some documentation sources guess). See [Galaxy driver — Per-Host Runtime Status Probes](Galaxy.md#per-host-runtime-status-probes-hostscanstate).
### Attributes query (standard)
@@ -53,8 +66,8 @@ Returns user-defined dynamic attributes for deployed objects:
When `ExtendedAttributes = true`, a more comprehensive query runs that unions two sources:
1. **Primitive attributes** -- Joins through `primitive_instance` and `attribute_definition` to include system-level attributes from primitive components. Each attribute carries its `primitive_name` so the address space can group them under their parent variable.
2. **Dynamic attributes** -- The same CTE-based query as the standard path, with an empty `primitive_name`.
1. **Primitive attributes** Joins through `primitive_instance` and `attribute_definition` to include system-level attributes from primitive components. Each attribute carries its `primitive_name` so the address space can group them under their parent variable.
2. **Dynamic attributes** The same CTE-based query as the standard path, with an empty `primitive_name`.
The `full_tag_reference` for primitive attributes follows the pattern `tag_name.primitive_name.attribute_name` (e.g., `TestMachine_001.AlarmAttr.InAlarm`).
@@ -66,10 +79,10 @@ A single-column query: `SELECT time_of_last_deploy FROM galaxy`. The `galaxy` ta
The Galaxy maintains two package references for each object:
- `checked_in_package_id` -- The latest saved version, which may include undeployed configuration changes
- `deployed_package_id` -- The version currently running on the target platform
- `checked_in_package_id` — the latest saved version, which may include undeployed configuration changes
- `deployed_package_id` — the version currently running on the target platform
The queries filter on `deployed_package_id <> 0` because the OPC UA server must mirror what is actually running in the Galaxy runtime. Using `checked_in_package_id` would expose attributes and objects that exist in the IDE but have not been deployed, causing mismatches between the OPC UA address space and the MXAccess runtime.
The queries filter on `deployed_package_id <> 0` because the OPC UA address space must mirror what is actually running in the Galaxy runtime. Using `checked_in_package_id` would expose attributes and objects that exist in the IDE but have not been deployed, causing mismatches between the OPC UA address space and the MXAccess runtime.
## Platform Scope Filter
@@ -77,21 +90,16 @@ When `Scope` is set to `LocalPlatform`, the repository applies a post-query C# f
### How it works
1. **Platform lookup** -- A separate `const string` SQL query (`PlatformLookupSql`) reads `platform_gobject_id` and `node_name` from the `platform` table for all deployed platforms. This runs once per hierarchy load.
2. **Platform matching** -- The configured `PlatformName` (or `Environment.MachineName` when null) is matched case-insensitively against the `node_name` column. If no match is found, a warning is logged listing the available platforms, and the address space is empty.
3. **Host chain collection** -- The filter collects the matching platform's `gobject_id`, then iterates the hierarchy to find all `$AppEngine` (category 3) objects whose `HostedByGobjectId` equals the platform. This produces the full set of host gobject_ids under the local platform.
4. **Object inclusion** -- All non-area objects whose `HostedByGobjectId` is in the host set are included, along with the hosts themselves.
5. **Area retention** -- `ParentGobjectId` chains are walked upward from included objects to pull in ancestor areas, keeping the browse tree connected. Areas that contain no local descendants are excluded.
6. **Attribute filtering** -- The set of included `gobject_id` values is cached after `GetHierarchyAsync` and reused by `GetAttributesAsync` to filter attributes to the same scope.
1. **Platform lookup** A separate `const string` SQL query (`PlatformLookupSql`) reads `platform_gobject_id` and `node_name` from the `platform` table for all deployed platforms. This runs once per hierarchy load.
2. **Platform matching** — The configured `PlatformName` (or `Environment.MachineName` when null) is matched case-insensitively against the `node_name` column. If no match is found, a warning is logged listing the available platforms and the address space is empty.
3. **Host chain collection** The filter collects the matching platform's `gobject_id`, then iterates the hierarchy to find all `$AppEngine` (category 3) objects whose `HostedByGobjectId` equals the platform. This produces the full set of host gobject_ids under the local platform.
4. **Object inclusion** — All non-area objects whose `HostedByGobjectId` is in the host set are included, along with the hosts themselves.
5. **Area retention**`ParentGobjectId` chains are walked upward from included objects to pull in ancestor areas, keeping the browse tree connected. Areas that contain no local descendants are excluded.
6. **Attribute filtering** — The set of included `gobject_id` values is cached after `GetHierarchyAsync` and reused by `GetAttributesAsync` to filter attributes to the same scope.
### Design rationale
The filter is applied in C# rather than SQL because the project convention `GR-006` requires `const string` SQL queries with no dynamic SQL. The hierarchy query already returns `HostedByGobjectId` and `CategoryId` on every row, so all information needed for filtering is already in memory after the query runs. The only new SQL is the lightweight platform lookup query.
The filter is applied in C# rather than SQL because project convention `GR-006` requires `const string` SQL queries with no dynamic SQL. The hierarchy query already returns `HostedByGobjectId` and `CategoryId` on every row, so all information needed for filtering is already in memory after the query runs. The only new SQL is the lightweight platform lookup query.
### Configuration
@@ -102,7 +110,7 @@ The filter is applied in C# rather than SQL because the project convention `GR-0
}
```
- Set `Scope` to `"LocalPlatform"` to enable filtering. Default is `"Galaxy"` (load everything, backward compatible).
- Set `Scope` to `"LocalPlatform"` to enable filtering. Default is `"Galaxy"` (load everything).
- Set `PlatformName` to an explicit hostname to target a specific platform, or leave null to use the local machine name.
### Startup log
@@ -119,25 +127,26 @@ GetAttributesAsync returned 4206 attributes (extended=true)
Scope filter retained 2100 of 4206 attributes
```
## Change Detection Polling
## Change Detection Polling and IRediscoverable
`ChangeDetectionService` runs a background polling loop that calls `GetLastDeployTimeAsync` at the configured interval. It compares the returned timestamp against the last known value:
`ChangeDetectionService` runs a background polling loop in the Host process that calls `GetLastDeployTimeAsync` at the configured interval. It compares the returned timestamp against the last known value:
- On the first poll (no previous state), the timestamp is recorded and `OnGalaxyChanged` fires unconditionally
- On subsequent polls, `OnGalaxyChanged` fires only when `time_of_last_deploy` differs from the cached value
When the event fires, the host service queries fresh hierarchy and attribute data from the repository and calls `LmxNodeManager.RebuildAddressSpace` (which delegates to incremental `SyncAddressSpace`).
When the event fires, the Host re-runs the hierarchy and attribute queries and pushes the result back to the Server via an IPC `RediscoveryNeeded` message. That surfaces on `GalaxyProxyDriver` as the **`IRediscoverable.OnRediscoveryNeeded`** event; the Server's `DriverNodeManager` consumes it and calls `SyncAddressSpace` to compute the diff against the live address space.
The polling approach is used because the Galaxy Repository database does not provide change notifications. The `galaxy.time_of_last_deploy` column updates only on completed deployments, so the polling interval controls how quickly the OPC UA address space reflects Galaxy changes.
## TestConnection
`TestConnectionAsync` runs `SELECT 1` against the configured database. This is used at service startup to verify connectivity before attempting the full hierarchy query.
`TestConnectionAsync` runs `SELECT 1` against the configured database. This is used at Host startup to verify connectivity before attempting the full hierarchy query.
## Key source files
- `src/ZB.MOM.WW.OtOpcUa.Host/GalaxyRepository/GalaxyRepositoryService.cs` -- SQL queries and data access
- `src/ZB.MOM.WW.OtOpcUa.Host/GalaxyRepository/PlatformScopeFilter.cs` -- Platform-based hierarchy and attribute filtering
- `src/ZB.MOM.WW.OtOpcUa.Host/GalaxyRepository/ChangeDetectionService.cs` -- Deploy timestamp polling loop
- `src/ZB.MOM.WW.OtOpcUa.Host/Configuration/GalaxyRepositoryConfiguration.cs` -- Connection, polling, and scope settings
- `src/ZB.MOM.WW.OtOpcUa.Host/Domain/PlatformInfo.cs` -- Platform-to-hostname DTO
- `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/Backend/GalaxyRepository/GalaxyRepositoryService.cs` SQL queries and data access
- `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/Backend/GalaxyRepository/PlatformScopeFilter.cs` Platform-based hierarchy and attribute filtering
- `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/Backend/GalaxyRepository/ChangeDetectionService.cs` Deploy timestamp polling loop
- `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/Configuration/GalaxyRepositoryConfiguration.cs` Connection, polling, and scope settings
- `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/Domain/PlatformInfo.cs` Platform-to-hostname DTO
- `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Shared/Contracts/DiscoveryResponse.cs` — IPC DTO the Host uses to return hierarchy + attribute results across the pipe

View File

@@ -0,0 +1,164 @@
# Galaxy test fixture
Coverage map + gap inventory for the Galaxy driver — out-of-process Host
(net48 x86 MXAccess COM) + Proxy (net10) + Shared protocol.
**TL;DR: Galaxy has the richest test harness in the fleet** — real Host
subprocess spawn, real ZB SQL queries, IPC parity checks against the v1
LmxProxy reference, + live-smoke tests when MXAccess runtime is actually
installed. Gaps are live-plant + failover-shaped: the E2E suite covers the
representative ~50-tag deployment but not large-site discovery stress, real
Rockwell/Siemens PLC enumeration through MXAccess, or ZB SQL Always-On
replica failover.
## What the fixture is
Multi-project test topology:
- **E2E parity** —
`tests/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.E2E/ParityFixture.cs` spawns the
production `OtOpcUa.Driver.Galaxy.Host.exe` as a subprocess, opens the
named-pipe IPC, connects `GalaxyProxyDriver` + runs hierarchy / stability
parity tests against both.
- **Host.Tests** —
`tests/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host.Tests/` — direct Host process
testing (18+ test classes covering alarm discovery, AVEVA prerequisite
checks, IPC dispatcher, alarm tracker, probe manager, historian
cluster/quality/wiring, history read, OPC UA attribute mapping,
subscription lifecycle, reconnect, multi-host proxy, ADS address routing,
expression evaluation) + `GalaxyRepositoryLiveSmokeTests` that hit real
ZB SQL.
- **Proxy.Tests** — `GalaxyProxyDriver` client contract tests.
- **Shared.Tests** — shared protocol + address model.
- **TestSupport** — test helpers reused across the above.
## How tests skip
- **E2E parity**: `ParityFixture.SkipIfUnavailable()` runs at class init and
checks Windows-only, non-admin user, ZB SQL reachable on
`localhost:1433`, Host EXE built in the expected `bin/` folder. Any miss
→ tests skip.
- **Live-smoke** (`GalaxyRepositoryLiveSmokeTests`): `Assert.Skip` when ZB
unreachable. A `per project_galaxy_host_installed` memory on this repo's
dev box notes the MXAccess runtime is installed + pipe ACL denies Admins,
so live tests must run from a non-elevated shell.
- **Unit** tests (Shared, Proxy contract, most Host.Tests) have no skip —
they run anywhere.
## What it actually covers
### E2E parity suite
- `HierarchyParityTests` — Host address-space hierarchy vs v1 LmxProxy
reference (same ZB, same Galaxy, same shape)
- `StabilityFindingsRegressionTests` — probe subscription failure
handling + host-status mutation guard from the v1 stability findings
backlog
### Host.Tests (representative)
- Alarm discovery → subsystem setup
- AVEVA prerequisite checks (runtime installed, platform deployed, etc.)
- IPC dispatcher — request/response routing over the named pipe
- Alarm tracker state machine
- Probe manager — per-runtime probe subscription + reconnect
- Historian cluster / quality / wiring — Aveva Historian integration
- OPC UA attribute mapping
- Subscription lifecycle + reconnect
- Multi-host proxy routing
- ADS address routing + expression evaluation (Galaxy's legacy expression
language)
### Live-smoke
- `GalaxyRepositoryLiveSmokeTests` — real SQL against ZB database, verifies
the ZB schema + `LocalPlatform` scope filter + change-detection query
shape match production.
### Capability surfaces hit
All of them: `IDriver`, `IReadable`, `IWritable`, `ITagDiscovery`,
`ISubscribable`, `IHostConnectivityProbe`, `IPerCallHostResolver`,
`IAlarmSource`, `IHistoryProvider`. Galaxy is the only driver where every
interface sees both contract + real-integration coverage.
## What it does NOT cover
### 1. MXAccess COM by default
The E2E parity suite backs subscriptions via the DB-only path; MXAccess COM
integration opts in via a separate live-smoke. So "does the MXAccess STA
pump correctly handle real Wonderware runtime events" is exercised only
when the operator runs live smoke on a machine with MXAccess installed.
### 2. Real Rockwell / Siemens PLC enumeration
Galaxy runtime talks to PLCs through MXAccess (Device Integration Objects).
The CI parity suite uses a representative ~50-tag deployment; large sites
(1000+ tag hierarchies, multi-Galaxy replication, deeply-nested templates)
are not stressed.
### 3. ZB SQL Always-On failover
Live-smoke hits a single SQL instance. Real production ZB often runs on
Always-On availability groups; replica failover behavior is not tested.
### 4. Galaxy replication / backup-restore
Galaxy supports backup + partial replication across platforms — these
rewrite the ZB schema in ways that change the contained_name vs tag_name
mapping. Not exercised.
### 5. Historian failover
Aveva Historian can be clustered. `historian cluster / quality` tests
verify the cluster-config query; they don't exercise actual failover
(primary dies → secondary takes over mid-HistoryRead).
### 6. AVEVA runtime version matrix
MXAccess COM contract varies subtly across System Platform 2017 / 2020 /
2023. The live-smoke runs against whatever version is installed on the dev
box; CI has no AVEVA installed at all (licensing + footprint).
## When to trust the Galaxy suite, when to reach for a live plant
| Question | E2E parity | Live-smoke | Real plant |
| --- | --- | --- | --- |
| "Does Host spawn + IPC round-trip work?" | yes | yes | yes |
| "Does the ZB schema query match production shape?" | partial | yes | yes |
| "Does MXAccess COM handle runtime reconnect correctly?" | no | yes | yes |
| "Does the driver scale to 1000+ tags on one Galaxy?" | no | partial | yes (required) |
| "Does historian failover mid-read return a clean error?" | no | no | yes (required) |
| "Does System Platform 2023's MXAccess differ from 2020?" | no | partial | yes (required) |
| "Does ZB Always-On replica failover preserve generation?" | no | no | yes (required) |
## Follow-up candidates
1. **System Platform 2023 live-smoke matrix** — set up a second dev box
running SP2023; run the same live-smoke against both to catch COM-contract
drift early.
2. **Synthetic large-site fixture** — script a ZB populator that creates a
1000-Equipment / 20000-tag hierarchy, run the parity suite against it.
Catches O(N) → O(N²) discovery regressions.
3. **Historian failover scripted test** — with a two-node AVEVA Historian
cluster, tear down primary mid-HistoryRead + verify the driver's failover
behavior + error surface.
4. **ZB Always-On CI** — SQL Server 2022 on Linux supports Always-On;
could stand up a two-replica group for replica-failover coverage.
This is already the best-tested driver; the remaining work is site-scale
+ production-topology coverage, not capability coverage.
## Key fixture / config files
- `tests/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.E2E/ParityFixture.cs` — E2E fixture
that spawns Host + connects Proxy
- `tests/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host.Tests/GalaxyRepositoryLiveSmokeTests.cs`
— live ZB smoke with `Assert.Skip` gate
- `tests/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.TestSupport/` — shared helpers
- `docs/drivers/Galaxy.md` — COM bridge + STA pump + IPC architecture
- `docs/drivers/Galaxy-Repository.md` — ZB SQL reader + `LocalPlatform`
scope filter + change detection
- `docs/v2/aveva-system-platform-io-research.md` — MXAccess + Wonderware
background

211
docs/drivers/Galaxy.md Normal file
View File

@@ -0,0 +1,211 @@
# Galaxy Driver
The Galaxy driver bridges OtOpcUa to AVEVA System Platform (Wonderware) Galaxies through the `ArchestrA.MxAccess` COM API plus the Galaxy Repository SQL database. It is one driver of seven in the OtOpcUa platform (see [drivers/README.md](README.md) for the full list); all other drivers run in-process in the main Server (.NET 10 x64). Galaxy is the exception — it runs as its own Windows service and talks to the Server over a local named pipe.
For the decision record on why Galaxy is out-of-process and how the refactor was staged, see [docs/v2/plan.md §4 Galaxy/MXAccess as Out-of-Process Driver](../v2/plan.md). For the full driver spec (addressing, data-type map, config shape), see [docs/v2/driver-specs.md §1](../v2/driver-specs.md).
## Project Split
Galaxy ships as three projects:
| Project | Target | Role |
|---------|--------|------|
| `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Shared/` | .NET Standard 2.0 | IPC contracts (MessagePack records + `MessageKind` enum) referenced by both sides |
| `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/` | .NET Framework 4.8 **x86** | Separate Windows service hosting the MXAccess COM objects, STA thread + Win32 message pump, Galaxy Repository reader, Historian SDK, runtime-probe manager |
| `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Proxy/` | .NET 10 (matches Server) | `GalaxyProxyDriver : IDriver, ITagDiscovery, IReadable, IWritable, ISubscribable, IAlarmSource, IHistoryProvider, IRediscoverable, IHostConnectivityProbe` — loaded in-process by the Server; every call forwards over the pipe to the Host |
The Shared assembly is the **only** contract between the two runtimes. It carries no COM or SDK references so Proxy (net10) can reference it without dragging x86 code into the Server process.
## Why Out-of-Process
Two reasons drive the split, per `docs/v2/plan.md`:
1. **Bitness constraint.** MXAccess is 32-bit COM only — `ArchestrA.MxAccess.dll` in `Program Files (x86)\ArchestrA\Framework\bin` has no 64-bit variant. The main OtOpcUa Server is .NET 10 x64 (the OPC Foundation stack, SqlClient, and every other non-Galaxy driver target 64-bit). In-process hosting would force the whole Server to x86, which every other driver project would then inherit.
2. **Tier-C stability isolation.** Galaxy is classified Tier C in [docs/v2/driver-stability.md](../v2/driver-stability.md) — the COM runtime, STA thread, Aveva Historian SDK, and SQL queries all have crash/hang modes that can take down the hosting process. Isolating the driver in its own Windows service means a COM deadlock, AccessViolation in an unmanaged Historian DLL, or a runaway SQL query never takes the Server endpoint down. The Proxy-side supervisor restarts the Host with crash-loop circuit-breaker.
The same Tier-C isolation story applies to FOCAS (decision record in `docs/v2/plan.md` §7), which is the second out-of-process driver.
## IPC Transport
`GalaxyProxyDriver``GalaxyIpcClient` → named pipe → `Galaxy.Host` pipe server.
- Pipe name: `otopcua-galaxy-{DriverInstanceId}` (localhost-only, no TCP surface)
- Wire format: MessagePack-CSharp, length-prefixed frames
- ACL: pipe is created with a DACL that grants only the Server's service identity; the Admins group is explicitly denied so a live-smoke test running from an elevated shell fails fast rather than silently bypassing the handshake
- Handshake: Proxy presents a shared secret at `OpenSessionRequest`; Host rejects anything else with `MessageKind.OpenSessionResponse{Success=false}`
- Heartbeat: Proxy sends a periodic ping; missed heartbeats trigger the Proxy-side crash-loop supervisor to restart the Host
Every capability call on `GalaxyProxyDriver` (Read, Write, Subscribe, HistoryRead*, etc.) serializes a `*Request`, awaits the matching `*Response` via a `CallAsync<TReq, TResp>` helper, and rehydrates the result into the `Core.Abstractions` shape the Server expects.
## STA Thread Requirement (Host-side)
MXAccess COM objects — `LMXProxyServer` instantiation, `Register`, `AddItem`, `AdviseSupervisory`, `Write`, and cleanup calls — must all execute on the same Single-Threaded Apartment. Calling a COM object from the wrong thread causes marshalling failures or silent data corruption.
`StaComThread` in the Host provides that thread with the apartment state set before the thread starts:
```csharp
_thread = new Thread(ThreadEntry) { Name = "MxAccess-STA", IsBackground = true };
_thread.SetApartmentState(ApartmentState.STA);
```
Work items queue via `RunAsync(Action)` or `RunAsync<T>(Func<T>)` into a `ConcurrentQueue<Action>` and post `WM_APP` to wake the pump. Each work item is wrapped in a `TaskCompletionSource` so callers can `await` the result from any thread — including the IPC handler thread that receives the inbound pipe request.
## Win32 Message Pump (Host-side)
COM callbacks (`OnDataChange`, `OnWriteComplete`) are delivered through the Windows message loop. `StaComThread` runs a standard Win32 message pump via P/Invoke:
1. `PeekMessage` primes the message queue (required before `PostThreadMessage` works)
2. `GetMessage` blocks until a message arrives
3. `WM_APP` drains the work queue
4. `WM_APP + 1` drains the queue and posts `WM_QUIT` to exit the loop
5. All other messages go through `TranslateMessage` / `DispatchMessage` for COM callback delivery
Without this pump MXAccess callbacks never fire and the driver delivers no live data.
## LMXProxyServer COM Object
`MxProxyAdapter` wraps the real `ArchestrA.MxAccess.LMXProxyServer` COM object behind the `IMxProxy` interface so Host unit tests can substitute a fake proxy without requiring the ArchestrA runtime. Lifecycle:
1. **`Register(clientName)`** — Creates a new `LMXProxyServer` instance, wires up `OnDataChange` and `OnWriteComplete` event handlers, calls `Register` to obtain a connection handle
2. **`Unregister(handle)`** — Unwires event handlers, calls `Unregister`, releases the COM object via `Marshal.ReleaseComObject`
## Register / AddItem / AdviseSupervisory Pattern
Every MXAccess data operation follows a three-step pattern, all executed on the STA thread:
1. **`AddItem(handle, address)`** — Resolves a Galaxy tag reference (e.g., `TestMachine_001.MachineID`) to an integer item handle
2. **`AdviseSupervisory(handle, itemHandle)`** — Subscribes the item for supervisory data-change callbacks
3. The runtime begins delivering `OnDataChange` events
For writes, after `AddItem` + `AdviseSupervisory`, `Write(handle, itemHandle, value, securityClassification)` sends the value; `OnWriteComplete` confirms or rejects. Cleanup reverses: `UnAdviseSupervisory` then `RemoveItem`.
## OnDataChange and OnWriteComplete Callbacks
### OnDataChange
Fired by the COM runtime on the STA thread when a subscribed tag changes. The handler in `MxAccessClient.EventHandlers.cs`:
1. Maps the integer `phItemHandle` back to a tag address via `_handleToAddress`
2. Maps the MXAccess quality code to the internal `Quality` enum
3. Checks `MXSTATUS_PROXY` for error details and adjusts quality
4. Converts the timestamp to UTC
5. Constructs a `Vtq` (Value/Timestamp/Quality) and delivers it to:
- The stored per-tag subscription callback
- Any pending one-shot read completions
- The global `OnTagValueChanged` event (consumed by the Host's subscription dispatcher, which packages changes into `DataChangeEventArgs` and forwards them over the pipe to `GalaxyProxyDriver.OnDataChange`)
### OnWriteComplete
Fired when the runtime acknowledges or rejects a write. The handler resolves the pending `TaskCompletionSource<bool>` for the item handle. If `MXSTATUS_PROXY.success == 0` the write is considered failed and the error detail is logged.
## Reconnection Logic
`MxAccessClient` implements automatic reconnection through two mechanisms.
### Monitor loop
`StartMonitor` launches a background task that polls at `MonitorIntervalSeconds`. On each cycle:
- If the state is `Disconnected` or `Error` and `AutoReconnect` is enabled, it calls `ReconnectAsync`
- If connected and a probe tag is configured, it checks the probe staleness threshold
### Reconnect sequence
`ReconnectAsync` performs a full disconnect-then-connect cycle:
1. Increment the reconnect counter
2. `DisconnectAsync` — tear down all active subscriptions (`UnAdviseSupervisory` + `RemoveItem` for each), detach COM event handlers, call `Unregister`, clear all handle mappings
3. `ConnectAsync` — create a fresh `LMXProxyServer`, register, replay all stored subscriptions, re-subscribe the probe tag
Stored subscriptions (`_storedSubscriptions`) persist across reconnects. `ReplayStoredSubscriptionsAsync` iterates the stored entries and calls `AddItem` + `AdviseSupervisory` for each.
## Probe Tag Health Monitoring
A configurable probe tag (e.g., a frequently updating Galaxy attribute) serves as a connection health indicator. After connecting, the client subscribes to the probe tag and records `_lastProbeValueTime` on every `OnDataChange`. The monitor loop compares `DateTime.UtcNow - _lastProbeValueTime` against `ProbeStaleThresholdSeconds`; if the probe has not updated within the window, the connection is assumed stale and a reconnect is forced. This catches scenarios where the COM connection is technically alive but the runtime has stopped delivering data.
## Per-Host Runtime Status Probes (`<Host>.ScanState`)
Separate from the connection-level probe, the driver advises `<HostName>.ScanState` on every deployed `$WinPlatform` and `$AppEngine` in the Galaxy. These probes track per-host runtime state so the Admin UI dashboard can report "this specific Platform / AppEngine is off scan" and the driver can proactively invalidate every OPC UA variable hosted by the stopped object — preventing MXAccess from serving stale Good-quality cached values to clients who read those tags while the host is down.
Enabled by default via `MxAccess.RuntimeStatusProbesEnabled`; see [Configuration](../Configuration.md#mxaccess) for the two config fields.
### How it works
`GalaxyRuntimeProbeManager` lives in `Driver.Galaxy.Host` alongside the rest of the MXAccess code. It is owned by the Host's subscription dispatcher and runs a three-state machine per host (Unknown / Running / Stopped):
1. **Discovery** — After the Host completes `BuildAddressSpace`, the manager filters the hierarchy to rows where `CategoryId == 1` (`$WinPlatform`) or `CategoryId == 3` (`$AppEngine`) and issues `AdviseSupervisory` for `<TagName>.ScanState` on each one. Probes are driver-owned, not ref-counted against client subscriptions, and persist across address-space rebuilds via a `Sync` diff.
2. **Transition predicate** — A probe callback is interpreted as `isRunning = vtq.Quality.IsGood() && vtq.Value is bool b && b`. Everything else (explicit `ScanState = false`, bad quality, communication errors) means **Stopped**.
3. **On-change-only delivery**`ScanState` is delivered only when the value actually changes. A stably Running host may go hours without a callback. `Tick()` does NOT run a starvation check on Running entries — the only time-based transition is **Unknown → Stopped** when the initial callback hasn't arrived within `RuntimeStatusUnknownTimeoutSeconds` (default 15s). This protects against a probe that fails to resolve at all without incorrectly flipping healthy long-running hosts.
4. **Transport gating** — When `IMxAccessClient.State != Connected`, `GetSnapshot()` forces every entry to `Unknown`. The dashboard shows the Connection panel as the primary signal in that case rather than misleading operators with "every host stopped".
5. **Subscribe failure rollback** — If `SubscribeAsync` throws for a new probe (SDK failure, broker rejection, transport error), the manager rolls back both `_byProbe` and `_probeByGobjectId` so the probe never appears in `GetSnapshot()`. Stability review 2026-04-13 Finding 1.
### Subtree quality invalidation on transition
When a host transitions **Running → Stopped**, the probe manager invokes a callback that walks `_hostedVariables[gobjectId]` — the set of every OPC UA variable transitively hosted by that Galaxy object — and sets each variable's `StatusCode` to `BadOutOfService`. **Stopped → Running** calls `ClearHostVariablesBadQuality` to reset each to `Good` so the next on-change MXAccess update repopulates the value.
The hosted-variables map is built once per `BuildAddressSpace` by walking each object's `HostedByGobjectId` chain up to the nearest Platform or Engine ancestor. A variable hosted by an Engine inside a Platform lands in both the Engine's list and the Platform's list, so stopping the Platform transitively invalidates every descendant Engine's variables.
### Read-path short-circuit (`IsTagUnderStoppedHost`)
The Host's Read handler checks `IsTagUnderStoppedHost(tagRef)` (a reverse-index lookup `_hostIdsByTagRef[tagRef]``GalaxyRuntimeProbeManager.IsHostStopped(hostId)`) before the MXAccess round-trip. When the owning host is Stopped, the handler returns a synthesized `DataValue { Value = cachedVar.Value, StatusCode = BadOutOfService }` directly without touching MXAccess. This guarantees clients see a uniform `BadOutOfService` on every descendant tag of a stopped host, regardless of whether they're reading or subscribing.
### Deferred dispatch — the STA deadlock
**Critical**: probe transition callbacks must **not** run synchronously on the STA thread that delivered the `OnDataChange`. `MarkHostVariablesBadQuality` takes the subscription dispatcher lock, which may be held by a worker thread currently inside `Read` waiting on an `_mxAccessClient.ReadAsync()` round-trip that is itself waiting for the STA thread. Classic circular wait — the first real deploy of this feature hung inside 30 seconds from exactly this pattern.
The fix is a deferred-dispatch queue: probe callbacks enqueue the transition onto `ConcurrentQueue<(int GobjectId, bool Stopped)>` and set the existing dispatch signal. The dispatch thread drains the queue inside its existing 100ms `WaitOne` loop — outside any locks held by the STA path — and then calls `MarkHostVariablesBadQuality` / `ClearHostVariablesBadQuality` under its own natural lock acquisition. No circular wait, no STA involvement.
### Dashboard and health surface
- Admin UI **Galaxy Runtime** panel shows per-host state with Name / Kind / State / Since / Last Error columns. Panel color is green (all Running), yellow (any Unknown, none Stopped), red (any Stopped), gray (MXAccess transport disconnected)
- `HealthCheckService.CheckHealth` rolls overall driver health to `Degraded` when any host is Stopped
See [Status Dashboard](../StatusDashboard.md#galaxy-runtime) for the field table and [Configuration](../Configuration.md#mxaccess) for the config fields.
## Request Timeout Safety Backstop
Every sync-over-async site on the OPC UA stack thread that calls into Galaxy (`Read`, `Write`, address-space rebuild probe sync) is wrapped in a bounded `SyncOverAsync.WaitSync(...)` helper with timeout `MxAccess.RequestTimeoutSeconds` (default 30s). Inner `ReadTimeoutSeconds` / `WriteTimeoutSeconds` bounds on the async path are the first line of defense; the outer wrapper is a backstop so a scheduler stall, slow reconnect, or any other non-returning async path cannot park the stack thread indefinitely.
On timeout, the underlying task is **not** cancelled — it runs to completion on the thread pool and is abandoned. This is acceptable because Galaxy IPC clients are shared singletons and the abandoned continuation does not capture request-scoped state. The OPC UA stack receives `StatusCodes.BadTimeout` on the affected operation.
`ConfigurationValidator` enforces `RequestTimeoutSeconds >= 1` and warns when it is set below the inner Read/Write timeouts (operator misconfiguration). Stability review 2026-04-13 Finding 3.
All capability calls at the Server dispatch layer are additionally wrapped by `CapabilityInvoker` (Core/Resilience/) which runs them through a Polly pipeline keyed on `(DriverInstanceId, HostName, DriverCapability)`. `OTOPCUA0001` analyzer enforces the wrap at build time.
## Why Marshal.ReleaseComObject Is Needed
The .NET Framework runtime's garbage collector releases COM references non-deterministically. For MXAccess, delayed release can leave stale COM connections open, preventing clean re-registration. `MxProxyAdapter.Unregister` calls `Marshal.ReleaseComObject(_lmxProxy)` in a `finally` block to immediately drive the COM reference count to zero. This ensures the underlying COM server is freed before a reconnect attempt creates a new instance.
## Tag Discovery and Historical Data
Tag discovery (the Galaxy Repository SQL reader + `LocalPlatform` scope filter) is covered in [Galaxy-Repository.md](Galaxy-Repository.md). The Galaxy driver is `ITagDiscovery` for the Server's bootstrap path and `IRediscoverable` for the on-change-redeploy path.
Historical data access (raw, processed, at-time, events) runs against the Aveva Historian via the `aahClientManaged` SDK and is exposed through the Galaxy driver's `IHistoryProvider` implementation. See [HistoricalDataAccess.md](../HistoricalDataAccess.md).
## Key source files
Host-side (`.NET 4.8 x86`, `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/`):
- `Backend/MxAccess/StaComThread.cs` — STA thread and Win32 message pump
- `Backend/MxAccess/MxAccessClient.cs` — Core client (partial)
- `Backend/MxAccess/MxAccessClient.Connection.cs` — Connect / disconnect / reconnect
- `Backend/MxAccess/MxAccessClient.Subscription.cs` — Subscribe / unsubscribe / replay
- `Backend/MxAccess/MxAccessClient.ReadWrite.cs` — Read and write operations
- `Backend/MxAccess/MxAccessClient.EventHandlers.cs``OnDataChange` / `OnWriteComplete` handlers
- `Backend/MxAccess/MxAccessClient.Monitor.cs` — Background health monitor
- `Backend/MxAccess/MxProxyAdapter.cs` — COM object wrapper
- `Backend/MxAccess/GalaxyRuntimeProbeManager.cs` — Per-host `ScanState` probes, state machine, `IsHostStopped` lookup
- `Backend/Historian/HistorianDataSource.cs``aahClientManaged` SDK wrapper (see [HistoricalDataAccess.md](../HistoricalDataAccess.md))
- `Ipc/GalaxyIpcServer.cs` — Named-pipe server, message dispatch
- `Domain/IMxAccessClient.cs` — Client interface
Shared (`.NET Standard 2.0`, `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Shared/`):
- `Contracts/MessageKind.cs` — IPC message kinds (`ReadRequest`, `HistoryReadRequest`, `OpenSessionResponse`, …)
- `Contracts/*.cs` — MessagePack DTOs for every request/response pair
Proxy-side (`.NET 10`, `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Proxy/`):
- `GalaxyProxyDriver.cs``IDriver`/`ITagDiscovery`/`IReadable`/`IWritable`/`ISubscribable`/`IAlarmSource`/`IHistoryProvider`/`IRediscoverable`/`IHostConnectivityProbe` implementation; every method forwards via `GalaxyIpcClient`
- `Ipc/GalaxyIpcClient.cs` — Named-pipe client, `CallAsync<TReq, TResp>`, reconnect on broken pipe
- `GalaxyProxySupervisor.cs` — Host-process monitor, crash-loop circuit-breaker, Host relaunch

View File

@@ -0,0 +1,123 @@
# Modbus test fixture
Coverage map + gap inventory for the Modbus TCP driver's integration-test
harness backed by `pymodbus` simulator profiles per PLC family.
**TL;DR:** Modbus is the best-covered driver — a real `pymodbus` server on
localhost with per-family seed-register profiles, plus a skip-gate when the
simulator port isn't reachable. Covers DL205 / Mitsubishi MELSEC / Siemens
S7-1500 family quirks end-to-end. Gaps are mostly error-path + alarm/history
shaped (neither is a Modbus-side concept).
## What the fixture is
- **Simulator**: `pymodbus` (Python, BSD) launched as a pinned Docker
container at
`tests/ZB.MOM.WW.OtOpcUa.Driver.Modbus.IntegrationTests/Docker/`.
Docker is the only supported launch path.
- **Lifecycle**: `ModbusSimulatorFixture` (collection-scoped) TCP-probes
`localhost:5020` on first use. `MODBUS_SIM_ENDPOINT` env var overrides the
endpoint so the same suite can target a real PLC.
- **Profiles**: `DL205Profile`, `MitsubishiProfile`, `S7_1500Profile`
each composes device-specific register-format + quirk-seed JSON for pymodbus.
Profile JSONs live under `Docker/profiles/` and are baked into the image.
- **Compose services**: one per profile (`standard` / `dl205` /
`mitsubishi` / `s7_1500`); only one binds `:5020` at a time.
- **Tests skip** via `Assert.Skip(sim.SkipReason)` when the probe fails; no
custom FactAttribute needed because `ModbusSimulatorCollection` carries the
skip reason.
## What it actually covers
### DL205 (Automation Direct)
- `DL205SmokeTests` — FC16 write → FC03 read round-trip on holding register
- `DL205CoilMappingTests` — Y-output / C-relay / X-input address mapping
(octal → Modbus offset)
- `DL205ExceptionCodeTests` — Modbus exception 0x02 → OPC UA `BadOutOfRange` against the dl205 profile (natural out-of-range path)
- `ExceptionInjectionTests` — every other exception code in the mapping table (0x01 / 0x03 / 0x04 / 0x05 / 0x06 / 0x0A / 0x0B) against the `exception_injection` profile on both read + write paths
- `DL205FloatCdabQuirkTests` — CDAB word-swap float encoding
- `DL205StringQuirkTests` — packed-string V-memory layout
- `DL205VMemoryQuirkTests` — V-memory octal addressing
- `DL205XInputTests` — X-register read-only enforcement
### Mitsubishi MELSEC
- `MitsubishiSmokeTests` — read + write round-trip
- `MitsubishiQuirkTests` — word-order, device-code mapping (D/M/X/Y ranges)
### Siemens S7-1500 (Modbus gateway flavor)
- `S7_1500SmokeTests` — read + write round-trip
- `S7_ByteOrderTests` — ABCD/DCBA/BADC/CDAB byte-order matrix
### Capability surfaces hit
- `IReadable` + `IWritable` — full round-trip
- `ISubscribable` — via the shared `PollGroupEngine` (polled subscription)
- `IHostConnectivityProbe` — TCP-reach transitions
## What it does NOT cover
### 1. No `ITagDiscovery`
Modbus has no symbol table — the driver requires a static tag map from
`DriverConfig`. There is no discovery path to test + none in the fixture.
### 2. Error-path fuzzing
`pymodbus` serves the seeded values happily; the fixture can't easily inject
exception responses (code 0x010x0B) or malformed PDUs. The
`AbCipStatusMapper`-equivalent for exception codes is unit-tested via
`DL205ExceptionCodeTests` but the simulator itself never refuses a read.
### 3. Variant-specific quirks beyond the three profiles
- FX5U / QJ71MT91 Mitsubishi variants — profile scaffolds exist, no tests yet
- Non-S7-1500 Siemens (S7-1200 / ET200SP) — byte-order covered but
connection-pool + fragmentation quirks untested
- DL205-family cousins (DL06, DL260) — no dedicated profile
### 4. Subscription stress
`PollGroupEngine` is unit-tested standalone but the simulator doesn't exercise
it under multi-register packing stress (FC03 with 125-register batches,
boundary splits, etc.).
### 5. Alarms / history
Not a Modbus concept. Driver doesn't implement `IAlarmSource` or
`IHistoryProvider`; no test coverage is the correct shape.
## When to trust the Modbus fixture, when to reach for a rig
| Question | Fixture | Unit tests | Real PLC |
| --- | --- | --- | --- |
| "Does FC03/FC06/FC16 work end-to-end?" | yes | - | yes |
| "Does DL205 octal addressing map correctly?" | yes | yes | yes |
| "Does float CDAB word-swap round-trip?" | yes | yes | yes |
| "Does the driver handle exception responses?" | no | yes | yes (required) |
| "Does packing 125 regs into one FC03 work?" | no | no | yes (required) |
| "Does FX5U behave like Q-series?" | no | no | yes (required) |
## Follow-up candidates
1. Add `MODBUS_SIM_ENDPOINT` override documentation to
`docs/v2/test-data-sources.md` so operators can point the suite at a lab rig.
2. ~~Extend `pymodbus` profiles to inject exception responses~~**shipped**
via the `exception_injection` compose profile + standalone
`exception_injector.py` server. Rules in
`Docker/profiles/exception_injection.json` map `(fc, address)` to an
exception code; `ExceptionInjectionTests` exercises every code in
`MapModbusExceptionToStatus` (0x01 / 0x02 / 0x03 / 0x04 / 0x05 / 0x06 /
0x0A / 0x0B) end-to-end on both read (FC03) and write (FC06) paths.
3. Add an FX5U profile once a lab rig is available; the scaffolding is in place.
## Key fixture / config files
- `tests/ZB.MOM.WW.OtOpcUa.Driver.Modbus.IntegrationTests/ModbusSimulatorFixture.cs`
- `tests/ZB.MOM.WW.OtOpcUa.Driver.Modbus.IntegrationTests/DL205/DL205Profile.cs`
- `tests/ZB.MOM.WW.OtOpcUa.Driver.Modbus.IntegrationTests/Mitsubishi/MitsubishiProfile.cs`
- `tests/ZB.MOM.WW.OtOpcUa.Driver.Modbus.IntegrationTests/S7/S7_1500Profile.cs`
- `tests/ZB.MOM.WW.OtOpcUa.Driver.Modbus.IntegrationTests/Docker/`
Dockerfile + compose + per-family JSON profiles

View File

@@ -0,0 +1,170 @@
# OPC UA Client test fixture
Coverage map + gap inventory for the OPC UA Client (gateway / aggregation)
driver.
**TL;DR:** Wire-level coverage now exists via
[opc-plc](https://github.com/Azure-Samples/iot-edge-opc-plc) — Microsoft
Industrial IoT's OPC UA PLC simulator running in Docker (task #215). Real
Secure Channel, real Session, real MonitoredItem exchange against an
independent server implementation. Unit tests still carry the exhaustive
capability matrix (cert auth / security policies / reconnect / failover /
attribute mapping). Gaps remaining: upstream-server-specific quirks
(historian aggregates, typed ConditionType events, SDK-publish-queue edge
behavior under load) — opc-plc uses the same OPCFoundation stack internally
so fully-independent-stack coverage needs `open62541/open62541` as a second
image (follow-up).
## What the fixture is
**Integration layer** (task #215):
`tests/ZB.MOM.WW.OtOpcUa.Driver.OpcUaClient.IntegrationTests/` stands up
`mcr.microsoft.com/iotedge/opc-plc:2.14.10` via `Docker/docker-compose.yml`
on `opc.tcp://localhost:50000`. `OpcPlcFixture` probes the port at
collection init + skips tests with a clear message when the container's
not running (matches the Modbus/pymodbus + S7/python-snap7 skip pattern).
Docker is the launcher — no PowerShell wrapper needed because opc-plc
ships pre-containerized. Compose-file flags: `--ut` (unsecured transport
advertised), `--aa` (auto-accept client certs — opc-plc's cert trust store
resets on each spin-up), `--alm` (alarm simulation for IAlarmSource
follow-up coverage), `--pn=50000` (port).
**Unit layer**:
`tests/ZB.MOM.WW.OtOpcUa.Driver.OpcUaClient.Tests/` is still the primary
coverage. Tests inject fakes through the driver's construction path; the
OPCFoundation.NetStandard `Session` surface is wrapped behind an interface
the tests mock.
## What it actually covers
### Integration (opc-plc Docker, task #215)
- `OpcUaClientSmokeTests.Client_connects_and_reads_StepUp_node_through_real_OPC_UA_stack`
full Secure Channel + Session + `ns=3;s=StepUp` Read round-trip
- `OpcUaClientSmokeTests.Client_reads_batch_of_varied_types_from_live_simulator`
batch Read of UInt32 / Int32 / Boolean; asserts `bool`-specific Variant
decoding to catch a common attribute-mapping regression
- `OpcUaClientSmokeTests.Client_subscribe_receives_StepUp_data_changes_from_live_server`
real `MonitoredItem` subscription against `ns=3;s=FastUInt1` (ticks every
100 ms); asserts `OnDataChange` fires within 3 s of subscribe
Wire-level surfaces verified: `IDriver` + `IReadable` + `ISubscribable` +
`IHostConnectivityProbe` (via the Secure Channel exchange).
### Unit
The surface is broad because `OpcUaClientDriver` is the richest-capability
driver in the fleet (it's a gateway for another OPC UA server, so it
mirrors the full capability matrix):
- `OpcUaClientDriverScaffoldTests``IDriver` lifecycle
- `OpcUaClientReadWriteTests` — read + write lifecycle
- `OpcUaClientSubscribeAndProbeTests` — monitored-item subscription + probe
state transitions
- `OpcUaClientDiscoveryTests``GetEndpoints` + endpoint selection
- `OpcUaClientAttributeMappingTests` — OPC UA node attribute → driver value
mapping
- `OpcUaClientSecurityPolicyTests``SignAndEncrypt` / `Sign` / `None`
policy negotiation contract
- `OpcUaClientCertAuthTests` — cert store paths, revocation-list config
- `OpcUaClientReconnectTests` — SDK reconnect hook + `TransferSubscriptions`
across the disconnect boundary
- `OpcUaClientFailoverTests` — primary → secondary session fallback per
driver config
- `OpcUaClientAlarmTests` — A&E severity bucket (11000 → Low / Medium /
High / Critical), subscribe / unsubscribe / ack contract
- `OpcUaClientHistoryTests` — historical data read + interpolation contract
Capability surfaces whose contract is verified: `IDriver`, `ITagDiscovery`,
`IReadable`, `IWritable`, `ISubscribable`, `IHostConnectivityProbe`,
`IAlarmSource`, `IHistoryProvider`.
## What it does NOT cover
### 1. Real stack exchange
No UA Secure Channel is ever opened. Every test mocks `Session.ReadAsync`,
`Session.CreateSubscription`, `Session.AddItem`, etc. — the SDK itself is
trusted. Certificate validation, signing, nonce handling, chunk assembly,
keep-alive cadence — all SDK-internal and untested here.
### 2. Subscription transfer across reconnect
Contract test: "after a simulated reconnect, `TransferSubscriptions` is
called with the right handles." Real behavior: SDK re-publishes against the
new channel and some events can be lost depending on publish-queue state.
The lossy window is not characterized.
### 3. Large-scale subscription stress
100+ monitored items with heterogeneous publish intervals under a single
session — the shape that breaks publish-queue-size tuning in the wild — is
not exercised.
### 4. Real historian mappings
`IHistoryProvider.ReadRawAsync` + `ReadProcessedAsync` +
`ReadAtTimeAsync` + `ReadEventsAsync` are contract-mocked. Against a real
historian (AVEVA Historian, Prosys historian, Kepware LocalHistorian) each
has specific interpolation + bad-quality-handling quirks the contract test
doesn't see.
### 5. Real A&E events
Alarm subscription is mocked via filtered monitored items; the actual
`EventFilter` select-clause behavior against a server that exposes typed
ConditionType events (non-base `BaseEventType`) is not verified.
### 6. Authentication variants
- Anonymous, UserName/Password, X509 cert tokens — each is contract-tested
but not exchanged against a server that actually enforces each.
- LDAP-backed `UserName` (matching this repo's server-side
`LdapUserAuthenticator`) requires a live LDAP round-trip; not tested.
## When to trust OpcUaClient tests, when to reach for a server
| Question | Unit tests | Real upstream server |
| --- | --- | --- |
| "Does severity 750 bucket as High?" | yes | yes |
| "Does the driver call `TransferSubscriptions` after reconnect?" | yes | yes |
| "Does a real OPC UA read/write round-trip work?" | no | yes (required) |
| "Does event-filter-based alarm subscription return ConditionType events?" | no | yes (required) |
| "Does history read from AVEVA Historian return correct aggregates?" | no | yes (required) |
| "Does the SDK's publish queue lose notifications under load?" | no | yes (stress) |
## Follow-up candidates
The easiest win here is to **wire the client driver tests against this
repo's own server**. The integration test project
`tests/ZB.MOM.WW.OtOpcUa.Server.Tests/OpcUaServerIntegrationTests.cs`
already stands up a real OPC UA server on a non-default port with a seeded
FakeDriver. An `OpcUaClientLiveLoopbackTests` that connects the client
driver to that server would give:
- Real Secure Channel negotiation
- Real Session / Subscription / MonitoredItem exchange
- Real read/write round-trip
- Real certificate validation (the integration test already sets up PKI)
It wouldn't cover upstream-server-specific quirks (AVEVA Historian, Kepware,
Prosys), but it would cover 80% of the SDK surface the driver sits on top
of.
Beyond that:
1. **Prosys OPC UA Simulation Server** — free, Windows-available, scriptable.
2. **UaExpert Server-Side Simulator** — Unified Automation's sample server;
good coverage of typed ConditionType events.
3. **Dedicated historian integration lab** — only path for
historian-specific coverage.
## Key fixture / config files
- `tests/ZB.MOM.WW.OtOpcUa.Driver.OpcUaClient.Tests/` — unit tests with
mocked `Session`
- `src/ZB.MOM.WW.OtOpcUa.Driver.OpcUaClient/OpcUaClientDriver.cs` — ctor +
session-factory seam tests mock through
- `tests/ZB.MOM.WW.OtOpcUa.Server.Tests/OpcUaServerIntegrationTests.cs`
the server-side integration harness a future loopback client test could
piggyback on

59
docs/drivers/README.md Normal file
View File

@@ -0,0 +1,59 @@
# Drivers
OtOpcUa is a multi-driver OPC UA server. The Core (`ZB.MOM.WW.OtOpcUa.Core` + `Core.Abstractions` + `Server`) owns the OPC UA stack, address space, session/security/subscription machinery, resilience pipeline, and namespace kinds (Equipment + SystemPlatform). Drivers plug in through **capability interfaces** defined in `src/ZB.MOM.WW.OtOpcUa.Core.Abstractions/`:
- `IDriver` — lifecycle (`InitializeAsync`, `ReinitializeAsync`, `ShutdownAsync`, `GetHealth`)
- `IReadable` / `IWritable` — one-shot reads and writes
- `ITagDiscovery` — address-space enumeration
- `ISubscribable` — driver-pushed data-change streams
- `IHostConnectivityProbe` — per-host reachability events
- `IPerCallHostResolver` — multi-host drivers that route each call to a target endpoint at dispatch time
- `IAlarmSource` — driver-emitted OPC UA A&C events
- `IHistoryProvider` — raw / processed / at-time / events HistoryRead (see [HistoricalDataAccess.md](../HistoricalDataAccess.md))
- `IRediscoverable` — driver-initiated address-space rebuild notifications
Each driver opts into only the capabilities it supports. Every async capability call at the Server dispatch layer goes through `CapabilityInvoker` (`Core/Resilience/CapabilityInvoker.cs`), which wraps it in a Polly pipeline keyed on `(DriverInstanceId, HostName, DriverCapability)`. The `OTOPCUA0001` analyzer enforces the wrap at build time. Drivers themselves never depend on Polly; they just implement the capability interface and let the Core wrap it.
Driver type metadata is registered at startup in `DriverTypeRegistry` (`src/ZB.MOM.WW.OtOpcUa.Core.Abstractions/DriverTypeRegistry.cs`). The registry records each type's allowed namespace kinds (`Equipment` / `SystemPlatform` / `Simulated`), its JSON Schema for `DriverConfig` / `DeviceConfig` / `TagConfig` columns, and its stability tier per [docs/v2/driver-stability.md](../v2/driver-stability.md).
## Ground-truth driver list
| Driver | Project path | Tier | Wire / library | Capabilities | Notable quirk |
|--------|--------------|:----:|----------------|--------------|---------------|
| [Galaxy](Galaxy.md) | `Driver.Galaxy.{Shared, Host, Proxy}` | C | MXAccess COM + `aahClientManaged` + SqlClient | IDriver, ITagDiscovery, IReadable, IWritable, ISubscribable, IAlarmSource, IHistoryProvider, IRediscoverable, IHostConnectivityProbe | Out-of-process — Host is its own Windows service (.NET 4.8 x86 for the COM bitness constraint); Proxy talks to Host over a named pipe |
| Modbus TCP | `Driver.Modbus` | A | NModbus-derived in-house client | IDriver, ITagDiscovery, IReadable, IWritable, ISubscribable, IHostConnectivityProbe | Polled subscriptions via the shared `PollGroupEngine`. DL205 PLCs are covered by `AddressFormat=DL205` (octal V/X/Y/C/T/CT translation) — no separate driver |
| Siemens S7 | `Driver.S7` | A | [S7netplus](https://github.com/S7NetPlus/s7netplus) | IDriver, ITagDiscovery, IReadable, IWritable, ISubscribable, IHostConnectivityProbe | Single S7netplus `Plc` instance per PLC serialized with `SemaphoreSlim` — the S7 CPU's comm mailbox is scanned at most once per cycle, so parallel reads don't help |
| AB CIP | `Driver.AbCip` | A | libplctag CIP | IDriver, ITagDiscovery, IReadable, IWritable, ISubscribable, IHostConnectivityProbe, IPerCallHostResolver | ControlLogix / CompactLogix. Tag discovery uses the `@tags` walker to enumerate controller-scoped + program-scoped symbols; UDT member resolution via the UDT template reader |
| AB Legacy | `Driver.AbLegacy` | A | libplctag PCCC | IDriver, ITagDiscovery, IReadable, IWritable, ISubscribable, IHostConnectivityProbe, IPerCallHostResolver | SLC 500 / MicroLogix. File-based addressing (`N7:0`, `F8:0`) — no symbol table, tag list is user-authored in the config DB |
| TwinCAT | `Driver.TwinCAT` | B | Beckhoff `TwinCAT.Ads` (`TcAdsClient`) | IDriver, ITagDiscovery, IReadable, IWritable, ISubscribable, IHostConnectivityProbe, IPerCallHostResolver | The only native-notification driver outside Galaxy — ADS delivers `ValueChangedCallback` events the driver forwards straight to `ISubscribable.OnDataChange` without polling. Symbol tree uploaded via `SymbolLoaderFactory` |
| FOCAS | `Driver.FOCAS` | C | FANUC FOCAS2 (`Fwlib32.dll` P/Invoke) | IDriver, ITagDiscovery, IReadable, IWritable, ISubscribable, IHostConnectivityProbe, IPerCallHostResolver | Tier C — FOCAS DLL has crash modes that warrant process isolation. CNC-shaped data model (axes, spindle, PMC, macros, alarms) not a flat tag map |
| OPC UA Client | `Driver.OpcUaClient` | B | OPCFoundation `Opc.Ua.Client` | IDriver, ITagDiscovery, IReadable, IWritable, ISubscribable, IAlarmSource, IHistoryProvider, IHostConnectivityProbe | Gateway/aggregation driver. Opens a single `Session` against a remote OPC UA server and re-exposes its address space. Owns its own `ApplicationConfiguration` (distinct from `Client.Shared`) because it's always-on with keep-alive + `TransferSubscriptions` across SDK reconnect, not an interactive CLI |
## Per-driver documentation
- **Galaxy** has its own docs in this folder because the out-of-process architecture + MXAccess COM rules + Galaxy Repository SQL + Historian + runtime probe manager don't fit a single table row:
- [Galaxy.md](Galaxy.md) — COM bridge, STA pump, IPC, runtime probes
- [Galaxy-Repository.md](Galaxy-Repository.md) — ZB SQL reader, `LocalPlatform` scope filter, change detection
- **All other drivers** share a single per-driver specification in [docs/v2/driver-specs.md](../v2/driver-specs.md) — addressing, data-type maps, connection settings, and quirks live there. That file is the authoritative per-driver reference; this index points at it rather than duplicating.
## Test-fixture coverage maps
Each driver has a dedicated fixture doc that lays out what the integration / unit harness actually covers vs. what's trusted from field deployments. Read the relevant one before claiming "green suite = production-ready" for a driver.
- [AB CIP](AbServer-Test-Fixture.md) — Dockerized `ab_server` (multi-stage build from libplctag source); atomic-read smoke across 4 families; UDT / ALMD / family quirks unit-only
- [Modbus](Modbus-Test-Fixture.md) — Dockerized `pymodbus` + per-family JSON profiles (4 compose profiles); best-covered driver, gaps are error-path-shaped
- [Siemens S7](S7-Test-Fixture.md) — Dockerized `python-snap7` server; DB/MB read + write round-trip verified end-to-end on `:1102`
- [AB Legacy](AbLegacy-Test-Fixture.md) — Docker scaffold via `ab_server` PCCC mode (task #224); wire-level round-trip currently blocked by ab_server's PCCC coverage gap, docs call out RSEmulate 500 + lab-rig resolution paths
- [TwinCAT](TwinCAT-Test-Fixture.md) — XAR-VM integration scaffolding (task #221); three smoke tests skip when VM unreachable. Unit via `FakeTwinCATClient` with native-notification harness
- [FOCAS](FOCAS-Test-Fixture.md) — no integration fixture, unit-only via `FakeFocasClient`; Tier C out-of-process isolation scoped but not shipped
- [OPC UA Client](OpcUaClient-Test-Fixture.md) — no integration fixture, unit-only via mocked `Session`; loopback against this repo's own server is the obvious next step
- [Galaxy](Galaxy-Test-Fixture.md) — richest harness: E2E Host subprocess + ZB SQL live-smoke + MXAccess opt-in
## Related cross-driver docs
- [HistoricalDataAccess.md](../HistoricalDataAccess.md) — `IHistoryProvider` dispatch, aggregate mapping, continuation points. The Galaxy driver's Aveva Historian implementation is the first; OPC UA Client forwards to the upstream server; other drivers do not implement the interface and return `BadHistoryOperationUnsupported`.
- [AlarmTracking.md](../AlarmTracking.md) — `IAlarmSource` event model and filtering.
- [Subscriptions.md](../Subscriptions.md) — how the Server multiplexes subscriptions onto `ISubscribable.OnDataChange`.
- [docs/v2/driver-stability.md](../v2/driver-stability.md) — tier system (A / B / C), shared `CapabilityPolicy` defaults per tier × capability, `MemoryTracking` hybrid formula, and process-level recycle rules.
- [docs/v2/plan.md](../v2/plan.md) — authoritative vision, architecture decisions, migration strategy.

View File

@@ -0,0 +1,121 @@
# Siemens S7 test fixture
Coverage map + gap inventory for the S7 driver.
**TL;DR:** S7 now has a wire-level integration fixture backed by
[python-snap7](https://github.com/gijzelaerr/python-snap7)'s `Server` class
(task #216). Atomic reads (u16 / i16 / i32 / f32 / bool-with-bit) + DB
write-then-read round-trip are exercised end-to-end through S7netplus +
real ISO-on-TCP on `localhost:1102`. Unit tests still carry everything
else (address parsing, error-branch handling, probe-loop contract). Gaps
remaining are variant-quirk-shaped: Optimized-DB symbolic access, PG/OP
session types, PUT/GET-disabled enforcement — all need real hardware.
## What the fixture is
**Integration layer** (task #216):
`tests/ZB.MOM.WW.OtOpcUa.Driver.S7.IntegrationTests/` stands up a
python-snap7 `Server` via `Docker/docker-compose.yml --profile s7_1500`
on `localhost:1102` (pinned `python:3.12-slim-bookworm` base +
`python-snap7>=2.0`). Docker is the only supported launch path.
`Snap7ServerFixture` probes the port at collection init + skips with a
clear message when unreachable (matches the pymodbus pattern).
`server.py` (baked into the image under `Docker/`) reads a JSON profile
+ seeds DB/MB bytes at declared offsets; seeds are typed (`u16` / `i16`
/ `i32` / `f32` / `bool` / `ascii` for S7 STRING).
**Unit layer**: `tests/ZB.MOM.WW.OtOpcUa.Driver.S7.Tests/` covers
everything the wire-level suite doesn't — address parsing, error
branches, probe-loop contract. All tests tagged
`[Trait("Category", "Unit")]`.
The driver ctor change that made this possible:
`Plc(CpuType, host, port, rack, slot)` — S7netplus 0.20's 5-arg overload
— wires `S7DriverOptions.Port` through so the simulator can bind 1102
(non-privileged) instead of 102 (root / Firewall-prompt territory).
## What it actually covers
### Integration (python-snap7, task #216)
- `S7_1500SmokeTests.Driver_reads_seeded_u16_through_real_S7comm` — DB1.DBW0
read via real S7netplus over TCP + simulator; proves handshake + read path
- `S7_1500SmokeTests.Driver_reads_seeded_typed_batch` — i16, i32, f32,
bool-with-bit in one batch call; proves typed decode per S7DataType
- `S7_1500SmokeTests.Driver_write_then_read_round_trip_on_scratch_word`
`DB1.DBW100` write → read-back; proves write path + buffer visibility
### Unit
- `S7AddressParserTests` — S7 address syntax (`DB1.DBD0`, `M10.3`, `IW4`, etc.)
- `S7DriverScaffoldTests``IDriver` lifecycle (init / reinit / shutdown / health)
- `S7DriverReadWriteTests` — error paths (uninitialized read/write, bad
addresses, transport exceptions)
- `S7DiscoveryAndSubscribeTests``ITagDiscovery.DiscoverAsync` + polled
`ISubscribable` contract with the shared `PollGroupEngine`
Capability surfaces whose contract is verified: `IDriver`, `ITagDiscovery`,
`IReadable`, `IWritable`, `ISubscribable`, `IHostConnectivityProbe`.
Wire-level surfaces verified: `IReadable`, `IWritable`.
## What it does NOT cover
### 1. Wire-level anything
No ISO-on-TCP frame is ever sent during the test suite. S7netplus is the only
wire-path abstraction and it has no in-process fake mode; the shipping choice
was to contract-test via `IS7Client` rather than patch into S7netplus
internals.
### 2. Read/write happy path
Every `S7DriverReadWriteTests` case exercises error branches. A successful
read returning real PLC data is not tested end-to-end — the return value is
whatever the fake says it is.
### 3. Mailbox serialization under concurrent reads
The driver's `SemaphoreSlim` serializes S7netplus calls because the S7 CPU's
comm mailbox is scanned at most once per cycle. Contention behavior under
real PLC latency is not exercised.
### 4. Variant quirks
S7-1200 vs S7-1500 vs S7-300/400 connection semantics (PG vs OP vs S7-Basic)
not differentiated at test time.
### 5. Data types beyond the scalars
UDT fan-out, `STRING` with length-prefix quirks, `DTL` / `DATE_AND_TIME`,
arrays of structs — not covered.
## When to trust the S7 tests, when to reach for a rig
| Question | Unit tests | Real PLC |
| --- | --- | --- |
| "Does the address parser accept X syntax?" | yes | - |
| "Does the driver lifecycle hang / crash?" | yes | yes |
| "Does a real read against an S7-1500 return correct bytes?" | no | yes (required) |
| "Does mailbox serialization actually prevent PG timeouts?" | no | yes (required) |
| "Does a UDT fan-out produce usable member variables?" | no | yes (required) |
## Follow-up candidates
1. **Snap7 server** — [Snap7](https://snap7.sourceforge.net/) ships a
C-library-based S7 server that could run in-CI on Linux. A pinned build +
a fixture shape similar to `ab_server` would give S7 parity with Modbus /
AB CIP coverage.
2. **Plcsim Advanced** — Siemens' paid emulator. Licensed per-seat; fits a
lab rig but not CI.
3. **Real S7 lab rig** — cheapest physical PLC (CPU 1212C) on a dedicated
network port, wired via self-hosted runner.
Without any of these, S7 driver correctness against real hardware is trusted
from field deployments, not from the test suite.
## Key fixture / config files
- `tests/ZB.MOM.WW.OtOpcUa.Driver.S7.Tests/` — unit tests only, no harness
- `src/ZB.MOM.WW.OtOpcUa.Driver.S7/S7Driver.cs` — ctor takes
`IS7ClientFactory` which tests fake; docstring lines 8-20 note the deferred
integration fixture

View File

@@ -0,0 +1,158 @@
# TwinCAT test fixture
Coverage map + gap inventory for the Beckhoff TwinCAT ADS driver.
**TL;DR:** Integration-test scaffolding lives at
`tests/ZB.MOM.WW.OtOpcUa.Driver.TwinCAT.IntegrationTests/` (task #221).
`TwinCATXarFixture` probes TCP 48898 on an operator-supplied VM; three
smoke tests (read / write / native notification) run end-to-end through
the real ADS stack when the VM is reachable, skip cleanly otherwise.
**Remaining operational work**: stand up a TwinCAT 3 XAR runtime in a
Hyper-V VM, author the `.tsproj` project documented at
`TwinCatProject/README.md`, rotate the 7-day trial license (or buy a
paid runtime). Unit tests via `FakeTwinCATClient` still carry the
exhaustive contract coverage.
TwinCAT is the only driver outside Galaxy that uses **native
notifications** (no polling) for `ISubscribable`, and the fake exposes a
fire-event harness so notification routing is contract-tested rigorously
at the unit layer.
## What the fixture is
**Integration layer** (task #221, scaffolded):
`tests/ZB.MOM.WW.OtOpcUa.Driver.TwinCAT.IntegrationTests/`
`TwinCATXarFixture` TCP-probes ADS port 48898 on the host specified by
`TWINCAT_TARGET_HOST` + requires `TWINCAT_TARGET_NETID` (AmsNetId of the
VM). No fixture-owned lifecycle — XAR can't run in Docker because it
bypasses the Windows kernel scheduler, so the VM stays
operator-managed. `TwinCatProject/README.md` documents the required
`.tsproj` project state; the file itself lands once the XAR VM is up +
the project is authored. Three smoke tests:
`Driver_reads_seeded_DINT_through_real_ADS`,
`Driver_write_then_read_round_trip_on_scratch_REAL`, and
`Driver_subscribe_receives_native_ADS_notifications_on_counter_changes`
— all skip cleanly via `[TwinCATFact]` when the runtime isn't
reachable.
**Unit layer**: `tests/ZB.MOM.WW.OtOpcUa.Driver.TwinCAT.Tests/` is
still the primary coverage. `FakeTwinCATClient` also fakes the
`AddDeviceNotification` flow so tests can trigger callbacks without a
running runtime.
## What it actually covers
### Integration (XAR VM, task #221 — code scaffolded, needs VM + project)
- `TwinCAT3SmokeTests.Driver_reads_seeded_DINT_through_real_ADS` — real AMS
handshake + ADS read of `GVL_Fixture.nCounter` (seeded at 1234, MAIN
increments each cycle)
- `TwinCAT3SmokeTests.Driver_write_then_read_round_trip_on_scratch_REAL`
real ADS write + read on `GVL_Fixture.rSetpoint`
- `TwinCAT3SmokeTests.Driver_subscribe_receives_native_ADS_notifications_on_counter_changes`
— real `AddDeviceNotification` against the cycle-incrementing counter;
observes `OnDataChange` firing within 3 s of subscribe
All three gated on `TWINCAT_TARGET_HOST` + `TWINCAT_TARGET_NETID` env
vars; skip cleanly via `[TwinCATFact]` when the VM isn't reachable or
vars are unset.
### Unit
- `TwinCATAmsAddressTests``ads://<netId>:<port>` parsing + routing
- `TwinCATCapabilityTests` — data-type mapping (primitives + declared UDTs),
read-only classification
- `TwinCATReadWriteTests` — read + write through the fake, status mapping
- `TwinCATSymbolPathTests` — symbol-path routing for nested struct members
- `TwinCATSymbolBrowserTests``ITagDiscovery.DiscoverAsync` via
`ReadSymbolsAsync` (#188) + system-symbol filtering
- `TwinCATNativeNotificationTests``AddDeviceNotification` (#189)
registration, callback-delivery-to-`OnDataChange` wiring, unregister on
unsubscribe
- `TwinCATDriverTests``IDriver` lifecycle
Capability surfaces whose contract is verified: `IDriver`, `IReadable`,
`IWritable`, `ITagDiscovery`, `ISubscribable`, `IHostConnectivityProbe`,
`IPerCallHostResolver`.
## What it does NOT cover
### 1. AMS / ADS wire traffic
No real AMS router frame is exchanged. Beckhoff's `TwinCAT.Ads` NuGet (their
own .NET SDK, not libplctag-style OSS) has no in-process fake; tests stub
the `ITwinCATClient` abstraction above it.
### 2. Multi-route AMS
ADS supports chained routes (`<localNetId> → <routerNetId> → <targetNetId>`)
for PLCs behind an EC master / IPC gateway. Parse coverage exists; wire-path
coverage doesn't.
### 3. Notification reliability under jitter
`AddDeviceNotification` delivers at the runtime's cycle boundary; under high
CPU load or network jitter real notifications can coalesce. The fake fires
one callback per test invocation — real callback-coalescing behavior is
untested.
### 4. TC2 vs TC3 variant handling
TwinCAT 2 (ADS v1) and TwinCAT 3 (ADS v2) have subtly different
`GetSymbolInfoByName` semantics + symbol-table layouts. Driver targets TC3;
TC2 compatibility is not exercised.
### 5. Cycle-time alignment for `ISubscribable`
Native ADS notifications fire on the PLC cycle boundary. The fake test
harness assumes notifications fire on a timer the test controls;
cycle-aligned firing under real PLC control is not verified.
### 6. Alarms / history
Driver doesn't implement `IAlarmSource` or `IHistoryProvider` — not in
scope for this driver family. TwinCAT 3's TcEventLogger could theoretically
back an `IAlarmSource`, but shipping that is a separate feature.
## When to trust TwinCAT tests, when to reach for a rig
| Question | Unit tests | Real TwinCAT runtime |
| --- | --- | --- |
| "Does the AMS address parser accept X?" | yes | - |
| "Does notification → `OnDataChange` wire correctly?" | yes (contract) | yes |
| "Does symbol browsing filter TwinCAT internals?" | yes | yes |
| "Does a real ADS read return correct bytes?" | no | yes (required) |
| "Do notifications coalesce under load?" | no | yes (required) |
| "Does a TC2 PLC work the same as TC3?" | no | yes (required) |
## Follow-up candidates
1. **XAR VM live-population** — scaffolding is in place (this PR); the
remaining work is operational: stand up the Hyper-V VM, install XAR,
author the `.tsproj` per `TwinCatProject/README.md`, configure the
bilateral ADS route, set `TWINCAT_TARGET_HOST` + `TWINCAT_TARGET_NETID`
on the dev box. Then the three smoke tests transition skip → pass.
Tracked as #221.
2. **License-rotation automation** — XAR's 7-day trial expires on
schedule. Either automate `TcActivate.exe /reactivate` via a
scheduled task on the VM (not officially supported; reportedly works
for some TC3 builds), or buy a paid runtime license (~$1k one-time
per runtime per CPU) to kill the rotation. The doc at
`TwinCatProject/README.md` §License rotation walks through both.
3. **Lab rig** — cheapest IPC (CX7000 / CX9020) on a dedicated network;
the only route that covers TC2 + real EtherCAT I/O timing + cycle
jitter under CPU load.
## Key fixture / config files
- `tests/ZB.MOM.WW.OtOpcUa.Driver.TwinCAT.IntegrationTests/TwinCATXarFixture.cs`
— TCP probe + skip-attributes + env-var parsing
- `tests/ZB.MOM.WW.OtOpcUa.Driver.TwinCAT.IntegrationTests/TwinCAT3SmokeTests.cs`
— three wire-level smoke tests
- `tests/ZB.MOM.WW.OtOpcUa.Driver.TwinCAT.IntegrationTests/TwinCatProject/README.md`
— project spec + VM setup + license-rotation notes
- `tests/ZB.MOM.WW.OtOpcUa.Driver.TwinCAT.Tests/FakeTwinCATClient.cs`
in-process fake with the notification-fire harness used by
`TwinCATNativeNotificationTests`
- `src/ZB.MOM.WW.OtOpcUa.Driver.TwinCAT/TwinCATDriver.cs` — ctor takes
`ITwinCATClientFactory`

View File

@@ -1,8 +1,10 @@
# OPC UA Client Requirements
## Overview
> **Revision** — Refreshed 2026-04-19 for the OtOpcUa v2 multi-driver platform (task #205). The Client surface (shared library + CLI + UI) shipped for v2 is preserved; this refresh restructures the document into numbered, directly-verifiable requirements (CLI-* and UI-* prefixes) layered on top of the existing detailed design content. Requirement coverage added for the `redundancy` command, alarm subscribe/ack round-trip, history-read, and UI tree-browser drag-to-subscribe behaviors. Original design-spec material for `ConnectionSettings`, `IOpcUaClientService`, models, and view-models is retained as reference-level details below the numbered requirements.
Three new .NET 10 cross-platform projects providing a shared OPC UA client library, a CLI tool, and an Avalonia desktop UI. All projects target Windows and macOS.
Parent: [HLR-001](HighLevelReqs.md#hlr-001-opc-ua-server), [HLR-009](HighLevelReqs.md#hlr-009-transport-security-and-authentication), [HLR-013](HighLevelReqs.md#hlr-013-cluster-redundancy)
See also: `docs/Client.CLI.md`, `docs/Client.UI.md`.
## Projects
@@ -10,134 +12,161 @@ Three new .NET 10 cross-platform projects providing a shared OPC UA client libra
|---------|------|---------|
| `ZB.MOM.WW.OtOpcUa.Client.Shared` | Class library | Core OPC UA client, models, interfaces |
| `ZB.MOM.WW.OtOpcUa.Client.CLI` | Console app | Command-line interface using CliFx |
| `ZB.MOM.WW.OtOpcUa.Client.UI` | Avalonia app | Desktop UI with tree browser, subscriptions, alarms |
| `ZB.MOM.WW.OtOpcUa.Client.Shared.Tests` | Test project | Unit tests for shared library |
| `ZB.MOM.WW.OtOpcUa.Client.CLI.Tests` | Test project | Unit tests for CLI commands |
| `ZB.MOM.WW.OtOpcUa.Client.UI.Tests` | Test project | Unit tests for UI view models |
| `ZB.MOM.WW.OtOpcUa.Client.UI` | Avalonia app | Desktop UI |
| `ZB.MOM.WW.OtOpcUa.Client.Shared.Tests` | Test project | Shared-library unit tests |
| `ZB.MOM.WW.OtOpcUa.Client.CLI.Tests` | Test project | CLI command tests |
| `ZB.MOM.WW.OtOpcUa.Client.UI.Tests` | Test project | ViewModel unit tests |
## Shared Requirements (Client.Shared)
### SHR-001: Single Service Interface
The Client.Shared library shall expose a single service interface `IOpcUaClientService` covering connect, disconnect, read, write, browse, subscribe, alarm-subscribe, alarm-ack, history-read-raw, history-read-aggregate, and get-redundancy-info operations.
### SHR-002: ConnectionSettings Model
The library shall expose a `ConnectionSettings` record with the fields: `EndpointUrl` (required), `FailoverUrls[]`, `Username`, `Password`, `SecurityMode` (None/Sign/SignAndEncrypt; default None), `SessionTimeoutSeconds` (default 60), `AutoAcceptCertificates` (default true), `CertificateStorePath`.
### SHR-003: Automatic Failover
The library shall monitor session keep-alive and automatically fail over across `FailoverUrls` when the primary endpoint is unreachable, emitting a `ConnectionStateChanged` event on each transition (Disconnected / Connecting / Connected / Reconnecting).
### SHR-004: Cross-Platform Certificate Store
The library shall auto-generate a client certificate on first use and store it in a cross-platform path (default `{AppData}/OtOpcUaClient/pki/`). Server certificates are auto-accepted when `AutoAcceptCertificates = true`.
### SHR-005: Type-Coercing Write
The library's `WriteValueAsync(NodeId, object)` shall read the node's current value to determine target type and coerce the input value before sending.
### SHR-006: UI-Thread Dispatch Neutrality
The library shall not assume any specific synchronization context. Events (`DataChanged`, `AlarmEvent`, `ConnectionStateChanged`) are raised on the OPC UA stack thread; the consuming CLI / UI is responsible for dispatching to its UI thread.
---
## CLI Requirements (Client.CLI)
### CLI-001: Command Surface
The CLI shall expose the following commands: `connect`, `read`, `write`, `browse`, `subscribe`, `historyread`, `alarms`, `redundancy`.
### CLI-002: Common Options
All CLI commands shall accept the options `-u, --url` (required), `-U, --username`, `-P, --password`, `-S, --security none|sign|encrypt`, `-F, --failover-urls` (comma-separated), `--verbose`.
### CLI-003: Connect Command
The `connect` command shall attempt to establish a session using the supplied options and print `Connected` plus the resolved endpoint's `ServerUriArray` and `ApplicationUri` on success, or a diagnostic error message on failure.
### CLI-004: Read Command
The `read -n <NodeId>` command shall print `NodeId`, `Value`, `StatusCode`, `SourceTimestamp`, `ServerTimestamp` one per line.
### CLI-005: Write Command
The `write -n <NodeId> -v <value>` command shall coerce the value to the node's current type (per SHR-005) and print the resulting `StatusCode`. A `Bad_UserAccessDenied` result is printed verbatim so operators see the authorization outcome.
### CLI-006: Browse Command
The `browse [-n <parent>] [-r] [-d <depth>]` command shall list child nodes under `parent` (or the `Objects` folder if omitted). `-r` enables recursion up to `-d` depth (default 1).
### CLI-007: Subscribe Command
The `subscribe -n <NodeId> -i <intervalMs>` command shall create a monitored item at `intervalMs` publishing interval, print each `DataChanged` event as `<timestamp> <nodeId> <value> <status>` until Ctrl-C, then cleanly unsubscribe.
### CLI-008: Historyread Command
The `historyread -n <NodeId> --start <utc> --end <utc> [--max <n>] [--aggregate <type> --interval <ms>]` command shall print raw values or aggregate buckets. Supported aggregate types: Average, Minimum, Maximum, Count, Start, End.
### CLI-009: Alarms Command
The `alarms [-n <source>] [-i <intervalMs>]` command shall subscribe to alarm events, print each event as `<time> <source> <condition> <severity> <state> <acked> <message>`, accept `ack <conditionId>` commands interactively, and support `refresh` to trigger `RequestConditionRefreshAsync`.
### CLI-010: Redundancy Command
The `redundancy` command shall call `GetRedundancyInfoAsync` and print `Mode`, `ServiceLevel`, `ApplicationUri`, and `ServerUris` (one per line). Suitable for redundancy-failover smoke tests.
### CLI-011: Logging
The CLI shall use Serilog console sink at `Warning` minimum by default; `--verbose` raises to `Debug`.
---
## UI Requirements (Client.UI)
### UI-001: Connection Panel
The UI shall present a top-bar connection panel with fields for Endpoint URL, Username, Password, Security mode, and a Connect / Disconnect button. The resolved `RedundancyInfo` is displayed next to the bar on successful connect.
### UI-002: Tree Browser
The UI shall present a left-pane tree browser backed by `IOpcUaClientService.BrowseAsync`, lazy-loading children on node expansion (one level per `BrowseAsync` call).
### UI-003: Read/Write Tab
The UI shall provide a Read/Write tab that auto-reads the selected tree node's current value, displays `Value` + `StatusCode` + `SourceTimestamp`, and accepts a write value with a Send button.
### UI-004: Subscriptions Tab
The UI shall provide a Subscriptions tab that lists active monitored items (columns: NodeId, Value, Status, Timestamp), supports Add and Remove, and dispatches `DataChanged` events to the Avalonia UI thread via `Dispatcher.UIThread.Post`.
### UI-005: Alarms Tab
The UI shall provide an Alarms tab that supports SubscribeAlarms / UnsubscribeAlarms / RefreshConditions commands, displays live alarm events, and supports `Acknowledge` on selected events. Acknowledgment failure (including `Bad_UserAccessDenied`) is surfaced to the user.
### UI-006: History Tab
The UI shall provide a History tab with inputs for StartTime, EndTime, MaxValues, AggregateType, Interval, a Read command, and a results table with columns (Timestamp, Value, Status).
### UI-007: Connection State Reflects in UI
All tabs shall reflect the connection state — when disconnected, all action commands are disabled; the status bar shows `Disconnected` / `Connecting` / `Connected` / `Reconnecting` tied to the `ConnectionStateChanged` event.
### UI-008: Cross-Platform
The UI shall build and run on Windows (win-x64) and macOS (osx-arm64 / osx-x64). No platform-specific OPC UA stack APIs are used.
---
## Technology Stack
- .NET 10, C#
- OPC UA: OPCFoundation.NetStandard.Opc.Ua.Client
- OPC UA: `OPCFoundation.NetStandard.Opc.Ua.Client`
- Logging: Serilog
- CLI: CliFx
- UI: Avalonia 11.x with CommunityToolkit.Mvvm
- Tests: xUnit 3, Shouldly, Microsoft.Testing.Platform runner
## Client.Shared
## Client.Shared — Design Detail
### ConnectionSettings Model
### IOpcUaClientService Interface (reference)
```
EndpointUrl: string (required)
FailoverUrls: string[] (optional)
Username: string? (optional, first-class property)
Password: string? (optional, first-class property)
SecurityMode: enum (None, Sign, SignAndEncrypt) — default None
SessionTimeoutSeconds: int — default 60
AutoAcceptCertificates: bool — default true
CertificateStorePath: string? — default platform-appropriate location
```
**Lifecycle:** `ConnectAsync(ConnectionSettings)`, `DisconnectAsync()`, `IsConnected`.
### IOpcUaClientService Interface
**Read/Write:** `ReadValueAsync(NodeId)`, `WriteValueAsync(NodeId, object)`.
Single service interface covering all OPC UA operations:
**Browse:** `BrowseAsync(NodeId? parent)``BrowseResult[]` (NodeId, DisplayName, NodeClass, HasChildren); lazy-load compatible.
**Lifecycle:**
- `ConnectAsync(ConnectionSettings)` — connect to server, handle endpoint discovery, security, auth
- `DisconnectAsync()` — close session cleanly
- `IsConnected` property
**Subscribe:** `SubscribeAsync(NodeId, int intervalMs)`, `UnsubscribeAsync(NodeId)`, `event DataChanged(NodeId, DataValue)`.
**Read/Write:**
- `ReadValueAsync(NodeId)` — returns DataValue (value, status, timestamps)
- `WriteValueAsync(NodeId, object value)` — auto-detects target type, returns StatusCode
**Alarms:** `SubscribeAlarmsAsync(NodeId? source, int intervalMs)`, `UnsubscribeAlarmsAsync()`, `AcknowledgeAsync(conditionId, comment)`, `RequestConditionRefreshAsync()`, `event AlarmEvent(AlarmEventArgs)`.
**Browse:**
- `BrowseAsync(NodeId? parent)` — returns list of BrowseResult (NodeId, DisplayName, NodeClass)
- Lazy-load compatible (browse one level at a time)
**History:** `HistoryReadRawAsync(NodeId, start, end, maxValues)`, `HistoryReadAggregateAsync(NodeId, start, end, AggregateType, intervalMs)`.
**Subscribe:**
- `SubscribeAsync(NodeId, int intervalMs)` — create monitored item subscription
- `UnsubscribeAsync(NodeId)` — remove monitored item
- `event DataChanged` — fires on value change with (NodeId, DataValue)
**Alarms:**
- `SubscribeAlarmsAsync(NodeId? source, int intervalMs)` — subscribe to alarm events
- `UnsubscribeAlarmsAsync()` — remove alarm subscription
- `RequestConditionRefreshAsync()` — trigger condition refresh
- `event AlarmEvent` — fires on alarm state change with AlarmEventArgs
**History:**
- `HistoryReadRawAsync(NodeId, DateTime start, DateTime end, int maxValues)` — raw historical values
- `HistoryReadAggregateAsync(NodeId, DateTime start, DateTime end, AggregateType, double intervalMs)` — aggregated values
**Redundancy:**
- `GetRedundancyInfoAsync()` — returns RedundancyInfo (mode, service level, server URIs, app URI)
**Failover:**
- Automatic failover across FailoverUrls with keep-alive monitoring
- `event ConnectionStateChanged` — fires on connect/disconnect/failover
**Redundancy:** `GetRedundancyInfoAsync()``RedundancyInfo` (Mode, ServiceLevel, ServerUris, ApplicationUri).
### Models
- `BrowseResult`: NodeId, DisplayName, NodeClass, HasChildren
- `AlarmEventArgs`: SourceName, ConditionName, Severity, Message, Retain, ActiveState, AckedState, Time
- `RedundancyInfo`: Mode, ServiceLevel, ServerUris, ApplicationUri
- `ConnectionState`: enum (Disconnected, Connecting, Connected, Reconnecting)
- `AggregateType`: enum (Average, Minimum, Maximum, Count, Start, End)
- `BrowseResult` NodeId, DisplayName, NodeClass, HasChildren
- `AlarmEventArgs` SourceName, ConditionName, Severity, Message, Retain, ActiveState, AckedState, Time
- `RedundancyInfo` Mode, ServiceLevel, ServerUris, ApplicationUri
- `ConnectionState` enum (Disconnected, Connecting, Connected, Reconnecting)
- `AggregateType` enum (Average, Minimum, Maximum, Count, Start, End)
### Type Conversion
---
Port the existing `ConvertValue` logic from the CLI tool: reads the current node value to determine the target type, then coerces the input value.
### Certificate Management
- Cross-platform certificate store path (default: `{AppData}/LmxOpcUaClient/pki/`)
- Auto-generate client certificate on first use
- Auto-accept untrusted server certificates (configurable)
### Logging
Serilog with `ILogger` passed via constructor or `Log.ForContext<T>()`. No sinks configured in the library — consumers configure sinks.
## Client.CLI
### Commands
All 8 commands:
| Command | Description |
|---------|-------------|
| `connect` | Test server connectivity |
| `read` | Read a node value |
| `write` | Write a value to a node |
| `browse` | Browse address space (with depth/recursive) |
| `subscribe` | Monitor node for value changes |
| `historyread` | Read historical data (raw + aggregates) |
| `alarms` | Subscribe to alarm events |
| `redundancy` | Query redundancy state |
All commands use the shared `IOpcUaClientService`. Each command:
1. Creates `ConnectionSettings` from CLI options
2. Creates `OpcUaClientService`
3. Calls the appropriate method
4. Formats and prints results
### Common Options (all commands)
- `-u, --url` (required): Endpoint URL
- `-U, --username`: Username
- `-P, --password`: Password
- `-S, --security`: Security mode (none/sign/encrypt)
- `-F, --failover-urls`: Comma-separated failover endpoints
### Logging
Serilog console sink at Warning level by default, with `--verbose` flag for Debug.
## Client.UI
### Window Layout
## Client.UI — View Layout (reference)
Single-window Avalonia application:
@@ -146,82 +175,43 @@ Single-window Avalonia application:
│ [Endpoint URL] [User] [Pass] [Security▼] [Connect] │
│ Redundancy: Mode=Warm ServiceLevel=200 AppUri=... │
├──────────────┬──────────────────────────────────────────┤
│ │ ┌Read/Write─┬─Subscriptions─┬─Alarms─┬─History┐│
│ Address │ │ Node: ns=3;s=Tag.Attr ││
│ Space │ │ Value: 42.5 ││
│ Tree │ │ Status: Good ││
│ Browser │ │ [Write: ____] [Send] │
│ │ │ ││
│ (lazy-load) │ │ ││
│ │ └──────────────────────────────────────┘│
│ │ ┌Read/WriteSubscriptionsAlarmsHistory┐│
│ Address │ │ Node: ns=3;s=Tag.Attr ││
│ Space │ │ Value: 42.5 Status: Good ││
│ Tree │ │ [Write: ____] [Send] ││
│ Browser │ └───────────────────────────────────────┘
├──────────────┴──────────────────────────────────────────┤
│ Status: Connected | Session: abc123 | 3 subscriptions │
└─────────────────────────────────────────────────────────┘
```
### Views and ViewModels (CommunityToolkit.Mvvm)
### ViewModels (CommunityToolkit.Mvvm)
**MainWindowViewModel:**
- Connection settings properties (bound to top bar inputs)
- ConnectCommand / DisconnectCommand (RelayCommand)
- ConnectionState property
- RedundancyInfo property
- SelectedTreeNode property
- StatusMessage property
**BrowseTreeViewModel:**
- Root nodes collection (ObservableCollection)
- Lazy-load children on expand via `BrowseAsync`
- TreeNodeViewModel: NodeId, DisplayName, NodeClass, Children, IsExpanded, HasChildren
**ReadWriteViewModel:**
- SelectedNode (from tree selection)
- CurrentValue, Status, SourceTimestamp
- WriteValue input + WriteCommand
- Auto-read on node selection
**SubscriptionsViewModel:**
- ActiveSubscriptions collection (ObservableCollection)
- AddSubscription / RemoveSubscription commands
- Live value updates dispatched to UI thread
- Columns: NodeId, Value, Status, Timestamp
**AlarmsViewModel:**
- AlarmEvents collection (ObservableCollection)
- SubscribeCommand / UnsubscribeCommand / RefreshCommand
- MonitoredNode property
- Live alarm events dispatched to UI thread
**HistoryViewModel:**
- SelectedNode (from tree selection)
- StartTime, EndTime, MaxValues, AggregateType, Interval
- ReadCommand
- Results collection (ObservableCollection)
- Columns: Timestamp, Value, Status
### UI Thread Dispatch
All events from `IOpcUaClientService` must be dispatched to the Avalonia UI thread via `Dispatcher.UIThread.Post()` before updating ObservableCollections.
- `MainWindowViewModel` — connection fields, connect/disconnect commands, `ConnectionState`, `RedundancyInfo`, `SelectedTreeNode`, `StatusMessage`.
- `BrowseTreeViewModel` — root collection (`ObservableCollection<TreeNodeViewModel>`), lazy-load on expand.
- `ReadWriteViewModel` — auto-read on selection, `WriteValue` + `WriteCommand`.
- `SubscriptionsViewModel``ActiveSubscriptions`, `AddSubscription`, `RemoveSubscription`, live `DataChanged` dispatch to UI thread.
- `AlarmsViewModel``AlarmEvents`, Subscribe / Unsubscribe / Refresh / Acknowledge commands.
- `HistoryViewModel``StartTime`, `EndTime`, `MaxValues`, `AggregateType`, `Interval`, `ReadCommand`, `Results`.
## Test Projects
### Client.Shared.Tests
- ConnectionSettings validation
- Type conversion (ConvertValue)
- BrowseResult model construction
- AlarmEventArgs model construction
- `ConnectionSettings` validation
- Type conversion
- `BrowseResult` / `AlarmEventArgs` / `RedundancyInfo` model construction
- FailoverUrl parsing
### Client.CLI.Tests
- Command option parsing (via CliFx test infrastructure)
- Output formatting
- Output formatting for each command
### Client.UI.Tests
- ViewModel property change notifications
- Command can-execute logic
- Tree node lazy-load behavior (with mocked IOpcUaClientService)
- ViewModel property-change notifications
- Command `CanExecute` logic
- Tree lazy-load behavior (with mocked `IOpcUaClientService`)
### Test Framework
- xUnit 3 with Microsoft.Testing.Platform runner
- Shouldly for assertions
- No live OPC UA server required — mock IOpcUaClientService for unit tests
- Shouldly
- No live OPC UA server required — mock `IOpcUaClientService` for unit tests

View File

@@ -1,106 +1,113 @@
# Galaxy Repository — Component Requirements
# Galaxy Driver — Galaxy Repository Requirements
Parent: [HLR-002](HighLevelReqs.md#hlr-002-galaxy-hierarchy-as-opc-ua-address-space), [HLR-005](HighLevelReqs.md#hlr-005-dynamic-address-space-rebuild)
> **Revision** — Refreshed 2026-04-19 for the OtOpcUa v2 multi-driver platform (task #205). Scope clarified: this document is **Galaxy-driver-specific**. Galaxy is one of seven drivers in the OtOpcUa platform; the requirements below describe the SQL-side of the Galaxy driver (hierarchy/attribute/change-detection queries against the ZB database) that backs the Galaxy driver's `ITagDiscovery.DiscoverAsync` and `IRediscoverable` implementations. All Galaxy-specific SQL runs inside `OtOpcUa.Galaxy.Host` (.NET 4.8 x86 Windows service); the in-server `Driver.Galaxy.Proxy` calls it over a named pipe. For platform-wide tag discovery requirements see `OpcUaServerReqs.md` OPC-002. For deeper spec see `docs/GalaxyRepository.md` and `docs/v2/driver-specs.md`.
Parent: [HLR-002](HighLevelReqs.md#hlr-002-multi-driver-plug-in-model), [HLR-003](HighLevelReqs.md#hlr-003-address-space-composition-per-namespace), [HLR-006](HighLevelReqs.md#hlr-006-change-detection-and-rediscovery)
Driver scope: Galaxy only. Namespace kind: `SystemPlatform`.
## GR-001: Hierarchy Extraction
The system shall query the Galaxy Repository database to extract all deployed objects with their parent-child containment relationships, contained names, and tag names.
The Galaxy driver's `ITagDiscovery.DiscoverAsync` implementation shall query the ZB Galaxy Repository database to extract all deployed objects with their parent-child containment relationships, contained names, and tag names.
### Acceptance Criteria
- Executes `queries/hierarchy.sql` against the ZB database.
- Executes `queries/hierarchy.sql` against the ZB database from within `OtOpcUa.Galaxy.Host`.
- Returns a list of objects with: `gobject_id`, `tag_name`, `contained_name`, `browse_name`, `parent_gobject_id`, `is_area`.
- Objects with `parent_gobject_id = 0` are children of the root ZB node.
- Objects with `parent_gobject_id = 0` become children of the root ZB node inside the `SystemPlatform` namespace.
- Only deployed, non-template objects matching the category filter (areas, engines, user-defined objects, etc.) are returned.
- Query completes within 10 seconds on a typical Galaxy (hundreds of objects). Log a Warning if it takes longer.
- Query completes within 10 seconds on a typical Galaxy (hundreds of objects). Log Warning if it takes longer.
### Details
- Results are ordered by `parent_gobject_id, tag_name` for deterministic tree building.
- If the query returns zero rows, log a Warning (Galaxy may have no deployed objects, or the DB connection may be misconfigured).
- Orphan detection: if a row references a `parent_gobject_id` that does not exist in the result set and is not 0, log a Warning and skip that node.
- Empty result → Warning logged (Galaxy may have no deployed objects, or the DB connection may be misconfigured).
- Orphan detection: a row referencing a non-existent `parent_gobject_id` (and not 0) is skipped with a Warning.
- Streamed to the core via `IAddressSpaceBuilder.AddFolder` / `AddObject` calls over the Galaxy named pipe; no in-memory full-tree buffering on the Host side.
---
## GR-002: Attribute Extraction
The system shall query user-defined (dynamic) attributes for deployed objects, including data type, array flag, and array dimensions.
The Galaxy driver shall query user-defined (dynamic) attributes for deployed objects, including data type, array flag, and array dimensions.
### Acceptance Criteria
- Executes `queries/attributes.sql` using the template chain CTE to resolve inherited attributes.
- Returns: `gobject_id`, `tag_name`, `attribute_name`, `full_tag_reference`, `mx_data_type`, `is_array`, `array_dimension`, `security_classification`.
- Attributes starting with `_` are filtered out by the query.
- `array_dimension` is correctly extracted from the `mx_value` hex bytes (positions 13-16, little-endian uint16).
- `array_dimension` is extracted from the `mx_value` hex bytes (positions 13-16, little-endian uint16).
### Details
- CTE recursion depth is limited to 10 levels (per the query). This is sufficient for Galaxy template hierarchies.
- If `mx_data_type` is null or not in the known set (1-8, 13-16), default to String.
- If `gobject_id` from an attribute row does not match any hierarchy object, skip that attribute (object may not be deployed).
- CTE recursion depth is limited to 10 levels.
- `mx_data_type` not in the known set (1-8, 13-16) defaults to String.
- `gobject_id` that doesn't match a hierarchy object is skipped (object may not be deployed).
- Each emitted attribute is reported via `DriverAttributeInfo` to the core through `IAddressSpaceBuilder.AddVariable`.
---
## GR-003: Change Detection
## GR-003: Change Detection and IRediscoverable
The system shall poll `galaxy.time_of_last_deploy` at a configurable interval to detect when a new deployment has occurred.
The Galaxy driver shall implement `IRediscoverable` by polling `galaxy.time_of_last_deploy` on a configurable interval to detect when a new deployment has occurred.
### Acceptance Criteria
- Polls `SELECT time_of_last_deploy FROM galaxy` at a configurable interval (`GalaxyRepository:ChangeDetectionIntervalSeconds`, default 30 seconds).
- Polls `SELECT time_of_last_deploy FROM galaxy` at a configurable interval (`Galaxy:ChangeDetectionIntervalSeconds`, default 30 seconds).
- Compares the returned timestamp to the last known value stored in memory.
- If different, triggers a rebuild (re-run hierarchy + attributes queries, notify OPC UA server).
- First poll after startup always triggers an initial build.
- If the query fails (SQL timeout, connection error), log Warning and retry at next interval. Do not trigger a rebuild on failure.
- If different, raises the `IRediscoverable.RediscoveryNeeded` signal so the core re-runs `ITagDiscovery.DiscoverAsync` and surgically rebuilds the Galaxy namespace subtree (per OPC-017).
- First poll after startup always triggers an initial discovery.
- Query failure → Warning logged; no rediscovery triggered; retry at next interval.
### Details
- Polling runs on a background timer thread, not blocking the STA thread.
- `time_of_last_deploy` is a datetime column. Compare using exact equality (not range).
- Polling runs on a background `Task` inside `OtOpcUa.Galaxy.Host`, not on the STA message-pump thread.
- `time_of_last_deploy` is a `datetime` column; compared using exact equality (not a range).
- Signal delivery to the Proxy happens via a server-push message on the Galaxy named pipe.
---
## GR-004: Rebuild on Change
## GR-004: Rediscovery Data Flow
When a deployment change is detected, the system shall re-query hierarchy and attributes and provide the updated structure to the OPC UA server for address space rebuild.
On a deployment change, the Galaxy driver shall re-query hierarchy + attributes and stream the updated structure to the core for surgical namespace rebuild.
### Acceptance Criteria
- On change detection, re-query both hierarchy and attributes.
- Provide the new data set to the OPC UA server component for address space replacement.
- Log at Information level: "Galaxy deployment change detected. Rebuilding address space. ({ObjectCount} objects, {AttributeCount} attributes)".
- Log total rebuild time at Information level.
- If the re-query fails, log Error and keep the existing address space (do not clear it).
- On change signal, re-run `GR-001` (hierarchy) and `GR-002` (attributes) queries.
- Stream the new tree to the core via `IAddressSpaceBuilder` over the named pipe.
- Log at Information level: `"Galaxy deployment change detected. Rebuilding. ({ObjectCount} objects, {AttributeCount} attributes)"`.
- Log total rediscovery duration at Information level.
- On re-query failure: Error logged; existing Galaxy subtree is retained.
### Details
- Rebuild is not atomic from the DB perspective — hierarchy and attributes are two separate queries. This is acceptable; deployment is an infrequent operation.
- Raise an event/callback that the OPC UA server subscribes to: `OnGalaxyChanged(hierarchyData, attributeData)`.
- Rediscovery is not atomic from the DB perspective — hierarchy and attributes are two separate queries. Acceptable; Galaxy deployment is an infrequent operation.
- The core owns the diff/surgical apply per OPC-017; the Galaxy driver only streams the new authoritative tree.
---
## GR-005: Connection Configuration
Database connection parameters shall be configurable via appsettings.json (connection string using Windows Authentication by default).
Galaxy DB connection parameters shall be configurable via environment variables passed from the `OtOpcUa.Galaxy.Host` supervisor at spawn time.
### Acceptance Criteria
- Connection string in `appsettings.json` under `GalaxyRepository:ConnectionString`.
- Default: `Server=localhost;Database=ZB;Integrated Security=true` (Windows Auth).
- ADO.NET `SqlConnection` used for queries (.NET Framework 4.8 built-in).
- Connection string via `OTOPCUA_GALAXY_ZB_CONN` environment variable.
- Default: `Server=localhost;Database=ZB;Integrated Security=True;TrustServerCertificate=True;Encrypt=False;` (Windows Auth).
- ADO.NET `SqlConnection` used for queries (.NET Framework 4.8).
- Connection is opened per-query (not kept open). Connection pooling handles efficiency.
- If the initial connection test at startup fails, log Error with the connection string and continue attempting (change detection polls will keep retrying).
- If the initial connection test at startup fails, log Error with the connection string sanitized and continue attempting (change-detection polls keep retrying).
### Details
- Command timeout: configurable via `GalaxyRepository:CommandTimeoutSeconds`, default 30 seconds.
- No ORM. Raw ADO.NET with `SqlCommand` and `SqlDataReader`. SQL text is embedded as constants (not dynamically constructed).
- Command timeout: `Galaxy:CommandTimeoutSeconds` in Config DB driver JSON (default 30 seconds).
- No ORM. Raw ADO.NET with `SqlCommand` and `SqlDataReader`. SQL text embedded as constants.
---
## GR-006: Query Safety
All SQL queries shall be static read-only SELECT statements. No writes to the Galaxy Repository database.
All Galaxy SQL queries shall be static read-only SELECT statements. No writes to the Galaxy Repository database.
### Acceptance Criteria
@@ -112,10 +119,23 @@ All SQL queries shall be static read-only SELECT statements. No writes to the Ga
## GR-007: Startup Validation
On startup, the Galaxy Repository component shall validate database connectivity.
On startup, the Galaxy driver's DB component inside `OtOpcUa.Galaxy.Host` shall validate database connectivity.
### Acceptance Criteria
- Execute a simple test query (`SELECT 1`) against the configured database.
- If the database is unreachable, log an Error but do not prevent service startup.
- The service runs in degraded mode (empty address space) until the database becomes available and the next change detection poll succeeds.
- Execute a simple test query (`SELECT 1`) against the configured Galaxy DB.
- If the database is unreachable, log Error but do not prevent Host startup.
- The Galaxy driver runs in degraded mode (empty SystemPlatform namespace) until the database becomes available and the next change-detection poll succeeds.
- In degraded mode the Galaxy driver instance reports `DriverHealth.Unavailable`, causing its Polly circuit state to be open until the first successful discovery.
---
## GR-008: Capability Wrapping
All calls into the Galaxy DB component from the Proxy side shall route through `CapabilityInvoker.InvokeAsync(DriverCapability.Discover, …)`.
### Acceptance Criteria
- `Driver.Galaxy.Proxy.DiscoverAsync` is a thin capability-invoker call that sends a MessagePack request over the named pipe to the Host's DB component.
- Roslyn analyzer **OTOPCUA0001** validates there are no direct discovery calls bypassing the invoker.
- Polly pipeline for `DriverCapability.Discover` on the Galaxy driver instance carries Timeout + Retry + CircuitBreaker.

View File

@@ -1,47 +1,94 @@
# High-Level Requirements
> **Revision** — Refreshed 2026-04-19 for the OtOpcUa v2 multi-driver platform (task #205). The original 2025 text described a single-process Galaxy/MXAccess server called LmxOpcUa. Today the project is the **OtOpcUa** multi-driver OPC UA platform deployed as three cooperating processes (Server, Admin, Galaxy.Host). The Galaxy integration is one of seven shipped drivers. HLR-001 through HLR-008 have been rewritten driver-agnostically; HLR-009 has been retired (the embedded Status Dashboard is superseded by the Admin UI). HLR-010 through HLR-017 are new and cover plug-in drivers, resilience, Config DB / draft-publish, cluster redundancy, fleet-wide identifier uniqueness, Admin UI, audit logging, metrics, and the Roslyn capability-wrapping analyzer.
## HLR-001: OPC UA Server
The system shall expose an OPC UA server endpoint that OPC UA clients can connect to for browsing, reading, and writing Galaxy tag data.
The system shall expose an OPC UA server endpoint that OPC UA clients can connect to for browsing, reading, writing, subscribing, acknowledging alarms, and reading historical values. Data is sourced from one or more **driver instances** that plug into the common core; OPC UA clients see a single unified address space per endpoint regardless of how many drivers are active behind it.
## HLR-002: Galaxy Hierarchy as OPC UA Address Space
## HLR-002: Multi-Driver Plug-In Model
The system shall build an OPC UA address space that mirrors the System Platform Galaxy object hierarchy, using contained names for browse structure and tag names for runtime data access.
The system shall support pluggable driver modules that bind to specific data sources. v2.0 ships seven drivers: Galaxy (AVEVA System Platform via MXAccess), Modbus TCP (including DL205 via `AddressFormat=DL205`), Allen-Bradley CIP (ControlLogix/CompactLogix), Allen-Bradley Legacy (SLC/MicroLogix via PCCC), Siemens S7, Beckhoff TwinCAT (ADS), FANUC FOCAS, and OPC UA Client (aggregation/gateway). Drivers implement only the capability interfaces (`IDriver`, `ITagDiscovery`, `IReadable`, `IWritable`, `ISubscribable`, `IAlarmSource`, `IHistoryProvider`, `IHostConnectivityProbe`, `IPerCallHostResolver`, `IRediscoverable`) defined in `ZB.MOM.WW.OtOpcUa.Core.Abstractions` that apply to their protocol. Multiple instances of the same driver type are supported; each instance binds to its own OPC UA namespace index.
## HLR-003: MXAccess Runtime Data Access
## HLR-003: Address Space Composition per Namespace
The system shall use the MXAccess toolkit to subscribe to, read, and write Galaxy tag attribute values at runtime on behalf of connected OPC UA clients.
The system shall build the OPC UA address space by composing per-driver subtrees into a single endpoint. Each driver instance owns one namespace and registers its nodes via the core-provided `IAddressSpaceBuilder` streaming API. The Galaxy driver continues to mirror the deployed ArchestrA object hierarchy (contained-name browse paths) in a namespace of kind `SystemPlatform`. Native-protocol drivers populate a namespace of kind `Equipment` whose browse structure conforms to the canonical 5-level Unified Namespace (`Enterprise / Site / Area / Line / Equipment / Signal`).
## HLR-004: Data Type Mapping
The system shall map Galaxy attribute data types (mx_data_type) to appropriate OPC UA built-in types, including support for array attributes.
Each driver shall map its native data types to OPC UA built-in types via `DriverDataType` conversions, including support for arrays (ValueRank=1 with ArrayDimensions). Type mapping is driver-specific — `docs/DataTypeMapping.md` covers Galaxy/MXAccess; each other driver's spec in `docs/v2/driver-specs.md` covers its own mapping. Unknown/unmapped driver types shall default to String per the driver's spec.
## HLR-005: Dynamic Address Space Rebuild
## HLR-005: Live Data Access
The system shall detect Galaxy deployment changes (via `galaxy.time_of_last_deploy`) and rebuild the OPC UA address space to reflect the current deployed state.
For every data-path operation (read, write, subscribe notification, alarm event, history read, tag rediscovery, host connectivity probe), the system shall route the call through the capability interface owned by the target driver instance. Reads and subscriptions shall deliver a `DataValueSnapshot` carrying value, OPC UA `StatusCode`, and source timestamp regardless of the underlying protocol. Every async capability invocation at dispatch shall pass through `Core.Resilience.CapabilityInvoker`.
## HLR-006: Windows Service Hosting
## HLR-006: Change Detection and Rediscovery
The system shall run as a Windows service (via TopShelf) with support for install, uninstall, and interactive console modes.
Drivers whose backend has a native change signal (e.g. Galaxy's `time_of_last_deploy`, OPC UA Client receiving `ServerStatusChange`) shall implement the optional `IRediscoverable` interface so the core can rebuild only the affected subtree. Drivers whose tag set is static relative to a published config generation are not required to implement `IRediscoverable`; their address-space structure changes only via a new published Config DB generation (see HLR-012).
## HLR-007: Logging
## HLR-007: Service Hosting
The system shall log operational events to rolling daily log files using Serilog.
The system shall be deployed as three cooperating Windows services:
## HLR-008: Connection Resilience
- **OtOpcUa.Server** — .NET 10 x64, `Microsoft.Extensions.Hosting` + `AddWindowsService`, hosts all non-Galaxy drivers in-process and the OPC UA endpoint.
- **OtOpcUa.Admin** — .NET 10 x64 Blazor Server web app, hosts the admin UI, SignalR hubs for live updates, `/metrics` Prometheus endpoint, and audit log writers.
- **OtOpcUa.Galaxy.Host** — .NET Framework 4.8 x86 (TopShelf), hosts MXAccess COM + Galaxy Repository SQL + Historian plugin. Talks to `Driver.Galaxy.Proxy` inside `OtOpcUa.Server` via a named pipe (MessagePack over length-prefixed frames, per-process shared secret, SID-restricted ACL).
The system shall automatically reconnect to MXAccess after connection loss, replaying active subscriptions upon reconnect.
## HLR-008: Logging
## HLR-009: Status Dashboard
The system shall log operational events to rolling daily file sinks using Serilog on every process. Plain-text is on by default; structured JSON (CompactJsonFormatter) is opt-in via `Serilog:WriteJson = true` so SIEMs (Splunk, Datadog) can ingest without a regex parser.
The system shall host an embedded HTTP status dashboard (similar to the LmxProxy dashboard) providing at-a-glance operational visibility including connection state, health, subscription statistics, and operation metrics.
## HLR-009: Transport Security and Authentication
The system shall support configurable OPC UA transport-security profiles (`None`, `Basic256Sha256-Sign`, `Basic256Sha256-SignAndEncrypt`, `Aes128_Sha256_RsaOaep-Sign`, `Aes128_Sha256_RsaOaep-SignAndEncrypt`, `Aes256_Sha256_RsaPss-Sign`, `Aes256_Sha256_RsaPss-SignAndEncrypt`) resolved at startup by `SecurityProfileResolver`. UserName-token authentication shall be validated against LDAP (production: Active Directory; dev: GLAuth). The server certificate is always created even for `None`-only deployments because UserName token encryption depends on it.
## HLR-010: Per-Driver-Instance Resilience
Every async capability call at dispatch shall pass through `Core.Resilience.CapabilityInvoker`, which runs a Polly v8 pipeline keyed on `(DriverInstanceId, HostName, DriverCapability)`. Retry and circuit-breaker strategies are per capability per decision #143: Read / Discover / Probe / Subscribe / AlarmSubscribe / HistoryRead retry automatically; Write and AlarmAcknowledge do **not** retry unless the tag or capability is explicitly marked with `WriteIdempotentAttribute`. A driver-instance circuit-breaker trip sets Bad quality on that instance's nodes only; other drivers are unaffected (decision #144 — per-host Polly isolation).
## HLR-011: Config DB and Draft/Publish
Cluster topology, driver instances, namespaces, UNS hierarchy, equipment, tags, node ACLs, poll groups, and role grants shall live in a central MSSQL Config DB, not in `appsettings.json`. Changes accumulate in a draft generation that is validated and then atomically published. Each published generation gets a monotonically increasing `GenerationNumber` scoped per cluster. Nodes poll the DB for new published generations and diff-apply surgically against an atomic snapshot. `appsettings.json` is reduced to bootstrap-only fields (Config DB connection, NodeId, ClusterId, LDAP, security profile, redundancy role, logging, local cache path).
## HLR-012: Local Cache Fallback
Each node shall maintain a sealed LiteDB local cache of the most recent successfully applied generation. If the central Config DB is unreachable at startup, the node shall boot from its cached generation and log a warning. Cache reads are the Polly `Fallback` leg of the Config DB pipeline.
## HLR-013: Cluster Redundancy
The system shall support non-transparent OPC UA redundancy via 2-node clusters sharing a Config DB generation. `RedundancyCoordinator` + `ServiceLevelCalculator` compute a dynamic OPC UA `ServiceLevel` reflecting role (Primary/Secondary), publish state (current generation applied vs mid-apply), health (driver circuit-breaker state), and apply-lease state. Clients select an endpoint by `ServerUriArray` + `ServiceLevel` per the OPC UA spec; there is no VIP or load balancer. Single-node deployments use the same model with `NodeCount = 1`.
## HLR-014: Fleet-Wide Identifier Uniqueness
Equipment identifiers that integrate with external systems (`ZTag` for ERP, `SAPID` for SAP PM) shall be unique fleet-wide (across all clusters), not just within a cluster. The Admin UI enforces this at draft-publish time via the `ExternalIdReservation` table, which reserves external IDs across clusters so two clusters cannot publish the same ZTag or SAPID. `EquipmentUuid` is immutable and globally unique (UUIDv4). `EquipmentId` and `MachineCode` are unique within a cluster.
## HLR-015: Admin UI Operator Surface
The system shall provide a Blazor Server Admin UI (`OtOpcUa.Admin`) as the sole write path into the Config DB. Capabilities include: cluster + node management, driver-instance CRUD with schemaless JSON editors, UNS drag-and-drop hierarchy editor, CSV-driven equipment import with fleet-wide external-id reservation, draft/publish with a 6-section diff viewer (Drivers / Namespaces / UNS / Equipment / Tags / ACLs), node-ACL editor producing a permission trie, LDAP role grants, redundancy tab, live cluster-generation state via SignalR, audit log viewer. Users authenticate via cookie-auth over LDAP bind; three admin roles (`ConfigViewer`, `ConfigEditor`, `FleetAdmin`) gate UI operations.
## HLR-016: Audit Logging
Every publish event and every ACL / role-grant change shall produce an immutable audit log row in the Config DB via `AuditLogService` with the acting principal, timestamp, action, before/after generation numbers, and affected entity ids. Audit rows are never mutated or deleted.
## HLR-017: Prometheus Metrics
The Admin service shall expose a `/metrics` endpoint using OpenTelemetry → Prometheus. Core / Server shall emit driver health (per `DriverInstanceId`), Polly circuit-breaker states (per `DriverInstanceId` + `HostName` + `DriverCapability`), capability-call duration histograms, subscription counts, session counts, memory-tracking gauges (Phase 6.1), publish durations, and Config-DB apply-status gauges.
## HLR-018: Roslyn Analyzer OTOPCUA0001
All direct call sites to capability-interface methods (`IReadable.ReadAsync`, `IWritable.WriteAsync`, `ITagDiscovery.DiscoverAsync`, `ISubscribable.SubscribeAsync`, `IAlarmSource.SubscribeAlarmsAsync` / `AcknowledgeAsync`, `IHistoryProvider.*`, `IHostConnectivityProbe.*`) made outside `Core.Resilience.CapabilityInvoker` shall produce Roslyn diagnostic **OTOPCUA0001** at build time. The analyzer is shipped in `ZB.MOM.WW.OtOpcUa.Analyzers` and referenced by every project that could host a capability call, guaranteeing that resilience cannot be accidentally bypassed.
## Retired HLRs
- **HLR-009 (Status Dashboard)** — retired. Superseded by the Admin UI (HLR-015). See `docs/v2/admin-ui.md`.
## Component-Level Requirements
Detailed requirements are broken out into the following documents:
- [OPC UA Server Requirements](OpcUaServerReqs.md)
- [MXAccess Client Requirements](MxAccessClientReqs.md)
- [Galaxy Repository Requirements](GalaxyRepositoryReqs.md)
- [Service Host Requirements](ServiceHostReqs.md)
- [Status Dashboard Requirements](StatusDashboardReqs.md)
- [Galaxy Driver — Repository Requirements](GalaxyRepositoryReqs.md) (Galaxy driver only)
- [Galaxy Driver — MXAccess Client Requirements](MxAccessClientReqs.md) (Galaxy driver only)
- [Service Host Requirements](ServiceHostReqs.md) (all three processes)
- [Client Requirements](ClientRequirements.md) (Client CLI + Client UI)
- [Status Dashboard Requirements](StatusDashboardReqs.md) (retired — pointer only)

View File

@@ -1,6 +1,10 @@
# MXAccess Client — Component Requirements
# Galaxy Driver — MXAccess Client Requirements
Parent: [HLR-003](HighLevelReqs.md#hlr-003-mxaccess-runtime-data-access), [HLR-008](HighLevelReqs.md#hlr-008-connection-resilience)
> **Revision** — Refreshed 2026-04-19 for the OtOpcUa v2 multi-driver platform (task #205). Scope narrowed: this document covers the MXAccess surface **inside `OtOpcUa.Galaxy.Host`** (.NET Framework 4.8 x86 Windows service). The in-server `Driver.Galaxy.Proxy` implements the `IReadable` / `IWritable` / `ISubscribable` / `IAlarmSource` / `IHistoryProvider` capability interfaces and routes every wire call through the named pipe to this Host process. The STA thread + reconnect playback + subscription refcount requirements from v1 are preserved; what changed is where they live (Host service, not the Server process). MXA-010 (proxy-side wrapping) and MXA-011 (pipe ACL / shared secret) are new.
Parent: [HLR-002](HighLevelReqs.md#hlr-002-multi-driver-plug-in-model), [HLR-005](HighLevelReqs.md#hlr-005-live-data-access), [HLR-007](HighLevelReqs.md#hlr-007-service-hosting)
Driver scope: Galaxy only. Process scope: `OtOpcUa.Galaxy.Host` (Host side) and `Driver.Galaxy.Proxy` (server-side forwarder).
## MXA-001: STA Thread with Message Pump
@@ -8,165 +12,194 @@ All MXAccess COM objects shall be created and called on a dedicated STA thread r
### Acceptance Criteria
- A dedicated thread is created with `ApartmentState.STA` before any MXAccess COM objects are instantiated.
- The thread runs a Win32 message pump using `GetMessage`/`TranslateMessage`/`DispatchMessage` loop.
- A dedicated thread is created with `ApartmentState.STA` before any MXAccess COM object is instantiated; implementation lives in `StaPump` inside `OtOpcUa.Galaxy.Host`.
- The thread runs a Win32 message pump using `GetMessage` / `TranslateMessage` / `DispatchMessage`.
- Work items are marshalled to the STA thread via `PostThreadMessage(WM_APP)` and a concurrent queue.
- The STA thread processes work items between message pump iterations.
- All COM object creation (`LMXProxyServer` constructor), method calls, and event callbacks happen on this thread.
- All COM object creation (`LMXProxyServer`), method calls, and event callbacks happen on this thread.
- Thread name `Galaxy.Sta` (for diagnostics).
### Details
- Thread name: `MxAccess-STA` (for diagnostics).
- If the STA thread dies unexpectedly, log Fatal and trigger service shutdown. Do not attempt to create a replacement thread (COM objects on the dead thread are unrecoverable).
- `RunAsync(Action)` method returns a `Task` that completes when the action executes on the STA thread. Callers can `await` it.
- If the STA thread dies unexpectedly, log Fatal and trigger Host service shutdown. The supervisor restarts the Host under its driver-stability policy (`docs/v2/driver-stability.md`). COM objects on the dead thread are unrecoverable; no in-process recovery is attempted.
- `RunAsync(Action)` returns a `Task` that completes when the action executes on the STA thread. Callers can `await` it.
---
## MXA-002: Connection Lifecycle
The client shall support Register/Unregister lifecycle with the LMXProxyServer COM object, tracking the connection handle.
The Host shall support Register/Unregister lifecycle with the `LMXProxyServer` COM object, tracking the connection handle.
### Acceptance Criteria
- `Register(clientName)` is called on the STA thread and returns a positive connection handle on success.
- If Register returns handle <= 0, throw with descriptive error.
- Handle ≤ 0 → descriptive error thrown; Host reports `DriverHealth.Unavailable` via the pipe so the Proxy reports Bad quality to the core.
- `Unregister(handle)` is called during disconnect after all subscriptions are removed.
- Client name: configurable via `MxAccess:ClientName`, default `LmxOpcUa`. Must be unique per MXAccess registration.
- Client name comes from `OTOPCUA_GALAXY_CLIENT_NAME` environment variable; default `OtOpcUa-Galaxy.Host`. Must be unique per MXAccess registration (a cluster's Primary and Secondary each get their own client-name suffix via node override).
- Connection state transitions: Disconnected → Connecting → Connected → Disconnecting → Disconnected (and Error from any state).
### Details
- `ConnectedSince` timestamp (UTC) is recorded after successful Register.
- `ReconnectCount` is tracked for diagnostics and dashboard display.
- State change events are raised for dashboard and health check consumption.
- `ConnectedSince` (UTC) recorded after successful Register.
- `ReconnectCount` tracked for diagnostics and `/metrics`.
- State changes are emitted over the pipe as `DriverHealth` updates.
---
## MXA-003: Tag Subscription
The client shall support subscribing to tags via AddItem + AdviseSupervisory, receiving value updates through OnDataChange callbacks.
The Host shall support subscribing to tags via AddItem + AdviseSupervisory, receiving value updates through OnDataChange callbacks.
### Acceptance Criteria
- Subscribe sequence: `AddItem(handle, address)` returns item handle, then `AdviseSupervisory(handle, itemHandle)` starts the subscription.
- `OnDataChange` callback delivers value, quality (integer), timestamp, and MXSTATUS_PROXY array.
- `OnDataChange` callback delivers value, quality, timestamp, and MXSTATUS_PROXY array.
- Item address format: `tag_name.AttributeName` for scalars, `tag_name.AttributeName[]` for whole arrays.
- If AddItem fails (e.g., tag does not exist), log Warning and return failure to caller.
- Bidirectional maps of `address ↔ itemHandle` are maintained for callback resolution.
- AddItem failure → Warning logged, failure propagated over the pipe to the Proxy.
- Bidirectional maps of `address ↔ itemHandle` maintained for callback resolution.
- Multi-client refcounting: two Proxy-side subscribe calls for the same address produce one MXAccess subscription; refcount decrement on the last unsubscribe triggers `UnAdvise` / `RemoveItem`.
### Details
- Use `AdviseSupervisory` (not `Advise`) because this is a background service with no interactive user session. AdviseSupervisory allows secured/verified writes without user authentication.
- Stored subscriptions dictionary maps address to callback for reconnect replay.
- On reconnect, all entries in stored subscriptions are re-subscribed (AddItem + AdviseSupervisory with new handles).
- `AdviseSupervisory` (not `Advise`) is used because this is a background service without an interactive user session.
- Stored subscriptions dictionary maps address callback for reconnect replay.
- On reconnect, every entry in stored subscriptions is re-subscribed (AddItem + AdviseSupervisory with new handles).
---
## MXA-004: Tag Read/Write
The client shall support synchronous-style read and write operations, marshalled to the STA thread, with configurable timeouts.
The Host shall support synchronous-style read and write operations, marshalled to the STA thread, with configurable timeouts.
### Acceptance Criteria
- Read: implemented as subscribe-get-first-value-unsubscribe pattern (AddItem → AdviseSupervisory → wait for OnDataChange → UnAdvise → RemoveItem).
- Read pattern: prefer cached subscription value; fall back to subscribe-get-first-value-unsubscribe (AddItem → AdviseSupervisory → wait for OnDataChange → UnAdvise → RemoveItem).
- Write: AddItem → AdviseSupervisory → `Write()` → await `OnWriteComplete` callback → cleanup.
- Read timeout: configurable via `MxAccess:ReadTimeoutSeconds`, default 5 seconds.
- Write timeout: configurable via `MxAccess:WriteTimeoutSeconds`, default 5 seconds. On timeout, log Warning and return timeout error.
- Concurrent operation limit: configurable semaphore via `MxAccess:MaxConcurrentOperations`, default 10.
- Read timeout: `Galaxy:ReadTimeoutSeconds` in driver config (default 5 seconds) — enforced on the Host side in addition to the Proxy-side Polly `Timeout` leg.
- Write timeout: `Galaxy:WriteTimeoutSeconds` (default 5 seconds) — enforced similarly.
- Concurrent operation limit: configurable semaphore (`Galaxy:MaxConcurrentOperations`, default 10).
- All operations marshalled to the STA thread.
### Details
- Write uses security classification -1 (no security). Galaxy runtime handles security enforcement.
- `OnWriteComplete` callback: check MXSTATUS_PROXY `success` field. If 0, extract detail code and propagate error.
- COM exceptions (`COMException` with HRESULT) are caught and translated to meaningful error messages.
- Write uses security classification `-1` (no security). Galaxy runtime enforces security; OtOpcUa authorization is enforced server-side before the call ever reaches the pipe (per OPC-014 `AuthorizationGate`).
- `OnWriteComplete`: check `MXSTATUS_PROXY.success`. If 0, extract detail code and propagate as an error over the pipe.
- COM exceptions translated to meaningful error messages.
---
## MXA-005: Auto-Reconnect
The client shall monitor connection health and automatically reconnect on failure, replaying all stored subscriptions after reconnect.
The Host shall monitor connection health and automatically reconnect on failure, replaying all stored subscriptions after reconnect.
### Acceptance Criteria
- Monitor loop runs on a background thread, checking connection health at configurable interval (`MxAccess:MonitorIntervalSeconds`, default 5 seconds).
- If disconnected, attempt reconnect. On success, replay all stored subscriptions.
- On reconnect failure, log Warning and retry at next interval (no exponential backoff — reconnect as quickly as possible on a plant-floor service).
- Monitor loop runs on a background thread at `Galaxy:MonitorIntervalSeconds` (default 5 seconds).
- On disconnect, attempt reconnect. On success, replay all stored subscriptions.
- On reconnect failure, log Warning and retry at next interval (no exponential backoff inside the Host; the Proxy-side Polly pipeline handles cross-process backoff against pipe failures).
- Reconnect count is incremented on each successful reconnect.
- Monitor loop is cancellable (for clean shutdown).
- Monitor loop is cancellable for clean Host shutdown.
### Details
- Reconnect cleans up old COM objects before creating new ones.
- After reconnect, probe subscription is re-established first, then stored subscriptions.
- No max retry limit — keep trying indefinitely until service is stopped.
- After reconnect, probe subscription (MXA-006) is re-established first, then stored subscriptions.
- No max retry limit — keep trying indefinitely until the Host service is stopped.
---
## MXA-006: Probe-Based Health Monitoring
The client shall optionally subscribe to a configurable probe tag and use OnDataChange callback staleness to detect silent connection failures.
The Host shall optionally subscribe to a configurable probe tag and use OnDataChange callback staleness to detect silent connection failures.
### Acceptance Criteria
- Subscribe to a configurable probe tag (a known-good Galaxy attribute that changes periodically).
- Probe tag address configured via `Galaxy:ProbeTag`. If unset, probe monitoring is disabled.
- Track `_lastProbeValueTime` (UTC) updated on each OnDataChange for the probe tag.
- If `DateTime.UtcNow - _lastProbeValueTime > staleThreshold`, force disconnect and reconnect.
- Probe tag address: configurable via `MxAccess:ProbeTag`. If not configured, probe monitoring is disabled.
- Stale threshold: configurable via `MxAccess:ProbeStaleThresholdSeconds`, default 60 seconds.
- Stale threshold: `Galaxy:ProbeStaleThresholdSeconds` (default 60 seconds).
- Implements `IHostConnectivityProbe` on the Proxy side so the core's `CapabilityInvoker` records probe outcomes with `DriverCapability.Probe` telemetry.
### Details
- The probe tag should be an attribute that the Galaxy runtime updates regularly (e.g., a platform heartbeat or area-level timestamp). The specific tag is site-dependent.
- After forced reconnect, reset `_lastProbeValueTime` to `DateTime.UtcNow` to give the new connection a full threshold window.
- The probe tag should be an attribute the Galaxy runtime updates regularly (platform heartbeat, area timestamp). Specific tag is site-dependent.
- After forced reconnect, reset `_lastProbeValueTime` to `DateTime.UtcNow`.
---
## MXA-007: COM Cleanup
On disconnect or disposal, the client shall unwire event handlers, unadvise/remove all items, unregister, and release COM objects via Marshal.ReleaseComObject.
On disconnect or disposal, the Host shall unwire event handlers, unadvise/remove all items, unregister, and release COM objects via `Marshal.ReleaseComObject`.
### Acceptance Criteria
- Cleanup order: UnAdvise all active subscriptions → RemoveItem all items → unwire OnDataChange and OnWriteComplete event handlers → Unregister → `Marshal.ReleaseComObject`.
- Cleanup order: UnAdvise all active subscriptions → RemoveItem all items → unwire OnDataChange and OnWriteComplete handlers → Unregister → `Marshal.ReleaseComObject`.
- On dispose: run disconnect if still connected, then dispose STA thread.
- Each cleanup step is wrapped in try/catch (cleanup must not throw).
- After cleanup: handle maps are cleared, pending write TCS entries are abandoned, COM reference is set to null.
- Each cleanup step wrapped in try/catch (cleanup must not throw).
- After cleanup: handle maps cleared, pending write TCS entries abandoned, COM reference set to null.
### Details
- `_storedSubscriptions` is NOT cleared on disconnect (preserved for reconnect replay). Only cleared on Dispose.
- Event handlers must be unwired BEFORE Unregister, or callbacks may fire on a dead object.
- `Marshal.ReleaseComObject` in a finally block, always, even if earlier steps fail.
- Stored subscriptions are NOT cleared on disconnect (preserved for reconnect replay). Only cleared on Dispose.
- Event handlers unwired BEFORE Unregister (else callbacks may fire on a dead object).
- `Marshal.ReleaseComObject` in a `finally` block, always.
---
## MXA-008: Operation Metrics
The MXAccess client shall record timing and success/failure for Read, Write, and Subscribe operations.
The MXAccess Host shall record timing and success/failure for Read, Write, and Subscribe operations.
### Acceptance Criteria
- Each operation records: duration (ms), success/failure.
- Metrics are available for the status dashboard: count, success rate, avg/min/max/P95 latency.
- Uses a rolling 1000-entry buffer for percentile calculation.
- Metrics are exposed via a queryable interface consumed by the status report service.
### Details
- Uses an `ITimingScope` pattern: `using (var scope = metrics.BeginOperation("read")) { ... }` for automatic timing and success tracking.
- Metrics are periodically logged at Debug level for diagnostics.
- Each operation records duration (ms) + success/failure.
- Metrics exposed over the pipe to the Proxy, which re-publishes them via OpenTelemetry → Prometheus under `DriverInstanceId = "galaxy-*"`, `HostName = "galaxy.host"`.
- Rolling 1000-entry buffer for percentile calculation.
- Uses an `ITimingScope` pattern: `using (var scope = metrics.BeginOperation("read")) { ... }`.
---
## MXA-009: Error Code Translation
The client shall translate known MXAccess error codes from MXSTATUS_PROXY.detail into human-readable messages for logging and OPC UA status propagation.
The Host shall translate known MXAccess error codes from `MXSTATUS_PROXY.detail` into human-readable messages for logging and OPC UA status propagation.
### Acceptance Criteria
- Error 1008 → "User lacks security permission"
- Error 1012 → "Secured write required (one signature)"
- Error 1013 → "Verified write required (two signatures)"
- Unknown error codes are logged with their numeric value.
- Translated messages are included in OPC UA StatusCode descriptions and log entries.
- Unknown error codes logged with their numeric value.
- Translated messages flow back through the pipe and surface in OPC UA `StatusCode` descriptions and Server logs.
- Errors 1008 / 1012 / 1013 on write operations map to `Bad_UserAccessDenied` at the OPC UA surface.
---
## MXA-010: Proxy-Side Capability Wrapping
`Driver.Galaxy.Proxy` shall implement the capability interfaces as thin forwarders that serialize every call through the named pipe and route every call through `CapabilityInvoker`.
### Acceptance Criteria
- `Driver.Galaxy.Proxy` implements `IDriver` + `IReadable` + `IWritable` + `ISubscribable` + `ITagDiscovery` + `IRediscoverable` + `IAlarmSource` + `IHistoryProvider` + `IHostConnectivityProbe`.
- Each implementation uses `CapabilityInvoker.InvokeAsync(DriverCapability.<...>, …)` — direct pipe calls bypassing the invoker are caught by Roslyn **OTOPCUA0001**.
- Each method serializes a MessagePack request frame, sends over the pipe, awaits the response frame, deserializes, returns.
- Pipe disconnect mid-call → `CapabilityInvoker`'s circuit breaker counts the failure; sustained disconnect opens the circuit and Galaxy nodes surface Bad quality until the pipe reconnects.
- Proxy tolerates Host service restarts — it automatically reconnects and replays subscription setup (parallel to MXA-005 but across the IPC boundary).
---
## MXA-011: Pipe Security
The named pipe between Proxy and Host shall be restricted to the Server's runtime principal via SID-based ACL and authenticated with a per-process shared secret.
### Acceptance Criteria
- Pipe name from `OTOPCUA_GALAXY_PIPE` environment variable; default `OtOpcUaGalaxy`.
- Allowed SID passed as `OTOPCUA_ALLOWED_SID` — only the declared principal (typically the Server service account) can open the pipe; `Administrators` is explicitly NOT granted (per the `project_galaxy_host_installed` memory note).
- Shared secret passed via `OTOPCUA_GALAXY_SECRET` at spawn time; the Proxy must present the matching secret on the opening handshake.
- Secret is process-scoped (regenerated per Host restart) and never persisted to disk or Config DB.
- Pipe ACL denials are logged as Warning with the rejected principal SID.
### Details
- Environment variables are passed by the supervisor launching the Host (`docs/v2/driver-stability.md`).
- Dev-box secret is stored at `.local/galaxy-host-secret.txt` for NSSM-wrapped development runs (memory note: `project_galaxy_host_installed`).

View File

@@ -1,234 +1,266 @@
# OPC UA Server — Component Requirements
Parent: [HLR-001](HighLevelReqs.md#hlr-001-opc-ua-server), [HLR-002](HighLevelReqs.md#hlr-002-galaxy-hierarchy-as-opc-ua-address-space), [HLR-004](HighLevelReqs.md#hlr-004-data-type-mapping)
> **Revision** — Refreshed 2026-04-19 for the OtOpcUa v2 multi-driver platform (task #205). OPC-001…OPC-013 have been rewritten driver-agnostically — they now describe how the core OPC UA server composes multiple driver subtrees, enforces authorization, and invokes capabilities through the Polly-wrapped dispatch path. OPC-014 through OPC-022 are new and cover capability dispatch, per-host Polly isolation, idempotence-aware write retry, `AuthorizationGate`, `ServiceLevel` reporting, the alarm surface, history surface, server-certificate management, and the transport-security profile matrix. Galaxy-specific behavior has been moved out to `GalaxyRepositoryReqs.md` and `MxAccessClientReqs.md`.
Parent: [HLR-001](HighLevelReqs.md#hlr-001-opc-ua-server), [HLR-003](HighLevelReqs.md#hlr-003-address-space-composition-per-namespace), [HLR-009](HighLevelReqs.md#hlr-009-transport-security-and-authentication), [HLR-010](HighLevelReqs.md#hlr-010-per-driver-instance-resilience), [HLR-013](HighLevelReqs.md#hlr-013-cluster-redundancy)
## OPC-001: Server Endpoint
The OPC UA server shall listen on a configurable TCP port (default 4840) using the OPC Foundation .NET Standard stack.
The OPC UA server shall listen on a configurable TCP endpoint using the OPC Foundation .NET Standard stack and expose a single endpoint URL per cluster node.
### Acceptance Criteria
- Server starts and accepts TCP connections on the configured port.
- Port is read from `appsettings.json` under `OpcUa:Port`; defaults to 4840 if absent.
- Endpoint URL format: `opc.tcp://<hostname>:<port>/LmxOpcUa`.
- If the port is in use at startup, log an Error and fail to start (do not silently pick another port).
- Security policy: None (no certificate validation). This is an internal plant-floor service.
- Endpoint URL comes from `ClusterNode.EndpointUrl` in the Config DB (default form `opc.tcp://<hostname>:<port>/OtOpcUa`).
- `ApplicationName` and `ApplicationUri` come from `ClusterNode` fields; `ApplicationUri` is unique per node so redundancy `ServerUriArray` entries are distinguishable.
- Port defaults to 4840. If the port is in use at startup the server shall log Error and fail to start (no silent port reassignment).
- Uses `OPCFoundation.NetStandard.Opc.Ua.Server` NuGet.
- Endpoint URL logged at Information level on startup.
### Details
- Configurable items: port (default 4840), endpoint path (default `/LmxOpcUa`), server application name (default `LmxOpcUa`).
- Server shall use the `OPCFoundation.NetStandard.Opc.Ua.Server` NuGet package.
- On startup, log the endpoint URL at Information level.
- Node-local `appsettings.json` only carries the `Config DB connection + NodeId + ClusterId` bootstrap — actual endpoint topology comes from the Config DB per HLR-011.
---
## OPC-002: Address Space Structure
## OPC-002: Address Space Composition
The server shall create folder nodes for areas and object nodes for automation objects, organized in the same parent-child hierarchy as the Galaxy.
The server shall compose an address space by mounting each active driver instance's subtree under a dedicated OPC UA namespace.
### Acceptance Criteria
- The root folder node has BrowseName `ZB` (hardcoded Galaxy name).
- Objects where `is_area = 1` are created as FolderType nodes (organizational).
- Objects where `is_area = 0` are created as BaseObjectType nodes.
- Parent-child relationships use Organizes references (for areas) and HasComponent references (for contained objects).
- A client browsing Root → Objects → ZB → DEV → TestArea → TestMachine_001 → DelmiaReceiver sees the same structure as `gr/layout.md`.
### Details
- NodeIds use a string-based identifier scheme: `ns=1;s=<tag_name>` for object nodes, `ns=1;s=<tag_name>.<attribute_name>` for variable nodes.
- Infrastructure objects (AppEngines, Platforms) are included in the tree but may have no variable children.
- When `contained_name` is null or empty, fall back to `tag_name` as the BrowseName.
- Each `DriverInstance` in the current published generation registers one `IDriver` implementation in the core.
- Each driver's `ITagDiscovery.DiscoverAsync` result is streamed into the core via `IAddressSpaceBuilder``AddFolder` / `AddVariable` calls; the driver does not buffer the whole tree.
- Each driver instance gets its own namespace index; `NamespaceUri` comes from the `Namespace` row in the Config DB.
- Each cluster has at most one namespace per `Kind` (`Equipment`, `SystemPlatform`, future `Simulated`); enforced by UNIQUE on `(ClusterId, Kind)` in the DB.
- Galaxy driver subtree preserves the contained-name browse structure from the deployed Galaxy (moved to `GalaxyRepositoryReqs.md`).
- Equipment-kind drivers populate the canonical 5-level UNS structure (`Enterprise/Site/Area/Line/Equipment/Signal`).
---
## OPC-003: Variable Nodes for Attributes
## OPC-003: Variable Nodes and Access Levels
Each user-defined attribute on a deployed object shall be represented as an OPC UA variable node under its parent object node.
Each tag produced by a driver's `ITagDiscovery` shall become an OPC UA variable node.
### Acceptance Criteria
- Each row from `attributes.sql` creates one variable node under the matching object node (matched by `gobject_id`).
- Variable node BrowseName and DisplayName are set to `attribute_name`.
- Variable node stores `full_tag_reference` as its runtime MXAccess address.
- Variable node AccessLevel is set based on the attribute's `security_classification` per the mapping in `gr/data_type_mapping.md`.
- FreeAccess (0), Operate (1), Tune (4), Configure (5) → AccessLevel = CurrentRead | CurrentWrite (3).
- SecuredWrite (2), VerifiedWrite (3), ViewOnly (6) → AccessLevel = CurrentRead (1).
- Objects with no user-defined attributes still appear as object nodes with zero children.
### Details
- Security classification determines the OPC UA AccessLevel and UserAccessLevel attributes on each variable node. The OPC UA stack enforces read-only access for nodes with CurrentRead-only access level.
- Attributes whose names start with `_` are already filtered by the SQL query.
- Variable node `BrowseName` and `DisplayName` come from `DriverAttributeInfo`.
- `DataType` is resolved from `DriverDataType` per each driver's spec in `docs/v2/driver-specs.md`.
- `AccessLevel` and `UserAccessLevel` are derived from the tag's `SecurityClassification` and the session's effective permissions walked through the node-ACL permission trie (see OPC-017 `AuthorizationGate`).
- Scalar attributes produce `ValueRank = Scalar`; array attributes produce `ValueRank = OneDimension` with `ArrayDimensions` set from the driver's attribute info.
---
## OPC-004: Browse Name Translation
## OPC-004: Namespace Index Allocation
Browse names shall use contained names (human-readable, scoped to parent). The server shall internally translate browse paths to tag_name references for MXAccess operations.
The server shall register one OPC UA namespace per active driver instance.
### Acceptance Criteria
- A variable node browsed as `ZB/DEV/TestArea/TestMachine_001/DelmiaReceiver/DownloadPath` correctly translates to MXAccess reference `DelmiaReceiver_001.DownloadPath`.
- Translation uses the `tag_name` stored on the parent object node, not the browse path.
- No runtime path parsing — the mapping is baked into each node at build time.
### Details
- Each variable node stores its `full_tag_reference` (e.g., `DelmiaReceiver_001.DownloadPath`) at address-space build time. Read/write operations use this stored reference directly.
- Namespace index 0 remains the standard OPC UA namespace.
- Each driver instance's `Namespace.Uri` becomes a registered namespace; its index is assigned deterministically at startup from the published generation's driver ordering.
- All variable NodeIds use the driver's namespace index; NodeId identifiers are string-shaped and stable across restarts of the same generation.
- Namespace index reshuffles are a publish-time concern; clients reconciling server-relative NodeIds must re-resolve namespace URIs after a new generation is applied.
---
## OPC-005: Data Type Mapping
## OPC-005: Read Operations
Variable nodes shall use OPC UA data types mapped from Galaxy mx_data_type values per the mapping in `gr/data_type_mapping.md`.
The server shall fulfill OPC UA `Read` requests by invoking `IReadable.ReadAsync` on the target driver instance, dispatched through `CapabilityInvoker`.
### Acceptance Criteria
- Every `mx_data_type` value in the mapping table produces the correct OPC UA DataType NodeId on the variable node.
- Unknown/unmapped `mx_data_type` values default to String (i=12).
- ElapsedTime (type 7) maps to Double representing seconds.
### Details
- Full mapping table in `gr/data_type_mapping.md`.
- DateTime conversion: Galaxy may store local time; convert to UTC for OPC UA.
- LocalizedText (type 15): use empty locale string with the text value.
- Every read call at dispatch passes through `Core.Resilience.CapabilityInvoker.InvokeAsync(DriverCapability.Read, …)`.
- Returned `DataValueSnapshot` is converted to an OPC UA `DataValue` with `StatusCode`, source timestamp, and server timestamp.
- If the owning driver instance's Polly circuit is open, the read returns Bad quality immediately without hitting the wire.
- Reads on a node the session has no `Read` bit for in the permission trie return `Bad_UserAccessDenied` before the capability is invoked (OPC-017).
- Read timeout is the Polly timeout leg on the `Read` capability; its duration is per-`(DriverInstanceId, HostName)` and comes from the Config DB.
---
## OPC-006: Array Support
## OPC-006: Write Operations
Attributes marked as arrays shall have ValueRank=1 and ArrayDimensions set to the attribute's array_dimension value.
The server shall fulfill OPC UA `Write` requests by invoking `IWritable.WriteAsync` through `CapabilityInvoker` with **idempotence-aware** retry policy.
### Acceptance Criteria
- `is_array = 1` produces ValueRank = 1 (OneDimension) and ArrayDimensions = `[array_dimension]`.
- `is_array = 0` produces ValueRank = -1 (Scalar) and no ArrayDimensions.
- MXAccess reference for array attributes uses `tag_name.attribute[]` (whole array) format.
### Details
- Individual array element access (`tag_name.attribute[n]`) is not required for initial implementation. Whole-array read/write only.
- If `array_dimension` is null or 0 when `is_array = 1`, log a Warning and default to ArrayDimensions = [0] (variable-length).
- Writes dispatch through `CapabilityInvoker.InvokeAsync(DriverCapability.Write, …)`.
- Writes **do not auto-retry** unless the tag's `TagConfig.WriteIdempotent = true`, or the driver's capability is marked with `[WriteIdempotent]` (decision #143).
- Writes on a node the session lacks the required permission bit for (`WriteOperate`, `WriteTune`, or `WriteConfigure` derived from the tag's `SecurityClassification`) return `Bad_UserAccessDenied` before the capability runs.
- A write into an open circuit returns a driver-shaped error (`Bad_NoCommunication` / `Bad_ServerNotConnected`) without hitting the wire.
- The server shall coerce the written OPC UA value to the driver's expected native type using the node's `DriverDataType` before calling `WriteAsync`.
- Writes to a NodeId not currently in the address space return `Bad_NodeIdUnknown`.
---
## OPC-007: Read Operations
## OPC-007: Subscriptions and Monitored Items
The server shall fulfill OPC UA Read requests by reading the corresponding tag value from MXAccess using the tag_name.AttributeName reference.
The server shall map OPC UA `CreateMonitoredItems` / `DeleteMonitoredItems` to `ISubscribable.SubscribeAsync` / `UnsubscribeAsync` on the owning driver instance.
### Acceptance Criteria
- OPC UA Read request for a variable node results in a read via MXAccess using the node's stored `full_tag_reference`.
- Returned value is converted from the COM variant to the OPC UA data type specified on the node.
- OPC UA StatusCode reflects MXAccess quality: Good maps to Good, Bad/Uncertain map appropriately.
- If MXAccess is not connected, return StatusCode = Bad_NotConnected.
- Read timeout: configurable, default 5 seconds. On timeout, return Bad_Timeout.
### Details
- Prefer cached subscription-delivered values over on-demand reads to reduce COM round-trips.
- If no subscription is active for the tag, perform an on-demand read (AddItem, AdviseSupervisory, wait for first OnDataChange, then UnAdvise/RemoveItem).
- Concurrency: semaphore-limited to configurable max (default 10) concurrent MXAccess operations.
- Subscription setup dispatches through `CapabilityInvoker.InvokeAsync(DriverCapability.Subscribe, …)`.
- Two OPC UA monitored items against the same tag produce exactly one driver-side subscription (ref-counted); last unsubscribe releases the driver-side resource.
- `OnDataChange` callbacks from the driver arrive as `DataValueSnapshot` and are forwarded to all OPC UA monitored items on that tag.
- Driver-side quality maps to OPC UA `StatusCode` per the driver's spec.
- When the owning driver's circuit opens, subscribed items publish Bad quality; when it resets, resumption publishes the cached or freshly-sampled value.
- Across generation applies that preserve a tag's NodeId, existing OPC UA monitored items are preserved (no re-subscribe required on the client).
---
## OPC-008: Write Operations
## OPC-008: Alarm Surface
The server shall fulfill OPC UA Write requests by writing to the corresponding tag via MXAccess.
The server shall expose the OPC UA alarm and condition model backed by each driver's `IAlarmSource` (where implemented).
### Acceptance Criteria
- OPC UA Write request results in an MXAccess `Write()` call with completion confirmed via `OnWriteComplete()` callback.
- Write timeout: configurable, default 5 seconds. On timeout, log Warning and return Bad_Timeout.
- MXSTATUS_PROXY with `success = 0` causes the OPC UA write to return Bad_InternalError with the detail message.
- MXAccess errors 1008 (no permission), 1012 (secured write), 1013 (verified write) return Bad_UserAccessDenied.
- Write to a non-existent tag returns Bad_NodeIdUnknown.
- The server shall attempt to convert the written value to the expected Galaxy data type before passing to Write().
### Details
- Write uses security classification -1 (no security). Galaxy runtime handles security enforcement.
- Write sequence: uses existing subscription handle if available, otherwise AddItem + AdviseSupervisory + Write + await OnWriteComplete + cleanup.
- Concurrent write limit: same semaphore as reads (configurable, default 10).
- Drivers implementing `IAlarmSource` (today: Galaxy, FOCAS, OPC UA Client) produce alarm events that the core maps onto OPC UA `ConditionType` / `AlarmConditionType` instances in the driver's namespace.
- `AlarmSubscribe` dispatches through `CapabilityInvoker.InvokeAsync(DriverCapability.AlarmSubscribe, …)` and retries on transient failure.
- `AlarmAcknowledge` from the OPC UA client dispatches through `CapabilityInvoker.InvokeAsync(DriverCapability.AlarmAcknowledge, …)` and **does not retry** (decision #143 — ack is a write-shaped operation).
- Alarm-ack requires the `AlarmAck` permission bit for the tag / equipment node; otherwise `Bad_UserAccessDenied`.
- Drivers that do not implement `IAlarmSource` contribute no alarm nodes; the core does not synthesize placeholder conditions.
---
## OPC-009: Subscriptions
## OPC-009: Historical Access
The server shall support OPC UA subscriptions by mapping them to MXAccess advisory subscriptions and forwarding data change notifications.
The server shall surface OPC UA Historical Access (HA) via each driver's `IHistoryProvider` (where implemented).
### Acceptance Criteria
- OPC UA CreateMonitoredItems results in MXAccess `AdviseSupervisory()` subscriptions for the requested tags.
- Data changes from `OnDataChange` callback are forwarded as OPC UA notifications to all subscribed clients.
- Shared subscriptions: if two OPC UA clients subscribe to the same tag, only one MXAccess subscription exists (ref-counted).
- Last subscriber unsubscribing triggers UnAdvise/RemoveItem on the MXAccess side.
- After MXAccess reconnect, all active MXAccess subscriptions are re-established automatically.
### Details
- Publishing interval from the OPC UA subscription request is honored on the OPC UA side; MXAccess delivers changes as fast as it receives them.
- OPC UA quality mapping from MXAccess quality integers: 192+ = Good, 64-191 = Uncertain, 0-63 = Bad.
- OnDataChange with MXSTATUS_PROXY failure: deliver notification with Bad quality to subscribed clients.
- `HistoryRead` for `Raw`, `Processed`, `AtTime`, and `Events` dispatches through `CapabilityInvoker.InvokeAsync(DriverCapability.HistoryRead, …)`.
- Drivers implementing `IHistoryProvider` today: Galaxy (Wonderware Historian), OPC UA Client (proxy to remote historian).
- Drivers not implementing `IHistoryProvider` return `Bad_HistoryOperationUnsupported` for history requests on their nodes.
- History reads require the `Read` permission bit on the target node.
---
## OPC-010: Address Space Rebuild
## OPC-010: Transport Security Profiles
When a Galaxy deployment change is detected, the server shall rebuild the address space without dropping existing OPC UA client connections where possible.
The server shall offer OPC UA transport-security profiles resolved at startup by `SecurityProfileResolver`.
### Acceptance Criteria
- When Galaxy Repository detects a deployment change, the OPC UA address space is updated.
- Only changed gobject subtrees are torn down and rebuilt; unchanged nodes, subscriptions, and alarm tracking remain intact.
- Existing OPC UA client sessions are preserved — clients stay connected.
- Subscriptions for tags on unchanged objects continue to work without interruption.
- Subscriptions for tags that no longer exist receive a Bad_NodeIdUnknown status notification.
- Sync is logged at Information level with the number of changed gobjects.
### Details
- Uses incremental subtree sync: compares previous hierarchy+attributes with new, identifies changed gobject IDs, expands to include child subtrees, tears down only affected subtrees, and rebuilds them.
- First build (no cached state) performs a full build.
- If no changes are detected, the sync is a no-op (logged and skipped).
- Alarm tracking and MXAccess subscriptions for unchanged objects are not disrupted.
- Falls back to full rebuild behavior if the entire hierarchy changes.
- Supported profiles: `None`, `Basic256Sha256-Sign`, `Basic256Sha256-SignAndEncrypt`, `Aes128_Sha256_RsaOaep-Sign`, `Aes128_Sha256_RsaOaep-SignAndEncrypt`, `Aes256_Sha256_RsaPss-Sign`, `Aes256_Sha256_RsaPss-SignAndEncrypt`.
- Active profile list comes from `OpcUa.SecurityProfile` in `appsettings.json` (bootstrap config) or Config DB (per-cluster override).
- Server certificate is created at first startup even when only `None` is enabled, because UserName-token encryption depends on an ApplicationInstanceCertificate.
- Certificate store root path is configurable (default `%ProgramData%/OtOpcUa/pki/`).
- `AutoAcceptUntrustedClientCertificates` is a config flag; production deployments set it to `false` and operators add trusted client certs via the Admin UI Cert Trust screen.
---
## OPC-011: Server Diagnostics Node
## OPC-011: UserName Authentication
The server shall expose a ServerStatus node under the standard OPC UA Server object with ServerState, CurrentTime, and StartTime. This is required by the OPC UA specification for compliant servers.
The server shall validate `UserNameIdentityToken` credentials against LDAP (production: Active Directory; dev: GLAuth).
### Acceptance Criteria
- ServerState reports Running during normal operation.
- CurrentTime returns the server's current UTC time.
- StartTime returns the UTC time when the service started.
- If `Ldap.Enabled = false`, all UserName tokens are rejected (`BadUserAccessDenied`).
- When enabled, the server performs an LDAP bind using the supplied credentials via `LdapUserAuthenticator`.
- On successful bind, group memberships resolved from LDAP are mapped through `LdapOptions.GroupToRole` to produce the session's permission bits (`ReadOnly`, `WriteOperate`, `WriteTune`, `WriteConfigure`, `AlarmAck`).
- `LdapAuthenticationProvider` implements both `IUserAuthenticationProvider` and `IRoleProvider`.
- UserName tokens are always carried on an encrypted secure channel (either Sign-and-Encrypt transport, or encrypted token using the server certificate even on a `None` channel).
---
## OPC-012: Namespace Configuration
## OPC-012: Capability Dispatch via CapabilityInvoker
The server shall register a namespace URI at namespace index 1. All application-specific NodeIds shall use this namespace.
Every async capability-interface call the server makes shall route through `Core.Resilience.CapabilityInvoker`.
### Acceptance Criteria
- Namespace URI: `urn:ZB:LmxOpcUa` (Galaxy name is configurable).
- All object and variable NodeIds created from Galaxy data use namespace index 1.
- Standard OPC UA nodes remain in namespace 0.
- `CapabilityInvoker.InvokeAsync` resolves a Polly resilience pipeline keyed on `(DriverInstanceId, HostName, DriverCapability)`.
- Read / Discover / Probe / Subscribe / AlarmSubscribe / HistoryRead pipelines carry Timeout + Retry + CircuitBreaker strategies.
- Write / AlarmAcknowledge pipelines carry Timeout + CircuitBreaker only; Retry is enabled only when the tag or capability carries `[WriteIdempotent]` (decision #143).
- Roslyn diagnostic **OTOPCUA0001** fires on any direct call to a capability-interface method from outside `CapabilityInvoker` (enforced via `ZB.MOM.WW.OtOpcUa.Analyzers`).
---
## OPC-013: Session Management
## OPC-013: Per-Host Polly Isolation
Polly pipelines shall be keyed per `(DriverInstanceId, HostName, DriverCapability)` so that a failing device in one driver does not trip the circuit for another device on the same driver or any other driver (decision #144).
### Acceptance Criteria
- A driver serving `N` devices has `N × capabilityCount` distinct pipelines.
- Circuit-breaker state transitions are telemetry-published per pipeline and appear on the Admin UI + `/metrics`.
- A host-scope fault (e.g. shared PLC gateway) naturally trips all devices behind that host but leaves other hosts untouched.
---
## OPC-014: Authorization Gate and Permission Trie
`Security.AuthorizationGate` shall enforce node-level permissions on every browse, read, write, subscribe, alarm-ack, and history call before dispatch.
### Acceptance Criteria
- Permission bits for the session are assembled at login from LDAP group → role → permission mapping plus Config-DB `NodeAcl` rows that modify permission inheritance along the browse tree.
- The permission trie walks from the addressed node toward the root, inheriting permissions unless a `NodeAcl` overrides; first match wins.
- Missing `Read` bit → `Bad_UserAccessDenied` on Read / Subscribe / HistoryRead.
- Missing `Write*` bit (matching the tag's `SecurityClassification`) → `Bad_UserAccessDenied` on Write.
- Missing `AlarmAck` bit → `Bad_UserAccessDenied` on acknowledge.
- Authorization decisions are made at the server layer only — drivers never enforce authorization and only expose `SecurityClassification` metadata.
---
## OPC-015: ServiceLevel Reporting
The server shall expose a dynamic `ServiceLevel` value computed by `RedundancyCoordinator` + `ServiceLevelCalculator`.
### Acceptance Criteria
- `ServiceLevel` reflects: redundancy role (Primary higher than Secondary), publish state (current generation applied > mid-apply > failed-apply), driver health (any driver instance in open circuit lowers the value), apply-lease state.
- `ServiceLevel` is exposed as a Variable under the standard `Server` object and is readable by any authenticated client.
- Clients that observe Primary's `ServiceLevel` drop below Secondary's should failover per the OPC UA spec.
- Single-node deployments (`NodeCount = 1`) always publish their node as Primary.
---
## OPC-016: Session Management
The server shall support multiple concurrent OPC UA client sessions.
### Acceptance Criteria
- Maximum concurrent sessions: configurable, default 100.
- Session timeout: configurable, default 30 minutes of inactivity.
- Expired sessions are cleaned up and their subscriptions removed.
- Session count is reported to the status dashboard.
- Maximum concurrent sessions and session timeout come from Config DB cluster settings (default 100 sessions, 30-minute idle timeout).
- Expired sessions are cleaned up and their subscriptions and monitored items removed.
- Active session count is reported as a Prometheus gauge on the Admin `/metrics` endpoint.
---
## OPC-017: Address Space Rebuild on Generation Apply
When a new Config DB generation is applied, the server shall surgically update only the affected driver subtrees.
### Acceptance Criteria
- Apply compares the previous generation to the incoming generation and produces per-driver add / modify / remove sets.
- Existing OPC UA sessions, subscriptions, and monitored items are preserved across apply whenever the target NodeId survives the generation change.
- Tags that no longer exist post-apply emit `Bad_NodeIdUnknown` on their subscribed monitored items.
- During apply, the node's `ServiceLevel` is lowered (per `ServiceLevelCalculator`) so redundancy partners temporarily take precedence.
- Galaxy subtree rebuilds triggered by `IRediscoverable` (Galaxy deployment change) are scoped to the Galaxy driver's namespace and follow the same preservation rule (OPC-006 from the v1 file, now subsumed).
---
## OPC-018: Server Diagnostics Nodes
The server shall expose standard OPC UA `Server` object nodes required by the spec.
### Acceptance Criteria
- `ServerStatus` / `ServerState` / `CurrentTime` / `StartTime` populated and compliant with the OPC UA 1.05 spec.
- `ServerCapabilities` declares historical access capabilities for namespaces that have an `IHistoryProvider`-backed driver.
- `ServerRedundancy.RedundancySupport` reflects the cluster's redundancy mode (`None` / `Warm` / `Hot`).
- `ServerRedundancy.ServerUriArray` lists both cluster members' `ApplicationUri` values.
---
## OPC-019: Observability Hooks
The server shall emit OpenTelemetry metrics consumed by the Admin `/metrics` Prometheus endpoint.
### Acceptance Criteria
- Counters: capability calls per `DriverInstanceId` + `DriverCapability`, OPC UA requests per method, alarm events emitted, history reads, generation apply attempts.
- Histograms: capability-call duration per `DriverInstanceId` + `DriverCapability`, OPC UA request duration per method.
- Gauges: circuit-breaker state per pipeline, active OPC UA sessions, active monitored items, subscription queue depth, `ServiceLevel` value, memory-tracking watermarks (Phase 6.1).
- Metric cardinality is bounded — `DriverInstanceId` and `HostName` are the only high-cardinality labels, both controlled by the Config DB.

View File

@@ -1,117 +1,265 @@
# Service Host — Component Requirements
Parent: [HLR-006](HighLevelReqs.md#hlr-006-windows-service-hosting), [HLR-007](HighLevelReqs.md#hlr-007-logging)
> **Revision** — Refreshed 2026-04-19 for the OtOpcUa v2 multi-driver platform (task #205). v1 was a single Windows service; v2 ships **three cooperating Windows services** and the service-host requirements are rewritten per-process. SVC-001…SVC-006 from v1 are preserved in spirit (TopShelf, Serilog, config loading, graceful shutdown, startup sequence, unhandled-exception handling) but are now scoped to the process they apply to. SRV-* prefixes the Server process, ADM-* the Admin process, GHX-* the Galaxy Host process. A shared-requirements section at the top covers cross-process concerns (Serilog, logging rotation, bootstrap config scope).
## SVC-001: TopShelf Hosting
Parent: [HLR-007](HighLevelReqs.md#hlr-007-service-hosting), [HLR-008](HighLevelReqs.md#hlr-008-logging), [HLR-011](HighLevelReqs.md#hlr-011-config-db-and-draft-publish)
The application shall use TopShelf for Windows service lifecycle (install, uninstall, start, stop) and interactive console mode for development.
## Shared Requirements (all three processes)
### Acceptance Criteria
### SVC-SHARED-001: Serilog Logging
- TopShelf HostFactory configures the service with name `LmxOpcUa`, display name `LMX OPC UA Server`.
- Service installs via command line: `ZB.MOM.WW.OtOpcUa.Host.exe install`.
- Service uninstalls via: `ZB.MOM.WW.OtOpcUa.Host.exe uninstall`.
- Service runs as LocalSystem account (needed for MXAccess COM access and Windows Auth to SQL Server).
- Interactive console mode (exe with no args) works for development/debugging.
- `StartAutomatically` is set for Windows service registration.
Every process shall use Serilog with a rolling daily file sink at Information level minimum, plus a console sink, plus opt-in CompactJsonFormatter file sink.
### Details
#### Acceptance Criteria
- Platform target: x86 (32-bit) — required for MXAccess COM interop.
- Service description: "OPC UA server exposing System Platform Galaxy tags via MXAccess."
- Console sink active on every process (for interactive / debug mode).
- Rolling daily file sink:
- Server: `logs/otopcua-YYYYMMDD.log`
- Admin: `logs/otopcua-admin-YYYYMMDD.log`
- Galaxy Host: `%ProgramData%\OtOpcUa\galaxy-host-YYYYMMDD.log`
- Retention count and min level configurable via `Serilog:*` in each process's `appsettings.json`.
- JSON sink opt-in via `Serilog:WriteJson = true` (emits `*.json.log` alongside the plain-text file) for SIEM ingestion.
- `Log.CloseAndFlush()` invoked in a `finally` block on shutdown.
- Structured logging (Serilog message templates) — no `string.Format`.
---
## SVC-002: Serilog Logging
### SVC-SHARED-002: Bootstrap Configuration Scope
The application shall configure Serilog with a rolling daily file sink and console sink, with log files retained for a configurable number of days (default 31).
`appsettings.json` is bootstrap-only per HLR-011. Operational configuration (clusters, drivers, namespaces, tags, ACLs, poll groups) lives in the Config DB.
### Acceptance Criteria
#### Acceptance Criteria
- Console sink active (for interactive/debug mode).
- Rolling daily file sink writing to `logs/lmxopcua-YYYYMMDD.log`.
- Retained file count: configurable, default 31 days.
- Minimum log level: configurable, default Information.
- Log file path: configurable, default `logs/lmxopcua-.log`.
- Serilog is initialized before any other component (first thing in Main).
- `Log.CloseAndFlush()` called in finally block on exit.
### Details
- Structured logging with Serilog message templates (not string.Format).
- Log output includes timestamp, level, source context, message, and exception.
- Fatal exceptions are caught at the top level and logged before exit.
- `appsettings.json` may contain only: Config DB connection string, `Node:NodeId`, `Node:ClusterId`, `Node:LocalCachePath`, `OpcUa:*` security bootstrap fields, `Ldap:*` bootstrap fields, `Serilog:*`, `Redundancy:*` role id.
- Any attempt to configure driver instances, tags, or equipment through `appsettings.json` shall be rejected at startup with a descriptive error.
- Invalid or missing required bootstrap fields are detected at startup with a clear error (`"Node:NodeId not configured"` style).
---
## SVC-003: Configuration
## OtOpcUa.Server — Service Host Requirements (SRV-*)
The application shall load configuration from appsettings.json with support for environment-specific overrides (appsettings.*.json) and environment variables.
### SRV-001: Microsoft.Extensions.Hosting + AddWindowsService
### Acceptance Criteria
The Server shall use `Host.CreateApplicationBuilder(args)` with `AddWindowsService(o => o.ServiceName = "OtOpcUa")` to run as a Windows service.
- `appsettings.json` is the primary configuration file.
- Environment-specific overrides via `appsettings.{environment}.json`.
- Configuration sections: `OpcUa`, `MxAccess`, `GalaxyRepository`, `Dashboard`.
- Missing optional configuration keys use documented defaults (service does not crash).
- Invalid configuration (e.g., port = -1) is detected at startup with a clear error message.
#### Acceptance Criteria
### Details
- Config is loaded once at startup. No hot-reload (service restart required for config changes). This is appropriate for an industrial service.
- All configurable values and their defaults are documented in `appsettings.json`.
- Service name `OtOpcUa`.
- Installs via standard `sc.exe` tooling or the build-provided installer.
- Runs as a configured service account (typically a domain service account with Config DB read access; Windows Auth to SQL Server).
- Console mode (running `ZB.MOM.WW.OtOpcUa.Server.exe` with no Windows service context) works for development and debugging.
- Platform target: .NET 10 x64 (default per decision in `plan.md` §3).
---
## SVC-004: Graceful Shutdown
### SRV-002: Startup Sequence
On service stop, the application shall gracefully shut down all components and flush logs before exiting.
The Server shall start components in a defined order, with failure handling at each step.
### Acceptance Criteria
- TopShelf WhenStopped triggers orderly shutdown.
- Shutdown sequence: (1) stop change detection polling, (2) stop OPC UA server (stop accepting new sessions, complete pending operations), (3) disconnect MXAccess (cleanup all COM objects), (4) stop status dashboard HTTP listener, (5) flush Serilog.
- Shutdown completes within 30 seconds (Windows SCM timeout).
- All IDisposable components are disposed in reverse-creation order.
### Details
- `CancellationTokenSource` signals all background loops (monitor, change detection, HTTP listener) to stop.
- Log "Service shutdown complete" at Information level as the final log entry before flush.
---
## SVC-005: Startup Sequence
The service shall start components in a defined order, with failure handling at each step.
### Acceptance Criteria
#### Acceptance Criteria
- Startup sequence:
1. Load configuration
2. Initialize Serilog
3. Start STA thread
4. Connect to MXAccess
5. Query Galaxy Repository for initial build
6. Build OPC UA address space
7. Start OPC UA server listener
8. Start change detection polling
9. Start status dashboard HTTP listener
- Failure in steps 1-4 prevents startup (service fails to start).
- Failure in steps 5-9 logs Error but allows the service to run in degraded mode.
### Details
- Degraded mode means the service is running but may have an empty address space (waiting for Galaxy DB) or no dashboard (port conflict). MXAccess connection is the minimum required for the service to be useful.
1. Load `appsettings.json` bootstrap configuration + initialize Serilog.
2. Validate bootstrap fields (NodeId, ClusterId, Config DB connection).
3. Initialize `OpcUaApplicationHost` (server-certificate resolution via `SecurityProfileResolver`).
4. Connect to Config DB; request current published generation for `ClusterId`.
5. If unreachable, fall back to `LiteDbConfigCache` (latest applied generation).
6. Apply generation: register driver instances, build namespaces, wire capability pipelines.
7. Start `OpcUaServerService` hosted service (opens endpoint listener).
8. Start `HostStatusPublisher` (pushes `ClusterNodeGenerationState` to Config DB for Admin UI SignalR consumers).
9. Start `RedundancyCoordinator` + `ServiceLevelCalculator`.
- Failure in steps 1-3 prevents startup.
- Failure in steps 4-6 logs Error and enters degraded mode (empty namespaces, `DriverHealth.Unavailable` on every driver, `ServiceLevel = 0`).
- Failure in steps 7-9 logs Error and shuts down (endpoint is non-optional).
---
## SVC-006: Unhandled Exception Handling
### SRV-003: Graceful Shutdown
The service shall handle unexpected crashes gracefully.
On service stop, the Server shall gracefully shut down all driver instances, the OPC UA listener, and flush logs before exiting.
### Acceptance Criteria
#### Acceptance Criteria
- Register `AppDomain.CurrentDomain.UnhandledException` handler that logs Fatal before the process terminates.
- TopShelf service recovery is configured: restart on failure with 60-second delay.
- Fatal-level log entry includes the full exception details.
- `IHostApplicationLifetime.ApplicationStopping` triggers orderly shutdown.
- Shutdown sequence: stop `HostStatusPublisher` → stop driver instances (disconnect each via `IDriver.DisposeAsync`, which for Galaxy tears down the named pipe) → stop OPC UA server (stop accepting new sessions, complete pending reads/writes) → flush Serilog.
- Shutdown completes within 30 seconds (Windows SCM timeout).
- All `IDisposable` / `IAsyncDisposable` components disposed in reverse-creation order.
- Final log entry: `"OtOpcUa.Server shutdown complete"` at Information level.
---
### SRV-004: Unhandled Exception Handling
The Server shall handle unexpected crashes gracefully.
#### Acceptance Criteria
- Registers `AppDomain.CurrentDomain.UnhandledException` handler that logs Fatal before the process terminates.
- Windows service recovery configured: restart on failure with 60-second delay.
- Fatal log entry includes full exception details.
---
### SRV-005: Drivers Hosted In-Process
All drivers except Galaxy run in-process within the Server.
#### Acceptance Criteria
- Modbus TCP, AB CIP, AB Legacy, S7, TwinCAT, FOCAS, OPC UA Client drivers are resolved from the DI container and managed by `DriverHost`.
- Galaxy driver in-process component is `Driver.Galaxy.Proxy`, which forwards to `OtOpcUa.Galaxy.Host` over the named pipe (see GHX-*).
- Each driver instance's lifecycle (connect, discover, subscribe, dispose) is orchestrated by `DriverHost`.
---
### SRV-006: Redundancy-Node Bootstrap
The Server shall bootstrap its redundancy identity from `appsettings.json` and the Config DB.
#### Acceptance Criteria
- `Node:NodeId` + `Node:ClusterId` identify this node uniquely; the `Redundancy` coordinator looks up `ClusterNode.RedundancyRole` (Primary / Secondary) from the Config DB.
- Two nodes of the same cluster connect to the same Config DB and the same ClusterId but have different NodeIds and different `ApplicationUri` values.
- Missing or ambiguous `(ClusterId, NodeId)` causes startup failure.
---
## OtOpcUa.Admin — Service Host Requirements (ADM-*)
### ADM-001: ASP.NET Core Blazor Server
The Admin app shall use `WebApplication.CreateBuilder` with Razor Components (`AddRazorComponents().AddInteractiveServerComponents()`), SignalR, and cookie authentication.
#### Acceptance Criteria
- Blazor Server (not WebAssembly) per `plan.md` §Tech Stack.
- Hosts SignalR hubs for live cluster state (used by `ClusterNodeGenerationState` views, crash-loop alerts, etc.).
- Runs as a Windows service via `AddWindowsService` OR as a standard ASP.NET Core process behind IIS / reverse proxy (site decides).
- Platform target: .NET 10 x64.
---
### ADM-002: Authentication and Authorization
Admin users authenticate via LDAP bind with cookie auth; three admin roles gate operations.
#### Acceptance Criteria
- Cookie auth scheme: `OtOpcUa.Admin`, 8-hour expiry, path `/login` for challenge.
- LDAP bind via `LdapAuthService`; user group memberships map to admin roles (`ConfigViewer`, `ConfigEditor`, `FleetAdmin`).
- Authorization policies:
- `CanEdit` requires `ConfigEditor` or `FleetAdmin`.
- `CanPublish` requires `FleetAdmin`.
- View-only access requires `ConfigViewer` (or higher).
- Unauthenticated requests to any Admin page redirect to `/login`.
- Per-cluster role grants layer on top: a `ConfigEditor` with no grant for cluster X can view it but not edit.
---
### ADM-003: Config DB as Sole Write Path
The Admin service shall be the only process with write access to the Config DB.
#### Acceptance Criteria
- EF Core `OtOpcUaConfigDbContext` configured with the SQL login / connection string that has read+write permission on config tables.
- Server nodes connect with a read-only principal (`grant SELECT` only).
- Admin writes produce draft-generation rows; publish writes are atomic and transactional.
- Every write is audited via `AuditLogService` per ADM-006.
---
### ADM-004: Prometheus /metrics Endpoint
The Admin service shall expose an OpenTelemetry → Prometheus metrics endpoint at `/metrics`.
#### Acceptance Criteria
- `OpenTelemetry.Metrics` registered with Prometheus exporter.
- `/metrics` scrapeable without authentication (standard Prometheus pattern) OR gated behind an infrastructure allow-list (site-configurable).
- Exports metrics from Server nodes of managed clusters (aggregated via Config DB heartbeat telemetry) plus Admin-local metrics (login attempts, publish duration, active sessions).
---
### ADM-005: Graceful Shutdown
On shutdown, the Admin service shall disconnect SignalR clients cleanly, finish in-flight DB writes, and flush Serilog.
#### Acceptance Criteria
- `IHostApplicationLifetime.ApplicationStopping` closes SignalR hub connections gracefully.
- In-flight publish transactions are allowed to complete up to 30 seconds.
- Final log entry: `"OtOpcUa.Admin shutdown complete"`.
---
### ADM-006: Audit Logging
Every publish and every ACL / role-grant change shall produce an immutable audit row via `AuditLogService`.
#### Acceptance Criteria
- Audit rows include: timestamp (UTC), acting principal (LDAP DN + display name), action, entity kind + id, before/after generation number where applicable, session id, source IP.
- Audit rows are never mutated or deleted by application code.
- Audit table schema enforces immutability via DB permissions (no UPDATE / DELETE granted to the Admin app's principal).
---
## OtOpcUa.Galaxy.Host — Service Host Requirements (GHX-*)
### GHX-001: TopShelf Windows Service Hosting
The Galaxy Host shall use TopShelf for Windows service lifecycle (install, uninstall, start, stop) and interactive console mode.
#### Acceptance Criteria
- Service name `OtOpcUaGalaxyHost`, display name `OtOpcUa Galaxy Host`.
- Installs via `ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host.exe install`.
- Uninstalls via `ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host.exe uninstall`.
- Runs as a configured user account (typically the same account as the Server, or a dedicated Galaxy service account with ArchestrA platform access).
- Interactive console mode (no args) for development / debugging.
- Platform target: **.NET Framework 4.8 x86** — required for MXAccess COM 32-bit interop.
- Development deployments may use NSSM in place of TopShelf (memory: `project_galaxy_host_installed`).
### Details
- Service description: "OtOpcUa Galaxy Host — MXAccess + Galaxy Repository backend for the Galaxy driver, named-pipe IPC to OtOpcUa.Server."
---
### GHX-002: Named-Pipe IPC Bootstrap
The Host shall open a named pipe on startup whose name, ACL, and shared secret come from environment variables supplied by the supervisor at spawn time.
#### Acceptance Criteria
- `OTOPCUA_GALAXY_PIPE` → pipe name (default `OtOpcUaGalaxy`).
- `OTOPCUA_ALLOWED_SID` → SID of the principal allowed to connect; any other principal is denied at the ACL layer.
- `OTOPCUA_GALAXY_SECRET` → per-process shared secret; `Driver.Galaxy.Proxy` must present it on handshake.
- `OTOPCUA_GALAXY_BACKEND``stub` / `db` / `mxaccess` (default `mxaccess`) — selects which backend implementation is loaded.
- Missing `OTOPCUA_ALLOWED_SID` or `OTOPCUA_GALAXY_SECRET` at startup throws with a descriptive error.
---
### GHX-003: Backend Lifecycle
The Host shall instantiate the STA pump + MXAccess backend + Galaxy Repository + optional Historian plugin in a defined order and tear them down cleanly on shutdown.
#### Acceptance Criteria
- Startup (mxaccess backend): initialize Serilog → resolve env vars → create `PipeServer` → start `StaPump` → create `MxAccessClient` on STA thread → initialize `GalaxyRepository` → optionally initialize Historian plugin → begin pipe request handling.
- Shutdown: stop pipe → dispose MxAccessClient (MXA-007 COM cleanup) → dispose STA pump → flush Serilog.
- Shutdown must complete within 30 seconds (Windows SCM timeout).
- `Console.CancelKeyPress` triggers the same sequence in console mode.
---
### GHX-004: Unhandled Exception Handling
The Host shall log Fatal on crash and let the supervisor restart it.
#### Acceptance Criteria
- `AppDomain.CurrentDomain.UnhandledException` handler logs Fatal with full exception details before termination.
- The supervisor's driver-stability policy (`docs/v2/driver-stability.md`) governs restart behavior — backoff, crash-loop detection, and alerting live there, not in the Host.
- Server-side: `Driver.Galaxy.Proxy` detects pipe disconnect, opens its capability circuit, reports Bad quality on Galaxy nodes; reconnects automatically when the Host is back.

View File

@@ -1,157 +1,29 @@
# Status Dashboard — Component Requirements
# Status Dashboard — Retired
Parent: [HLR-009](HighLevelReqs.md#hlr-009-status-dashboard)
> **Revision** — Retired 2026-04-19 (task #205). The embedded HTTP Status Dashboard hosted inside the v1 LmxOpcUa service (`Dashboard:Port 8081`) has been **superseded by the Admin UI** introduced in OtOpcUa v2. The requirements formerly numbered DASH-001 through DASH-009 no longer apply.
Reference: LmxProxy Status Dashboard (see `dashboard.JPG` in project root).
## What replaces it
## DASH-001: Embedded HTTP Endpoint
Operator surface is now the **OtOpcUa Admin** Blazor Server web app:
The service shall host a lightweight HTTP listener on a configurable port serving a self-contained HTML status dashboard page (no external dependencies).
- Canonical design doc: `docs/v2/admin-ui.md`
- High-level operator surface requirement: [HLR-015](HighLevelReqs.md#hlr-015-admin-ui-operator-surface)
- Service-host requirements for the Admin process: [ServiceHostReqs.md → ADM-*](ServiceHostReqs.md#otopcua-admin---service-host-requirements-adm-)
- Cross-cluster metrics endpoint: `/metrics` on the Admin app — see [HLR-017](HighLevelReqs.md#hlr-017-prometheus-metrics).
- Audit log: see [HLR-016](HighLevelReqs.md#hlr-016-audit-logging) and `AuditLogService`.
### Acceptance Criteria
## Mapping from retired DASH-* requirements to today's surface
- Uses `System.Net.HttpListener` on a configurable port (`Dashboard:Port`, default 8081).
- Routes:
- `GET /` → HTML dashboard
- `GET /api/status` → JSON status report
- `GET /api/health` → 200 OK if healthy, 503 if unhealthy
- Only GET requests accepted; other methods return 405.
- Unknown paths return 404.
- All responses include `Cache-Control: no-cache, no-store, must-revalidate` headers.
- Dashboard can be disabled via config (`Dashboard:Enabled`, default true).
| Retired requirement | Replacement |
|---------------------|-------------|
| DASH-001 Embedded HTTP listener | Admin UI (Blazor Server) hosted in the `OtOpcUa.Admin` process. |
| DASH-002 Connection panel | Admin UI cluster-node view (live via SignalR) shows per-driver connection state. |
| DASH-003 Health panel | Admin UI renders `DriverHealth` + Polly circuit state per driver instance; cluster-level rollup on the cluster dashboard. |
| DASH-004 Subscriptions panel | Prometheus gauges (session count, monitored-item count, driver-subscription count) exposed via `/metrics`. |
| DASH-005 Operations table | Capability-call duration histograms + counts exposed via `/metrics`; Admin UI renders latency summaries per `DriverInstanceId`. |
| DASH-006 Footer (last-updated + version) | Admin UI footer; version stamped from the assembly version of the Admin app. |
| DASH-007 Auto-refresh | Admin UI uses SignalR push for live updates — no meta-refresh. |
| DASH-008 JSON status API | Prometheus `/metrics` endpoint is the programmatic surface. |
| DASH-009 Galaxy info panel | Admin UI Galaxy-driver-instance detail view (driver config, last discovery time, Galaxy DB connection state, MXAccess pipe health). |
### Details
- HTTP prefix: `http://+:{port}/` to bind to all interfaces.
- If HttpListener fails to start (port conflict, missing URL reservation), log Error and continue service startup without the dashboard.
- HTML page is self-contained: inline CSS, no external resources (no CDN, no JavaScript frameworks).
---
## DASH-002: Connection Panel
The dashboard shall display a Connection panel showing MXAccess connection state.
### Acceptance Criteria
- Shows: **Connected** (True/False), **State** (Connected/Disconnected/Reconnecting/Error), **Connected Since** (UTC timestamp).
- Green left border when Connected, red when Disconnected/Error, yellow when Reconnecting.
- "Connected Since" shows "N/A" when not connected.
- Data sourced from MXAccess client's connection state properties.
### Details
- Timestamp format: `yyyy-MM-dd HH:mm:ss UTC`.
- Panel title: "Connection".
---
## DASH-003: Health Panel
The dashboard shall display a Health panel showing overall service health.
### Acceptance Criteria
- Three states: **Healthy** (green text), **Degraded** (yellow text), **Unhealthy** (red text).
- Includes a health message string explaining the status.
- Health rules:
- Not connected to MXAccess → Unhealthy
- Success rate < 50% with > 100 total operations → Degraded
- Connected with acceptable success rate → Healthy
### Details
- Health message examples: "LmxOpcUa is healthy", "MXAccess client is not connected", "Average success rate is below 50%".
- Green left border for Healthy, yellow for Degraded, red for Unhealthy.
---
## DASH-004: Subscriptions Panel
The dashboard shall display a Subscriptions panel showing subscription statistics.
### Acceptance Criteria
- Shows: **Clients** (connected OPC UA client count), **Tags** (total variable nodes in address space), **Active** (active MXAccess subscriptions), **Delivered** (cumulative data change notifications delivered).
- Values update on each dashboard refresh.
- Zero values shown as "0", not blank.
### Details
- "Tags" is the count of variable nodes, not object/folder nodes.
- "Active" is the count of distinct MXAccess item subscriptions (after ref-counting — the number of actual AdviseSupervisory calls, not the number of OPC UA monitored items).
- "Delivered" is a running counter since service start (not reset on reconnect).
---
## DASH-005: Operations Table
The dashboard shall display an operations metrics table showing performance statistics.
### Acceptance Criteria
- Table with columns: **Operation**, **Count**, **Success Rate**, **Avg (ms)**, **Min (ms)**, **Max (ms)**, **P95 (ms)**.
- Rows: Read, Write, Subscribe, Browse.
- Empty cells show em-dash ("—") when no data available (count = 0).
- Success rate displayed as percentage (e.g., "99.8%").
- Latency values rounded to 1 decimal place.
### Details
- Metrics sourced from the PerformanceMetrics component (1000-entry rolling buffer for percentile calculation).
- "Browse" row tracks OPC UA browse operations.
- "Subscribe" row tracks OPC UA CreateMonitoredItems operations.
---
## DASH-006: Footer
The dashboard shall display a footer with last-updated time and service identification.
### Acceptance Criteria
- Format: "Last updated: {timestamp} UTC | Service: ZB.MOM.WW.OtOpcUa.Host v{version}".
- Timestamp is the server-side UTC time when the HTML was generated.
- Version is read from the assembly version (`Assembly.GetExecutingAssembly().GetName().Version`).
---
## DASH-007: Auto-Refresh
The dashboard page shall auto-refresh to show current status without manual reload.
### Acceptance Criteria
- HTML page includes `<meta http-equiv="refresh" content="10">` for 10-second auto-refresh.
- No JavaScript required for refresh (pure HTML meta-refresh).
- Refresh interval: configurable via `Dashboard:RefreshIntervalSeconds`, default 10 seconds.
---
## DASH-008: JSON Status API
The `/api/status` endpoint shall return a JSON object with all dashboard data for programmatic consumption.
### Acceptance Criteria
- Response Content-Type: `application/json`.
- JSON structure includes: connection state, health status, subscription statistics, and operation metrics.
- Same data as the HTML dashboard, structured for machine consumption.
- Suitable for integration with external monitoring tools.
---
## DASH-009: Galaxy Info Panel
The dashboard shall display a Galaxy Info panel showing Galaxy Repository state.
### Acceptance Criteria
- Shows: **Galaxy Name** (e.g., ZB), **DB Status** (Connected/Disconnected), **Last Deploy** (timestamp from `galaxy.time_of_last_deploy`), **Objects** (count), **Attributes** (count), **Last Rebuild** (timestamp of last address space rebuild).
- Provides visibility into the Galaxy Repository component's state independently of MXAccess connection status.
### Details
- "DB Status" reflects whether the most recent change detection poll succeeded.
- "Last Deploy" shows the raw `time_of_last_deploy` value from the Galaxy database.
- "Objects" and "Attributes" show counts from the most recent successful hierarchy/attribute query.
A formal requirements-level doc for the Admin UI (AdminUiReqs.md) is not yet written — the design doc at `docs/v2/admin-ui.md` serves as the authoritative reference until formal cert-compliance requirements are needed.

View File

@@ -1,15 +1,28 @@
# Transport Security
# Security
## Overview
OtOpcUa has four independent security concerns. This document covers all four:
The LmxOpcUa server supports configurable transport security profiles that control how data is protected on the wire between OPC UA clients and the server.
1. **Transport security** — OPC UA secure channel (signing, encryption, X.509 trust).
2. **OPC UA authentication** — Anonymous / UserName / X.509 session identities; UserName tokens authenticated by LDAP bind.
3. **Data-plane authorization** — who can browse, read, subscribe, write, acknowledge alarms on which nodes. Evaluated by `PermissionTrie` against the Config DB `NodeAcl` tree.
4. **Control-plane authorization** — who can view or edit fleet configuration in the Admin UI. Gated by the `AdminRole` (`ConfigViewer` / `ConfigEditor` / `FleetAdmin`) claim from `LdapGroupRoleMapping`.
Transport security and OPC UA authentication are per-node concerns configured in the Server's bootstrap `appsettings.json`. Data-plane ACLs and Admin role grants live in the Config DB.
---
## Transport Security
### Overview
The OtOpcUa Server supports configurable OPC UA transport security profiles that control how data is protected on the wire between OPC UA clients and the server.
There are two distinct layers of security in OPC UA:
- **Transport security** -- secures the communication channel itself using TLS-style certificate exchange, message signing, and encryption. This is what the `Security` configuration section controls.
- **UserName token encryption** -- protects user credentials (username/password) sent during session activation. The OPC UA stack encrypts UserName tokens using the server's application certificate regardless of the transport security mode. This means UserName authentication works on `None` endpoints too — the credentials themselves are always encrypted. However, a secure transport profile adds protection against message-level tampering and eavesdropping of data payloads.
- **Transport security** -- secures the communication channel itself using TLS-style certificate exchange, message signing, and encryption. This is what the `OpcUaServer:SecurityProfile` setting controls.
- **UserName token encryption** -- protects user credentials (username/password) sent during session activation. The OPC UA stack encrypts UserName tokens using the server's application certificate regardless of the transport security mode. UserName authentication therefore works on `None` endpoints too — the credentials themselves are always encrypted. A secure transport profile adds protection against message-level tampering and eavesdropping of data payloads.
## Supported Security Profiles
### Supported security profiles
The server supports seven transport security profiles:
@@ -23,334 +36,88 @@ The server supports seven transport security profiles:
| `Aes256_Sha256_RsaPss-Sign` | Aes256_Sha256_RsaPss | Sign | Strongest profile with AES-256 and RSA-PSS signatures. |
| `Aes256_Sha256_RsaPss-SignAndEncrypt` | Aes256_Sha256_RsaPss | SignAndEncrypt | Strongest profile. Recommended for high-security deployments. |
Multiple profiles can be enabled simultaneously. The server exposes a separate endpoint for each configured profile, and clients select the one they prefer during connection.
The server exposes a separate endpoint for each configured profile, and clients select the one they prefer during connection.
If no valid profiles are configured (or all names are unrecognized), the server falls back to `None` with a warning in the log.
### Configuration
## Configuration
Transport security is configured in the `Security` section of `appsettings.json`:
Transport security is configured in the `OpcUaServer` section of the Server process's bootstrap `appsettings.json`:
```json
{
"Security": {
"Profiles": ["None"],
"AutoAcceptClientCertificates": true,
"RejectSHA1Certificates": true,
"MinimumCertificateKeySize": 2048,
"PkiRootPath": null,
"CertificateSubject": null
"OpcUaServer": {
"EndpointUrl": "opc.tcp://0.0.0.0:4840/OtOpcUa",
"ApplicationName": "OtOpcUa Server",
"ApplicationUri": "urn:node-a:OtOpcUa",
"PkiStoreRoot": "C:/ProgramData/OtOpcUa/pki",
"AutoAcceptUntrustedClientCertificates": false,
"SecurityProfile": "Basic256Sha256-SignAndEncrypt"
}
}
```
### Properties
The server certificate is auto-generated on first start if none exists in `PkiStoreRoot/own/`. Always generated even for `None`-only deployments because UserName token encryption depends on it.
| Property | Type | Default | Description |
|--------------------------------|------------|--------------------------------------------------|-------------|
| `Profiles` | `string[]` | `["None"]` | List of security profile names to expose as server endpoints. Valid values: `None`, `Basic256Sha256-Sign`, `Basic256Sha256-SignAndEncrypt`, `Aes128_Sha256_RsaOaep-Sign`, `Aes128_Sha256_RsaOaep-SignAndEncrypt`, `Aes256_Sha256_RsaPss-Sign`, `Aes256_Sha256_RsaPss-SignAndEncrypt`. Profile names are case-insensitive. Duplicates are ignored. |
| `AutoAcceptClientCertificates` | `bool` | `true` | When `true`, the server automatically trusts client certificates that are not already in the trusted store. Set to `false` in production for explicit trust management. |
| `RejectSHA1Certificates` | `bool` | `true` | When `true`, client certificates signed with SHA-1 are rejected. SHA-1 is considered cryptographically weak. |
| `MinimumCertificateKeySize` | `int` | `2048` | Minimum RSA key size (in bits) required for client certificates. Certificates with shorter keys are rejected. |
| `PkiRootPath` | `string?` | `null` (defaults to `%LOCALAPPDATA%\OPC Foundation\pki`) | Override for the PKI root directory where certificates are stored. When `null`, uses the OPC Foundation default location. |
| `CertificateSubject` | `string?` | `null` (defaults to `CN={ServerName}, O=ZB MOM, DC=localhost`) | Override for the server certificate subject name. When `null`, the subject is derived from the configured `ServerName`. |
### Example: Development (no security)
```json
{
"Security": {
"Profiles": ["None"],
"AutoAcceptClientCertificates": true
}
}
```
### Example: Production (encrypted only)
```json
{
"Security": {
"Profiles": ["Basic256Sha256-SignAndEncrypt"],
"AutoAcceptClientCertificates": false,
"RejectSHA1Certificates": true,
"MinimumCertificateKeySize": 2048
}
}
```
### Example: Mixed (sign and encrypt endpoints, no plaintext)
```json
{
"Security": {
"Profiles": ["Basic256Sha256-Sign", "Basic256Sha256-SignAndEncrypt"],
"AutoAcceptClientCertificates": false
}
}
```
## PKI Directory Layout
The server stores certificates in a directory-based PKI store. The default root is:
### PKI directory layout
```
%LOCALAPPDATA%\OPC Foundation\pki\
```
This can be overridden with the `PkiRootPath` setting. The directory structure is:
```
pki/
{PkiStoreRoot}/
own/ Server's own application certificate and private key
issuer/ CA certificates that issued trusted client certificates
trusted/ Explicitly trusted client (peer) certificates
rejected/ Certificates that were presented but not trusted
```
### Certificate Trust Flow
### Certificate trust flow
When a client connects using a secure profile (`Sign` or `SignAndEncrypt`), the following trust evaluation occurs:
1. The client presents its application certificate during the secure channel handshake.
2. The server checks whether the certificate exists in the `trusted/` store.
3. If found, the connection proceeds (subject to key size and SHA-1 checks).
4. If not found and `AutoAcceptClientCertificates` is `true`, the certificate is automatically copied to `trusted/` and the connection proceeds.
5. If not found and `AutoAcceptClientCertificates` is `false`, the certificate is copied to `rejected/` and the connection is refused.
6. Regardless of trust status, the certificate must meet the `MinimumCertificateKeySize` requirement and pass the SHA-1 check (if `RejectSHA1Certificates` is `true`).
3. If found, the connection proceeds.
4. If not found and `AutoAcceptUntrustedClientCertificates` is `true`, the certificate is automatically copied to `trusted/` and the connection proceeds.
5. If not found and `AutoAcceptUntrustedClientCertificates` is `false`, the certificate is copied to `rejected/` and the connection is refused.
On first startup with a secure profile, the server automatically generates a self-signed application certificate in the `own/` directory if one does not already exist.
The Admin UI `Certificates.razor` page uses `CertTrustService` (singleton reading `CertTrustOptions` for the Server's `PkiStoreRoot`) to promote rejected client certs to trusted without operators having to file-copy manually.
## Production Hardening
### Production hardening
The default settings prioritize ease of development. Before deploying to production, apply the following changes:
### 1. Disable automatic certificate acceptance
Set `AutoAcceptClientCertificates` to `false` so that only explicitly trusted client certificates are accepted:
```json
{
"Security": {
"AutoAcceptClientCertificates": false
}
}
```
After changing this setting, you must manually copy each client's application certificate (the `.der` file) into the `trusted/` directory.
### 2. Remove the None profile
Remove `None` from the `Profiles` list to prevent unencrypted connections:
```json
{
"Security": {
"Profiles": ["Aes256_Sha256_RsaPss-SignAndEncrypt"]
}
}
```
### 3. Configure LDAP authentication
Enable LDAP authentication to validate credentials against the GLAuth server. LDAP group membership controls what each user can do (read, write, alarm acknowledgment). See [Configuration Guide](Configuration.md) for the full LDAP property reference.
```json
{
"Authentication": {
"AllowAnonymous": false,
"AnonymousCanWrite": false,
"Ldap": {
"Enabled": true,
"Host": "localhost",
"Port": 3893,
"BaseDN": "dc=lmxopcua,dc=local",
"ServiceAccountDn": "cn=serviceaccount,dc=lmxopcua,dc=local",
"ServiceAccountPassword": "serviceaccount123"
}
}
}
```
While UserName tokens are always encrypted by the OPC UA stack (using the server certificate), enabling a secure transport profile adds protection against message-level tampering and data eavesdropping.
### 4. Review the rejected certificate store
Periodically inspect the `rejected/` directory. Certificates that appear here were presented by clients but were not trusted. If you recognize a legitimate client certificate, move it to the `trusted/` directory to grant access.
## X.509 Certificate Authentication
The server supports X.509 certificate-based user authentication in addition to Anonymous and UserName tokens. When any non-None security profile is configured, the server advertises `UserTokenType.Certificate` in its endpoint descriptions.
Clients can authenticate by presenting an X.509 certificate. The server extracts the Common Name (CN) from the certificate subject and assigns the `AuthenticatedUser` and `ReadOnly` roles. The authentication is logged with the certificate's CN, subject, and thumbprint.
X.509 authentication is available automatically when transport security is enabled -- no additional configuration is required.
## Audit Logging
The server generates audit log entries for security-relevant operations. All audit entries use the `AUDIT:` prefix and are written to the Serilog rolling file sink for compliance review.
Audited events:
- **Authentication success**: Logs username, assigned roles, and session ID
- **Authentication failure**: Logs username and session ID
- **X.509 authentication**: Logs certificate CN, subject, and thumbprint
- **Certificate validation**: Logs certificate subject, thumbprint, and expiry for all validation events (accepted or rejected)
- **Write access denial**: Logged by the role-based access control system when a user lacks the required role
Example audit log entries:
```
AUDIT: Authentication SUCCESS for user admin with roles [ReadOnly, WriteOperate, AlarmAck] session abc123
AUDIT: Authentication FAILED for user baduser from session def456
X509 certificate authenticated: CN=ClientApp, Subject=CN=ClientApp,O=Acme, Thumbprint=AB12CD34
```
## CLI Examples
The Client CLI supports the `-S` (or `--security`) flag to select the transport security mode when connecting. Valid values are `none`, `sign`, `encrypt`, and `signandencrypt`.
### Connect with no security
```bash
dotnet run --project src/ZB.MOM.WW.OtOpcUa.Client.CLI -- connect -u opc.tcp://localhost:4840/LmxOpcUa -S none
```
### Connect with signing
```bash
dotnet run --project src/ZB.MOM.WW.OtOpcUa.Client.CLI -- connect -u opc.tcp://localhost:4840/LmxOpcUa -S sign
```
### Connect with signing and encryption
```bash
dotnet run --project src/ZB.MOM.WW.OtOpcUa.Client.CLI -- connect -u opc.tcp://localhost:4840/LmxOpcUa -S encrypt
```
### Browse with encryption and authentication
```bash
dotnet run -- browse -u opc.tcp://localhost:4840/LmxOpcUa -S encrypt -U operator -P secure-password -r -d 3
```
### Read a node with signing
```bash
dotnet run -- read -u opc.tcp://localhost:4840/LmxOpcUa -S sign -n "ns=2;s=TestMachine_001/Speed"
```
The CLI tool auto-generates its own client certificate on first use (stored under `%LOCALAPPDATA%\OpcUaCli\pki\own\`). When connecting to a server with `AutoAcceptClientCertificates` set to `false`, you must copy the CLI tool's certificate into the server's `trusted/` directory before the connection will succeed.
## Troubleshooting
### Certificate trust failure
**Symptom:** The client receives a `BadSecurityChecksFailed` or `BadCertificateUntrusted` error when connecting.
**Cause:** The server does not trust the client's certificate (or vice versa), and `AutoAcceptClientCertificates` is `false`.
**Resolution:**
1. Check the server's `rejected/` directory for the client's certificate file.
2. Copy the `.der` file from `rejected/` to `trusted/`.
3. Retry the connection.
4. If the server's own certificate is not trusted by the client, copy the server's certificate from `pki/own/certs/` to the client's trusted store.
### Endpoint mismatch
**Symptom:** The client receives a `BadSecurityModeRejected` or `BadSecurityPolicyRejected` error, or reports "No endpoint found with security mode...".
**Cause:** The client is requesting a security mode that the server does not expose. For example, the client requests `SignAndEncrypt` but the server only has `None` configured.
**Resolution:**
1. Verify the server's configured `Profiles` in `appsettings.json`.
2. Ensure the profile matching the client's requested mode is listed (e.g., add `Basic256Sha256-SignAndEncrypt` for encrypted connections).
3. Restart the server after changing the configuration.
4. Use the CLI tool to verify available endpoints:
```bash
dotnet run -- connect -u opc.tcp://localhost:4840/LmxOpcUa -S none
```
The output displays the security mode and policy of the connected endpoint.
### Server certificate not generated
**Symptom:** The server logs a warning about application certificate check failure on startup.
**Cause:** The `pki/own/` directory may not be writable, or the certificate generation failed.
**Resolution:**
1. Ensure the service account has write access to the PKI root directory.
2. Check that the `PkiRootPath` (if overridden) points to a valid, writable location.
3. Delete any corrupt certificate files in `pki/own/` and restart the server to trigger regeneration.
### SHA-1 certificate rejection
**Symptom:** A client with a valid certificate is rejected, and the server logs mention SHA-1.
**Cause:** The client's certificate was signed with SHA-1, and `RejectSHA1Certificates` is `true` (the default).
**Resolution:**
- Regenerate the client certificate using SHA-256 or stronger (recommended).
- Alternatively, set `RejectSHA1Certificates` to `false` in the server configuration (not recommended for production).
- Set `AutoAcceptUntrustedClientCertificates = false`.
- Drop `None` from the profile set.
- Use the Admin UI to promote trusted client certs rather than the auto-accept fallback.
- Periodically audit the `rejected/` directory; an unexpected entry is often a misconfigured client or a probe attempt.
---
## LDAP Authentication
## OPC UA Authentication
The server supports LDAP-based user authentication via GLAuth (or any standard LDAP server). When enabled, OPC UA `UserName` token credentials are validated by LDAP bind. LDAP group membership is resolved once during authentication and mapped to custom OPC UA role `NodeId`s in the `urn:zbmom:lmxopcua:roles` namespace. These role NodeIds are stored on the session's `RoleBasedIdentity.GrantedRoleIds` and checked directly during write and alarm-ack operations.
The Server accepts three OPC UA identity-token types:
### Architecture
| Token | Handler | Notes |
|---|---|---|
| Anonymous | `IUserAuthenticator.AuthenticateAsync(username: "", password: "")` | Refused in strict mode unless explicit anonymous grants exist; allowed in lax mode for backward compatibility. |
| UserName/Password | `LdapUserAuthenticator` (`src/ZB.MOM.WW.OtOpcUa.Server/Security/LdapUserAuthenticator.cs`) | LDAP bind + group lookup; resolved `LdapGroups` flow into the session's identity bearer (`ILdapGroupsBearer`). |
| X.509 Certificate | Stack-level acceptance + role mapping via CN | X.509 identity carries `AuthenticatedUser` + read roles; finer-grain authorization happens through the data-plane ACLs. |
```
OPC UA Client → UserName Token → LmxOpcUa Server → LDAP Bind (validate credentials)
→ LDAP Search (resolve group membership)
→ Map groups to OPC UA role NodeIds
→ Store on RoleBasedIdentity.GrantedRoleIds
→ Permission checks via GrantedRoleIds.Contains()
### LDAP bind flow (`LdapUserAuthenticator`)
`Program.cs` in the Server registers the authenticator based on `OpcUaServer:Ldap`:
```csharp
builder.Services.AddSingleton<IUserAuthenticator>(sp => ldapOptions.Enabled
? new LdapUserAuthenticator(ldapOptions, sp.GetRequiredService<ILogger<LdapUserAuthenticator>>())
: new DenyAllUserAuthenticator());
```
### LDAP Groups and OPC UA Permissions
`LdapUserAuthenticator`:
All authenticated LDAP users can browse and read nodes regardless of group membership. Groups grant additional permissions:
1. Refuses to bind over plain-LDAP unless `AllowInsecureLdap = true` (dev/test only).
2. Connects to `Server:Port`, optionally upgrades to TLS (`UseTls = true`, port 636 for AD).
3. Binds as the service account; searches `SearchBase` for `UserNameAttribute = username`.
4. Rebinds as the resolved user DN with the supplied password (the actual credential check).
5. Reads `GroupAttribute` (default `memberOf`) and strips the leading `CN=` so operators configure friendly group names in `GroupToRole`.
6. Returns a `UserAuthResult` carrying the validated username + the set of LDAP groups. The set flows through to the session identity via `ILdapGroupsBearer.LdapGroups`.
| LDAP Group | Permission |
|---|---|
| ReadOnly | No additional permissions (read-only access) |
| WriteOperate | Write FreeAccess and Operate attributes |
| WriteTune | Write Tune attributes |
| WriteConfigure | Write Configure attributes |
| AlarmAck | Acknowledge alarms |
Users can belong to multiple groups. The `admin` user in the default GLAuth configuration belongs to all groups.
### Effective Permission Matrix
The effective permission for a write operation depends on two factors: the user's session role (from LDAP group membership or anonymous access) and the Galaxy attribute's security classification. The security classification controls the node's `AccessLevel` — attributes classified as `SecuredWrite`, `VerifiedWrite`, or `ViewOnly` are exposed as read-only nodes regardless of the user's role. For writable classifications, the required write role depends on the classification.
| | FreeAccess | Operate | SecuredWrite | VerifiedWrite | Tune | Configure | ViewOnly |
|---|---|---|---|---|---|---|---|
| **Anonymous (`AnonymousCanWrite=true`)** | Write | Write | Read | Read | Write | Write | Read |
| **Anonymous (`AnonymousCanWrite=false`)** | Read | Read | Read | Read | Read | Read | Read |
| **ReadOnly** | Read | Read | Read | Read | Read | Read | Read |
| **WriteOperate** | Write | Write | Read | Read | Read | Read | Read |
| **WriteTune** | Read | Read | Read | Read | Write | Read | Read |
| **WriteConfigure** | Read | Read | Read | Read | Read | Write | Read |
| **AlarmAck** (only) | Read | Read | Read | Read | Read | Read | Read |
| **Admin** (all groups) | Write | Write | Read | Read | Write | Write | Read |
All roles can browse and read all nodes. The "Read" entries above mean the node is either read-only by classification or the user lacks the required write role. "Write" means the write is permitted by both the node's classification and the user's role.
Alarm acknowledgment is an independent permission controlled by the `AlarmAck` role and is not affected by security classification.
### GLAuth Setup
The project uses [GLAuth](https://github.com/glauth/glauth) v2.4.0 as the LDAP server, installed at `C:\publish\glauth\`. See `C:\publish\glauth\auth.md` for the complete user/group reference and service management commands.
### Configuration
Enable LDAP in `appsettings.json` under `Authentication.Ldap`. See [Configuration Guide](Configuration.md) for the full property reference.
### Active Directory configuration
Production deployments typically point at Active Directory instead of GLAuth. Only four properties differ from the dev defaults: `Server`, `Port`, `UserNameAttribute`, and `ServiceAccountDn`. The same `GroupToRole` mechanism works — map your AD security groups to OPC UA roles.
Configuration example (Active Directory production):
```json
{
@@ -362,32 +129,169 @@ Production deployments typically point at Active Directory instead of GLAuth. On
"UseTls": true,
"AllowInsecureLdap": false,
"SearchBase": "DC=corp,DC=example,DC=com",
"ServiceAccountDn": "CN=OpcUaSvc,OU=Service Accounts,DC=corp,DC=example,DC=com",
"ServiceAccountDn": "CN=OtOpcUaSvc,OU=Service Accounts,DC=corp,DC=example,DC=com",
"ServiceAccountPassword": "<from your secret store>",
"DisplayNameAttribute": "displayName",
"GroupAttribute": "memberOf",
"UserNameAttribute": "sAMAccountName",
"GroupToRole": {
"OPCUA-Operators": "WriteOperate",
"OPCUA-Engineers": "WriteConfigure",
"OPCUA-AlarmAck": "AlarmAck",
"OPCUA-Tuners": "WriteTune"
"OPCUA-Tuners": "WriteTune",
"OPCUA-AlarmAck": "AlarmAck"
}
}
}
}
```
Notes:
`UserNameAttribute: "sAMAccountName"` is the critical AD override — the default `uid` is not populated on AD user entries. Use `userPrincipalName` instead if operators log in with `user@corp.example.com` form. Nested group membership is not expanded — assign users directly to the role-mapped groups, or pre-flatten in AD.
- `UserNameAttribute: "sAMAccountName"` is the critical AD override — the default `uid` is not populated on AD user entries, so the user-DN lookup returns no results without it. Use `userPrincipalName` instead if operators log in with `user@corp.example.com` form.
- `Port: 636` + `UseTls: true` is required under AD's LDAP-signing enforcement. AD increasingly rejects plain-LDAP bind; set `AllowInsecureLdap: false` to refuse fallback.
- `ServiceAccountDn` should name a dedicated read-only service principal — not a privileged admin. The account needs read access to user and group entries in the search base.
- `memberOf` values come back as full DNs like `CN=OPCUA-Operators,OU=OPC UA Security Groups,OU=Groups,DC=corp,DC=example,DC=com`. The authenticator strips the leading `CN=` RDN value so operators configure `GroupToRole` with readable group common-names.
- Nested group membership is **not** expanded — assign users directly to the role-mapped groups, or pre-flatten membership in AD. `LDAP_MATCHING_RULE_IN_CHAIN` / `tokenGroups` expansion is an authenticator enhancement, not a config change.
The same options bind the Admin's `LdapAuthService` (cookie auth / login form) so operators authenticate with a single credential across both processes.
### Security Considerations
---
- LDAP credentials are transmitted in plaintext over the OPC UA channel unless transport security is enabled. Use `Basic256Sha256-SignAndEncrypt` for production deployments.
- The GLAuth LDAP server itself listens on plain LDAP (port 3893). Enable LDAPS in `glauth.cfg` for environments where LDAP traffic crosses network boundaries.
- The service account password is stored in `appsettings.json`. Protect this file with appropriate filesystem permissions.
## Data-Plane Authorization
Data-plane authorization is the check run on every OPC UA operation against an OtOpcUa endpoint: *can this authenticated user Browse / Read / Subscribe / Write / HistoryRead / AckAlarm / Call on this specific node?*
Per decision #129 the model is **additive-only — no explicit Deny**. Grants at each hierarchy level union; absence of a grant is the default-deny.
### Hierarchy
ACLs are evaluated against the UNS path:
```
ClusterId → Namespace → UnsArea → UnsLine → Equipment → Tag
```
Each level can carry `NodeAcl` rows (`src/ZB.MOM.WW.OtOpcUa.Configuration/Entities/NodeAcl.cs`) that grant a permission bundle to a set of `LdapGroups`.
### Permission flags
```csharp
[Flags]
public enum NodePermissions : uint
{
Browse = 1 << 0,
Read = 1 << 1,
Subscribe = 1 << 2,
HistoryRead = 1 << 3,
WriteOperate = 1 << 4,
WriteTune = 1 << 5,
WriteConfigure = 1 << 6,
AlarmRead = 1 << 7,
AlarmAcknowledge = 1 << 8,
AlarmConfirm = 1 << 9,
AlarmShelve = 1 << 10,
MethodCall = 1 << 11,
ReadOnly = Browse | Read | Subscribe | HistoryRead | AlarmRead,
Operator = ReadOnly | WriteOperate | AlarmAcknowledge | AlarmConfirm,
Engineer = Operator | WriteTune | AlarmShelve,
Admin = Engineer | WriteConfigure | MethodCall,
}
```
The three Write tiers map to Galaxy's v1 `SecurityClassification``FreeAccess`/`Operate``WriteOperate`, `Tune``WriteTune`, `Configure``WriteConfigure`. `SecuredWrite` / `VerifiedWrite` / `ViewOnly` classifications remain read-only from OPC UA regardless of grant.
### Evaluator — `PermissionTrie`
`src/ZB.MOM.WW.OtOpcUa.Core/Authorization/`:
| Class | Role |
|---|---|
| `PermissionTrie` | Cluster-scoped trie; each node carries `(GroupId → NodePermissions)` grants. |
| `PermissionTrieBuilder` | Builds a trie from the current `NodeAcl` rows in one pass. |
| `PermissionTrieCache` | Per-cluster memoised trie; invalidated via `AclChangeNotifier` when the Admin publishes a draft that touches ACLs. |
| `TriePermissionEvaluator` | Implements `IPermissionEvaluator.Authorize(session, operation, scope)` — walks from the root to the leaf for the supplied `NodeScope`, unions grants along the path, compares required permission to the union. |
`NodeScope` carries `(ClusterId, NamespaceId, AreaId, LineId, EquipmentId, TagId)`; any suffix may be null — a tag-level ACL is more specific than an area-level ACL but both contribute via union.
### Dispatch gate — `AuthorizationGate`
`src/ZB.MOM.WW.OtOpcUa.Server/Security/AuthorizationGate.cs` bridges the OPC UA stack's `ISystemContext.UserIdentity` to the evaluator. `DriverNodeManager` holds exactly one reference to it and calls `IsAllowed(identity, OpcUaOperation.*, NodeScope)` on every Read, Write, HistoryRead, Browse, Subscribe, AckAlarm, Call path. A false return short-circuits the dispatch with `BadUserAccessDenied`.
Key properties:
- **Driver-agnostic.** No driver-level code participates in authorization decisions. Drivers report `SecurityClassification` as metadata on tag discovery; everything else flows through `AuthorizationGate`.
- **Fail-open-during-transition.** `StrictMode = false` (default during ACL rollouts) lets sessions without resolved LDAP groups proceed; flip `Authorization:StrictMode = true` in production once ACLs are populated.
- **Evaluator stays pure.** `TriePermissionEvaluator` has no OPC UA stack dependency — it's tested directly from xUnit.
### Probe-this-permission (Admin UI)
`PermissionProbeService` (`src/ZB.MOM.WW.OtOpcUa.Admin/Services/PermissionProbeService.cs`) lets an operator ask "if a user with groups X, Y, Z asked to do operation O on node N, would it succeed?" The answer is rendered in the AclsTab "Probe" dialog — same evaluator, same trie, so the Admin UI answer and the live Server answer cannot disagree.
### Full model
See [`docs/v2/acl-design.md`](v2/acl-design.md) for the complete design: trie invalidation, flag semantics, per-path override rules, and the reasoning behind additive-only (no Deny).
---
## Control-Plane Authorization
Control-plane authorization governs **the Admin UI** — who can view fleet config, edit drafts, publish generations, manage cluster nodes + credentials.
Per decision #150 control-plane roles are **deliberately independent of data-plane ACLs**. An operator who can read every OPC UA tag in production may not be allowed to edit cluster config; conversely a ConfigEditor may not have any data-plane grants at all.
### Roles
`src/ZB.MOM.WW.OtOpcUa.Admin/Services/AdminRoles.cs`:
| Role | Capabilities |
|---|---|
| `ConfigViewer` | Read-only access to drafts, generations, audit log, fleet status. |
| `ConfigEditor` | ConfigViewer plus draft editing (UNS, equipment, tags, ACLs, driver instances, reservations, CSV imports). Cannot publish. |
| `FleetAdmin` | ConfigEditor plus publish, cluster/node CRUD, credential management, role-grant management. |
Policies registered in Admin `Program.cs`:
```csharp
builder.Services.AddAuthorizationBuilder()
.AddPolicy("CanEdit", p => p.RequireRole(AdminRoles.ConfigEditor, AdminRoles.FleetAdmin))
.AddPolicy("CanPublish", p => p.RequireRole(AdminRoles.FleetAdmin));
```
Razor pages and API endpoints gate with `[Authorize(Policy = "CanEdit")]` / `"CanPublish"`; nav-menu sections hide via `<AuthorizeView>`.
### Role grant source
Admin reads `LdapGroupRoleMapping` rows from the Config DB (`src/ZB.MOM.WW.OtOpcUa.Configuration/Entities/LdapGroupRoleMapping.cs`) — the same pattern as the data-plane `NodeAcl` but scoped to Admin roles + (optionally) cluster scope for multi-site fleets. The `RoleGrants.razor` page lets FleetAdmins edit these mappings without leaving the UI.
---
## OTOPCUA0001 Analyzer — Compile-Time Guard
Per-capability resilience (retry, timeout, circuit-breaker, bulkhead) is applied by `CapabilityInvoker` in `src/ZB.MOM.WW.OtOpcUa.Core/Resilience/`. A driver-capability call made **outside** the invoker bypasses resilience entirely — which in production looks like inconsistent timeouts, un-wrapped retries, and unbounded blocking.
`OTOPCUA0001` (Roslyn analyzer at `src/ZB.MOM.WW.OtOpcUa.Analyzers/UnwrappedCapabilityCallAnalyzer.cs`) fires as a compile-time **warning** when an `async`/`Task`-returning method on one of the seven guarded capability interfaces (`IReadable`, `IWritable`, `ITagDiscovery`, `ISubscribable`, `IHostConnectivityProbe`, `IAlarmSource`, `IHistoryProvider`) is invoked **outside** a lambda passed to `CapabilityInvoker.ExecuteAsync` / `ExecuteWriteAsync` / `AlarmSurfaceInvoker.*`. The analyzer walks up the syntax tree from the call site, finds any enclosing invoker invocation, and verifies the call lives transitively inside that invocation's anonymous-function argument — a sibling pattern (do the call, then invoke `ExecuteAsync` on something unrelated nearby) does not satisfy the rule.
Five xUnit-v3 + Shouldly tests at `tests/ZB.MOM.WW.OtOpcUa.Analyzers.Tests` cover the common fail/pass shapes + the sibling-pattern regression guard.
The rule is intentionally scoped to async surfaces — pure in-memory accessors like `IHostConnectivityProbe.GetHostStatuses()` return synchronously and do not require the invoker wrap.
---
## Audit Logging
- **Server**: Serilog `AUDIT:` prefix on every authentication success/failure, certificate validation result, write access denial. Written alongside the regular rolling file sink.
- **Admin**: `AuditLogService` writes `ConfigAuditLog` rows to the Config DB for every publish, rollback, cluster-node CRUD, credential rotation. Visible in the Audit page for operators with `ConfigViewer` or above.
---
## Troubleshooting
### Certificate trust failure
Check `{PkiStoreRoot}/rejected/` for the client's cert. Promote via Admin UI Certificates page, or copy the `.der` file manually to `trusted/`.
### LDAP users can connect but fail authorization
Verify (a) `OpcUaServer:Ldap:GroupAttribute` returns groups in the form `CN=MyGroup,…` (OtOpcUa strips the `CN=` for matching), (b) a `NodeAcl` grant exists at any level of the node's UNS path that unions to the required permission, (c) `Authorization:StrictMode` is correctly set for the deployment stage.
### LDAP bind rejected as "insecure"
Set `UseTls = true` + `Port = 636`, or temporarily flip `AllowInsecureLdap = true` in dev. Production Active Directory increasingly refuses plain-LDAP bind under LDAP-signing enforcement.
### `AuthorizationGate` denies every call after a publish
`AclChangeNotifier` invalidates the `PermissionTrieCache` on publish; a stuck cache is usually a missed notification. Restart the Server as a quick mitigation and file a bug — the design is to stay fresh without restarts.

View File

@@ -143,15 +143,49 @@ Dev credentials in this inventory are convenience defaults, not secrets. Change
| Resource | Purpose | Type | Default port | Default credentials | Owner |
|----------|---------|------|--------------|---------------------|-------|
| **Docker Desktop for Windows** | Host for containerized simulators | Install | (Hyper-V required; not compatible with TwinCAT runtime — see TwinCAT row below for the workaround) | n/a | Integration host admin |
| **`oitc/modbus-server`** | Modbus TCP simulator (per `test-data-sources.md` §1) | Docker container | 502 (Modbus TCP) | n/a (no auth in protocol) | Integration host admin |
| **`ab_server`** (libplctag binary) | AB CIP + AB Legacy simulator (per `test-data-sources.md` §2 + §3) | Native binary built from libplctag source; runs in a separate VM or host since it conflicts with Docker Desktop's Hyper-V if run on bare metal | 44818 (CIP) | n/a | Integration host admin |
| **Snap7 Server** | S7 simulator (per `test-data-sources.md` §4) | Native binary; runs in a separate VM or in WSL2 to avoid Hyper-V conflict | 102 (ISO-TCP) | n/a | Integration host admin |
| **Docker Desktop for Windows** | Host for every driver test-fixture simulator (Modbus / AB CIP / S7 / OpcUaClient) + SQL Server | Install | (Hyper-V required; not compatible with TwinCAT runtime — see TwinCAT row below for the workaround) | n/a | Integration host admin |
| **Modbus fixture — `otopcua-pymodbus:3.13.0`** | Modbus driver integration tests | Docker image (local build, see `tests/ZB.MOM.WW.OtOpcUa.Driver.Modbus.IntegrationTests/Docker/`); 4 compose profiles: `standard` / `dl205` / `mitsubishi` / `s7_1500` | 5020 (non-privileged) | n/a (no auth in protocol) | Developer (per machine) |
| **AB CIP fixture — `otopcua-ab-server:libplctag-release`** | AB CIP driver integration tests | Docker image (multi-stage build of libplctag's `ab_server` from source, pinned to the `release` tag; see `tests/ZB.MOM.WW.OtOpcUa.Driver.AbCip.IntegrationTests/Docker/`); 4 compose profiles: `controllogix` / `compactlogix` / `micro800` / `guardlogix` | 44818 (CIP / EtherNet/IP) | n/a | Developer (per machine) |
| **S7 fixture — `otopcua-python-snap7:1.0`** | S7 driver integration tests | Docker image (local build, `python-snap7>=2.0`; see `tests/ZB.MOM.WW.OtOpcUa.Driver.S7.IntegrationTests/Docker/`); 1 compose profile: `s7_1500` | 1102 (non-privileged; driver honours `S7DriverOptions.Port`) | n/a | Developer (per machine) |
| **OPC UA Client fixture — `mcr.microsoft.com/iotedge/opc-plc:2.14.10`** | OpcUaClient driver integration tests | Docker image (Microsoft-maintained, pinned; see `tests/ZB.MOM.WW.OtOpcUa.Driver.OpcUaClient.IntegrationTests/Docker/`) | 50000 (OPC UA) | Anonymous (`--daa` off); auto-accept certs (`--aa`) | Developer (per machine) |
| **TwinCAT XAR runtime VM** | TwinCAT ADS testing (per `test-data-sources.md` §5; Beckhoff XAR cannot coexist with Hyper-V on the same OS) | Hyper-V VM with Windows + TwinCAT XAR installed under 7-day renewable trial | 48898 (ADS over TCP) | TwinCAT default route credentials configured per Beckhoff docs | Integration host admin |
| **OPC Foundation reference server** | OPC UA Client driver test source (per `test-data-sources.md` §"OPC UA Client") | Built from `OPCFoundation/UA-.NETStandard` `ConsoleReferenceServer` project | 62541 (default for the reference server) | Anonymous + Username (`user1` / `password1`) per the reference server's built-in user list | Integration host admin |
| **Rockwell Studio 5000 Logix Emulate** | AB CIP golden-box tier — closes UDT / ALMD / AOI / GuardLogix-safety / CompactLogix-ConnectionSize gaps the ab_server simulator can't cover. Loads the L5X project documented at `tests/.../AbCip.IntegrationTests/LogixProject/README.md`. Tests gated on `AB_SERVER_PROFILE=emulate` + `AB_SERVER_ENDPOINT=<ip>:44818`; see `docs/drivers/AbServer-Test-Fixture.md` §Logix Emulate golden-box tier | Windows-only install; **Hyper-V conflict** — can't coexist with Docker Desktop's WSL 2 backend on the same OS, same story as TwinCAT XAR. Runs on a dedicated Windows PC reachable on the LAN | 44818 (CIP / EtherNet/IP) | None required at the CIP layer; Studio 5000 project credentials per Rockwell install | Integration host admin (license + install); Developer (per session — open Emulate, load L5X, click Run) |
| **FOCAS TCP stub** (`Driver.Focas.TestStub`) | FOCAS functional testing (per `test-data-sources.md` §6) | Local .NET 10 console app from this repo | 8193 (FOCAS) | n/a | Developer / integration host (run on demand) |
| **FOCAS FaultShim** (`Driver.Focas.FaultShim`) | FOCAS native-fault injection (per `test-data-sources.md` §6) | Test-only native DLL named `Fwlib64.dll`, loaded via DLL search path in the test fixture | n/a (in-process) | n/a | Developer / integration host (test-only) |
### Docker fixtures — quick reference
Every driver's integration-test simulator ships as a Docker image (or pulls
one from MCR). Start the one you need, run `dotnet test`, stop it.
Container lifecycle is always manual — fixtures TCP-probe at collection
init + skip cleanly when nothing's running.
| Driver | Fixture image | Compose file | Bring up |
|---|---|---|---|
| Modbus | local-build `otopcua-pymodbus:3.13.0` | `tests/ZB.MOM.WW.OtOpcUa.Driver.Modbus.IntegrationTests/Docker/docker-compose.yml` | `docker compose -f <compose> --profile <standard\|dl205\|mitsubishi\|s7_1500> up -d` |
| AB CIP | local-build `otopcua-ab-server:libplctag-release` | `tests/ZB.MOM.WW.OtOpcUa.Driver.AbCip.IntegrationTests/Docker/docker-compose.yml` | `docker compose -f <compose> --profile <controllogix\|compactlogix\|micro800\|guardlogix> up -d` |
| S7 | local-build `otopcua-python-snap7:1.0` | `tests/ZB.MOM.WW.OtOpcUa.Driver.S7.IntegrationTests/Docker/docker-compose.yml` | `docker compose -f <compose> --profile s7_1500 up -d` |
| OpcUaClient | `mcr.microsoft.com/iotedge/opc-plc:2.14.10` (pinned) | `tests/ZB.MOM.WW.OtOpcUa.Driver.OpcUaClient.IntegrationTests/Docker/docker-compose.yml` | `docker compose -f <compose> up -d` |
First build of a local-build image takes 15 minutes; subsequent runs use
layer cache. `ab_server` is the slowest (multi-stage build clones
libplctag + compiles C). Stop with `docker compose -f <compose> --profile <…> down`.
**Endpoint overrides** — every fixture respects an env var to point at a
real PLC instead of the simulator:
- `MODBUS_SIM_ENDPOINT` (default `localhost:5020`)
- `AB_SERVER_ENDPOINT` (no default; overrides the local container endpoint)
- `S7_SIM_ENDPOINT` (default `localhost:1102`)
- `OPCUA_SIM_ENDPOINT` (default `opc.tcp://localhost:50000`)
No native launchers — Docker is the only supported path for these
fixtures. A fresh clone needs Docker Desktop and nothing else; fixture
TCP probes skip tests cleanly when the container isn't running.
See each driver's `docs/drivers/*-Test-Fixture.md` for the full coverage
map + gap inventory.
### D. Cloud / external services
| Resource | Purpose | Type | Access | Owner |

View File

@@ -0,0 +1,145 @@
# FOCAS version / capability matrix
Authoritative source for the per-CNC-series ranges that
[`FocasCapabilityMatrix`](../../src/ZB.MOM.WW.OtOpcUa.Driver.FOCAS/FocasCapabilityMatrix.cs)
enforces at driver init time. Every row cites the Fanuc FOCAS Developer
Kit function whose documented input range determines the ceiling.
**Why this exists** — we have no FOCAS hardware on the bench and no
working simulator. Fwlib32 returns `EW_NUMBER` / `EW_PARAM` when you
hand it an address outside the controller's supported range; the
driver would map that to a per-read `BadOutOfRange` at steady state.
Catching at `InitializeAsync` with this matrix surfaces operator
typos + mismatched series declarations as config errors before any
session is opened, which is the only feedback loop available without
a live CNC to read against.
**Who declares the series**`FocasDeviceOptions.Series` in
`appsettings.json`. Defaults to `Unknown`, which is permissive — every
address passes validation. Pre-matrix configs don't break on upgrade.
---
## Series covered
| Enum value | Controller family | Typical era |
| --- | --- | --- |
| `Unknown` | (legacy / not declared) | permissive fallback |
| `Sixteen_i` | 16i / 18i / 21i | 1997-2008 |
| `Zero_i_D` | 0i-D | 2008-2013 |
| `Zero_i_F` | 0i-F | 2013-present, general-purpose |
| `Zero_i_MF` | 0i-MF | 0i-F lathe variant |
| `Zero_i_TF` | 0i-TF | 0i-F turning variant |
| `Thirty_i` | 30i-A / 30i-B | 2007-present, high-end |
| `ThirtyOne_i` | 31i-A / 31i-B | 30i simpler variant |
| `ThirtyTwo_i` | 32i-A / 32i-B | 30i compact |
| `PowerMotion_i` | Power Motion i-A / i-MODEL A | motion-only controller |
## Macro variable range (`cnc_rdmacro` / `cnc_wrmacro`)
Common macros `1-33` + `100-199` + `500-999` are universal across all
series. Extended macros (`#10000+`) exist only on higher-end series.
The numbers below reflect the extended ceiling per series per the
DevKit range tables.
| Series | Min | Max | Notes |
| --- | ---: | ---: | --- |
| `Sixteen_i` | 0 | 999 | legacy ceiling — no extended |
| `Zero_i_D` | 0 | 999 | 0i-D still at legacy ceiling |
| `Zero_i_F` / `Zero_i_MF` / `Zero_i_TF` | 0 | 9999 | extended added on 0i-F |
| `Thirty_i` / `ThirtyOne_i` / `ThirtyTwo_i` | 0 | 99999 | full extended set |
| `PowerMotion_i` | 0 | 999 | atypical — limited macro coverage |
## Parameter range (`cnc_rdparam` / `cnc_wrparam`)
| Series | Min | Max |
| --- | ---: | ---: |
| `Sixteen_i` | 0 | 9999 |
| `Zero_i_D` / `Zero_i_F` / `Zero_i_MF` / `Zero_i_TF` | 0 | 14999 |
| `Thirty_i` / `ThirtyOne_i` / `ThirtyTwo_i` | 0 | 29999 |
| `PowerMotion_i` | 0 | 29999 |
## PMC letters (`pmc_rdpmcrng` / `pmc_wrpmcrng`)
Addresses are letter + number (e.g. `R100`, `F50.3`). Legacy
controllers omit the `F`/`G` signal groups that 30i-family ladder
programs use, and only the 30i-family exposes `K` (keep-relay) +
`T` (timer).
| Letter | 16i | 0i-D | 0i-F family | 30i family | Power Motion-i |
| --- | :-: | :-: | :-: | :-: | :-: |
| `X` | yes | yes | yes | yes | yes |
| `Y` | yes | yes | yes | yes | yes |
| `R` | yes | yes | yes | yes | yes |
| `D` | yes | yes | yes | yes | yes |
| `E` | — | yes | yes | yes | — |
| `A` | — | yes | yes | yes | — |
| `F` | — | — | yes | yes | — |
| `G` | — | — | yes | yes | — |
| `M` | — | — | yes | yes | — |
| `C` | — | — | yes | yes | — |
| `K` | — | — | — | yes | — |
| `T` | — | — | — | yes | — |
Letter match is case-insensitive. `FocasAddress.PmcLetter` is carried
as a string (not char) so the matrix can do ordinal-ignore-case
comparison.
## PMC address-number ceiling
PMC addresses are byte-addressed on read + bit-addressed on write;
`FocasAddress` carries the bit index separately, so these are byte
ceilings.
| Series | Max byte | Notes |
| --- | ---: | --- |
| `Sixteen_i` | 999 | legacy |
| `Zero_i_D` | 1999 | doubled since 16i |
| `Zero_i_F` family | 9999 | |
| `Thirty_i` family | 59999 | highest density |
| `PowerMotion_i` | 1999 | |
## Error surface
When a tag fails validation, `FocasDriver.InitializeAsync` throws
`InvalidOperationException` with a message of the form:
```
FOCAS tag '<name>' (<address>) rejected by capability matrix: <reason>
```
`<reason>` is the verbatim string from `FocasCapabilityMatrix.Validate`
and always names the series + the documented limit so the operator
can either raise the limit (if wrong) or correct the CNC series they
declared (if mismatched). Sample:
```
FOCAS tag 'X_axis_macro_ext' (MACRO:50000) rejected by capability
matrix: Macro variable #50000 is outside the documented range
[0, 9999] for Zero_i_F.
```
## How this matrix stays honest
- Every row is covered by a parameterized test in
[`FocasCapabilityMatrixTests.cs`](../../tests/ZB.MOM.WW.OtOpcUa.Driver.FOCAS.Tests/FocasCapabilityMatrixTests.cs)
— 46 cases across macro / parameter / PMC-letter / PMC-number
boundaries + unknown-series permissiveness + rejection-message
content + case-insensitivity.
- Widening or narrowing a range in the matrix without updating this
doc will fail a test, because the theories cite the specific row
they reflect in their `InlineData`.
- The matrix is not comprehensive — it encodes only the subset of
FOCAS surface the driver currently exposes (Macro / Parameter /
PMC). When the driver gains a new capability (e.g. tool management,
alarm history), add its series-specific range tables here + matching
tests at the same time.
## Follow-up
This validation closes the cheap half of the FOCAS hardware-free
stability gap — config errors now fail at load instead of per-read.
The expensive half is Tier-C process isolation so that a crashing
`Fwlib32.dll` doesn't take the main OPC UA server down with it. See
[`docs/v2/implementation/focas-isolation-plan.md`](implementation/focas-isolation-plan.md)
for that plan (task #220).

View File

@@ -0,0 +1,248 @@
# ADR-001 — Equipment node walker: how driver tags bind to the UNS address space
**Status:** Accepted 2026-04-20 — Option A (Config-primary); Option D deferred to v2.1
**Related tasks:** [#195 IdentificationFolderBuilder wire-in](../../../) (blocked on this)
**Related decisions in `plan.md`:** #110 (Tag belongs to Equipment via FK in Equipment ns),
#116 / #117 / #121 (five identifiers as properties, `Equipment.Name` as path segment),
#120 (UNS hierarchy mandatory in Equipment ns; SystemPlatform ns exempt).
## Context
Today the `DriverNodeManager` builds its address space by calling
`ITagDiscovery.DiscoverAsync` on each registered driver. Every driver returns whatever
browse shape its wire protocol produces — Galaxy returns gobjects with attributes, Modbus
returns whatever tag configs the operator authored, AB CIP returns controller-walk output,
etc. The result is a per-driver subtree, rooted under the driver's own namespace, with no
UNS levels.
The Config DB meanwhile carries the authoritative UNS model for every Equipment-kind
namespace:
```
ConfigGeneration
└─ ServerCluster
└─ Namespace (Kind=Equipment)
└─ UnsArea
└─ UnsLine
└─ Equipment (carries 9 OPC 40010 Identification fields + 5 identifiers)
└─ Tag (EquipmentId FK when Kind=Equipment; DriverInstanceId + FolderPath when Kind=SystemPlatform)
```
Decision #110 already binds `Tag → Equipment` by foreign key. Decision #120 requires the
Equipment-namespace browse tree to conform to `Enterprise/Site/Area/Line/Equipment/TagName`.
The building blocks exist:
- `IdentificationFolderBuilder.Build(equipmentBuilder, row)` — pure function that hangs
nine OPC 40010 properties under an Equipment node. Shipped, untested in integration.
- `Equipment` table rows with `UnsLineId` FK + the 9 identification columns + the 5
identifier columns.
- `Tag` table rows with nullable `EquipmentId` + a `TagConfig` JSON column carrying the
wire-level address.
- `NodeScopeResolver` — Phase-1 stub that returns a cluster-level scope only, with an
explicit "future resolver will join against the Configuration DB" note.
What's missing is the **walker**: server-side code that reads the UNS + Equipment + Tag
rows for the current published generation, traverses them in UNS order, materializes each
level as an OPC UA folder, and wires `IdentificationFolderBuilder.Build` + the 5-identifier
properties under each Equipment node.
The walker isn't pure bookkeeping — it has to decide **how driver-discovered tags bind to
UNS Equipment nodes**. That's the decision this ADR resolves.
## Open question
> For an Equipment-kind driver, is the published OPC UA surface driven by (a) the Config
> DB's `Tag` rows, (b) the driver's `ITagDiscovery.DiscoverAsync` output, or (c) some
> combination?
SystemPlatform-kind drivers (Galaxy only, today) are unambiguous: decision #120 exempts
them from UNS + they keep their v1 native hierarchy. The walker does not touch
SystemPlatform namespaces beyond the existing driver-discovery path. This ADR only decides
Equipment-kind composition.
## Options
### Option A — Config-primary
The `Tag` table is the sole source of truth for what gets published. `ITagDiscovery`
becomes a validation + enrichment surface, not a discovery surface.
**Walker flow:**
1. Read `UnsArea` / `UnsLine` / `Equipment` / `Tag` for the published generation.
2. Walk Area → Line → Equipment, materializing each level as an OPC UA folder.
3. Under each Equipment node:
- Add the 5 identifier properties (`EquipmentId`, `EquipmentUuid`, `MachineCode`,
`ZTag`, `SAPID`) as OPC UA properties per decision #121.
- Call `IdentificationFolderBuilder.Build` to add the `Identification` sub-folder with
the 9 OPC 40010 fields.
- For each `Tag` row bound to this Equipment: ask the driver's `IReadable` /
`IWritable` surface whether it can address `Tag.TagConfig.address`; if yes, create a
variable node. If no, create a `BadNotFound` placeholder with a diagnostic so
operators see the mismatch instead of a silent drop.
4. `ITagDiscovery.DiscoverAsync` is re-purposed to **enrich** — driver may return schema
hints (data type, bounds, description) that operators missed when authoring the Tag
row. The Admin UI surfaces them as "driver suggests" hints for next-draft edits.
**Trade-offs:**
- ✅ Matches decision #110's framing cleanly. `Tag` rows carry the contract; nothing gets
published that's not explicitly authored.
- ✅ Same model for every Equipment-kind driver. Modbus / S7 / AB CIP / AB Legacy /
TwinCAT / FOCAS / OpcUaClient all compose identically.
- ✅ UNS hierarchy is always exactly as-authored. No race between "driver added a tag at
runtime" and "operator hasn't approved it yet."
- ✅ Aligns with the Config-DB-first operator story the Admin UI already tells.
- ❌ Drivers with large native schemas (TwinCAT PLCs with thousands of symbols, AB CIP
controllers with full @tags walkers) can't "just publish everything" — operators must
author Tag rows. This is a pure workflow cost, not a correctness cost.
- ❌ A Tag row whose driver can't address it produces a placeholder node at runtime
(BadNotFound), not a publish-time validation failure. Mitigation: `sp_ValidateDraft`
already validates per-driver references at publish — extend it to call each driver's
existence check, or keep it as runtime-visible with an Admin UI indicator.
### Option B — Discovery-primary
`ITagDiscovery.DiscoverAsync` is the source of truth for what gets published. The walker
joins discovered tags against Config-DB Equipment rows to assemble the UNS tree.
**Walker flow:**
1. Driver runs `ITagDiscovery.DiscoverAsync` — returns its native tag graph.
2. Walker reads `Equipment` + `Tag` rows; uses `Tag.TagConfig.address` to match against
discovered references.
3. For each match: materialize the UNS path + attach the discovered variable under the
bound Equipment node.
4. Discovered tags with no matching `Tag` row: silently dropped (or surfaced under a
`Unmapped/` diagnostic folder).
5. `Tag` rows with no discovered match: hidden (or surfaced as `BadNotFound` placeholder
same as Option A).
**Trade-offs:**
- ✅ Lets drivers with rich discovery (TwinCAT `SymbolLoaderFactory`, AB CIP `@tags`)
publish live controller state without operator-authored Tag rows for every symbol.
- ✅ Driver-native metadata (real OPC UA data types, real bounds) is authoritative.
- ❌ Conflicts with the Config-DB-first publish workflow. Operators publish a generation
+ discover a different set at runtime + the two don't necessarily match. Diff tooling
becomes harder.
- ❌ Galaxy's SystemPlatform-namespace path still uses Option-B-like discovery — so the
codebase would host two compositions regardless. But adding a second
discovery-primary composition for Equipment-kind would double the surface operators
have to reason about.
- ❌ Requires each driver to emit tag identifiers that stably match `Tag.TagConfig.address`
shape across re-discovery. Works for Galaxy (attribute full refs are stable); harder for
AB CIP where the @tags walker may return tags operators haven't declared.
- ❌ Operator-visible symptom of "my tag didn't publish" splits between two places: the
Tag row exists (Config DB) + the driver can't find it (runtime discovery). Option A
surfaces the same gap as a single `BadNotFound` placeholder; B multiplies it.
### Option C — Parallel namespaces
Driver tags are always published under a driver-native folder hierarchy (discovery-driven,
same as today). A secondary UNS "view" namespace is overlaid, containing Equipment nodes
with Identification sub-folders + `Organizes` references pointing at the driver-native tag
nodes.
**Walker flow:**
1. Driver's native discovery publishes `ns=2;s={DriverInstanceId}/{...driver shape}` as
today.
2. Walker reads UNS + Equipment + Tag rows.
3. For each Equipment, creates a node under the UNS namespace (`ns=3;s=UNS/Site/Area/Line/Equipment`)
+ adds Identification properties + creates `Organizes` references from the Equipment
node to the matching driver-native variable nodes.
**Trade-offs:**
- ✅ Preserves the discovery-first driver shape — no change to what Modbus / S7 / AB CIP
publish natively; those projects keep working identically.
- ✅ UNS tree becomes an overlay that operators can opt into or out of. External consumers
that want UNS addressing browse via the UNS namespace; consumers that want driver-native
addressing keep using the driver namespace.
- ❌ Doubles the OPC UA node count for every Equipment-kind tag (one node in driver ns,
one reference in UNS ns). OPC UA clients handle it but it inflates browse-result sizes.
- ❌ Contradicts decision #120: "Equipment namespace browse paths must conform to the
canonical 5-level Unified Namespace structure." Option C makes the driver namespace
browse path NOT conform; the UNS namespace is a second view. An external client that
reads the Equipment namespace in driver-native shape doesn't see UNS at all.
- ❌ Identification ACL semantics get complicated — the sub-folder lives in the UNS ns,
but the tag data lives in the driver ns. Two different scope ids; two grants to author.
### Option D — Config-primary with driver-discovery-assist
Same as Option A, but `ITagDiscovery.DiscoverAsync` is called during *draft authoring*
(not at server runtime) to populate an Admin UI "discovered tags available" panel that
operators can one-click-add to the draft Tag table. At publish time the Tag rows drive
the server as in Option A — discovery runs only as an offline helper.
**Trade-offs:**
- ✅ Keeps Option A's runtime semantics — Config DB is the sole publish-time truth.
- ✅ Closes Option A's only real workflow weakness (authoring Tag rows for large
controllers) by letting operators import discovered tags with a click.
- ✅ Draws a clean line between author-time discovery (optional, offline) and publish-time
resolution (strict, Config-DB-driven).
- ❌ Adds work that isn't on the Phase 6.4 checklist — Admin UI needs a "pull discovered
tags from this driver" flow, which means the Admin host needs to proxy a DiscoverAsync
call through the Server process (or directly into the driver — more complex deployment
topology). v2.1 work, not v2.
## Recommendation
**Pick Option A.** Ship the walker as Config-primary immediately; defer Option D's
Admin-UI discovery-assist to v2.1 once the walker is proven.
Reasons:
1. **Decision #110 already points here.** `Tag.EquipmentId` + `Tag.TagConfig` are the
published contract. Option A is the straight-line implementation of that contract.
2. **Identical composition across seven drivers.** Every Equipment-kind driver uses the
same walker code path. New drivers (e.g. a future OPC UA Client gateway mode) plug in
without touching the walker.
3. **Phase 6.4 Admin UI already authors Tag rows.** CSV import, UnsTab drag/drop, draft
diff — all operate on Tag rows. The walker being Tag-row-driven means the Admin UI
and the server see the same surface.
4. **BadNotFound is a clean failure mode.** An operator publishes a Tag row whose
address the driver can't reach → client sees a `BadNotFound` variable with a
diagnostic, operator fixes the Tag row + republishes. This is legible + easy to
debug. Options B and C smear the failure across multiple namespaces.
5. **Option D is additive, not alternative.** Nothing in A blocks adding D later; the
walker contract stays the same, Admin UI just gets a discovery-assist panel.
The walker implementation lands under two tasks this ADR spawns (if accepted):
- **Task A** — Build `EquipmentNodeWalker` in `Core.OpcUa` that drives the
`ClusterNode → Namespace → UnsArea → UnsLine → Equipment → Tag` traversal, calls
`IdentificationFolderBuilder.Build` per Equipment, materializes the 5 identifier
properties, and creates variable nodes for each bound Tag row. Writes integration
tests covering the happy path + BadNotFound placeholder.
- **Task B** — Extend `NodeScopeResolver` to join against Config DB + populate the
full `NodeScope` path (UnsAreaId / UnsLineId / EquipmentId / TagId). Unblocks the
Phase 6.2 finer-grained ACL (per-Equipment, per-UnsLine grants). Add ACL integration
test per task #195 — browse `Equipment/Identification` as unauthorized user,
assert `BadUserAccessDenied`.
Task #195 closes on Task B's landing.
## Consequences if we don't decide
- Task #195 stays blocked. The `IdentificationFolderBuilder` exists but is dead code
reachable only from its unit tests.
- `NodeScopeResolver` stays at cluster-level scope. Per-Equipment / per-UnsLine ACL
grants work at the Admin UI authoring layer + the data-plane evaluator, but the
runtime scope resolution never populates anything below `ClusterId + TagId` — so
finer grants are effectively cluster-wide at dispatch. Phase 6.2's rollout plan calls
this out as a rollout limitation; it's not a correctness bug but it's a feature gap.
- Equipment metadata (the 9 OPC 40010 fields, the 5 identifiers) ships in the Config DB
+ the Admin UI editor but never surfaces on the OPC UA endpoint. External consumers
(ERP, SAP PM) can't resolve equipment via OPC UA properties as decision #121 promises.
## Next step
Accept this ADR + spawn Task A + Task B.
If the recommendation is rejected, the alternative options (B / C / D) are ranked by
implementation cost in the Trade-offs sections above. My strong preference is A + defer D.

View File

@@ -0,0 +1,136 @@
# ADR-002 — Driver-vs-virtual dispatch: how `DriverNodeManager` routes reads, writes, and subscriptions across driver tags and virtual (scripted) tags
**Status:** Accepted 2026-04-20 — Option B (single NodeManager + NodeSource tag on the resolver output); Options A and C explicitly rejected.
**Related phase:** [Phase 7 — Scripting Runtime + Scripted Alarms](phase-7-scripting-and-alarming.md) Stream G.
**Related tasks:** #237 Phase 7 Stream G — Address-space integration.
**Related ADRs:** [ADR-001 — Equipment node walker](adr-001-equipment-node-walker.md) (this ADR extends the walker + resolver it established).
## Context
Phase 7 introduces **virtual tags** — OPC UA variables whose values are computed by user-authored C# scripts against other tags (driver or virtual). Per design decision #2 in the Phase 7 plan, virtual tags **live in the Equipment tree alongside driver tags** (not a separate `/Virtual/...` namespace). An operator browsing `Enterprise/Site/Area/Line/Equipment/` sees a flat list of children that includes both driver-sourced variables (e.g. `SpeedSetpoint` coming from a Modbus tag) and virtual variables (e.g. `LineRate` computed from `SpeedSetpoint × 0.95`).
From the operator's perspective there is no difference. From the server's perspective there is a big one: a read / write / subscribe on a driver node must dispatch to a driver's `IReadable` / `IWritable` / `ISubscribable` implementation; the same operation on a virtual node must dispatch to the `VirtualTagEngine`. The existing `DriverNodeManager` (shipped in Phase 1, extended by ADR-001) only knows about the driver case today.
The question is how the dispatch should branch. Three options considered.
## Options
### Option A — A separate `VirtualTagNodeManager` sibling to `DriverNodeManager`
Register a second `INodeManager` with the OPC UA stack dedicated to virtual-tag nodes. Each tag landed under an Equipment folder would be owned by whichever NodeManager materialized it; mixed folders would have children belonging to two different managers.
**Pros:**
- Clean separation — virtual-tag code never touches driver code paths.
- Independent lifecycle: restart the virtual-tag engine without touching drivers.
**Cons:**
- ADR-001's `EquipmentNodeWalker` was designed as a single walker producing a single tree under one NodeManager. Forking into two walkers (one per source) risks the UNS / Equipment folders existing twice (once per manager) with different child sets, and the OPC UA stack treating them as distinct nodes.
- Mixed equipment folders: when a Line has 3 driver tags + 2 virtual tags, a client browsing the Line folder expects to see 5 children. Two NodeManagers each claiming ownership of the same folder adds the browse-merge problem the stack doesn't do cleanly.
- ACL binding (Phase 6.2 trie): one scope per Equipment folder, resolved by `NodeScopeResolver`. Two NodeManagers means two resolution paths or shared resolution logic — cross-manager coupling that defeats the separation.
- Audit pathways (Phase 6.2 `IAuditLogger`) and resilience wrappers (Phase 6.1 `CapabilityInvoker`) are wired into the existing `DriverNodeManager`. Duplicating them into a second manager doubles the surface that the Roslyn analyzer from Phase 6.1 Stream A follow-up must keep honest.
**Rejected** because the sharing of folders (Equipment nodes owning both kinds of children) is the common case, not the exception. Two NodeManagers would fight for ownership on every Equipment node.
### Option B — Single `DriverNodeManager`, `NodeScopeResolver` returns a `NodeSource` tag, dispatch branches on source
`NodeScopeResolver` (established in ADR-001) already joins nodes against the config DB to produce a `ScopeId` for ACL enforcement. Extend it to **also return a `NodeSource` enum** (`Driver` or `Virtual`). `DriverNodeManager` dispatch methods check the source and route:
```csharp
internal sealed class DriverNodeManager : CustomNodeManager2
{
private readonly IReadOnlyDictionary<string, IDriver> _drivers;
private readonly IVirtualTagEngine _virtualTagEngine;
private readonly NodeScopeResolver _resolver;
protected override async Task ReadValueAsync(NodeId nodeId, ...)
{
var scope = _resolver.Resolve(nodeId);
// ... ACL check via Phase 6.2 trie (unchanged)
return scope.Source switch
{
NodeSource.Driver => await _drivers[scope.DriverInstanceId].ReadAsync(...),
NodeSource.Virtual => await _virtualTagEngine.ReadAsync(scope.VirtualTagId, ...),
};
}
}
```
**Pros:**
- Single address-space tree. `EquipmentNodeWalker` emits one folder per Equipment node and hangs both driver and virtual children under it. Browse / subscribe fan-out / ACL resolution all happen in one NodeManager with one mental model.
- ACL binding works identically for both kinds. A user with `ReadEquipment` on `Line1/Pump_7` can read every child, driver-sourced or virtual.
- Phase 6.1 resilience wrapping + Phase 6.2 audit logging apply uniformly. The `CapabilityInvoker` analyzer stays correct without new exemptions.
- Adding future source kinds (e.g. a "derived tag" that's neither a driver read nor a script evaluation) is a single-enum-case addition — no new NodeManager.
**Cons:**
- `NodeScopeResolver` becomes slightly chunkier — it now carries dispatch metadata in addition to ACL scope. We own that complexity; the payoff is one tree, one lifecycle.
- A bug in the dispatch branch could leak a driver call into the virtual path or vice versa. Mitigated by an xUnit theory in Stream G.4 that mixes both kinds in one Equipment folder and asserts each routes correctly.
**Accepted.**
### Option C — Virtual tag engine registers as a synthetic `IDriver`
Implement a `VirtualTagDriverAdapter` that wraps `VirtualTagEngine` and registers it alongside real drivers through the existing `DriverTypeRegistry`. Then `DriverNodeManager` dispatches everything through driver plumbing — virtual tags are just "a driver with no wire."
**Pros:**
- Reuses every existing `IDriver` pathway without modification.
- Dispatch branch is trivial because there's no branch — everything routes through driver plumbing.
**Cons:**
- `DriverInstance` is the wrong shape for virtual-tag config: no `DriverType`, no `HostAddress`, no connectivity probe, no lifecycle-initialization parameters, no NSSM wrapper. Forcing it to fit means adding null columns / sentinel values everywhere.
- `IDriver.InitializeAsync` / `IRediscoverable` semantics don't match a scripting engine — the engine doesn't "discover" tags against a wire, it compiles scripts against a config snapshot.
- The resilience Polly wrappers are calibrated for network-bound calls (timeout / retry / circuit breaker). Applying them to a script evaluation is either a pointless passthrough or wrong tuning.
- The Admin UI would need special-casing in every driver-config screen to hide fields that don't apply. The shape mismatch leaks everywhere.
**Rejected** because the fit is worse than Option B's lightweight dispatch branch. The pretense of uniformity would cost more than the branch it avoids.
## Decision
**Option B is accepted.**
`NodeScopeResolver.Resolve(nodeId)` returns a `NodeScope` record with:
```csharp
public sealed record NodeScope(
string ScopeId, // ACL scope ID — unchanged from ADR-001
NodeSource Source, // NEW: Driver or Virtual
string? DriverInstanceId, // populated when Source=Driver
string? VirtualTagId); // populated when Source=Virtual
public enum NodeSource
{
Driver,
Virtual,
}
```
`DriverNodeManager` holds a single reference to `IVirtualTagEngine` alongside its driver dictionary. Read / Write / Subscribe dispatch pattern-matches on `scope.Source` and routes accordingly. Writes to a virtual node from an OPC UA client return `BadUserAccessDenied` because per Phase 7 decision #6, virtual tags are writable **only** from scripts via `ctx.SetVirtualTag`. That check lives in `DriverNodeManager` before the dispatch branch — a dedicated ACL rule rather than a capability of the engine.
Dispatch tests (Phase 7 Stream G.4) must cover at minimum:
- Mixed Equipment folder (driver + virtual children) browses with all children visible
- Read routes to the correct backend for each source kind
- Subscribe delivers changes from both kinds on the same subscription
- OPC UA client write to a virtual node returns `BadUserAccessDenied` without invoking the engine
- Script-driven write to a virtual node (via `ctx.SetVirtualTag`) updates the value + fires subscription notifications
## Consequences
- `EquipmentNodeWalker` (ADR-001) gains an extra input channel: the config DB's `VirtualTag` table alongside the existing `Tag` table. Walker emits both kinds of children under each Equipment folder with the `NodeSource` tag set per row.
- `NodeScopeResolver` gains a `NodeSource` return value. The change is additive (ADR-001's `ScopeId` field is unchanged), so Phase 6.2's ACL trie keeps working without modification.
- `DriverNodeManager` gains a dispatch branch but the shape of every `I*` call into drivers is unchanged. Phase 6.1's resilience wrapping applies identically to the driver branch; the virtual branch wraps separately (virtual tag evaluation errors map to `BadInternalError` per Phase 7 decision #11, not through the Polly pipeline).
- Adding a future source kind (e.g. an alias tag, a cross-cluster federation tag) is one enum case + one dispatch arm + the equivalent walker extension. The architecture is extensible without rewrite.
## Not Decided (revisitable)
- **Whether `IVirtualTagEngine` should live alongside `IDriver` in `Core.Abstractions` or stay in the Phase 7 project.** Plan currently keeps it in Phase 7's `Core.VirtualTags` project because it's not a driver capability. If Phase 7 Stream G discovers significant shared surface, promote later — not blocking.
- **Whether server-side method calls from OPC UA clients (e.g. a future "force-recompute-this-virtual-tag" admin method) should route through the same dispatch.** Out of scope — virtual tags have no method nodes today; scripted alarm method calls (`OneShotShelve` etc.) route through their own `ScriptedAlarmEngine` path per Phase 7 Stream C.6.
## References
- [Phase 7 — Scripting Runtime + Scripted Alarms](phase-7-scripting-and-alarming.md) Stream G
- [ADR-001 — Equipment node walker](adr-001-equipment-node-walker.md)
- [`docs/v2/plan.md`](../plan.md) decision #110 (Tag-to-Equipment binding)
- [`docs/v2/plan.md`](../plan.md) decision #120 (UNS hierarchy requirements)
- Phase 6.2 `NodeScopeResolver` ACL join

View File

@@ -0,0 +1,108 @@
# Phase 2 Close-Out (2026-04-20)
> Supersedes `exit-gate-phase-2-final.md` (2026-04-18) which captured the state at PR 2
> merge. Between that doc and today, PR 4 closed all open high + medium findings, PR 13
> shipped the probe manager, PR 14 shipped the alarm subsystem, and PR 61 closed the v1
> archive deletion. Phase 2 is closed.
## Status: **CLOSED**
Every stream in Phase 2 is complete. Every finding from the 2026-04-18 adversarial review
is resolved. The v1 archive is deleted. The Galaxy driver runs the full
`Shared` / `Host` / `Proxy` topology against live MXAccess on the dev box with all 9
capability interfaces wired end-to-end.
## Stream-by-stream
| Stream | Plan §reference | Status | Close commit |
|---|---|---|---|
| A — Driver.Galaxy.Shared | §A.1A.3 | ✅ Complete | PR 1 |
| B — Driver.Galaxy.Host | §B.1B.10 | ✅ Complete — real Win32 pump, Tier C protections, all 3 IGalaxyBackend impls (Stub / DbBacked / MxAccess), probe manager, alarm tracker, Historian wire-up | PR 1 + PR 4 + PR 12 + PR 13 + PR 14 |
| C — Driver.Galaxy.Proxy | §C.1C.4 | ✅ Complete — all 9 capability interfaces, supervisor (Backoff + CircuitBreaker + HeartbeatMonitor), subscription push frames | PR 1 + PR 4 |
| D — Retire legacy Host | §D.1D.3 | ✅ Complete — archive markings landed in PR 2, source tree deletion in Phase 3 PR 18, status doc closed in PR 61 | PR 2 → Phase 3 PR 18 → PR 61 |
| E — Parity validation | §E.1E.4 | ✅ Complete — E2E suite + 4 stability-finding regression tests + `HostSubprocessParityTests` cross-FX integration | PR 2 |
## 2026-04-18 adversarial findings — resolved
All four `High` + `Medium` items flagged as OPEN at the 2026-04-18 exit gate closed in PR 4
(`caa9cb8 Phase 2 PR 4 — close the 4 open high/medium MXAccess findings from
exit-gate-phase-2-final.md`):
| ID | Finding | Resolution |
|----|---------|------------|
| High 1 | MxAccess Read subscription-leak on cancellation | One-shot read now wraps subscribe → first `OnDataChange` → unsubscribe in try/finally. Per-tag callback always detached. If the read installed the underlying subscription (prior `_addressToHandle` key was absent) it tears it down on the way out — no leaked probe item handles on caller cancel or timeout. |
| High 2 | No MXAccess reconnect loop, only supervisor-driven recycle | `MxAccessClient` gains `MxAccessClientOptions { AutoReconnect, MonitorInterval=5s, StaleThreshold=60s }` + a background `MonitorLoopAsync` started on first `ConnectAsync`. Checks `_lastObservedActivityUtc` each interval (bumped by every `OnDataChange` callback); if stale, probes the proxy with a no-op COM `AddItem("$Heartbeat")` on the StaPump; on probe failure does reconnect-with-replay — Unregister (best-effort), Register, snapshot `_addressToHandle.Keys`, clear, re-AddItem every previously-active subscription. `ConnectionStateChanged` fires on the false→true transition; `ReconnectCount` bumps. |
| Medium 3 | `SubscribeAsync` doesn't push `OnDataChange` frames yet | `IGalaxyBackend` gains `OnDataChange` / `OnAlarmEvent` / `OnHostStatusChanged` events. New `IFrameHandler.AttachConnection(FrameWriter)` called per-connection by `PipeServer` after Hello. `GalaxyFrameHandler.ConnectionSink` subscribes the events for the connection lifetime, fire-and-forgets pushes as `MessageKind.OnDataChangeNotification` / `AlarmEvent` / `RuntimeStatusChange` frames through the writer, swallows `ObjectDisposedException` for dispose race, unsubscribes on Dispose. `MxAccessGalaxyBackend.SubscribeAsync` wires `OnTagValueChanged` that fans values out per-tag to every subscription listening (one MXAccess subscription, multi-fan-out via `_refToSubs` reverse map). `UnsubscribeAsync` only calls `mx.UnsubscribeAsync` when the last sub for a tag drops. |
| Medium 4 | `WriteValuesAsync` doesn't await `OnWriteComplete` | `MxAccessClient.WriteAsync` rewritten to return `Task<bool>` via the v1-style TCS-keyed-by-item-handle pattern in `_pendingWrites`. TCS added before the `Write` call, awaited with configurable timeout (default 5s), removed in finally. Returns true only when `OnWriteComplete` reported success. `MxAccessGalaxyBackend.WriteValuesAsync` reports per-tag `Bad_InternalError` ("MXAccess runtime reported write failure") when the bool returns false. |
## Cross-cutting deferrals — resolved
| Deferral | Resolution |
|----------|------------|
| Deletion of v1 archive | Phase 3 PR 18 deleted the source trees; PR 61 closed `V1_ARCHIVE_STATUS.md` |
| Wonderware Historian SDK plugin port | `Driver.Galaxy.Host/Backend/Historian/` ports the 10 source files (`HistorianDataSource`, `HistorianClusterEndpointPicker`, `HistorianHealthSnapshot`, etc.). `MxAccessGalaxyBackend` implements `HistoryReadAsync` / `HistoryReadProcessedAsync` / `HistoryReadAtTimeAsync` / `HistoryReadEventsAsync`. `GalaxyProxyDriver.MapAggregateToColumn` translates `HistoryAggregateType``AnalogSummaryQuery` column names on the proxy side so Host stays OPC-UA-free. |
| MxAccess subscription push frames | Closed under Medium 3 above |
| Wonderware Historian-backed HistoryRead | Closed under the Historian port row |
| Alarm subsystem wire-up | PR 14. `GalaxyAlarmTracker` in `Backend/Alarms/` advises the four Galaxy alarm-state attributes per `IsAlarm=true` attribute (`.InAlarm`, `.Priority`, `.DescAttrName`, `.Acked`), runs the OPC UA Part 9 lifecycle simplified for the Galaxy AlarmExtension model, raises `AlarmTransition` events (Active / Acknowledged / Inactive) forwarded through the existing `OnAlarmEvent` IPC frame. `AcknowledgeAlarmAsync` writes operator comment to `<tag>.AckMsg` through the PR 4 TCS-by-handle write path. |
| Reconnect-without-recycle in MxAccessClient | Closed under High 2 (reconnect-with-replay loop is the "without-recycle" path — supervisor recycle remains the fallback). |
| Real downstream-consumer cutover | Out of scope for this repo; phased Year-3 rollout per `docs/v2/plan.md` §Rollout — not a Phase 2 deliverable. |
## 2026-04-20 test baseline
Full-solution `dotnet test ZB.MOM.WW.OtOpcUa.slnx` on `v2` tip:
| Project | Pass | Skip | Target |
|---|---:|---:|---|
| Core.Abstractions.Tests | 37 | 0 | net10 |
| Client.Shared.Tests | 136 | 0 | net10 |
| Client.CLI.Tests | 52 | 0 | net10 |
| Client.UI.Tests | 98 | 0 | net10 |
| Driver.S7.Tests | 58 | 0 | net10 |
| Driver.Modbus.Tests | 182 | 0 | net10 |
| Driver.Modbus.IntegrationTests | 2 | 21 | net10 (Docker-gated) |
| Driver.AbLegacy.Tests | 96 | 0 | net10 |
| Driver.AbCip.Tests | 211 | 0 | net10 |
| Driver.AbCip.IntegrationTests | 11 | 1 | net10 (ab_server-gated) |
| Driver.TwinCAT.Tests | 110 | 0 | net10 |
| Driver.OpcUaClient.Tests | 78 | 0 | net10 |
| Driver.FOCAS.Tests | 119 | 0 | net10 |
| Driver.Galaxy.Shared.Tests | 6 | 0 | net10 |
| Driver.Galaxy.Proxy.Tests | 18 | 7 | net10 (live-Galaxy-gated) |
| **Driver.Galaxy.Host.Tests** | **107** | **0** | **net48 x86** |
| Analyzers.Tests | 5 | 0 | net10 |
| Core.Tests | 182 | 0 | net10 |
| Configuration.Tests | 71 | 0 | net10 |
| Admin.Tests | 92 | 0 | net10 |
| Server.Tests | 173 | 0 | net10 |
| **Total** | **1844** | **29** | |
**Observed flake**: one Configuration.Tests failure on the first full-solution run turned
green on re-run. Not a stable regression; logged as a known flake until it reproduces.
**Skips are all infra-gated**:
- Modbus 21 skips — oitc/modbus-server Docker container not started.
- AbCip 1 skip — libplctag `ab_server` binary not on PATH.
- Galaxy.Proxy 7 skips — live Galaxy stack not reachable from the current shell (admin-token pipe ACL).
## What "Phase 2 closed" means for Phase 3 and later
- Galaxy runs as first-class v2 driver, same capability-interface contract as Modbus / S7 /
AbCip / AbLegacy / TwinCAT / FOCAS / OpcUaClient.
- No v1 code path remains. Anything invoking the `ZB.MOM.WW.LmxOpcUa.*` namespaces is
historical; any future work routes through `Driver.Galaxy.Proxy` + the named-pipe IPC.
- The 2026-04-13 stability findings live on as named regression tests under
`tests/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.E2E/StabilityFindingsRegressionTests.cs` — a
future refactor that reintroduces any of those four defects trips the test.
- Aveva Historian integration is wired end-to-end; new driver families don't need
Historian-specific plumbing in the IPC — they just implement `IHistoryProvider`.
## Outstanding — not Phase 2 blockers
- **AB CIP whole-UDT read optimization** (task #194) — niche performance win for large UDT
reads; current per-member fan-out works correctly.
- **AB CIP `IAlarmSource` via tag-projected ALMA/ALMD** (task #177) — AB CIP driver doesn't
currently expose alarms; feature-flagged follow-up.
- **IdentificationFolderBuilder wire-in** (task #195) — blocked on Equipment node walker.
- **UnsTab Playwright E2E** (task #199) — infra setup PR.
None of these are Phase 2 scope; all are tracked independently.

View File

@@ -1,5 +1,11 @@
# Phase 2 Final Exit Gate (2026-04-18)
> **⚠️ Superseded by [`exit-gate-phase-2-closed.md`](exit-gate-phase-2-closed.md) (2026-04-20).**
> This doc captures the snapshot at PR 2 merge — when the four `High` + `Medium` findings
> in the adversarial review were still OPEN and Historian port + alarm subsystem were still
> deferred. All of those closed subsequently (PR 4 + PR 12 + PR 13 + PR 14 + PR 61). Kept
> as historical evidence; consult the close-out doc for current Phase 2 status.
> Supersedes `phase-2-partial-exit-evidence.md` and `exit-gate-phase-2.md`. Captures the
> as-built state at the close of Phase 2 work delivered across two PRs.

View File

@@ -0,0 +1,173 @@
# FOCAS Tier-C isolation — plan for task #220
> **Status**: PRs AE shipped. Architecture is in place; the only
> remaining FOCAS work is the hardware-dependent production
> integration of `Fwlib32.dll` into a real `IFocasBackend`
> (`FwlibHostedBackend`), which needs an actual CNC on the bench
> and is tracked as a follow-up on #220.
>
> **Pre-reqs shipped**: version matrix + pre-flight validation
> (PR #168 — the cheap half of the hardware-free stability gap).
## Why isolate
`Fwlib32.dll` is a proprietary Fanuc library with no source, no
symbols, and a documented habit of crashing the hosting process on
network errors, malformed responses, and during handle recycling.
Today the FOCAS driver runs in-process with the OPC UA server —
a crash inside the Fanuc DLL takes every driver down with it,
including ones that have nothing to do with FOCAS. Galaxy has the
same class of problem and solved it with the Tier-C pattern (host
service + proxy driver + named-pipe IPC); FOCAS should follow that
playbook.
## Topology (target)
```
+-------------------------------------+ +--------------------------+
| OtOpcUa.Server (.NET 10 x64) | | OtOpcUaFocasHost |
| | pipe | (.NET 4.8 x86 Windows |
| ZB.MOM.WW.OtOpcUa.Driver.FOCAS | <-----> | service) |
| - FocasProxyDriver (in-proc) | | |
| - supervisor / respawn / BackPr | | Fwlib32.dll + session |
| | | handles + STA thread |
+-------------------------------------+ +--------------------------+
```
Why .NET 4.8 x86 for the host: `Fwlib32.dll` ships as 32-bit only.
The Galaxy.Host is already .NET 4.8 x86 for the same reason
(MXAccess COM bitness), so the NSSM wrapper pattern transfers
directly.
## Three new projects
| Project | TFM | Role |
| --- | --- | --- |
| `ZB.MOM.WW.OtOpcUa.Driver.FOCAS.Shared` | `netstandard2.0` | MessagePack DTOs — `FocasReadRequest`, `FocasReadResponse`, `FocasSubscribeRequest`, `FocasPmcBitWriteRequest`, etc. Same assembly referenced by .NET 10 + .NET 4.8 so the wire format stays identical. |
| `ZB.MOM.WW.OtOpcUa.Driver.FOCAS.Host` | `net48` x86 | Windows service. Owns the Fwlib32 session handles + STA thread + handle-recycling loop. Pipe server + per-call auth (same ACL + caller SID + shared secret pattern as Galaxy.Host). |
| `ZB.MOM.WW.OtOpcUa.Driver.FOCAS` (existing) | `net10.0` | Collapses to a proxy that forwards each `IReadable` / `IWritable` / `ISubscribable` call over the pipe. `FocasCapabilityMatrix` + `FocasAddress` stay here — pre-flight runs before any IPC. |
## Supervisor responsibilities (in the Proxy)
Mirrors Galaxy.Proxy 1:1:
1. Start the Host process on first `InitializeAsync` (NSSM-wrapped
service in production, direct spawn in dev) + heartbeat every
5s.
2. If heartbeat misses 3× in a row, fan out `BadCommunicationError`
to every subscription and respawn with exponential backoff
(1s / 2s / 4s / max 30s).
3. Crash-loop circuit breaker: 5 respawns in 60s → drop to
`BadDeviceFailure` steady state until operator resets.
4. Post-mortem MMF: on Host exit, Host writes its last-N operations
+ session state to an MMF the Proxy reads to log context.
## IPC surface (approximate)
Every `FocasDriver` method that today calls into Fwlib32 directly
becomes an `ExecuteAsync` call with a typed request:
| Today (in-process) | Tier-C (IPC) |
| --- | --- |
| `FocasTagReader.Read(tag)` | `client.Execute(new FocasReadRequest(session, address))` |
| `FocasTagWriter.Write(tag, value)` | `client.Execute(new FocasWriteRequest(...))` |
| `FocasPmcBitRmw.Write(tag, bit, value)` | `client.Execute(new FocasPmcBitWriteRequest(...))` — RMW happens in Host so the critical section stays on one process |
| `FocasConnectivityProbe.ProbeAsync` | `client.Execute(new FocasProbeRequest())` |
| `FocasSubscriber.Subscribe(tags)` | `client.Execute(new FocasSubscribeRequest(tags))` — Host owns the poll loop + streams changes back as `FocasDataChangedNotification` over the pipe |
Subscription streaming is the non-obvious piece: the Host polls on
its own timer + pushes change notifications so the Proxy doesn't
round-trip per poll. Matches `Driver.Galaxy.Host` subscription
forwarding.
## PR sequence — shipped
1. **PR A (#169) — shared contracts**
`Driver.FOCAS.Shared` netstandard2.0 with MessagePack DTOs for every
IPC surface (Hello/Heartbeat/OpenSession/Read/Write/PmcBitWrite/
Subscribe/Probe/RuntimeStatus/Recycle/ErrorResponse) + FrameReader/
FrameWriter + 24 round-trip tests.
2. **PR B (#170) — Host project skeleton**
`Driver.FOCAS.Host` net48 x86 Windows Service entry point,
`PipeAcl` + `PipeServer` + `IFrameHandler` + `StubFrameHandler`.
ACL denies LocalSystem/Administrators; Hello verifies
shared-secret + protocol major. 3 handshake tests.
3. **PR C (#171) — IPC path end-to-end**
Proxy `Ipc/FocasIpcClient` + `Ipc/IpcFocasClient` (implements
IFocasClient via IPC). Host `Backend/IFocasBackend` +
`FakeFocasBackend` + `UnconfiguredFocasBackend` +
`Ipc/FwlibFrameHandler` replacing the stub. 13 new round-trip
tests via in-memory loopback.
4. **PR D (#172) — Supervisor + respawn**
`Supervisor/Backoff` (5s→15s→60s) + `CircuitBreaker` (3-in-5min →
1h→4h→manual) + `HeartbeatMonitor` + `IHostProcessLauncher` +
`FocasHostSupervisor`. 14 tests.
5. **PR E — Ops glue** ✅ (this PR)
`ProcessHostLauncher` (real Process.Start + FocasIpcClient
connect), `Host/Stability/PostMortemMmf` (magic 'OFPC') +
Proxy `Supervisor/PostMortemReader`, `scripts/install/
Install-FocasHost.ps1` + `Uninstall-FocasHost.ps1` NSSM wrappers.
7 tests (4 MMF round-trip + 3 reader format compatibility).
**Post-shipment totals: 189 FOCAS driver tests + 24 Shared tests + 13 Host tests = 226 FOCAS-family tests green.**
What remains is hardware-dependent: wiring `Fwlib32.dll` P/Invoke
into a real `FwlibHostedBackend` implementation of `IFocasBackend`
+ validating against a live CNC. The architecture is all the
plumbing that work needs.
## Testing without hardware
Same constraint as today: no CNC, no simulator. The isolation work
itself is verifiable without Fwlib32 actually being called:
- **Pipe contract**: PR A's MessagePack round-trip tests cover every
DTO.
- **Supervisor**: PR D uses a `FakeFocasHost` stub that can be told
to crash, hang, or miss heartbeats. The supervisor's respawn +
circuit-breaker behaviour is fully testable against the stub.
- **IPC ACL + auth**: reuse the Galaxy.Host's existing test harness
pattern — negative tests attempt to connect as the wrong user and
assert rejection.
- **Fwlib32 integration itself**: still untestable without hardware.
When a real CNC becomes available, the smoke tests already
scaffolded in `tests/ZB.MOM.WW.OtOpcUa.Driver.FOCAS.IntegrationTests/`
run against it via `FOCAS_ENDPOINT`.
## Decisions to confirm before starting
- **Sharing transport code with Galaxy.Host** — should the pipe
server + ACL + shared-secret + MMF plumbing go into a common
`Core.Hosting.Tier-C` project both hosts reference? Probably yes;
deferred until PR B is drafted because the right abstraction only
becomes visible after two uses.
- **Handle-recycling cadence** — Fwlib32 session handles leak
memory over weeks per the Fanuc-published defect list. Galaxy
recycles MXAccess handles on a 24h timer; FOCAS should mirror but
the trigger point (idle vs scheduled) needs operator input.
- **Per-CNC Host process vs one Host serving N CNCs** — one-per-CNC
isolates blast radius but scales poorly past ~20 machines; shared
Host scales but one bad CNC can wedge the lot. Start with shared
Host + document the blast-radius trade; revisit if operators hit
it.
## Non-goals
- Simulator work. `open_focas` + other OSS FOCAS simulators are
untested + not maintained; not worth chasing vs. waiting for real
hardware.
- Changing the public `FocasDriverOptions` shape beyond what
already shipped (the `Series` knob). Operator config continues to
look the same after the split — the Tier-C topology is invisible
from `appsettings.json`.
- Historian / long-term history integration. FOCAS driver doesn't
implement `IHistoryProvider` + there's no plan to add it.
## References
- [`docs/v2/implementation/phase-2-galaxy-out-of-process.md`](phase-2-galaxy-out-of-process.md)
— the working Tier-C template this plan follows.
- [`docs/drivers/FOCAS-Test-Fixture.md`](../../drivers/FOCAS-Test-Fixture.md)
— what's covered today + what stays blocked on hardware.
- [`docs/v2/focas-version-matrix.md`](../focas-version-matrix.md) —
the capability matrix that pre-flights configs before IPC runs.

View File

@@ -0,0 +1,190 @@
# Phase 7 — Scripting Runtime, Virtual Tags, and Scripted Alarms
> **Status**: DRAFT — planning output from the 2026-04-20 interactive planning session. Pending review before work begins. Task #230 tracks the draft; #231#238 are the stream placeholders.
>
> **Branch**: `v2/phase-7-scripting-and-alarming`
> **Estimated duration**: 1012 weeks (scope-comparable to Phase 6; largest single phase outside Phase 2 Galaxy split)
> **Predecessor**: Phase 6.4 (Admin UI completion) — reuses the tab-plugin pattern + draft/publish flow
> **Successor**: v2 release-readiness capstone
## Phase Objective
Add two **additive** runtime capabilities on top of the existing driver + Equipment address-space foundation:
1. **Virtual (calculated) tags** — OPC UA variables whose values are computed by user-authored C# scripts against other tags (driver or virtual), evaluated on change and/or timer. They live in the existing Equipment/UNS tree alongside driver tags and behave identically to clients (browse, subscribe, historize).
2. **Scripted alarms** — OPC UA Part 9 alarms whose condition is a user-authored C# predicate. Full state machine (EnabledState / ActiveState / AckedState / ConfirmedState / ShelvingState) with persistent operator-supplied state across restarts. Complement the existing Galaxy-native and AB CIP ALMD alarm sources — they do not replace them.
Tie-in capability — **historian alarm sink**:
3. **Aveva Historian as alarm system of record** — every qualifying alarm transition (activation, ack, confirm, clear, shelve, disable, comment) from **any `IAlarmSource`** (scripted + Galaxy + ALMD) routes through a new local SQLite store-and-forward queue to Galaxy.Host, which uses its already-loaded `aahClientManaged` DLLs to write to the Historian's alarm schema. Per-alarm `HistorizeToAveva` toggle gates which sources flow (default off for Galaxy-native since Galaxy itself already historizes them). Plant operators query one uniform historical alarm timeline.
**Why it's additive, not a rewrite**: every `IAlarmSource` implementation shipped in Phase 6.x stays unchanged; scripted alarms register as an additional source in the existing fan-out. The Equipment node walker built in ADR-001 gains a "virtual" source kind alongside "driver" without removing anything. Operator-facing semantics for existing driver tags and alarms are unchanged.
## Design Decisions (locked in the 2026-04-20 planning session)
| # | Decision | Rationale |
|---|---------|-----------|
| 1 | Script language = **C# via Roslyn scripting** | Developer audience, strong typing, AST walkable for dependency inference, existing .NET 10 runtime in main server. |
| 2 | Virtual tags live in the **Equipment tree** alongside driver tags (not a separate `/Virtual/...` namespace) | Operator mental model stays unified; calculated `LineRate` shows up under the Line1 folder next to the driver-sourced `SpeedSetpoint` it's derived from. |
| 3 | Evaluation trigger = **change-driven + timer-driven**; operator chooses per-tag | Change-driven is cheap at steady state; timer is the escape hatch for polling derivations that don't have a discrete "input changed" signal. |
| 4 | Script shape = **Shape A — one script per virtual tag/alarm**; `return` produces the value (or `bool` for alarm condition) | Minimal surface; no predicate/action split. Alarm side-effects (severity, message) configured out-of-band, not in the script. |
| 5 | Alarm fidelity = **full OPC UA Part 9** | Uniform with Galaxy + ALMD on the wire; client-side tooling (HMIs, historians, event pipelines) gets one shape. |
| 6 | Sandbox = **read-only context**; scripts can only read any tag + write to virtual tags | Strict Roslyn `ScriptOptions` allow-list. No HttpClient / File / Process / reflection. |
| 7 | Dependency declaration = **AST inference**; operator doesn't maintain a separate dependency list | `CSharpSyntaxWalker` extracts `ctx.GetTag("path")` string-literal calls at compile time; dynamic paths rejected at publish. |
| 8 | Config storage = **config DB with generation-sealed cache** (same as driver instances) | Virtual tags + alarms publish atomically in the same generation as the driver instance config they may depend on. |
| 9 | Script return value shape (`ctx.GetTag`) = **`DataValue { Value, StatusCode, Timestamp }`** | Scripts branch on quality naturally without separate `ctx.GetQuality(...)` calls. |
| 10 | Historize virtual tags = **per-tag checkbox** | Writes flow through the same history-write path as driver tags. Consumed by existing `IHistoryProvider`. |
| 11 | Per-tag error isolation — a throwing script sets that tag's quality to `BadInternalError`; engine keeps running for every other tag | Mirrors Phase 6.1 Stream B's per-surface error handling. |
| 12 | Dedicated Serilog sink = `scripts-*.log` rolling file; structured-property `ScriptName` for filtering | Keeps noisy script logs out of the main `opcua-*.log`. `ctx.Logger.Info/Warning/Error/Debug` bound in the script context. |
| 13 | Alarm message = **template with substitution** (`"Reactor temp {Reactor/Temp} exceeded {Limit}"`) | Middle ground between static and separate message-script; engine resolves `{path}` tokens at event emission. |
| 14 | Alarm state persistence — `ActiveState` recomputed from tag values on startup; `EnabledState / AckedState / ConfirmedState / ShelvingState` + audit trail persist to config DB | Operators don't re-ack after restart; ack history survives for compliance (GxP / 21 CFR Part 11). |
| 15 | Historian sink scope = **all `IAlarmSource` implementations**, not just scripted; per-alarm `HistorizeToAveva` toggle | Plant gets one consolidated alarm timeline; Galaxy-native alarms default off to avoid duplication. |
| 16 | Historian failure mode = **SQLite store-and-forward queue on the node**; config DB is source of truth, Historian is best-effort projection | Operators never blocked by Historian downtime; failed writes queue + retry when Historian recovers. |
| 17 | Historian ingestion path = **IPC to Galaxy.Host**, which calls the already-loaded `aahClientManaged` DLLs | Reuses existing bitness / licensing / Tier-C isolation. No new 32-bit DLL load in the main server. |
| 18 | Admin UI code editor = **Monaco** via the Admin project's asset pipeline | Industry default for C# editing in a browser; ~3 MB bundle acceptable given Admin is operator-facing only, not public. Revisitable if bundle size becomes a deployment constraint. |
| 19 | Cascade evaluation order = **serial** for v1; parallel promoted to a Phase 7 follow-up | Deterministic, easier to reason about, simplifies cycle + ordering bugs in the rollout. Parallel becomes a tuning knob when real 1000+ virtual-tag deployments measure contention. |
| 20 | Shelving UX = **OPC UA method calls only** (`OneShotShelve` / `TimedShelve` / `Unshelve` on the `AlarmConditionType` node); **no Admin UI shelve controls** | Plant HMIs + OPC UA clients already speak these methods by spec; reinventing the UI adds surface without operator value. Admin still renders current shelve state + audit trail read-only on the alarm detail page. |
| 21 | Dead-lettered historian events retained for **30 days** in the SQLite queue; Admin `/alarms/historian` exposes a "Retry dead-lettered" button | Long enough for a Historian outage or licensing glitch to be resolved + operator to investigate; short enough that the SQLite file doesn't grow unbounded. Configurable via `AlarmHistorian:DeadLetterRetentionDays` for deployments with stricter compliance windows. |
| 22 | Test harness synthetic inputs = **declared inputs only** (from the AST walker's extracted dependency set) | Enforces the dependency declaration — if a path can't be supplied to the harness, the AST walker didn't see it and the script can't reference it at runtime. Catches dependency-inference drift at test time, not publish time. |
## Scope — What Changes
| Concern | Change |
|---------|--------|
| **New project `OtOpcUa.Core.Scripting`** (.NET 10) | Roslyn-based script engine. Compiles user C# scripts with a sandboxed `ScriptOptions` allow-list (numeric / string / datetime / `ScriptContext` API only — no reflection / File / Process / HttpClient). `DependencyExtractor` uses `CSharpSyntaxWalker` to enumerate `ctx.GetTag("...")` literal-string calls; rejects non-literal paths at publish time. Per-script compile cache keyed by source hash. Per-evaluation timeout. Exception in script → tag goes `BadInternalError`; engine unaffected for other tags. `ctx.Logger` is a Serilog `ILogger` bound to the `scripts-*.log` rolling sink with structured property `ScriptName`. |
| **New project `OtOpcUa.Core.VirtualTags`** (.NET 10) | `VirtualTagEngine` consumes the `DependencyExtractor` output, builds a topological dependency graph spanning driver tags + other virtual tags (cycle detection at publish time), schedules re-evaluation on change + on timer, propagates results through an `IVirtualTagSource` that implements `IReadable` + `ISubscribable` so `DriverNodeManager` routes reads / subscriptions uniformly. Per-tag `Historize` flag routes to the same history-write path driver tags use. |
| **New project `OtOpcUa.Core.ScriptedAlarms`** (.NET 10) | `ScriptedAlarmEngine` materializes each configured alarm as an OPC UA `AlarmConditionType` (or `LimitAlarmType` / `OffNormalAlarmType`). On startup, re-evaluates every predicate against current tag values to rebuild `ActiveState` — no persistence needed for the active flag. Persistent state: `EnabledState`, `AckedState`, `ConfirmedState`, `ShelvingState`, branch stack, ack audit (user/time/comment). Template message substitution resolves `{TagPath}` tokens at event emission. Ack / Confirm / Shelve method nodes bound to the engine; transitions audit-logged via the existing `IAuditLogger` (Phase 6.2). Registers as an additional `IAlarmSource` — no change to the existing fan-out. |
| **New project `OtOpcUa.Core.AlarmHistorian`** (.NET 10) | `IAlarmHistorianSink` abstraction + `SqliteStoreAndForwardSink` default implementation. Every qualifying `IAlarmSource` emission (per-alarm `HistorizeToAveva` toggle) persists to a local SQLite queue (`%ProgramData%\OtOpcUa\alarm-historian-queue.db`). Background drain worker reads unsent rows + forwards over IPC to Galaxy.Host. Failed writes keep the row pending with exponential backoff. Queue capacity bounded (default 1M events, oldest-dropped with a structured warning log). |
| **`Driver.Galaxy.Shared`** — new IPC contracts | `HistorianAlarmEventRequest` (activation / ack / confirm / clear / shelve / disable / comment payloads matching the Aveva Historian alarm schema) + `HistorianAlarmEventResponse` (ack / retry-please / permanent-fail). `HistorianConnectivityStatusNotification` so the main server can surface "Historian disconnected" on the Admin `/hosts` page. |
| **`Driver.Galaxy.Host`** — new frame handler for alarm writes | Reuses the already-loaded `aahClientManaged.dll` + `aahClientCommon.dll`. Maps the IPC request DTOs to the historian SDK's alarm-event API (exact method TBD during Stream D.2 — needs a live-historian smoke to confirm the right SDK entry point). Errors map to structured response codes so the main server's backoff logic can distinguish "transient" from "permanent". |
| **Config DB schema** — new tables | `VirtualTag (Id, EquipmentPath, Name, DataType, IntervalMs?, ChangeTriggerEnabled, Historize, ScriptId)`; `Script (Id, SourceCode, CompiledHash, Language='CSharp')`; `ScriptedAlarm (Id, EquipmentPath, Name, AlarmType, Severity, MessageTemplate, HistorizeToAveva, PredicateScriptId)`; `ScriptedAlarmState (AlarmId, EnabledState, AckedState, ConfirmedState, ShelvingState, ShelvingExpiresUtc?, LastAckUser, LastAckComment, LastAckUtc, BranchStack_JSON)`. Every write goes through `sp_PublishGeneration` + `IAuditLogger`. |
| **Address-space build** — Phase 6 `EquipmentNodeWalker` extension | Emits virtual-tag nodes alongside driver-sourced nodes under the same Equipment folder. `NodeScopeResolver` gains a `Virtual` source kind alongside `Driver`. `DriverNodeManager` dispatch routes reads / writes / subscriptions to the `VirtualTagEngine` when the source is virtual. |
| **Admin UI** — new tabs | `/virtual-tags` and `/scripted-alarms` tabs under the existing draft/publish flow. Monaco-based C# code editor (syntax highlighting, IntelliSense against a hand-written type stub for `ScriptContext`). Dependency preview panel shows the inferred input list from the AST walker. Test-harness lets operator supply synthetic `DataValue` inputs + see script output + logger emissions without publishing. Per-alarm controls: `AlarmType`, `Severity`, `MessageTemplate`, `HistorizeToAveva`. New `/alarms/historian` diagnostics view: queue depth, drain rate, last-successful-write, per-alarm "last routed to historian" timestamp. |
| **`DriverTypeRegistry`** — no change | Scripting is not a driver — it doesn't register as a `DriverType`. The engine hangs off the same `SealedBootstrap` as drivers but through a different composition root. |
## Scope — What Does NOT Change
| Item | Reason |
|------|--------|
| Existing `IAlarmSource` implementations (Galaxy, AB CIP ALMD) | Scripted alarms register as an *additional* source; existing sources pass through unchanged. Default `HistorizeToAveva=false` for Galaxy alarms avoids duplicating records the Galaxy historian wiring already captures. |
| Driver capability surface (`IReadable` / `IWritable` / `ISubscribable` / etc.) | Virtual tags implement the same interfaces — drivers and virtual tags are interchangeable from the node manager's perspective. No new capability. |
| Config DB publication flow (`sp_PublishGeneration` + sealed cache) | Virtual tag + alarm tables plug in as additional rows. Atomic publish semantics unchanged. |
| Authorization trie (Phase 6.2) | Virtual-tag nodes inherit the Equipment scope's grants — same treatment as the Phase 6.4 Identification sub-folder. No new scope level. |
| Tier-C isolation topology | Scripting engine runs in the main .NET 10 server process. Roslyn scripts are already sandboxed via `ScriptOptions`; no need for process isolation because they have no unmanaged reach. Galaxy.Host's existing Tier-C boundary already owns the historian SDK writes. |
| Galaxy alarm ingestion path into the historian | Galaxy writes alarms directly via `aahClientManaged` today; Phase 7 Stream D gives it a *second* path (via the new sink) when a Galaxy alarm has `HistorizeToAveva=true`, but the direct path stays for the default case. |
| OPC UA wire protocol / AddressSpace schema | Clients see new nodes under existing folders + new alarm conditions. No new namespaces, no new ObjectTypes beyond what Part 9 already defines. |
## Entry Gate Checklist
- [ ] All Phase 6.x exit gates cleared (#133, #142, #151, #158)
- [ ] Equipment node walker wired into `DriverNodeManager` (task #212 — done)
- [ ] `IAuditLogger` surface live (Phase 6.2 Stream A)
- [ ] `sp_PublishGeneration` + sealed-cache flow verified on the existing driver-config tables
- [ ] Dev Aveva Historian reachable from the dev box (for Stream D.2 smoke)
- [ ] `v2` branch clean + baseline tests green
- [ ] Blazor editor component library picked (Monaco confirmed vs alternatives — see decision to log)
- [ ] Review this plan — decisions #1#17 signed off, no open questions
## Task Breakdown
### Stream A — `Core.Scripting` (Roslyn engine + sandbox + AST inference + logger) — **2 weeks**
1. **A.1** Project scaffold + NuGet `Microsoft.CodeAnalysis.CSharp.Scripting`. `ScriptOptions` allow-list (`typeof(object).Assembly`, `typeof(Enumerable).Assembly`, the Core.Scripting assembly itself — nothing else). Hand-written `ScriptContext` base class with `GetTag(string)` / `SetVirtualTag(string, object)` / `Logger` / `Now` / `Deadband(double, double, double)` helpers.
2. **A.2** `DependencyExtractor : CSharpSyntaxWalker`. Visits every `InvocationExpressionSyntax` targeting `ctx.GetTag` / `ctx.SetVirtualTag`; accepts only a `LiteralExpressionSyntax` argument. Non-literal arguments (concat, variable, method call) → publish-time rejection with an actionable error pointing the operator at the exact span. Outputs `IReadOnlySet<string> Inputs` + `IReadOnlySet<string> Outputs`.
3. **A.3** Compile cache. `(source_hash) → compiled Script<T>`. Recompile only when source changes. Warm on `SealedBootstrap`.
4. **A.4** Per-evaluation timeout wrapper (default 250ms; configurable per tag). Timeout = tag quality `BadInternalError` + structured warning log. Keeps a single runaway script from starving the engine.
5. **A.5** Serilog sink wiring. New `scripts-*.log` rolling file enricher; `ctx.Logger` returns an `ILogger` with `ForContext("ScriptName", ...)`. Main `opcua-*.log` gets a companion entry at WARN level if a script logs ERROR, so the operator sees it in the primary log.
6. **A.6** Tests: AST extraction unit tests (30+ cases covering literal / concat / variable / null / method-returned paths); sandbox escape tests (attempt `typeof`, `Assembly.Load`, `File.OpenRead` — all must fail at compile); exception isolation (throwing script doesn't kill the engine); timeout behavior; logger structured-property binding.
### Stream B — Virtual tag engine (dependency graph + change/timer schedulers + historize) — **1.5 weeks**
1. **B.1** `VirtualTagEngine`. Ingests the set of compiled scripts + their inputs/outputs; builds a directed dependency graph (driver tag ID → virtual tag ID → virtual tag ID). Cycle detection at publish-time via Tarjan; publish rejects with a clear error message listing the cycle.
2. **B.2** `ChangeTriggerDispatcher`. Subscribes to every referenced driver tag via the existing `ISubscribable` fan-out. On a `DataValueSnapshot` delta (value / status / timestamp — any of the three), enqueues affected virtual tags for re-evaluation in topological order.
3. **B.3** `TimerTriggerDispatcher`. Per-tag `IntervalMs` scheduled via a shared timer-wheel. Independent of change triggers — a tag can have both, either, or neither.
4. **B.4** `EvaluationPipeline`. Serial evaluation per cascade (parallel promoted to a follow-up — avoids cross-tag ordering bugs on first rollout). Exception handling per A.4; propagates results via `IVirtualTagSource`.
5. **B.5** `IVirtualTagSource` implementation. Implements `IReadable` + `ISubscribable`. Reads return the most recent evaluated value; subscriptions receive `OnDataChange` events on each re-evaluation.
6. **B.6** History routing. Per-tag `Historize` flag emits the value + timestamp to the existing history-write path used by drivers.
7. **B.7** Tests: dependency graph (happy + cycle); change cascade through two levels of virtual tags; timer-only tag ignores input changes; change + timer both configured; error propagation; historize on/off.
### Stream C — Scripted alarm engine + Part 9 state machine + template messages — **2.5 weeks**
1. **C.1** Alarm config model + `ScriptedAlarmEngine` skeleton. Alarms materialize as `AlarmConditionType` (or subtype — `LimitAlarm`, `OffNormal`) nodes under their configured Equipment path. Severity loaded from config.
2. **C.2** `Part9StateMachine`. Tracks `EnabledState`, `ActiveState`, `AckedState`, `ConfirmedState`, `ShelvingState` per condition ID. Shelving has `OneShotShelving` + `TimedShelving` variants + an `UnshelveTime` timer.
3. **C.3** Predicate evaluation. On any input change (same trigger mechanism as Stream B), run the `bool` predicate. On `false → true` transition, activate (increment branch stack if prior Ack-but-not-Confirmed state exists). On `true → false`, clear (but keep condition visible if retain flag set).
4. **C.4** Startup recovery. For every configured alarm, run the predicate against current tag values to rebuild `ActiveState` *only*. Load `EnabledState` / `AckedState` / `ConfirmedState` / `ShelvingState` + audit from the `ScriptedAlarmState` table. No re-acknowledgment required for conditions that were acked before restart.
5. **C.5** Template substitution. Engine resolves `{TagPath}` tokens in `MessageTemplate` at event emission time using current tag values. Unresolvable tokens (bad path, missing tag) emit a structured error log + substitute `{?}` so the event still fires.
6. **C.6** OPC UA method binding. `Acknowledge`, `Confirm`, `AddComment`, `OneShotShelve`, `TimedShelve`, `Unshelve` methods on each condition node route to the engine + persist via audit-logged writes to `ScriptedAlarmState`.
7. **C.7** `IAlarmSource` implementation. Emits Part 9-shaped events through the existing fan-out the `AlarmTracker` composes.
8. **C.8** Tests: every transition (all 32 state combinations the state machine can produce); startup recovery (seed table with varied ack/confirm/shelve state, restart, verify correct recovery); template substitution (literal path, nested path, bad path); shelving timer expiry; OPC UA method calls via Client.CLI.
### Stream D — Historian alarm sink (SQLite store-and-forward + Galaxy.Host IPC) — **2 weeks**
1. **D.1** `Core.AlarmHistorian` project. `IAlarmHistorianSink` interface; `SqliteStoreAndForwardSink` default implementation using Microsoft.Data.Sqlite. Schema: `Queue (RowId, AlarmId, EventType, PayloadJson, EnqueuedUtc, LastAttemptUtc?, AttemptCount, DeadLettered)`. Queue capacity bounded; oldest-dropped on overflow with structured warning.
2. **D.2** **Live-historian smoke** against the dev box's Aveva Historian. Identify the exact `aahClientManaged` alarm-write API entry point (likely `IAlarmsDatabase.WriteAlarmEvent` or equivalent — verify with a throwaway Galaxy.Host test hook). Document in a short `docs/v2/historian-alarm-api.md` artifact.
3. **D.3** `Driver.Galaxy.Shared` contract additions. `HistorianAlarmEventRequest` / `HistorianAlarmEventResponse` / `HistorianConnectivityStatusNotification`. Round-trip tests in `Driver.Galaxy.Shared.Tests`.
4. **D.4** `Driver.Galaxy.Host` handler. Translates incoming `HistorianAlarmEventRequest` to the SDK call identified in D.2. Returns structured response (Ack / RetryPlease / PermanentFail). Connectivity notifications sent proactively when the SDK's session drops.
5. **D.5** Drain worker in the main server. Polls the SQLite queue; batches up to 100 events per IPC round-trip; exponential backoff on `RetryPlease` (1s → 2s → 5s → 15s → 60s cap); `PermanentFail` dead-letters the row + structured error log.
6. **D.6** Per-alarm toggle wired through: `HistorizeToAveva` column on both `ScriptedAlarm` + a new `AlarmHistorizationPolicy` projection the Galaxy / ALMD alarm sources consult (default `false` for Galaxy, `true` for scripted, operator-adjustable per-alarm).
7. **D.7** `/alarms/historian` diagnostics view in Admin. Queue depth, drain rate, last-successful-write, last-error, per-alarm last-routed timestamp.
8. **D.8** Tests: SQLite queue round-trip; drain worker with fake IPC (success / retry / perm-fail); overflow eviction; Galaxy.Host handler against a stub historian API; end-to-end with the live historian on the dev box (non-CI — operator-invoked).
### Stream E — Config DB schema + generation-sealed cache extensions — **1 week**
1. **E.1** EF migration for new tables. Foreign keys from `VirtualTag.ScriptId` / `ScriptedAlarm.PredicateScriptId` to `Script.Id`.
2. **E.2** `sp_PublishGeneration` extension. Sealed-cache snapshot includes virtual tags + scripted alarms + their scripts. Atomic publish guarantees the address-space build sees a consistent view.
3. **E.3** CRUD services. `VirtualTagService`, `ScriptedAlarmService`, `ScriptService`. Each audit-logged; Ack / Confirm / Shelve persist through `ScriptedAlarmStateService` with full audit trail (who / when / comment / previous state).
4. **E.4** Tests: migration up / down; publish atomicity (concurrent writes to different alarm rows don't leak into an in-flight publish); audit trail on every mutation.
### Stream F — Admin UI scripting tab — **2 weeks**
1. **F.1** Monaco editor Razor component. CSS-isolated; loads Monaco via NPM + the Admin project's existing asset pipeline. C# syntax highlighting (Monaco ships it). IntelliSense via a hand-written `ScriptContext.cs` type stub delivered with the editor (not the compiled Core.Scripting DLL — keeps the browser bundle small).
2. **F.2** `/virtual-tags` tab. List view (Equipment path / Name / DataType / inputs-summary / Historize / actions). Edit pane splits: Monaco editor left, dependency preview panel right (live-updates from a debounced `/api/scripting/analyze` endpoint that runs the `DependencyExtractor`). Publish button gated by Phase 6.2 `WriteConfigure` permission.
3. **F.3** `/scripted-alarms` tab. Same editor shape + extra controls: AlarmType dropdown, Severity slider, MessageTemplate textbox with live-preview showing `{path}` token resolution against latest tag values, `HistorizeToAveva` checkbox. **Alarm detail page displays current `ShelvingState` + `LastAckUser / LastAckUtc / LastAckComment` read-only** — no shelve/unshelve / ack / confirm buttons per decision #20. Operators drive state transitions via OPC UA method calls from plant HMIs or the Client.CLI.
4. **F.4** Test harness. Modal that lets the operator supply synthetic `DataValue` inputs for the dependency set + see script output + logger emissions (rendered in a virtual terminal). Enables testing without publishing.
5. **F.5** Script log viewer. SignalR stream of the `scripts-*.log` sink filtered by the script under edit (using the structured `ScriptName` property). Tail-last-200 + "load more".
6. **F.6** `/alarms/historian` diagnostics view per Stream D.7.
7. **F.7** Playwright smoke. Author a calc tag, publish, verify it appears in the equipment tree via a probe OPC UA read. Author an alarm, verify it appears in `AlarmsAndConditions`.
### Stream G — Address-space integration — **1 week**
1. **G.1** `EquipmentNodeWalker` extension. Current walker iterates driver tags per equipment; extend to also iterate virtual tags + alarms. `NodeScopeResolver` returns `NodeSource.Virtual` for virtual nodes and `NodeSource.Driver` for existing.
2. **G.2** `DriverNodeManager` dispatch. Read / Write / Subscribe operations check the resolved source and route to `VirtualTagEngine` or the driver as appropriate. Writes to virtual tags allowed only from scripts (per decision #6) — OPC UA client writes to a virtual node return `BadUserAccessDenied`.
3. **G.3** `AlarmTracker` composition. The `ScriptedAlarmEngine` registers as an additional `IAlarmSource` — no new composition code, the existing fan-out already accepts multiple sources.
4. **G.4** Tests: mixed equipment folder (driver tag + virtual tag + driver-native alarm + scripted alarm) browsable via Client.CLI; read / subscribe round-trip for the virtual tag; scripted alarm transitions visible in the alarm event stream.
### Stream H — Exit gate — **1 week**
1. **H.1** Compliance script real-checks: schema migrations applied; new tables populated from a draft→publish cycle; sealed-generation snapshot includes virtual tags + alarms; SQLite alarm queue initialized; `scripts-*.log` sink emitting; `AlarmConditionType` nodes materialize in the address space; per-alarm `HistorizeToAveva` toggle enforced end-to-end.
2. **H.2** Full-solution `dotnet test` baseline. Target: Phase 6 baseline + ~300 new tests across Streams AG.
3. **H.3** `docs/v2/plan.md` Migration Strategy §6 update — add Phase 7.
4. **H.4** Phase-status memory update.
5. **H.5** Merge `v2/phase-7-scripting-and-alarming``v2`.
## Compliance Checks (run at exit gate)
- [ ] **Sandbox escape**: attempts to reference `System.IO.File`, `System.Net.Http.HttpClient`, `System.Diagnostics.Process`, or `typeof(X).Assembly.Load` fail at script compile with an actionable error.
- [ ] **Dependency inference**: `ctx.GetTag(myStringVar)` (non-literal path) is rejected at publish with a span-pointed error; `ctx.GetTag("Line1/Speed")` is accepted + appears in the inferred input set.
- [ ] **Change cascade**: tag A → virtual tag B → virtual tag C. When A changes, B recomputes, then C recomputes. Single change event triggers the full cascade in topological order within one evaluation pass.
- [ ] **Cycle rejection**: publish a config where virtual tag B depends on A and A depends on B. Publish fails pre-commit with a clear cycle message.
- [ ] **Startup recovery**: seed `ScriptedAlarmState` with one acked+confirmed alarm + one shelved alarm + one clean alarm, restart, verify operator does NOT see ack prompts for the first two, shelving remains in effect, clean alarm is clear.
- [ ] **Ack audit**: acknowledge an alarm; `IAuditLogger` captures user / timestamp / comment / prior state; row persists through restart.
- [ ] **Historian queue durability**: take Galaxy.Host offline, fire 10 alarm transitions, bring Galaxy.Host back; queue drains all 10 in order.
- [ ] **Per-alarm historian toggle**: Galaxy-native alarm with `HistorizeToAveva=false` does NOT enqueue; scripted alarm with `HistorizeToAveva=true` DOES enqueue.
- [ ] **Script timeout**: infinite-loop script times out at 250ms; tag quality `BadInternalError`; other tags unaffected.
- [ ] **Log isolation**: `ctx.Logger.Error("test")` lands in `scripts-*.log` with structured property `ScriptName=<name>`; main `opcua-*.log` gets a WARN companion entry.
- [ ] **ACL binding**: virtual tag under an Equipment scope inherits the Equipment's grants. User without the Equipment grant reads the virtual tag and gets `BadUserAccessDenied`.
## Decisions Resolved in Plan Review
Every open question from the initial draft was resolved in the 2026-04-20 plan review — see decisions #18#22 in the decisions table above. No pending questions block Stream A.
## References
- [`docs/v2/plan.md`](../plan.md) §6 Migration Strategy — add Phase 7 as the final additive phase before v2 release readiness.
- [`docs/v2/implementation/overview.md`](overview.md) — phase gate conventions.
- [`docs/v2/implementation/phase-6-2-authorization-runtime.md`](phase-6-2-authorization-runtime.md) — `IAuditLogger` surface reused for Ack/Confirm/Shelve + script edits.
- [`docs/v2/implementation/phase-6-4-admin-ui-completion.md`](phase-6-4-admin-ui-completion.md) — draft/publish flow, diff viewer, tab-plugin pattern reused.
- [`docs/v2/implementation/phase-2-galaxy-out-of-process.md`](phase-2-galaxy-out-of-process.md) — Galaxy.Host IPC shape + shared-contract conventions reused for Stream D.
- [`docs/v2/driver-specs.md`](../driver-specs.md) §Alarm semantics — Part 9 fidelity requirements.
- [`docs/v2/driver-stability.md`](../driver-stability.md) — per-surface error handling, crash-loop breaker patterns Stream A.4 mirrors.
- [`docs/v2/config-db-schema.md`](../config-db-schema.md) — add a Phase 7 §§ for `VirtualTag`, `Script`, `ScriptedAlarm`, `ScriptedAlarmState`.

View File

@@ -13,9 +13,9 @@ confirmed DL205 quirk lands in a follow-up PR as a named test in that project.
## Harness
**Chosen simulator: pymodbus 3.13.0** (`pip install 'pymodbus[simulator]==3.13.0'`).
Replaced ModbusPal in PR 43 — see `tests/.../Pymodbus/README.md` for the
trade-off rationale. Headline reasons:
**Chosen simulator: pymodbus 3.13.0** packaged as a pinned Docker image
under `tests/.../Modbus.IntegrationTests/Docker/`. See that folder's
`README.md` for image-build notes + compose profiles. Headline reasons:
- **Headless** pure-Python CLI; no Java GUI, runs cleanly on a CI runner.
- **Maintained** — current stable 3.13.0; ModbusPal 1.6b is abandoned.
@@ -26,17 +26,18 @@ trade-off rationale. Headline reasons:
- **Per-register raw uint16 seeding** — encoding the DL205 string-byte-order
/ BCD / CDAB-float quirks stays explicit (the quirk math lives in the
`_quirk` JSON-comment fields next to each register).
- Pip-installable on Windows; sidesteps the privileged-port admin
requirement by defaulting to TCP **5020** instead of 502.
- **Dockerized** — pinned image means the CI simulator surface is
reproducible + no `pip install` step on the dev box.
- Defaults to TCP **5020** (matches the compose port-map + the fixture
default endpoint; sidesteps the Windows Firewall prompt on 502).
**Setup pattern**:
1. `pip install "pymodbus[simulator]==3.13.0"`.
2. Start the simulator with one of the in-repo profiles:
`tests\.../Pymodbus\serve.ps1 -Profile standard` (or `-Profile dl205`).
3. `dotnet test tests\ZB.MOM.WW.OtOpcUa.Driver.Modbus.IntegrationTests`
1. `docker compose -f tests\...\Modbus.IntegrationTests\Docker\docker-compose.yml --profile <standard|dl205|mitsubishi|s7_1500> up -d`.
2. `dotnet test tests\ZB.MOM.WW.OtOpcUa.Driver.Modbus.IntegrationTests`
tests auto-skip when the endpoint is unreachable. Default endpoint is
`localhost:5020`; override via `MODBUS_SIM_ENDPOINT` for a real PLC on its
native port 502.
3. `docker compose -f ... --profile <…> down` when finished.
## Per-device quirk catalog
@@ -113,9 +114,11 @@ vendors get promoted into driver defaults or opt-in options:
- **PR 42 — ModbusPal `.xmpp` profiles** — **SUPERSEDED by PR 43**. Replaced
with pymodbus JSON because ModbusPal 1.6b is abandoned, GUI-only, and only
exposes 2 of the 4 standard tables.
- **PR 43 — pymodbus JSON profiles** — **DONE**. `Pymodbus/standard.json` +
`Pymodbus/dl205.json` + `Pymodbus/serve.ps1` runner. Both bind TCP 5020.
- **PR 43 — pymodbus JSON profiles** — **DONE**. Dockerized under
`Docker/profiles/` (standard.json, dl205.json, mitsubishi.json,
s7_1500.json); compose file launches each via a named profile.
All bind TCP 5020.
- **PR 44+**: one PR per confirmed DL205 quirk, landing the named test + any
driver-side adjustment (string byte order, BCD decoder, V-memory address
helper, FC16 cap-per-device-family) needed to pass it. Each quirk's value
is already pre-encoded in `Pymodbus/dl205.json`.
is already pre-encoded in `Docker/profiles/dl205.json`.

View File

@@ -736,7 +736,7 @@ Each step leaves the system runnable. The generic extraction is effectively free
6. **Wire `Server`** — bootstrap from Configuration using an instance-bound credential (cert/gMSA/SQL login), fail fast if the credential is rejected, register drivers, start Core.
7. **Scaffold `Admin`** — Blazor Server app with: instance + credential management, draft/publish/rollback generation workflow (diff viewer, "publish to fleet", per-instance override), and core CRUD for drivers/devices/tags. Driver-specific config screens deferred to later phases.
**Phase 2 — Galaxy driver (prove the refactor)**
**Phase 2 — Galaxy driver (prove the refactor) — ✅ CLOSED 2026-04-20** (see [`implementation/exit-gate-phase-2-closed.md`](implementation/exit-gate-phase-2-closed.md))
8. **Build `Galaxy.Shared`** — .NET Standard 2.0 IPC message contracts
9. **Build `Galaxy.Host`** — .NET 4.8 x86 process hosting MxAccessBridge, GalaxyRepository, alarms, HDA with IPC server
10. **Build `Galaxy.Proxy`** — .NET 10 in-process proxy implementing IDriver interfaces, forwarding over IPC

View File

@@ -189,6 +189,33 @@ Modbus has no native String, DateTime, or Int64 — those rows are skipped on th
- **ab_server tag-type coverage is finite** (BOOL, DINT, REAL, arrays, basic strings). UDTs and `Program:` scoping are not fully implemented. Document an "ab_server-supported tag set" in the harness and exclude the rest from default CI; UDT coverage moves to the Studio 5000 Emulate golden-box tier.
- CIP has no native subscriptions, so polling behavior matches real hardware.
### CI fixture (task #180)
The integration harness at `tests/ZB.MOM.WW.OtOpcUa.Driver.AbCip.IntegrationTests/` is Docker-only — `ab_server` is a source-only tool under libplctag's `src/tools/ab_server/`, and the fixture's multi-stage `Docker/Dockerfile` is the only supported reproducible build path.
- **`AbServerFixture(AbServerProfile)`** — thin TCP probe against `127.0.0.1:44818` (or `AB_SERVER_ENDPOINT` override). Does not spawn the simulator; the operator brings up the compose service for whichever family the test class targets (`controllogix` / `compactlogix` / `micro800` / `guardlogix`).
- **`KnownProfiles.{ControlLogix, CompactLogix, Micro800, GuardLogix}`** — thin `(Family, ComposeProfile, Notes)` records. The compose file (`Docker/docker-compose.yml`) is the canonical source of truth for which tags each family seeds + which `--plc` mode the simulator boots in. `Micro800` uses the dedicated `--plc=Micro800` mode; `GuardLogix` uses `ControlLogix` emulation because ab_server has no safety subsystem (the `_S`-suffixed seed tag triggers driver-side ViewOnly classification only).
**Pinned version**: the `Docker/Dockerfile` clones libplctag at a pinned tag (currently the `release` branch) via its `LIBPLCTAG_TAG` build-arg and compiles `ab_server` from source. Bump deliberately alongside a driver-side change that needs the newer simulator.
**CI step:**
```yaml
# GitHub Actions step placed before `dotnet test`:
- name: Start ab_server Docker container
shell: pwsh
run: |
docker compose -f tests/ZB.MOM.WW.OtOpcUa.Driver.AbCip.IntegrationTests/Docker/docker-compose.yml `
--profile controllogix up -d --build
# Wait for :44818 to accept connections (compose healthcheck-equivalent)
for ($i = 0; $i -lt 30; $i++) {
if ((Test-NetConnection -ComputerName localhost -Port 44818 -WarningAction SilentlyContinue).TcpTestSucceeded) { break }
Start-Sleep -Seconds 1
}
```
Tests skip via `AbServerFactAttribute` / `AbServerTheoryAttribute` when the probe fails, so fresh-clone runs without Docker still pass all unit suites in this project.
---
## 3. Allen-Bradley Legacy (SLC 500 / MicroLogix, PCCC)

View File

@@ -0,0 +1,108 @@
<#
.SYNOPSIS
Registers the OtOpcUaFocasHost Windows service. Optional companion to
Install-Services.ps1 — only run this on nodes where FOCAS driver instances will run
with Tier-C process isolation enabled.
.DESCRIPTION
FOCAS PR #220 / Tier-C isolation plan. Wraps OtOpcUa.Driver.FOCAS.Host.exe (net48 x86)
as a Windows service using NSSM, running under the same service account as the main
OtOpcUa service so the named-pipe ACL works. Passes the per-process shared secret via
environment variable at service-start time so it never hits disk.
.PARAMETER InstallRoot
Where the FOCAS Host binaries live (typically
C:\Program Files\OtOpcUa\Driver.FOCAS.Host).
.PARAMETER ServiceAccount
Service account SID or DOMAIN\name. Must match the main OtOpcUa server account so the
PipeAcl match succeeds.
.PARAMETER FocasSharedSecret
Per-process secret passed via env var. Generated freshly per install if not supplied.
.PARAMETER FocasBackend
Backend selector for the Host process. One of:
fwlib32 (default — real Fanuc Fwlib32.dll integration; requires licensed DLL on PATH)
fake (in-memory; smoke-test mode)
unconfigured (safe default returning structured errors; use until hardware is wired)
.PARAMETER FocasPipeName
Pipe name the Host listens on. Default: OtOpcUaFocas.
.EXAMPLE
.\Install-FocasHost.ps1 -InstallRoot 'C:\Program Files\OtOpcUa\Driver.FOCAS.Host' `
-ServiceAccount 'OTOPCUA\svc-otopcua' -FocasBackend fwlib32
#>
[CmdletBinding()]
param(
[Parameter(Mandatory)] [string]$InstallRoot,
[Parameter(Mandatory)] [string]$ServiceAccount,
[string]$FocasSharedSecret,
[ValidateSet('fwlib32','fake','unconfigured')] [string]$FocasBackend = 'unconfigured',
[string]$FocasPipeName = 'OtOpcUaFocas',
[string]$ServiceName = 'OtOpcUaFocasHost',
[string]$NssmPath = 'C:\Program Files\nssm\nssm.exe'
)
$ErrorActionPreference = 'Stop'
function Resolve-Sid {
param([string]$Account)
if ($Account -match '^S-\d-\d+') { return $Account }
try {
$nt = New-Object System.Security.Principal.NTAccount($Account)
return $nt.Translate([System.Security.Principal.SecurityIdentifier]).Value
} catch {
throw "Could not resolve '$Account' to a SID. Pass an explicit SID or check the account name."
}
}
if (-not (Test-Path $NssmPath)) {
throw "nssm.exe not found at '$NssmPath'. Install NSSM or pass -NssmPath."
}
$hostExe = Join-Path $InstallRoot 'OtOpcUa.Driver.FOCAS.Host.exe'
if (-not (Test-Path $hostExe)) {
throw "FOCAS Host binary not found at '$hostExe'. Publish the Driver.FOCAS.Host project first."
}
if (-not $FocasSharedSecret) {
$FocasSharedSecret = [System.Guid]::NewGuid().ToString('N')
Write-Host "Generated FocasSharedSecret — store it alongside the OtOpcUa service config."
}
$allowedSid = Resolve-Sid $ServiceAccount
# Idempotent install — remove + re-create if present.
$existing = Get-Service -Name $ServiceName -ErrorAction SilentlyContinue
if ($existing) {
Write-Host "Removing existing '$ServiceName' service..."
& $NssmPath stop $ServiceName confirm | Out-Null
& $NssmPath remove $ServiceName confirm | Out-Null
}
& $NssmPath install $ServiceName $hostExe | Out-Null
& $NssmPath set $ServiceName DisplayName 'OT-OPC-UA FOCAS Host (Tier-C isolated Fwlib32)' | Out-Null
& $NssmPath set $ServiceName Description 'Out-of-process Fwlib32.dll host for OtOpcUa FOCAS driver. Crash-isolated from the main OPC UA server.' | Out-Null
& $NssmPath set $ServiceName ObjectName $ServiceAccount | Out-Null
& $NssmPath set $ServiceName Start SERVICE_AUTO_START | Out-Null
& $NssmPath set $ServiceName AppStdout (Join-Path $env:ProgramData 'OtOpcUa\focas-host-stdout.log') | Out-Null
& $NssmPath set $ServiceName AppStderr (Join-Path $env:ProgramData 'OtOpcUa\focas-host-stderr.log') | Out-Null
& $NssmPath set $ServiceName AppRotateFiles 1 | Out-Null
& $NssmPath set $ServiceName AppRotateBytes 10485760 | Out-Null
& $NssmPath set $ServiceName AppEnvironmentExtra `
"OTOPCUA_FOCAS_PIPE=$FocasPipeName" `
"OTOPCUA_ALLOWED_SID=$allowedSid" `
"OTOPCUA_FOCAS_SECRET=$FocasSharedSecret" `
"OTOPCUA_FOCAS_BACKEND=$FocasBackend" | Out-Null
& $NssmPath set $ServiceName DependOnService OtOpcUa | Out-Null
Write-Host "Installed '$ServiceName' under '$ServiceAccount' (SID=$allowedSid)."
Write-Host "Pipe: \\.\pipe\$FocasPipeName Backend: $FocasBackend"
Write-Host "Start the service with: Start-Service $ServiceName"
Write-Host ""
Write-Host "NOTE: the Fwlib32 backend requires the licensed Fwlib32.dll on PATH"
Write-Host "alongside the Host exe. See docs/v2/focas-deployment.md."

View File

@@ -0,0 +1,27 @@
<#
.SYNOPSIS
Removes the OtOpcUaFocasHost Windows service.
.DESCRIPTION
Companion to Install-FocasHost.ps1. Stops + unregisters the service via NSSM.
Idempotent — succeeds silently if the service doesn't exist.
.EXAMPLE
.\Uninstall-FocasHost.ps1
#>
[CmdletBinding()]
param(
[string]$ServiceName = 'OtOpcUaFocasHost',
[string]$NssmPath = 'C:\Program Files\nssm\nssm.exe'
)
$ErrorActionPreference = 'Stop'
$svc = Get-Service -Name $ServiceName -ErrorAction SilentlyContinue
if (-not $svc) { Write-Host "Service '$ServiceName' not present — nothing to do."; return }
if (-not (Test-Path $NssmPath)) { throw "nssm.exe not found at '$NssmPath'." }
& $NssmPath stop $ServiceName confirm | Out-Null
& $NssmPath remove $ServiceName confirm | Out-Null
Write-Host "Removed '$ServiceName'."

View File

@@ -10,6 +10,7 @@
<li class="nav-item"><a class="nav-link text-light" href="/clusters">Clusters</a></li>
<li class="nav-item"><a class="nav-link text-light" href="/reservations">Reservations</a></li>
<li class="nav-item"><a class="nav-link text-light" href="/certificates">Certificates</a></li>
<li class="nav-item"><a class="nav-link text-light" href="/role-grants">Role grants</a></li>
</ul>
<div class="mt-5">

View File

@@ -1,7 +1,13 @@
@using Microsoft.AspNetCore.SignalR.Client
@using ZB.MOM.WW.OtOpcUa.Admin.Hubs
@using ZB.MOM.WW.OtOpcUa.Admin.Services
@using ZB.MOM.WW.OtOpcUa.Configuration.Entities
@using ZB.MOM.WW.OtOpcUa.Configuration.Enums
@using ZB.MOM.WW.OtOpcUa.Core.Authorization
@inject NodeAclService AclSvc
@inject PermissionProbeService ProbeSvc
@inject NavigationManager Nav
@implements IAsyncDisposable
<div class="d-flex justify-content-between mb-3">
<h4>Access-control grants</h4>
@@ -29,6 +35,95 @@ else
</table>
}
@* Probe-this-permission — task #196 slice 1 *@
<div class="card mt-4 mb-3">
<div class="card-header">
<strong>Probe this permission</strong>
<span class="small text-muted ms-2">
Ask the trie "if LDAP group X asks for permission Y on node Z, would it be granted?" —
answers the same way the live server does at request time.
</span>
</div>
<div class="card-body">
<div class="row g-2 align-items-end">
<div class="col-md-3">
<label class="form-label small">LDAP group</label>
<input class="form-control form-control-sm" @bind="_probeGroup" placeholder="cn=fleet-admin,…"/>
</div>
<div class="col-md-2">
<label class="form-label small">Namespace</label>
<input class="form-control form-control-sm" @bind="_probeNamespaceId" placeholder="ns-1"/>
</div>
<div class="col-md-2">
<label class="form-label small">UnsArea</label>
<input class="form-control form-control-sm" @bind="_probeUnsAreaId"/>
</div>
<div class="col-md-2">
<label class="form-label small">UnsLine</label>
<input class="form-control form-control-sm" @bind="_probeUnsLineId"/>
</div>
<div class="col-md-1">
<label class="form-label small">Equipment</label>
<input class="form-control form-control-sm" @bind="_probeEquipmentId"/>
</div>
<div class="col-md-1">
<label class="form-label small">Tag</label>
<input class="form-control form-control-sm" @bind="_probeTagId"/>
</div>
<div class="col-md-1">
<label class="form-label small">Permission</label>
<select class="form-select form-select-sm" @bind="_probePermission">
@foreach (var p in Enum.GetValues<NodePermissions>())
{
if (p == NodePermissions.None) continue;
<option value="@p">@p</option>
}
</select>
</div>
</div>
<div class="mt-3">
<button class="btn btn-sm btn-outline-primary" @onclick="RunProbeAsync" disabled="@_probing">Probe</button>
@if (_probeResult is not null)
{
<span class="ms-3">
@if (_probeResult.Granted)
{
<span class="badge bg-success">Granted</span>
}
else
{
<span class="badge bg-danger">Denied</span>
}
<span class="small ms-2">
Required <code>@_probeResult.Required</code>,
Effective <code>@_probeResult.Effective</code>
</span>
</span>
}
</div>
@if (_probeResult is not null && _probeResult.Matches.Count > 0)
{
<table class="table table-sm mt-3 mb-0">
<thead><tr><th>LDAP group matched</th><th>Level</th><th>Flags contributed</th></tr></thead>
<tbody>
@foreach (var m in _probeResult.Matches)
{
<tr>
<td><code>@m.LdapGroup</code></td>
<td>@m.Scope</td>
<td><code>@m.PermissionFlags</code></td>
</tr>
}
</tbody>
</table>
}
else if (_probeResult is not null)
{
<div class="mt-2 small text-muted">No matching grants for this (group, scope) — effective permission is <code>None</code>.</div>
}
</div>
</div>
@if (_showForm)
{
<div class="card">
@@ -80,6 +175,64 @@ else
private string _preset = "Read";
private string? _error;
// Probe-this-permission state
private string _probeGroup = string.Empty;
private string _probeNamespaceId = string.Empty;
private string _probeUnsAreaId = string.Empty;
private string _probeUnsLineId = string.Empty;
private string _probeEquipmentId = string.Empty;
private string _probeTagId = string.Empty;
private NodePermissions _probePermission = NodePermissions.Read;
private PermissionProbeResult? _probeResult;
private bool _probing;
private async Task RunProbeAsync()
{
if (string.IsNullOrWhiteSpace(_probeGroup)) { _probeResult = null; return; }
_probing = true;
try
{
var scope = new NodeScope
{
ClusterId = ClusterId,
NamespaceId = NullIfBlank(_probeNamespaceId),
UnsAreaId = NullIfBlank(_probeUnsAreaId),
UnsLineId = NullIfBlank(_probeUnsLineId),
EquipmentId = NullIfBlank(_probeEquipmentId),
TagId = NullIfBlank(_probeTagId),
Kind = NodeHierarchyKind.Equipment,
};
_probeResult = await ProbeSvc.ProbeAsync(GenerationId, _probeGroup.Trim(), scope, _probePermission, CancellationToken.None);
}
finally { _probing = false; }
}
private static string? NullIfBlank(string s) => string.IsNullOrWhiteSpace(s) ? null : s;
private HubConnection? _hub;
protected override async Task OnAfterRenderAsync(bool firstRender)
{
if (!firstRender || _hub is not null) return;
_hub = new HubConnectionBuilder()
.WithUrl(Nav.ToAbsoluteUri("/hubs/fleet-status"))
.WithAutomaticReconnect()
.Build();
_hub.On<NodeAclChangedMessage>("NodeAclChanged", async msg =>
{
if (msg.ClusterId != ClusterId || msg.GenerationId != GenerationId) return;
_acls = await AclSvc.ListAsync(GenerationId, CancellationToken.None);
await InvokeAsync(StateHasChanged);
});
await _hub.StartAsync();
await _hub.SendAsync("SubscribeCluster", ClusterId);
}
public async ValueTask DisposeAsync()
{
if (_hub is not null) { await _hub.DisposeAsync(); _hub = null; }
}
protected override async Task OnParametersSetAsync() =>
_acls = await AclSvc.ListAsync(GenerationId, CancellationToken.None);

View File

@@ -52,6 +52,7 @@ else
<li class="nav-item"><button class="nav-link @Tab("namespaces")" @onclick='() => _tab = "namespaces"'>Namespaces</button></li>
<li class="nav-item"><button class="nav-link @Tab("drivers")" @onclick='() => _tab = "drivers"'>Drivers</button></li>
<li class="nav-item"><button class="nav-link @Tab("acls")" @onclick='() => _tab = "acls"'>ACLs</button></li>
<li class="nav-item"><button class="nav-link @Tab("redundancy")" @onclick='() => _tab = "redundancy"'>Redundancy</button></li>
<li class="nav-item"><button class="nav-link @Tab("audit")" @onclick='() => _tab = "audit"'>Audit</button></li>
</ul>
@@ -92,6 +93,10 @@ else
{
<AclsTab GenerationId="@_currentDraft.GenerationId" ClusterId="@ClusterId"/>
}
else if (_tab == "redundancy")
{
<RedundancyTab ClusterId="@ClusterId"/>
}
else if (_tab == "audit")
{
<AuditTab ClusterId="@ClusterId"/>

View File

@@ -0,0 +1,90 @@
@using ZB.MOM.WW.OtOpcUa.Admin.Services
@* Per-section diff renderer — the base used by DiffViewer for every known TableName. Caps
output at RowCap rows so a pathological draft (e.g. 20k tags churned) can't freeze the
Blazor render; overflow banner tells operator how many rows were hidden. *@
<div class="card mb-3">
<div class="card-header d-flex justify-content-between align-items-center">
<div>
<strong>@Title</strong>
<small class="text-muted ms-2">@Description</small>
</div>
<div>
@if (_added > 0) { <span class="badge bg-success me-1">+@_added</span> }
@if (_removed > 0) { <span class="badge bg-danger me-1">@_removed</span> }
@if (_modified > 0) { <span class="badge bg-warning text-dark me-1">~@_modified</span> }
@if (_total == 0) { <span class="badge bg-secondary">no changes</span> }
</div>
</div>
@if (_total == 0)
{
<div class="card-body text-muted small">No changes in this section.</div>
}
else
{
@if (_total > RowCap)
{
<div class="alert alert-warning mb-0 small rounded-0">
Showing the first @RowCap of @_total rows — cap protects the browser from megabyte-class
diffs. Inspect the remainder via the SQL <code>sp_ComputeGenerationDiff</code> directly.
</div>
}
<div class="table-responsive" style="max-height: 400px; overflow-y: auto;">
<table class="table table-sm table-hover mb-0">
<thead class="table-light">
<tr><th>LogicalId</th><th style="width: 120px;">Change</th></tr>
</thead>
<tbody>
@foreach (var r in _visibleRows)
{
<tr>
<td><code>@r.LogicalId</code></td>
<td>
@switch (r.ChangeKind)
{
case "Added": <span class="badge bg-success">@r.ChangeKind</span> break;
case "Removed": <span class="badge bg-danger">@r.ChangeKind</span> break;
case "Modified": <span class="badge bg-warning text-dark">@r.ChangeKind</span> break;
default: <span class="badge bg-secondary">@r.ChangeKind</span> break;
}
</td>
</tr>
}
</tbody>
</table>
</div>
}
</div>
@code {
/// <summary>Default row-cap per section — matches task #156's acceptance criterion.</summary>
public const int DefaultRowCap = 1000;
[Parameter, EditorRequired] public string Title { get; set; } = string.Empty;
[Parameter] public string Description { get; set; } = string.Empty;
[Parameter, EditorRequired] public IReadOnlyList<DiffRow> Rows { get; set; } = [];
[Parameter] public int RowCap { get; set; } = DefaultRowCap;
private int _total;
private int _added;
private int _removed;
private int _modified;
private List<DiffRow> _visibleRows = [];
protected override void OnParametersSet()
{
_total = Rows.Count;
_added = 0; _removed = 0; _modified = 0;
foreach (var r in Rows)
{
switch (r.ChangeKind)
{
case "Added": _added++; break;
case "Removed": _removed++; break;
case "Modified": _modified++; break;
}
}
_visibleRows = _total > RowCap ? Rows.Take(RowCap).ToList() : Rows.ToList();
}
}

View File

@@ -28,36 +28,44 @@ else if (_rows.Count == 0)
}
else
{
<table class="table table-hover table-sm">
<thead><tr><th>Table</th><th>LogicalId</th><th>ChangeKind</th></tr></thead>
<tbody>
@foreach (var r in _rows)
{
<tr>
<td>@r.TableName</td>
<td><code>@r.LogicalId</code></td>
<td>
@switch (r.ChangeKind)
{
case "Added": <span class="badge bg-success">@r.ChangeKind</span> break;
case "Removed": <span class="badge bg-danger">@r.ChangeKind</span> break;
case "Modified": <span class="badge bg-warning text-dark">@r.ChangeKind</span> break;
default: <span class="badge bg-secondary">@r.ChangeKind</span> break;
}
</td>
</tr>
}
</tbody>
</table>
<p class="small text-muted mb-3">
@_rows.Count row@(_rows.Count == 1 ? "" : "s") across @_sectionsWithChanges of @Sections.Count sections.
Each section is capped at @DiffSection.DefaultRowCap rows to keep the browser responsive on pathological drafts.
</p>
@foreach (var sec in Sections)
{
<DiffSection Title="@sec.Title"
Description="@sec.Description"
Rows="@RowsFor(sec.TableName)"/>
}
}
@code {
[Parameter] public string ClusterId { get; set; } = string.Empty;
[Parameter] public long GenerationId { get; set; }
/// <summary>
/// Ordered section definitions — each maps a <c>TableName</c> emitted by
/// <c>sp_ComputeGenerationDiff</c> to a human label + description. The proc currently
/// emits Namespace/DriverInstance/Equipment/Tag; UnsLine + NodeAcl entries render as
/// empty "no changes" cards until the proc is extended (tracked in tasks #196 + #156
/// follow-up). Six sections total matches the task #156 target.
/// </summary>
private static readonly IReadOnlyList<SectionDef> Sections = new[]
{
new SectionDef("Namespace", "Namespaces", "OPC UA namespace URIs + enablement"),
new SectionDef("DriverInstance", "Driver instances","Per-cluster driver configuration rows"),
new SectionDef("Equipment", "Equipment", "UNS level-5 rows + identification fields"),
new SectionDef("Tag", "Tags", "Per-device tag definitions + poll-group binding"),
new SectionDef("UnsLine", "UNS structure", "Site / Area / Line hierarchy (proc-extension pending)"),
new SectionDef("NodeAcl", "ACLs", "LDAP-group → node-scope permission grants (logical id = LdapGroup|ScopeKind|ScopeId)"),
};
private List<DiffRow>? _rows;
private string _fromLabel = "(empty)";
private string? _error;
private int _sectionsWithChanges;
protected override async Task OnParametersSetAsync()
{
@@ -67,7 +75,13 @@ else
var from = all.FirstOrDefault(g => g.Status == GenerationStatus.Published);
_fromLabel = from is null ? "(empty)" : $"gen {from.GenerationId}";
_rows = await GenerationSvc.ComputeDiffAsync(from?.GenerationId ?? 0, GenerationId, CancellationToken.None);
_sectionsWithChanges = Sections.Count(s => _rows.Any(r => r.TableName == s.TableName));
}
catch (Exception ex) { _error = ex.Message; }
}
private IReadOnlyList<DiffRow> RowsFor(string tableName) =>
_rows?.Where(r => r.TableName == tableName).ToList() ?? [];
private sealed record SectionDef(string TableName, string Title, string Description);
}

View File

@@ -27,7 +27,7 @@
<div class="row">
<div class="col-md-8">
@if (_tab == "equipment") { <EquipmentTab GenerationId="@GenerationId"/> }
@if (_tab == "equipment") { <EquipmentTab GenerationId="@GenerationId" ClusterId="@ClusterId"/> }
else if (_tab == "uns") { <UnsTab GenerationId="@GenerationId" ClusterId="@ClusterId"/> }
else if (_tab == "namespaces") { <NamespacesTab GenerationId="@GenerationId" ClusterId="@ClusterId"/> }
else if (_tab == "drivers") { <DriversTab GenerationId="@GenerationId" ClusterId="@ClusterId"/> }

View File

@@ -2,10 +2,14 @@
@using ZB.MOM.WW.OtOpcUa.Configuration.Entities
@using ZB.MOM.WW.OtOpcUa.Configuration.Validation
@inject EquipmentService EquipmentSvc
@inject NavigationManager Nav
<div class="d-flex justify-content-between mb-3">
<h4>Equipment (draft gen @GenerationId)</h4>
<button class="btn btn-primary btn-sm" @onclick="StartAdd">Add equipment</button>
<div>
<button class="btn btn-outline-primary btn-sm me-2" @onclick="GoImport">Import CSV…</button>
<button class="btn btn-primary btn-sm" @onclick="StartAdd">Add equipment</button>
</div>
</div>
@if (_equipment is null)
@@ -96,6 +100,9 @@ else if (_equipment.Count > 0)
@code {
[Parameter] public long GenerationId { get; set; }
[Parameter] public string ClusterId { get; set; } = string.Empty;
private void GoImport() => Nav.NavigateTo($"/clusters/{ClusterId}/draft/{GenerationId}/import-equipment");
private List<Equipment>? _equipment;
private bool _showForm;
private bool _editMode;

View File

@@ -0,0 +1,200 @@
@page "/clusters/{ClusterId}/draft/{GenerationId:long}/import-equipment"
@using Microsoft.AspNetCore.Components.Authorization
@using ZB.MOM.WW.OtOpcUa.Admin.Services
@using ZB.MOM.WW.OtOpcUa.Configuration.Entities
@inject DriverInstanceService DriverSvc
@inject UnsService UnsSvc
@inject EquipmentImportBatchService BatchSvc
@inject NavigationManager Nav
@inject AuthenticationStateProvider AuthProvider
<div class="d-flex justify-content-between align-items-center mb-3">
<div>
<h1 class="mb-0">Equipment CSV import</h1>
<small class="text-muted">Cluster <code>@ClusterId</code> · draft generation @GenerationId</small>
</div>
<a class="btn btn-outline-secondary" href="/clusters/@ClusterId/draft/@GenerationId">Back to draft</a>
</div>
<div class="alert alert-info small mb-3">
Accepts <code>@EquipmentCsvImporter.VersionMarker</code>-headered CSV per Stream B.3.
Required columns: @string.Join(", ", EquipmentCsvImporter.RequiredColumns).
Optional columns cover the OPC 40010 Identification fields. Paste the file contents
or upload directly — the parser runs client-stream-side and shows a row-level preview
before anything lands in the draft. ZTag + SAPID uniqueness across the fleet is NOT
enforced here yet (see task #197); for now the finalise may fail at commit time if a
reservation conflict exists.
</div>
<div class="card mb-3">
<div class="card-body">
<div class="row g-3">
<div class="col-md-5">
<label class="form-label">Target driver instance (for every accepted row)</label>
<select class="form-select" @bind="_driverInstanceId">
<option value="">-- select driver --</option>
@if (_drivers is not null)
{
@foreach (var d in _drivers) { <option value="@d.DriverInstanceId">@d.DriverInstanceId</option> }
}
</select>
</div>
<div class="col-md-5">
<label class="form-label">Target UNS line (for every accepted row)</label>
<select class="form-select" @bind="_unsLineId">
<option value="">-- select line --</option>
@if (_unsLines is not null)
{
@foreach (var l in _unsLines) { <option value="@l.UnsLineId">@l.UnsLineId — @l.Name</option> }
}
</select>
</div>
<div class="col-md-2 pt-4">
<InputFile OnChange="HandleFileAsync" class="form-control form-control-sm" accept=".csv,.txt"/>
</div>
</div>
<div class="mt-3">
<label class="form-label">CSV content (paste or uploaded)</label>
<textarea class="form-control font-monospace" rows="8" @bind="_csvText"
placeholder="# OtOpcUaCsv v1&#10;ZTag,MachineCode,SAPID,EquipmentId,…"/>
</div>
<div class="mt-3">
<button class="btn btn-sm btn-outline-primary" @onclick="ParseAsync" disabled="@_busy">Parse</button>
<button class="btn btn-sm btn-primary ms-2" @onclick="StageAndFinaliseAsync"
disabled="@(_parseResult is null || _parseResult.AcceptedRows.Count == 0 || string.IsNullOrWhiteSpace(_driverInstanceId) || string.IsNullOrWhiteSpace(_unsLineId) || _busy)">
Stage + Finalise
</button>
@if (_parseError is not null) { <span class="alert alert-danger ms-3 py-1 px-2 small">@_parseError</span> }
@if (_result is not null) { <span class="alert alert-success ms-3 py-1 px-2 small">@_result</span> }
</div>
</div>
</div>
@if (_parseResult is not null)
{
<div class="row g-3">
<div class="col-md-6">
<div class="card">
<div class="card-header bg-success text-white">
Accepted (@_parseResult.AcceptedRows.Count)
</div>
<div class="card-body p-0" style="max-height: 400px; overflow-y: auto;">
@if (_parseResult.AcceptedRows.Count == 0)
{
<p class="text-muted p-3 mb-0">No accepted rows.</p>
}
else
{
<table class="table table-sm table-striped mb-0">
<thead>
<tr><th>ZTag</th><th>Machine</th><th>Name</th><th>Line</th></tr>
</thead>
<tbody>
@foreach (var r in _parseResult.AcceptedRows)
{
<tr>
<td><code>@r.ZTag</code></td>
<td>@r.MachineCode</td>
<td>@r.Name</td>
<td>@r.UnsLineName</td>
</tr>
}
</tbody>
</table>
}
</div>
</div>
</div>
<div class="col-md-6">
<div class="card">
<div class="card-header bg-danger text-white">
Rejected (@_parseResult.RejectedRows.Count)
</div>
<div class="card-body p-0" style="max-height: 400px; overflow-y: auto;">
@if (_parseResult.RejectedRows.Count == 0)
{
<p class="text-muted p-3 mb-0">No rejections.</p>
}
else
{
<table class="table table-sm table-striped mb-0">
<thead><tr><th>Line</th><th>Reason</th></tr></thead>
<tbody>
@foreach (var e in _parseResult.RejectedRows)
{
<tr>
<td>@e.LineNumber</td>
<td class="small">@e.Reason</td>
</tr>
}
</tbody>
</table>
}
</div>
</div>
</div>
</div>
}
@code {
[Parameter] public string ClusterId { get; set; } = string.Empty;
[Parameter] public long GenerationId { get; set; }
private List<DriverInstance>? _drivers;
private List<UnsLine>? _unsLines;
private string _driverInstanceId = string.Empty;
private string _unsLineId = string.Empty;
private string _csvText = string.Empty;
private EquipmentCsvParseResult? _parseResult;
private string? _parseError;
private string? _result;
private bool _busy;
protected override async Task OnInitializedAsync()
{
_drivers = await DriverSvc.ListAsync(GenerationId, CancellationToken.None);
_unsLines = await UnsSvc.ListLinesAsync(GenerationId, CancellationToken.None);
}
private async Task HandleFileAsync(InputFileChangeEventArgs e)
{
// 5 MiB cap — refuses pathological uploads that would OOM the server.
using var stream = e.File.OpenReadStream(maxAllowedSize: 5 * 1024 * 1024);
using var reader = new StreamReader(stream);
_csvText = await reader.ReadToEndAsync();
}
private void ParseAsync()
{
_parseError = null;
_parseResult = null;
_result = null;
try { _parseResult = EquipmentCsvImporter.Parse(_csvText); }
catch (InvalidCsvFormatException ex) { _parseError = ex.Message; }
catch (Exception ex) { _parseError = $"Parse failed: {ex.Message}"; }
}
private async Task StageAndFinaliseAsync()
{
if (_parseResult is null) return;
_busy = true;
_result = null;
_parseError = null;
try
{
var auth = await AuthProvider.GetAuthenticationStateAsync();
var createdBy = auth.User.Identity?.Name ?? "unknown";
var batch = await BatchSvc.CreateBatchAsync(ClusterId, createdBy, CancellationToken.None);
await BatchSvc.StageRowsAsync(batch.Id, _parseResult.AcceptedRows, _parseResult.RejectedRows, CancellationToken.None);
await BatchSvc.FinaliseBatchAsync(batch.Id, GenerationId, _driverInstanceId, _unsLineId, CancellationToken.None);
_result = $"Finalised batch {batch.Id:N} — {_parseResult.AcceptedRows.Count} rows added.";
// Pause 600 ms so the success banner is visible, then navigate back.
await Task.Delay(600);
Nav.NavigateTo($"/clusters/{ClusterId}/draft/{GenerationId}");
}
catch (Exception ex) { _parseError = $"Finalise failed: {ex.Message}"; }
finally { _busy = false; }
}
}

View File

@@ -0,0 +1,175 @@
@using Microsoft.AspNetCore.SignalR.Client
@using ZB.MOM.WW.OtOpcUa.Admin.Hubs
@using ZB.MOM.WW.OtOpcUa.Admin.Services
@using ZB.MOM.WW.OtOpcUa.Configuration.Entities
@using ZB.MOM.WW.OtOpcUa.Configuration.Enums
@inject ClusterNodeService NodeSvc
@inject NavigationManager Nav
@implements IAsyncDisposable
<h4>Redundancy topology</h4>
@if (_roleChangedBanner is not null)
{
<div class="alert alert-info small mb-2">@_roleChangedBanner</div>
}
<p class="text-muted small">
One row per <code>ClusterNode</code> in this cluster. Role, <code>ApplicationUri</code>,
and <code>ServiceLevelBase</code> are authored separately; the Admin UI shows them read-only
here so operators can confirm the published topology without touching it. LastSeen older than
@((int)ClusterNodeService.StaleThreshold.TotalSeconds)s is flagged Stale — the node has
stopped heart-beating and is likely down. Role swap goes through the server-side
<code>RedundancyCoordinator</code> apply-lease flow, not direct DB edits.
</p>
@if (_nodes is null)
{
<p>Loading…</p>
}
else if (_nodes.Count == 0)
{
<div class="alert alert-warning">
No ClusterNode rows for this cluster. The server process needs at least one entry
(with a non-blank <code>ApplicationUri</code>) before it can start up per OPC UA spec.
</div>
}
else
{
var primaries = _nodes.Count(n => n.RedundancyRole == RedundancyRole.Primary);
var secondaries = _nodes.Count(n => n.RedundancyRole == RedundancyRole.Secondary);
var standalone = _nodes.Count(n => n.RedundancyRole == RedundancyRole.Standalone);
var staleCount = _nodes.Count(ClusterNodeService.IsStale);
<div class="row g-3 mb-4">
<div class="col-md-3"><div class="card"><div class="card-body">
<h6 class="text-muted mb-1">Nodes</h6>
<div class="fs-3">@_nodes.Count</div>
</div></div></div>
<div class="col-md-3"><div class="card border-success"><div class="card-body">
<h6 class="text-muted mb-1">Primary</h6>
<div class="fs-3 text-success">@primaries</div>
</div></div></div>
<div class="col-md-3"><div class="card border-info"><div class="card-body">
<h6 class="text-muted mb-1">Secondary</h6>
<div class="fs-3 text-info">@secondaries</div>
</div></div></div>
<div class="col-md-3"><div class="card @(staleCount > 0 ? "border-warning" : "")"><div class="card-body">
<h6 class="text-muted mb-1">Stale</h6>
<div class="fs-3 @(staleCount > 0 ? "text-warning" : "")">@staleCount</div>
</div></div></div>
</div>
@if (primaries == 0 && standalone == 0)
{
<div class="alert alert-danger small mb-3">
No Primary or Standalone node — the cluster has no authoritative write target. Secondaries
stay read-only until one of them gets promoted via <code>RedundancyCoordinator</code>.
</div>
}
else if (primaries > 1)
{
<div class="alert alert-danger small mb-3">
<strong>Split-brain:</strong> @primaries nodes claim the Primary role. Apply-lease
enforcement should have made this impossible at the coordinator level. Investigate
immediately — one of the rows was likely hand-edited.
</div>
}
<table class="table table-sm table-hover align-middle">
<thead>
<tr>
<th>Node</th>
<th>Role</th>
<th>Host</th>
<th class="text-end">OPC UA port</th>
<th class="text-end">ServiceLevel base</th>
<th>ApplicationUri</th>
<th>Enabled</th>
<th>Last seen</th>
</tr>
</thead>
<tbody>
@foreach (var n in _nodes)
{
<tr class="@RowClass(n)">
<td><code>@n.NodeId</code></td>
<td><span class="badge @RoleBadge(n.RedundancyRole)">@n.RedundancyRole</span></td>
<td>@n.Host</td>
<td class="text-end"><code>@n.OpcUaPort</code></td>
<td class="text-end">@n.ServiceLevelBase</td>
<td class="small text-break"><code>@n.ApplicationUri</code></td>
<td>
@if (n.Enabled) { <span class="badge bg-success">Enabled</span> }
else { <span class="badge bg-secondary">Disabled</span> }
</td>
<td class="small @(ClusterNodeService.IsStale(n) ? "text-warning fw-bold" : "")">
@(n.LastSeenAt is null ? "never" : FormatAge(n.LastSeenAt.Value))
@if (ClusterNodeService.IsStale(n)) { <span class="badge bg-warning text-dark ms-1">Stale</span> }
</td>
</tr>
}
</tbody>
</table>
}
@code {
[Parameter] public string ClusterId { get; set; } = string.Empty;
private List<ClusterNode>? _nodes;
private HubConnection? _hub;
private string? _roleChangedBanner;
protected override async Task OnParametersSetAsync()
{
_nodes = await NodeSvc.ListByClusterAsync(ClusterId, CancellationToken.None);
if (_hub is null) await ConnectHubAsync();
}
private async Task ConnectHubAsync()
{
_hub = new HubConnectionBuilder()
.WithUrl(Nav.ToAbsoluteUri("/hubs/fleet-status"))
.WithAutomaticReconnect()
.Build();
_hub.On<RoleChangedMessage>("RoleChanged", async msg =>
{
if (msg.ClusterId != ClusterId) return;
_roleChangedBanner = $"Role changed on {msg.NodeId}: {msg.FromRole} → {msg.ToRole} at {msg.ObservedAtUtc:HH:mm:ss 'UTC'}";
_nodes = await NodeSvc.ListByClusterAsync(ClusterId, CancellationToken.None);
await InvokeAsync(StateHasChanged);
});
await _hub.StartAsync();
await _hub.SendAsync("SubscribeCluster", ClusterId);
}
public async ValueTask DisposeAsync()
{
if (_hub is not null)
{
await _hub.DisposeAsync();
_hub = null;
}
}
private static string RowClass(ClusterNode n) =>
ClusterNodeService.IsStale(n) ? "table-warning" :
!n.Enabled ? "table-secondary" : "";
private static string RoleBadge(RedundancyRole r) => r switch
{
RedundancyRole.Primary => "bg-success",
RedundancyRole.Secondary => "bg-info",
RedundancyRole.Standalone => "bg-primary",
_ => "bg-secondary",
};
private static string FormatAge(DateTime t)
{
var age = DateTime.UtcNow - t;
if (age.TotalSeconds < 60) return $"{(int)age.TotalSeconds}s ago";
if (age.TotalMinutes < 60) return $"{(int)age.TotalMinutes}m ago";
if (age.TotalHours < 24) return $"{(int)age.TotalHours}h ago";
return t.ToString("yyyy-MM-dd HH:mm 'UTC'");
}
}

View File

@@ -2,6 +2,13 @@
@using ZB.MOM.WW.OtOpcUa.Configuration.Entities
@inject UnsService UnsSvc
<div class="alert alert-info small mb-3">
Drag any line in the <strong>UNS Lines</strong> table onto an area row in <strong>UNS Areas</strong>
to re-parent it. A preview modal shows the impact (equipment re-home count) + lets you confirm
or cancel. If another operator modifies the draft while you're confirming, you'll see a 409
refresh-required modal instead of clobbering their work.
</div>
<div class="row">
<div class="col-md-6">
<div class="d-flex justify-content-between mb-2">
@@ -14,11 +21,20 @@
else
{
<table class="table table-sm">
<thead><tr><th>AreaId</th><th>Name</th></tr></thead>
<thead><tr><th>AreaId</th><th>Name</th><th class="small text-muted">(drop target)</th></tr></thead>
<tbody>
@foreach (var a in _areas)
{
<tr><td><code>@a.UnsAreaId</code></td><td>@a.Name</td></tr>
<tr class="@(_hoverAreaId == a.UnsAreaId ? "table-primary" : "")"
@ondragover="e => OnAreaDragOver(e, a.UnsAreaId)"
@ondragover:preventDefault
@ondragleave="() => _hoverAreaId = null"
@ondrop="() => OnLineDroppedAsync(a.UnsAreaId)"
@ondrop:preventDefault>
<td><code>@a.UnsAreaId</code></td>
<td>@a.Name</td>
<td class="small text-muted">drop here</td>
</tr>
}
</tbody>
</table>
@@ -35,6 +51,7 @@
</div>
}
</div>
<div class="col-md-6">
<div class="d-flex justify-content-between mb-2">
<h4>UNS Lines</h4>
@@ -50,7 +67,14 @@
<tbody>
@foreach (var l in _lines)
{
<tr><td><code>@l.UnsLineId</code></td><td><code>@l.UnsAreaId</code></td><td>@l.Name</td></tr>
<tr draggable="true"
@ondragstart="() => _dragLineId = l.UnsLineId"
@ondragend="() => { _dragLineId = null; _hoverAreaId = null; }"
style="cursor: grab;">
<td><code>@l.UnsLineId</code></td>
<td><code>@l.UnsAreaId</code></td>
<td>@l.Name</td>
</tr>
}
</tbody>
</table>
@@ -75,6 +99,64 @@
</div>
</div>
@* Preview / confirm modal for a pending drag-drop move *@
@if (_pendingPreview is not null)
{
<div class="modal show d-block" tabindex="-1" style="background-color: rgba(0,0,0,0.5);">
<div class="modal-dialog">
<div class="modal-content">
<div class="modal-header">
<h5 class="modal-title">Confirm UNS move</h5>
<button type="button" class="btn-close" @onclick="CancelMove"></button>
</div>
<div class="modal-body">
<p>@_pendingPreview.HumanReadableSummary</p>
<p class="text-muted small">
Equipment re-homed: <strong>@_pendingPreview.AffectedEquipmentCount</strong>.
Tags re-parented: <strong>@_pendingPreview.AffectedTagCount</strong>.
</p>
@if (_pendingPreview.CascadeWarnings.Count > 0)
{
<div class="alert alert-warning small mb-0">
<ul class="mb-0">
@foreach (var w in _pendingPreview.CascadeWarnings) { <li>@w</li> }
</ul>
</div>
}
</div>
<div class="modal-footer">
<button class="btn btn-secondary" @onclick="CancelMove">Cancel</button>
<button class="btn btn-primary" @onclick="ConfirmMoveAsync" disabled="@_committing">Confirm move</button>
</div>
</div>
</div>
</div>
}
@* 409 concurrent-edit modal — another operator changed the draft between preview + commit *@
@if (_conflictMessage is not null)
{
<div class="modal show d-block" tabindex="-1" style="background-color: rgba(0,0,0,0.5);">
<div class="modal-dialog">
<div class="modal-content border-danger">
<div class="modal-header bg-danger text-white">
<h5 class="modal-title">Draft changed — refresh required</h5>
</div>
<div class="modal-body">
<p>@_conflictMessage</p>
<p class="small text-muted">
Concurrency guard per DraftRevisionToken prevented overwriting the peer
operator's edit. Reload the tab + redo the move on the current draft state.
</p>
</div>
<div class="modal-footer">
<button class="btn btn-primary" @onclick="ReloadAfterConflict">Reload draft</button>
</div>
</div>
</div>
</div>
}
@code {
[Parameter] public long GenerationId { get; set; }
[Parameter] public string ClusterId { get; set; } = string.Empty;
@@ -87,6 +169,13 @@
private string _newLineName = string.Empty;
private string _newLineAreaId = string.Empty;
private string? _dragLineId;
private string? _hoverAreaId;
private UnsImpactPreview? _pendingPreview;
private UnsMoveOperation? _pendingMove;
private bool _committing;
private string? _conflictMessage;
protected override async Task OnParametersSetAsync() => await ReloadAsync();
private async Task ReloadAsync()
@@ -112,4 +201,72 @@
_showLineForm = false;
await ReloadAsync();
}
private void OnAreaDragOver(DragEventArgs _, string areaId) => _hoverAreaId = areaId;
private async Task OnLineDroppedAsync(string targetAreaId)
{
var lineId = _dragLineId;
_hoverAreaId = null;
_dragLineId = null;
if (string.IsNullOrWhiteSpace(lineId)) return;
var line = _lines?.FirstOrDefault(l => l.UnsLineId == lineId);
if (line is null || line.UnsAreaId == targetAreaId) return;
var snapshot = await UnsSvc.LoadSnapshotAsync(GenerationId, CancellationToken.None);
var move = new UnsMoveOperation(
Kind: UnsMoveKind.LineMove,
SourceClusterId: ClusterId,
TargetClusterId: ClusterId,
SourceLineId: lineId,
TargetAreaId: targetAreaId);
try
{
_pendingPreview = UnsImpactAnalyzer.Analyze(snapshot, move);
_pendingMove = move;
}
catch (Exception ex)
{
_conflictMessage = ex.Message; // CrossCluster or validation failure surfaces here
}
}
private void CancelMove()
{
_pendingPreview = null;
_pendingMove = null;
}
private async Task ConfirmMoveAsync()
{
if (_pendingPreview is null || _pendingMove is null) return;
_committing = true;
try
{
await UnsSvc.MoveLineAsync(
GenerationId,
_pendingPreview.RevisionToken,
_pendingMove.SourceLineId!,
_pendingMove.TargetAreaId!,
CancellationToken.None);
_pendingPreview = null;
_pendingMove = null;
await ReloadAsync();
}
catch (DraftRevisionConflictException ex)
{
_pendingPreview = null;
_pendingMove = null;
_conflictMessage = ex.Message;
}
finally { _committing = false; }
}
private async Task ReloadAfterConflict()
{
_conflictMessage = null;
await ReloadAsync();
}
}

View File

@@ -0,0 +1,192 @@
@page "/role-grants"
@using Microsoft.AspNetCore.Components.Web
@using Microsoft.AspNetCore.SignalR.Client
@using ZB.MOM.WW.OtOpcUa.Admin.Hubs
@using ZB.MOM.WW.OtOpcUa.Admin.Services
@using ZB.MOM.WW.OtOpcUa.Configuration.Entities
@using ZB.MOM.WW.OtOpcUa.Configuration.Enums
@using ZB.MOM.WW.OtOpcUa.Configuration.Services
@inject ILdapGroupRoleMappingService RoleSvc
@inject ClusterService ClusterSvc
@inject AclChangeNotifier Notifier
@inject NavigationManager Nav
@implements IAsyncDisposable
<h1 class="mb-4">LDAP group → Admin role grants</h1>
<div class="alert alert-info small mb-4">
Maps LDAP groups to Admin UI roles (ConfigViewer / ConfigEditor / FleetAdmin). Control-plane
only — OPC UA data-path authorization reads <code>NodeAcl</code> rows directly and is
unaffected by these mappings (see decision #150). A fleet-wide grant applies across every
cluster; a cluster-scoped grant only binds within the named cluster. The same LDAP group
may hold different roles on different clusters.
</div>
<div class="d-flex justify-content-end mb-3">
<button class="btn btn-primary btn-sm" @onclick="StartAdd">Add grant</button>
</div>
@if (_rows is null)
{
<p>Loading…</p>
}
else if (_rows.Count == 0)
{
<p class="text-muted">No role grants defined yet. Without at least one FleetAdmin grant,
only the bootstrap admin can publish drafts.</p>
}
else
{
<table class="table table-sm table-hover">
<thead>
<tr><th>LDAP group</th><th>Role</th><th>Scope</th><th>Created</th><th>Notes</th><th></th></tr>
</thead>
<tbody>
@foreach (var r in _rows)
{
<tr>
<td><code>@r.LdapGroup</code></td>
<td><span class="badge bg-secondary">@r.Role</span></td>
<td>@(r.IsSystemWide ? "Fleet-wide" : $"Cluster: {r.ClusterId}")</td>
<td class="small">@r.CreatedAtUtc.ToString("yyyy-MM-dd")</td>
<td class="small text-muted">@r.Notes</td>
<td><button class="btn btn-sm btn-outline-danger" @onclick="() => DeleteAsync(r.Id)">Revoke</button></td>
</tr>
}
</tbody>
</table>
}
@if (_showForm)
{
<div class="card mt-3">
<div class="card-body">
<h5>New role grant</h5>
<div class="row g-3">
<div class="col-md-4">
<label class="form-label">LDAP group (DN)</label>
<input class="form-control" @bind="_group" placeholder="cn=fleet-admin,ou=groups,dc=…"/>
</div>
<div class="col-md-3">
<label class="form-label">Role</label>
<select class="form-select" @bind="_role">
@foreach (var r in Enum.GetValues<AdminRole>())
{
<option value="@r">@r</option>
}
</select>
</div>
<div class="col-md-2 pt-4">
<div class="form-check">
<input class="form-check-input" type="checkbox" id="systemWide" @bind="_isSystemWide"/>
<label class="form-check-label" for="systemWide">Fleet-wide</label>
</div>
</div>
<div class="col-md-3">
<label class="form-label">Cluster @(_isSystemWide ? "(disabled)" : "")</label>
<select class="form-select" @bind="_clusterId" disabled="@_isSystemWide">
<option value="">-- select --</option>
@if (_clusters is not null)
{
@foreach (var c in _clusters)
{
<option value="@c.ClusterId">@c.ClusterId</option>
}
}
</select>
</div>
<div class="col-12">
<label class="form-label">Notes (optional)</label>
<input class="form-control" @bind="_notes"/>
</div>
</div>
@if (_error is not null) { <div class="alert alert-danger mt-3">@_error</div> }
<div class="mt-3">
<button class="btn btn-sm btn-primary" @onclick="SaveAsync">Save</button>
<button class="btn btn-sm btn-secondary ms-2" @onclick="() => _showForm = false">Cancel</button>
</div>
</div>
</div>
}
@code {
private IReadOnlyList<LdapGroupRoleMapping>? _rows;
private List<ServerCluster>? _clusters;
private bool _showForm;
private string _group = string.Empty;
private AdminRole _role = AdminRole.ConfigViewer;
private bool _isSystemWide;
private string _clusterId = string.Empty;
private string? _notes;
private string? _error;
protected override async Task OnInitializedAsync() => await ReloadAsync();
private async Task ReloadAsync()
{
_rows = await RoleSvc.ListAllAsync(CancellationToken.None);
_clusters = await ClusterSvc.ListAsync(CancellationToken.None);
}
private void StartAdd()
{
_group = string.Empty;
_role = AdminRole.ConfigViewer;
_isSystemWide = false;
_clusterId = string.Empty;
_notes = null;
_error = null;
_showForm = true;
}
private async Task SaveAsync()
{
_error = null;
try
{
var row = new LdapGroupRoleMapping
{
LdapGroup = _group.Trim(),
Role = _role,
IsSystemWide = _isSystemWide,
ClusterId = _isSystemWide ? null : (string.IsNullOrWhiteSpace(_clusterId) ? null : _clusterId),
Notes = string.IsNullOrWhiteSpace(_notes) ? null : _notes,
};
await RoleSvc.CreateAsync(row, CancellationToken.None);
await Notifier.NotifyRoleGrantsChangedAsync(CancellationToken.None);
_showForm = false;
await ReloadAsync();
}
catch (Exception ex) { _error = ex.Message; }
}
private async Task DeleteAsync(Guid id)
{
await RoleSvc.DeleteAsync(id, CancellationToken.None);
await Notifier.NotifyRoleGrantsChangedAsync(CancellationToken.None);
await ReloadAsync();
}
private HubConnection? _hub;
protected override async Task OnAfterRenderAsync(bool firstRender)
{
if (!firstRender || _hub is not null) return;
_hub = new HubConnectionBuilder()
.WithUrl(Nav.ToAbsoluteUri("/hubs/fleet-status"))
.WithAutomaticReconnect()
.Build();
_hub.On<RoleGrantsChangedMessage>("RoleGrantsChanged", async _ =>
{
await ReloadAsync();
await InvokeAsync(StateHasChanged);
});
await _hub.StartAsync();
await _hub.SendAsync("SubscribeFleet");
}
public async ValueTask DisposeAsync()
{
if (_hub is not null) { await _hub.DisposeAsync(); _hub = null; }
}
}

View File

@@ -1,7 +1,9 @@
using Microsoft.AspNetCore.SignalR;
using Microsoft.EntityFrameworkCore;
using ZB.MOM.WW.OtOpcUa.Admin.Services;
using ZB.MOM.WW.OtOpcUa.Configuration;
using ZB.MOM.WW.OtOpcUa.Configuration.Entities;
using ZB.MOM.WW.OtOpcUa.Configuration.Enums;
namespace ZB.MOM.WW.OtOpcUa.Admin.Hubs;
@@ -14,11 +16,13 @@ public sealed class FleetStatusPoller(
IServiceScopeFactory scopeFactory,
IHubContext<FleetStatusHub> fleetHub,
IHubContext<AlertHub> alertHub,
ILogger<FleetStatusPoller> logger) : BackgroundService
ILogger<FleetStatusPoller> logger,
RedundancyMetrics redundancyMetrics) : BackgroundService
{
public TimeSpan PollInterval { get; init; } = TimeSpan.FromSeconds(5);
private readonly Dictionary<string, NodeStateSnapshot> _last = new();
private readonly Dictionary<string, RedundancyRole> _lastRole = new(StringComparer.Ordinal);
protected override async Task ExecuteAsync(CancellationToken stoppingToken)
{
@@ -42,6 +46,10 @@ public sealed class FleetStatusPoller(
using var scope = scopeFactory.CreateScope();
var db = scope.ServiceProvider.GetRequiredService<OtOpcUaConfigDbContext>();
var nodes = await db.ClusterNodes.AsNoTracking().ToListAsync(ct);
await PollRolesAsync(nodes, ct);
UpdateClusterGauges(nodes);
var rows = await db.ClusterNodeGenerationStates.AsNoTracking()
.Join(db.ClusterNodes.AsNoTracking(), s => s.NodeId, n => n.NodeId, (s, n) => new { s, n.ClusterId })
.ToListAsync(ct);
@@ -85,9 +93,63 @@ public sealed class FleetStatusPoller(
}
/// <summary>Exposed for tests — forces a snapshot reset so stub data re-seeds.</summary>
internal void ResetCache() => _last.Clear();
internal void ResetCache()
{
_last.Clear();
_lastRole.Clear();
}
private async Task PollRolesAsync(IReadOnlyList<ClusterNode> nodes, CancellationToken ct)
{
foreach (var n in nodes)
{
var hadPrior = _lastRole.TryGetValue(n.NodeId, out var priorRole);
if (hadPrior && priorRole == n.RedundancyRole) continue;
_lastRole[n.NodeId] = n.RedundancyRole;
if (!hadPrior) continue; // first-observation bootstrap — not a transition
redundancyMetrics.RecordRoleTransition(
clusterId: n.ClusterId, nodeId: n.NodeId,
fromRole: priorRole.ToString(), toRole: n.RedundancyRole.ToString());
var msg = new RoleChangedMessage(
ClusterId: n.ClusterId, NodeId: n.NodeId,
FromRole: priorRole.ToString(), ToRole: n.RedundancyRole.ToString(),
ObservedAtUtc: DateTime.UtcNow);
await fleetHub.Clients.Group(FleetStatusHub.GroupName(n.ClusterId))
.SendAsync("RoleChanged", msg, ct);
await fleetHub.Clients.Group(FleetStatusHub.FleetGroup)
.SendAsync("RoleChanged", msg, ct);
}
}
private void UpdateClusterGauges(IReadOnlyList<ClusterNode> nodes)
{
var staleCutoff = DateTime.UtcNow - Services.ClusterNodeService.StaleThreshold;
foreach (var group in nodes.GroupBy(n => n.ClusterId))
{
var primary = group.Count(n => n.RedundancyRole == RedundancyRole.Primary);
var secondary = group.Count(n => n.RedundancyRole == RedundancyRole.Secondary);
var stale = group.Count(n => n.LastSeenAt is null || n.LastSeenAt.Value < staleCutoff);
redundancyMetrics.SetClusterCounts(group.Key, primary, secondary, stale);
}
}
private readonly record struct NodeStateSnapshot(
string NodeId, string ClusterId, long? GenerationId,
string? Status, string? Error, DateTime? AppliedAt, DateTime? SeenAt);
}
/// <summary>
/// Pushed by <see cref="FleetStatusPoller"/> when it observes a change in
/// <see cref="ClusterNode.RedundancyRole"/>. Consumed by the Admin RedundancyTab to trigger
/// an instant reload instead of waiting for the next on-parameter-set poll.
/// </summary>
public sealed record RoleChangedMessage(
string ClusterId,
string NodeId,
string FromRole,
string ToRole,
DateTime ObservedAtUtc);

View File

@@ -1,6 +1,7 @@
using Microsoft.AspNetCore.Authentication;
using Microsoft.AspNetCore.Authentication.Cookies;
using Microsoft.EntityFrameworkCore;
using OpenTelemetry.Metrics;
using Serilog;
using ZB.MOM.WW.OtOpcUa.Admin.Components;
using ZB.MOM.WW.OtOpcUa.Admin.Hubs;
@@ -44,10 +45,17 @@ builder.Services.AddScoped<UnsService>();
builder.Services.AddScoped<NamespaceService>();
builder.Services.AddScoped<DriverInstanceService>();
builder.Services.AddScoped<NodeAclService>();
builder.Services.AddScoped<PermissionProbeService>();
builder.Services.AddScoped<AclChangeNotifier>();
builder.Services.AddScoped<ReservationService>();
builder.Services.AddScoped<DraftValidationService>();
builder.Services.AddScoped<AuditLogService>();
builder.Services.AddScoped<HostStatusService>();
builder.Services.AddScoped<ClusterNodeService>();
builder.Services.AddSingleton<RedundancyMetrics>();
builder.Services.AddScoped<EquipmentImportBatchService>();
builder.Services.AddScoped<ZB.MOM.WW.OtOpcUa.Configuration.Services.ILdapGroupRoleMappingService,
ZB.MOM.WW.OtOpcUa.Configuration.Services.LdapGroupRoleMappingService>();
// Cert-trust management — reads the OPC UA server's PKI store root so rejected client certs
// can be promoted to trusted via the Admin UI. Singleton: no per-request state, just
@@ -63,6 +71,19 @@ builder.Services.AddScoped<ILdapAuthService, LdapAuthService>();
// SignalR real-time fleet status + alerts (admin-ui.md §"Real-Time Updates").
builder.Services.AddHostedService<FleetStatusPoller>();
// OpenTelemetry Prometheus exporter — Meter stream from RedundancyMetrics + any future
// Admin-side instrumentation lands on the /metrics endpoint Prometheus scrapes. Pull-based
// means no OTel Collector deployment required for the common deploy-in-a-K8s case; appsettings
// Metrics:Prometheus:Enabled=false disables the endpoint entirely for locked-down deployments.
var metricsEnabled = builder.Configuration.GetValue("Metrics:Prometheus:Enabled", true);
if (metricsEnabled)
{
builder.Services.AddOpenTelemetry()
.WithMetrics(m => m
.AddMeter(RedundancyMetrics.MeterName)
.AddPrometheusExporter());
}
var app = builder.Build();
app.UseSerilogRequestLogging();
@@ -80,6 +101,15 @@ app.MapPost("/auth/logout", async (HttpContext ctx) =>
app.MapHub<FleetStatusHub>("/hubs/fleet");
app.MapHub<AlertHub>("/hubs/alerts");
if (metricsEnabled)
{
// Prometheus scrape endpoint — expose instrumentation registered in the OTel MeterProvider
// above. Emits text-format metrics at /metrics; auth is intentionally NOT required (Prometheus
// scrape jobs typically run on a trusted network). Operators who need auth put the endpoint
// behind a reverse-proxy basic-auth gate per fleet-ops convention.
app.MapPrometheusScrapingEndpoint();
}
app.MapRazorComponents<App>().AddInteractiveServerRenderMode();
await app.RunAsync();

View File

@@ -0,0 +1,49 @@
using Microsoft.AspNetCore.SignalR;
using ZB.MOM.WW.OtOpcUa.Admin.Hubs;
namespace ZB.MOM.WW.OtOpcUa.Admin.Services;
/// <summary>
/// Thin SignalR push helper for ACL + role-grant invalidation — slice 2 of task #196.
/// Lets the Admin services + razor pages invalidate connected peers' views without each
/// one having to know the hub wiring. Two message kinds: <c>NodeAclChanged</c> (cluster-scoped)
/// and <c>RoleGrantsChanged</c> (fleet-wide — role mappings cross cluster boundaries).
/// </summary>
/// <remarks>
/// Intentionally fire-and-forget — a failed hub send doesn't rollback the DB write that
/// triggered it. Worst-case an operator sees stale data until their next poll or manual
/// refresh; better than a transient hub blip blocking the authoritative write path.
/// </remarks>
public sealed class AclChangeNotifier(IHubContext<FleetStatusHub> fleetHub, ILogger<AclChangeNotifier> logger)
{
public async Task NotifyNodeAclChangedAsync(string clusterId, long generationId, CancellationToken ct)
{
try
{
var msg = new NodeAclChangedMessage(ClusterId: clusterId, GenerationId: generationId, ObservedAtUtc: DateTime.UtcNow);
await fleetHub.Clients.Group(FleetStatusHub.GroupName(clusterId))
.SendAsync("NodeAclChanged", msg, ct).ConfigureAwait(false);
}
catch (Exception ex) when (ex is not OperationCanceledException)
{
logger.LogWarning(ex, "NodeAclChanged push failed for cluster {ClusterId} gen {GenerationId}", clusterId, generationId);
}
}
public async Task NotifyRoleGrantsChangedAsync(CancellationToken ct)
{
try
{
var msg = new RoleGrantsChangedMessage(ObservedAtUtc: DateTime.UtcNow);
await fleetHub.Clients.Group(FleetStatusHub.FleetGroup)
.SendAsync("RoleGrantsChanged", msg, ct).ConfigureAwait(false);
}
catch (Exception ex) when (ex is not OperationCanceledException)
{
logger.LogWarning(ex, "RoleGrantsChanged push failed");
}
}
}
public sealed record NodeAclChangedMessage(string ClusterId, long GenerationId, DateTime ObservedAtUtc);
public sealed record RoleGrantsChangedMessage(DateTime ObservedAtUtc);

View File

@@ -0,0 +1,28 @@
using Microsoft.EntityFrameworkCore;
using ZB.MOM.WW.OtOpcUa.Configuration;
using ZB.MOM.WW.OtOpcUa.Configuration.Entities;
namespace ZB.MOM.WW.OtOpcUa.Admin.Services;
/// <summary>
/// Read-side service for ClusterNode rows + their cluster-scoped redundancy view. Consumed
/// by the RedundancyTab on the cluster detail page. Writes (role swap, node enable/disable)
/// are not supported here — role swap happens through the RedundancyCoordinator apply-lease
/// flow on the server side and would conflict with any direct DB mutation from Admin.
/// </summary>
public sealed class ClusterNodeService(OtOpcUaConfigDbContext db)
{
/// <summary>Stale-threshold matching <c>HostStatusService.StaleThreshold</c> — 30s of clock
/// tolerance covers a missed heartbeat plus publisher GC pauses.</summary>
public static readonly TimeSpan StaleThreshold = TimeSpan.FromSeconds(30);
public Task<List<ClusterNode>> ListByClusterAsync(string clusterId, CancellationToken ct) =>
db.ClusterNodes.AsNoTracking()
.Where(n => n.ClusterId == clusterId)
.OrderByDescending(n => n.ServiceLevelBase)
.ThenBy(n => n.NodeId)
.ToListAsync(ct);
public static bool IsStale(ClusterNode node) =>
node.LastSeenAt is null || DateTime.UtcNow - node.LastSeenAt.Value > StaleThreshold;
}

View File

@@ -2,6 +2,7 @@ using Microsoft.EntityFrameworkCore;
using ZB.MOM.WW.OtOpcUa.Admin.Services;
using ZB.MOM.WW.OtOpcUa.Configuration;
using ZB.MOM.WW.OtOpcUa.Configuration.Entities;
using ZB.MOM.WW.OtOpcUa.Configuration.Enums;
namespace ZB.MOM.WW.OtOpcUa.Admin.Services;
@@ -152,14 +153,37 @@ public sealed class EquipmentImportBatchService(OtOpcUaConfigDbContext db)
try
{
foreach (var row in batch.Rows.Where(r => r.IsAccepted))
// Snapshot active reservations that overlap this batch's ZTag + SAPID set — one
// round-trip instead of N. Released rows (ReleasedAt IS NOT NULL) are ignored so
// an explicitly-released value can be reused.
var accepted = batch.Rows.Where(r => r.IsAccepted).ToList();
var zTags = accepted.Where(r => !string.IsNullOrWhiteSpace(r.ZTag))
.Select(r => r.ZTag).Distinct(StringComparer.OrdinalIgnoreCase).ToList();
var sapIds = accepted.Where(r => !string.IsNullOrWhiteSpace(r.SAPID))
.Select(r => r.SAPID).Distinct(StringComparer.OrdinalIgnoreCase).ToList();
var existingReservations = await db.ExternalIdReservations
.Where(r => r.ReleasedAt == null &&
((r.Kind == ReservationKind.ZTag && zTags.Contains(r.Value)) ||
(r.Kind == ReservationKind.SAPID && sapIds.Contains(r.Value))))
.ToListAsync(ct).ConfigureAwait(false);
var resByKey = existingReservations.ToDictionary(
r => (r.Kind, r.Value.ToLowerInvariant()),
r => r);
var nowUtc = DateTime.UtcNow;
var firstPublishedBy = batch.CreatedBy;
foreach (var row in accepted)
{
var equipmentUuid = Guid.TryParse(row.EquipmentUuid, out var u) ? u : Guid.NewGuid();
db.Equipment.Add(new Equipment
{
EquipmentRowId = Guid.NewGuid(),
GenerationId = generationId,
EquipmentId = row.EquipmentId,
EquipmentUuid = Guid.TryParse(row.EquipmentUuid, out var u) ? u : Guid.NewGuid(),
EquipmentUuid = equipmentUuid,
DriverInstanceId = driverInstanceIdForRows,
UnsLineId = unsLineIdForRows,
Name = row.Name,
@@ -176,10 +200,25 @@ public sealed class EquipmentImportBatchService(OtOpcUaConfigDbContext db)
ManufacturerUri = row.ManufacturerUri,
DeviceManualUri = row.DeviceManualUri,
});
MergeReservation(row.ZTag, ReservationKind.ZTag, equipmentUuid, batch.ClusterId,
firstPublishedBy, nowUtc, resByKey);
MergeReservation(row.SAPID, ReservationKind.SAPID, equipmentUuid, batch.ClusterId,
firstPublishedBy, nowUtc, resByKey);
}
batch.FinalisedAtUtc = DateTime.UtcNow;
await db.SaveChangesAsync(ct).ConfigureAwait(false);
batch.FinalisedAtUtc = nowUtc;
try
{
await db.SaveChangesAsync(ct).ConfigureAwait(false);
}
catch (DbUpdateException ex) when (IsReservationUniquenessViolation(ex))
{
throw new ExternalIdReservationConflictException(
"Finalise rejected: one or more ZTag/SAPID values were reserved by another operator " +
"between batch preview and commit. Inspect active reservations + retry after resolving the conflict.",
ex);
}
if (tx is not null) await tx.CommitAsync(ct).ConfigureAwait(false);
}
catch
@@ -193,6 +232,71 @@ public sealed class EquipmentImportBatchService(OtOpcUaConfigDbContext db)
}
}
/// <summary>
/// Merge one external-ID reservation for an equipment row. Three outcomes:
/// (1) value is empty → skip; (2) reservation exists for same <paramref name="equipmentUuid"/>
/// → bump <c>LastPublishedAt</c>; (3) reservation exists for a different EquipmentUuid
/// → throw <see cref="ExternalIdReservationConflictException"/> with the conflicting UUID
/// so the caller sees which equipment already owns the value; (4) no reservation → create new.
/// </summary>
private void MergeReservation(
string? value,
ReservationKind kind,
Guid equipmentUuid,
string clusterId,
string firstPublishedBy,
DateTime nowUtc,
Dictionary<(ReservationKind, string), ExternalIdReservation> cache)
{
if (string.IsNullOrWhiteSpace(value)) return;
var key = (kind, value.ToLowerInvariant());
if (cache.TryGetValue(key, out var existing))
{
if (existing.EquipmentUuid != equipmentUuid)
throw new ExternalIdReservationConflictException(
$"{kind} '{value}' is already reserved by EquipmentUuid {existing.EquipmentUuid} " +
$"(first published {existing.FirstPublishedAt:u} on cluster '{existing.ClusterId}'). " +
$"Refusing to re-assign to {equipmentUuid}.");
existing.LastPublishedAt = nowUtc;
return;
}
var fresh = new ExternalIdReservation
{
ReservationId = Guid.NewGuid(),
Kind = kind,
Value = value,
EquipmentUuid = equipmentUuid,
ClusterId = clusterId,
FirstPublishedAt = nowUtc,
FirstPublishedBy = firstPublishedBy,
LastPublishedAt = nowUtc,
};
db.ExternalIdReservations.Add(fresh);
cache[key] = fresh;
}
/// <summary>
/// True when the <see cref="DbUpdateException"/> root-cause was the filtered-unique
/// index <c>UX_ExternalIdReservation_KindValue_Active</c> — i.e. another transaction
/// won the race between our cache-load + commit. SQL Server surfaces this as 2601 / 2627.
/// </summary>
private static bool IsReservationUniquenessViolation(DbUpdateException ex)
{
for (Exception? inner = ex; inner is not null; inner = inner.InnerException)
{
if (inner is Microsoft.Data.SqlClient.SqlException sql &&
(sql.Number == 2601 || sql.Number == 2627) &&
sql.Message.Contains("UX_ExternalIdReservation_KindValue_Active", StringComparison.OrdinalIgnoreCase))
{
return true;
}
}
return false;
}
/// <summary>List batches created by the given user. Finalised batches are archived; include them on demand.</summary>
public async Task<IReadOnlyList<EquipmentImportBatch>> ListByUserAsync(string createdBy, bool includeFinalised, CancellationToken ct)
{
@@ -205,3 +309,16 @@ public sealed class EquipmentImportBatchService(OtOpcUaConfigDbContext db)
public sealed class ImportBatchNotFoundException(string message) : Exception(message);
public sealed class ImportBatchAlreadyFinalisedException(string message) : Exception(message);
/// <summary>
/// Thrown when a <c>FinaliseBatchAsync</c> call detects that one of its ZTag/SAPID values is
/// already reserved by a different EquipmentUuid — either from a prior published generation
/// or a concurrent finalise that won the race. The operator sees the message + the conflicting
/// equipment ownership so they can resolve the conflict (pick a new ZTag, release the existing
/// reservation via <c>sp_ReleaseExternalIdReservation</c>, etc.) and retry the finalise.
/// </summary>
public sealed class ExternalIdReservationConflictException : Exception
{
public ExternalIdReservationConflictException(string message) : base(message) { }
public ExternalIdReservationConflictException(string message, Exception inner) : base(message, inner) { }
}

View File

@@ -5,7 +5,7 @@ using ZB.MOM.WW.OtOpcUa.Configuration.Enums;
namespace ZB.MOM.WW.OtOpcUa.Admin.Services;
public sealed class NodeAclService(OtOpcUaConfigDbContext db)
public sealed class NodeAclService(OtOpcUaConfigDbContext db, AclChangeNotifier? notifier = null)
{
public Task<List<NodeAcl>> ListAsync(long generationId, CancellationToken ct) =>
db.NodeAcls.AsNoTracking()
@@ -31,6 +31,10 @@ public sealed class NodeAclService(OtOpcUaConfigDbContext db)
};
db.NodeAcls.Add(acl);
await db.SaveChangesAsync(ct);
if (notifier is not null)
await notifier.NotifyNodeAclChangedAsync(clusterId, draftId, ct);
return acl;
}
@@ -40,5 +44,8 @@ public sealed class NodeAclService(OtOpcUaConfigDbContext db)
if (row is null) return;
db.NodeAcls.Remove(row);
await db.SaveChangesAsync(ct);
if (notifier is not null)
await notifier.NotifyNodeAclChangedAsync(row.ClusterId, row.GenerationId, ct);
}
}

View File

@@ -0,0 +1,63 @@
using Microsoft.EntityFrameworkCore;
using ZB.MOM.WW.OtOpcUa.Configuration;
using ZB.MOM.WW.OtOpcUa.Configuration.Entities;
using ZB.MOM.WW.OtOpcUa.Configuration.Enums;
using ZB.MOM.WW.OtOpcUa.Core.Authorization;
namespace ZB.MOM.WW.OtOpcUa.Admin.Services;
/// <summary>
/// Runs an ad-hoc permission probe against a draft or published generation's NodeAcl rows —
/// "if LDAP group X asks for permission Y on node Z, would the trie grant it, and which
/// rows contributed?" Powers the AclsTab "Probe this permission" form per the #196 sub-slice.
/// </summary>
/// <remarks>
/// Thin wrapper over <see cref="PermissionTrieBuilder"/> + <see cref="PermissionTrie.CollectMatches"/> —
/// the same code path the Server's dispatch layer uses at request time, so a probe result
/// is guaranteed to match what the live server would decide. The probe is read-only + has
/// no side effects; failing probes do NOT generate audit log rows.
/// </remarks>
public sealed class PermissionProbeService(OtOpcUaConfigDbContext db)
{
/// <summary>
/// Evaluate <paramref name="required"/> against the NodeAcl rows of
/// <paramref name="generationId"/> for a request by <paramref name="ldapGroup"/> at
/// <paramref name="scope"/>. Returns whether the permission would be granted + the list
/// of matching grants so the UI can show *why*.
/// </summary>
public async Task<PermissionProbeResult> ProbeAsync(
long generationId,
string ldapGroup,
NodeScope scope,
NodePermissions required,
CancellationToken ct)
{
ArgumentException.ThrowIfNullOrWhiteSpace(ldapGroup);
ArgumentNullException.ThrowIfNull(scope);
var rows = await db.NodeAcls.AsNoTracking()
.Where(a => a.GenerationId == generationId && a.ClusterId == scope.ClusterId)
.ToListAsync(ct).ConfigureAwait(false);
var trie = PermissionTrieBuilder.Build(scope.ClusterId, generationId, rows);
var matches = trie.CollectMatches(scope, [ldapGroup]);
var effective = NodePermissions.None;
foreach (var m in matches)
effective |= m.PermissionFlags;
var granted = (effective & required) == required;
return new PermissionProbeResult(
Granted: granted,
Required: required,
Effective: effective,
Matches: matches);
}
}
/// <summary>Outcome of a <see cref="PermissionProbeService.ProbeAsync"/> call.</summary>
public sealed record PermissionProbeResult(
bool Granted,
NodePermissions Required,
NodePermissions Effective,
IReadOnlyList<MatchedGrant> Matches);

View File

@@ -0,0 +1,102 @@
using System.Diagnostics.Metrics;
namespace ZB.MOM.WW.OtOpcUa.Admin.Services;
/// <summary>
/// OpenTelemetry-compatible instrumentation for the redundancy surface. Uses in-box
/// <see cref="System.Diagnostics.Metrics"/> so no NuGet dependency is required to emit —
/// any MeterListener (dotnet-counters, OpenTelemetry.Extensions.Hosting OTLP exporter,
/// Prometheus exporter, etc.) picks up the instruments by the <see cref="MeterName"/>.
/// </summary>
/// <remarks>
/// Exporter configuration (OTLP, Prometheus, etc.) is intentionally NOT wired here —
/// that's a deployment-ops decision that belongs in <c>Program.cs</c> behind an
/// <c>appsettings</c> toggle. This class owns only the Meter + instruments so the
/// production data stream exists regardless of exporter availability.
///
/// Counter + gauge names follow the otel-semantic-conventions pattern:
/// <c>otopcua.redundancy.*</c> with tags for ClusterId + (for transitions) FromRole/ToRole/NodeId.
/// </remarks>
public sealed class RedundancyMetrics : IDisposable
{
public const string MeterName = "ZB.MOM.WW.OtOpcUa.Redundancy";
private readonly Meter _meter;
private readonly Counter<long> _roleTransitions;
private readonly object _gaugeLock = new();
private readonly Dictionary<string, ClusterGaugeState> _gaugeState = new();
public RedundancyMetrics()
{
_meter = new Meter(MeterName, version: "1.0.0");
_roleTransitions = _meter.CreateCounter<long>(
"otopcua.redundancy.role_transition",
unit: "{transition}",
description: "Observed RedundancyRole changes per node — tagged FromRole, ToRole, NodeId, ClusterId.");
// Observable gauges — the callback reports whatever the last Observe*Count call stashed.
_meter.CreateObservableGauge(
"otopcua.redundancy.primary_count",
ObservePrimaryCounts,
unit: "{node}",
description: "Count of Primary-role nodes per cluster (should be 1 for N+1 redundant clusters, 0 during failover).");
_meter.CreateObservableGauge(
"otopcua.redundancy.secondary_count",
ObserveSecondaryCounts,
unit: "{node}",
description: "Count of Secondary-role nodes per cluster.");
_meter.CreateObservableGauge(
"otopcua.redundancy.stale_count",
ObserveStaleCounts,
unit: "{node}",
description: "Count of cluster nodes whose LastSeenAt is older than StaleThreshold.");
}
/// <summary>
/// Update the per-cluster snapshot consumed by the ObservableGauges. Poller calls this
/// at the end of every tick so the collectors see fresh numbers on the next observation
/// window (by default 1s for dotnet-counters, configurable per exporter).
/// </summary>
public void SetClusterCounts(string clusterId, int primary, int secondary, int stale)
{
lock (_gaugeLock)
{
_gaugeState[clusterId] = new ClusterGaugeState(primary, secondary, stale);
}
}
/// <summary>
/// Increment the role_transition counter when a node's RedundancyRole changes. Tags
/// allow breakdowns by from/to roles (e.g. Primary → Secondary for planned failover vs
/// Primary → Standalone for emergency recovery) + by cluster for multi-site fleets.
/// </summary>
public void RecordRoleTransition(string clusterId, string nodeId, string fromRole, string toRole)
{
_roleTransitions.Add(1,
new KeyValuePair<string, object?>("cluster.id", clusterId),
new KeyValuePair<string, object?>("node.id", nodeId),
new KeyValuePair<string, object?>("from_role", fromRole),
new KeyValuePair<string, object?>("to_role", toRole));
}
public void Dispose() => _meter.Dispose();
private IEnumerable<Measurement<long>> ObservePrimaryCounts() => SnapshotGauge(s => s.Primary);
private IEnumerable<Measurement<long>> ObserveSecondaryCounts() => SnapshotGauge(s => s.Secondary);
private IEnumerable<Measurement<long>> ObserveStaleCounts() => SnapshotGauge(s => s.Stale);
private IEnumerable<Measurement<long>> SnapshotGauge(Func<ClusterGaugeState, int> selector)
{
List<Measurement<long>> results;
lock (_gaugeLock)
{
results = new List<Measurement<long>>(_gaugeState.Count);
foreach (var (cluster, state) in _gaugeState)
results.Add(new Measurement<long>(selector(state),
new KeyValuePair<string, object?>("cluster.id", cluster)));
}
return results;
}
private readonly record struct ClusterGaugeState(int Primary, int Secondary, int Stale);
}

View File

@@ -1,3 +1,5 @@
using System.Security.Cryptography;
using System.Text;
using Microsoft.EntityFrameworkCore;
using ZB.MOM.WW.OtOpcUa.Configuration;
using ZB.MOM.WW.OtOpcUa.Configuration.Entities;
@@ -47,4 +49,132 @@ public sealed class UnsService(OtOpcUaConfigDbContext db)
await db.SaveChangesAsync(ct);
return line;
}
/// <summary>
/// Build the full UNS tree snapshot for the analyzer. Walks areas + lines in the draft
/// and counts equipment + tags per line. Returns the snapshot plus a deterministic
/// revision token computed by SHA-256'ing the sorted (kind, id, parent, name) tuples —
/// stable across processes + changes whenever any row is added / modified / deleted.
/// </summary>
public async Task<UnsTreeSnapshot> LoadSnapshotAsync(long generationId, CancellationToken ct)
{
var areas = await db.UnsAreas.AsNoTracking()
.Where(a => a.GenerationId == generationId)
.OrderBy(a => a.UnsAreaId)
.ToListAsync(ct);
var lines = await db.UnsLines.AsNoTracking()
.Where(l => l.GenerationId == generationId)
.OrderBy(l => l.UnsLineId)
.ToListAsync(ct);
var equipmentCounts = await db.Equipment.AsNoTracking()
.Where(e => e.GenerationId == generationId)
.GroupBy(e => e.UnsLineId)
.Select(g => new { LineId = g.Key, Count = g.Count() })
.ToListAsync(ct);
var equipmentByLine = equipmentCounts.ToDictionary(x => x.LineId, x => x.Count, StringComparer.OrdinalIgnoreCase);
var lineSummaries = lines.Select(l =>
new UnsLineSummary(
LineId: l.UnsLineId,
Name: l.Name,
EquipmentCount: equipmentByLine.GetValueOrDefault(l.UnsLineId),
TagCount: 0)).ToList();
var areaSummaries = areas.Select(a =>
new UnsAreaSummary(
AreaId: a.UnsAreaId,
Name: a.Name,
LineIds: lines.Where(l => string.Equals(l.UnsAreaId, a.UnsAreaId, StringComparison.OrdinalIgnoreCase))
.Select(l => l.UnsLineId).ToList())).ToList();
return new UnsTreeSnapshot
{
DraftGenerationId = generationId,
RevisionToken = ComputeRevisionToken(areas, lines),
Areas = areaSummaries,
Lines = lineSummaries,
};
}
/// <summary>
/// Atomic re-parent of a line to a new area inside the same draft. The caller must pass
/// the revision token it observed at preview time — a mismatch raises
/// <see cref="DraftRevisionConflictException"/> so the UI can show the 409 concurrent-edit
/// modal instead of silently overwriting a peer's work.
/// </summary>
public async Task MoveLineAsync(
long generationId,
DraftRevisionToken expected,
string lineId,
string targetAreaId,
CancellationToken ct)
{
ArgumentNullException.ThrowIfNull(expected);
ArgumentException.ThrowIfNullOrWhiteSpace(lineId);
ArgumentException.ThrowIfNullOrWhiteSpace(targetAreaId);
var supportsTx = db.Database.IsRelational();
Microsoft.EntityFrameworkCore.Storage.IDbContextTransaction? tx = null;
if (supportsTx) tx = await db.Database.BeginTransactionAsync(ct).ConfigureAwait(false);
try
{
var areas = await db.UnsAreas
.Where(a => a.GenerationId == generationId)
.OrderBy(a => a.UnsAreaId)
.ToListAsync(ct);
var lines = await db.UnsLines
.Where(l => l.GenerationId == generationId)
.OrderBy(l => l.UnsLineId)
.ToListAsync(ct);
var current = ComputeRevisionToken(areas, lines);
if (!current.Matches(expected))
throw new DraftRevisionConflictException(
$"Draft {generationId} changed since preview. Expected revision {expected.Value}, saw {current.Value}. " +
"Refresh + redo the move.");
var line = lines.FirstOrDefault(l => string.Equals(l.UnsLineId, lineId, StringComparison.OrdinalIgnoreCase))
?? throw new InvalidOperationException($"Line '{lineId}' not found in draft {generationId}.");
if (!areas.Any(a => string.Equals(a.UnsAreaId, targetAreaId, StringComparison.OrdinalIgnoreCase)))
throw new InvalidOperationException($"Target area '{targetAreaId}' not found in draft {generationId}.");
if (string.Equals(line.UnsAreaId, targetAreaId, StringComparison.OrdinalIgnoreCase))
return; // no-op drop — same area
line.UnsAreaId = targetAreaId;
await db.SaveChangesAsync(ct);
if (tx is not null) await tx.CommitAsync(ct).ConfigureAwait(false);
}
catch
{
if (tx is not null) await tx.RollbackAsync(ct).ConfigureAwait(false);
throw;
}
finally
{
if (tx is not null) await tx.DisposeAsync().ConfigureAwait(false);
}
}
private static DraftRevisionToken ComputeRevisionToken(IReadOnlyList<UnsArea> areas, IReadOnlyList<UnsLine> lines)
{
var sb = new StringBuilder(capacity: 256 + (areas.Count + lines.Count) * 80);
foreach (var a in areas.OrderBy(a => a.UnsAreaId, StringComparer.Ordinal))
sb.Append("A:").Append(a.UnsAreaId).Append('|').Append(a.Name).Append('|').Append(a.Notes ?? "").Append(';');
foreach (var l in lines.OrderBy(l => l.UnsLineId, StringComparer.Ordinal))
sb.Append("L:").Append(l.UnsLineId).Append('|').Append(l.UnsAreaId).Append('|').Append(l.Name).Append('|').Append(l.Notes ?? "").Append(';');
var hash = SHA256.HashData(Encoding.UTF8.GetBytes(sb.ToString()));
return new DraftRevisionToken(Convert.ToHexStringLower(hash)[..16]);
}
}
/// <summary>Thrown when a UNS move's expected revision token no longer matches the live draft
/// — another operator mutated the draft between preview + commit. Caller surfaces a 409-style
/// "refresh required" modal in the Admin UI.</summary>
public sealed class DraftRevisionConflictException(string message) : Exception(message);

View File

@@ -16,10 +16,13 @@
<PackageReference Include="Novell.Directory.Ldap.NETStandard" Version="3.6.0"/>
<PackageReference Include="Microsoft.AspNetCore.SignalR.Client" Version="10.0.0"/>
<PackageReference Include="Serilog.AspNetCore" Version="9.0.0"/>
<PackageReference Include="OpenTelemetry.Extensions.Hosting" Version="1.15.2"/>
<PackageReference Include="OpenTelemetry.Exporter.Prometheus.AspNetCore" Version="1.15.2-beta.1"/>
</ItemGroup>
<ItemGroup>
<ProjectReference Include="..\ZB.MOM.WW.OtOpcUa.Configuration\ZB.MOM.WW.OtOpcUa.Configuration.csproj"/>
<ProjectReference Include="..\ZB.MOM.WW.OtOpcUa.Core\ZB.MOM.WW.OtOpcUa.Core.csproj"/>
</ItemGroup>
<ItemGroup>

View File

@@ -23,5 +23,10 @@
},
"Serilog": {
"MinimumLevel": "Information"
},
"Metrics": {
"Prometheus": {
"Enabled": true
}
}
}

View File

@@ -0,0 +1,10 @@
; Shipped analyzer releases.
; See https://github.com/dotnet/roslyn-analyzers/blob/main/src/Microsoft.CodeAnalysis.Analyzers/ReleaseTrackingAnalyzers.Help.md
## Release 1.0
### New Rules
Rule ID | Category | Severity | Notes
--------|----------|----------|-------
OTOPCUA0001 | OtOpcUa.Resilience | Warning | Direct driver-capability call bypasses CapabilityInvoker

View File

@@ -0,0 +1,2 @@
; Unshipped analyzer release.
; See https://github.com/dotnet/roslyn-analyzers/blob/main/src/Microsoft.CodeAnalysis.Analyzers/ReleaseTrackingAnalyzers.Help.md

View File

@@ -0,0 +1,143 @@
using System.Collections.Generic;
using System.Collections.Immutable;
using System.Linq;
using Microsoft.CodeAnalysis;
using Microsoft.CodeAnalysis.CSharp.Syntax;
using Microsoft.CodeAnalysis.Diagnostics;
using Microsoft.CodeAnalysis.Operations;
namespace ZB.MOM.WW.OtOpcUa.Analyzers;
/// <summary>
/// Diagnostic analyzer that flags direct invocations of Phase 6.1-wrapped driver-capability
/// methods when the call is NOT already running inside a <c>CapabilityInvoker.ExecuteAsync</c>,
/// <c>CapabilityInvoker.ExecuteWriteAsync</c>, or <c>AlarmSurfaceInvoker.*Async</c> lambda.
/// The wrapping is what gives us per-host breaker isolation, retry semantics, bulkhead-depth
/// accounting, and alarm-ack idempotence guards — raw calls bypass all of that.
/// </summary>
/// <remarks>
/// The analyzer matches by receiver-interface identity using Roslyn's semantic model, not by
/// method name, so a driver with an unusually-named method implementing <c>IReadable.ReadAsync</c>
/// still trips the rule. Lambda-context detection walks up the syntax tree from the call site
/// + checks whether any enclosing <c>InvocationExpressionSyntax</c> targets a member whose
/// containing type is <c>CapabilityInvoker</c> or <c>AlarmSurfaceInvoker</c>. The rule is
/// intentionally narrow: it does NOT try to enforce the capability argument matches the
/// method (e.g. ReadAsync wrapped in <c>ExecuteAsync(DriverCapability.Write, ...)</c> still
/// passes) — that'd require flow analysis beyond single-expression scope.
/// </remarks>
[DiagnosticAnalyzer(Microsoft.CodeAnalysis.LanguageNames.CSharp)]
public sealed class UnwrappedCapabilityCallAnalyzer : DiagnosticAnalyzer
{
public const string DiagnosticId = "OTOPCUA0001";
/// <summary>Interfaces whose methods must be called through the capability invoker.</summary>
private static readonly string[] GuardedInterfaces =
[
"ZB.MOM.WW.OtOpcUa.Core.Abstractions.IReadable",
"ZB.MOM.WW.OtOpcUa.Core.Abstractions.IWritable",
"ZB.MOM.WW.OtOpcUa.Core.Abstractions.ITagDiscovery",
"ZB.MOM.WW.OtOpcUa.Core.Abstractions.ISubscribable",
"ZB.MOM.WW.OtOpcUa.Core.Abstractions.IHostConnectivityProbe",
"ZB.MOM.WW.OtOpcUa.Core.Abstractions.IAlarmSource",
"ZB.MOM.WW.OtOpcUa.Core.Abstractions.IHistoryProvider",
];
/// <summary>Wrapper types whose lambda arguments are the allowed home for guarded calls.</summary>
private static readonly string[] WrapperTypes =
[
"ZB.MOM.WW.OtOpcUa.Core.Resilience.CapabilityInvoker",
"ZB.MOM.WW.OtOpcUa.Core.Resilience.AlarmSurfaceInvoker",
];
private static readonly DiagnosticDescriptor Rule = new(
id: DiagnosticId,
title: "Driver capability call must be wrapped in CapabilityInvoker",
messageFormat: "Call to '{0}' is not wrapped in CapabilityInvoker.ExecuteAsync / ExecuteWriteAsync / AlarmSurfaceInvoker.*. Without the wrapping, Phase 6.1 resilience (retry, breaker, bulkhead, tracker telemetry) is bypassed for this call.",
category: "OtOpcUa.Resilience",
defaultSeverity: DiagnosticSeverity.Warning,
isEnabledByDefault: true,
description: "Phase 6.1 Stream A requires every IReadable/IWritable/ITagDiscovery/ISubscribable/IHostConnectivityProbe/IAlarmSource/IHistoryProvider call to route through the shared Polly pipeline. Direct calls skip the pipeline + lose per-host isolation, retry semantics, and telemetry. If the caller is Core/Server/Driver dispatch code, wrap the call in CapabilityInvoker.ExecuteAsync. If the caller is a unit test invoking the driver directly to test its wire-level behavior, either suppress with a pragma or move the suppression into a NoWarn for the test project.");
public override ImmutableArray<DiagnosticDescriptor> SupportedDiagnostics { get; } = ImmutableArray.Create(Rule);
public override void Initialize(AnalysisContext context)
{
context.ConfigureGeneratedCodeAnalysis(GeneratedCodeAnalysisFlags.None);
context.EnableConcurrentExecution();
context.RegisterOperationAction(AnalyzeInvocation, OperationKind.Invocation);
}
private static void AnalyzeInvocation(OperationAnalysisContext context)
{
var invocation = (Microsoft.CodeAnalysis.Operations.IInvocationOperation)context.Operation;
var method = invocation.TargetMethod;
// Narrow the rule to async wire calls. Synchronous accessors like
// IHostConnectivityProbe.GetHostStatuses() are pure in-memory snapshots + would never
// benefit from the Polly pipeline; flagging them just creates false-positives.
if (!IsAsyncReturningType(method.ReturnType)) return;
if (!ImplementsGuardedInterface(method)) return;
if (IsInsideWrapperLambda(invocation.Syntax, context.Operation.SemanticModel, context.CancellationToken)) return;
var diag = Diagnostic.Create(Rule, invocation.Syntax.GetLocation(), $"{method.ContainingType.Name}.{method.Name}");
context.ReportDiagnostic(diag);
}
private static bool IsAsyncReturningType(ITypeSymbol type)
{
var name = type.OriginalDefinition.ToDisplayString(SymbolDisplayFormat.FullyQualifiedFormat);
return name is "global::System.Threading.Tasks.Task"
or "global::System.Threading.Tasks.Task<TResult>"
or "global::System.Threading.Tasks.ValueTask"
or "global::System.Threading.Tasks.ValueTask<TResult>";
}
private static bool ImplementsGuardedInterface(IMethodSymbol method)
{
foreach (var iface in method.ContainingType.AllInterfaces.Concat(new[] { method.ContainingType }))
{
var ifaceFqn = iface.OriginalDefinition.ToDisplayString(SymbolDisplayFormat.FullyQualifiedFormat)
.Replace("global::", string.Empty);
if (!GuardedInterfaces.Contains(ifaceFqn)) continue;
foreach (var member in iface.GetMembers().OfType<IMethodSymbol>())
{
var impl = method.ContainingType.FindImplementationForInterfaceMember(member);
if (SymbolEqualityComparer.Default.Equals(impl, method) ||
SymbolEqualityComparer.Default.Equals(method.OriginalDefinition, member))
return true;
}
}
return false;
}
private static bool IsInsideWrapperLambda(SyntaxNode startNode, SemanticModel? semanticModel, System.Threading.CancellationToken ct)
{
if (semanticModel is null) return false;
for (var node = startNode.Parent; node is not null; node = node.Parent)
{
// We only care about an enclosing invocation — the call we're auditing must literally
// live inside a lambda (ParenthesizedLambda / SimpleLambda / AnonymousMethod) that is
// an argument of a CapabilityInvoker.Execute* / AlarmSurfaceInvoker.* call.
if (node is not InvocationExpressionSyntax outer) continue;
var sym = semanticModel.GetSymbolInfo(outer, ct).Symbol as IMethodSymbol;
if (sym is null) continue;
var outerTypeFqn = sym.ContainingType.OriginalDefinition.ToDisplayString(SymbolDisplayFormat.FullyQualifiedFormat)
.Replace("global::", string.Empty);
if (!WrapperTypes.Contains(outerTypeFqn)) continue;
// The call is wrapped IFF our startNode is transitively inside one of the outer
// call's argument lambdas. Walk the outer invocation's argument list + check whether
// any lambda body contains the startNode's position.
foreach (var arg in outer.ArgumentList.Arguments)
{
if (arg.Expression is not AnonymousFunctionExpressionSyntax lambda) continue;
if (lambda.Span.Contains(startNode.Span)) return true;
}
}
return false;
}
}

View File

@@ -0,0 +1,24 @@
<Project Sdk="Microsoft.NET.Sdk">
<PropertyGroup>
<!-- Roslyn analyzers ship as netstandard2.0 so they load into the MSBuild compiler host
(which on .NET Framework 4.7.2 and .NET 6+ equally resolves netstandard2.0). -->
<TargetFramework>netstandard2.0</TargetFramework>
<Nullable>enable</Nullable>
<LangVersion>latest</LangVersion>
<IsPackable>false</IsPackable>
<IsRoslynComponent>true</IsRoslynComponent>
<EnforceExtendedAnalyzerRules>true</EnforceExtendedAnalyzerRules>
<RootNamespace>ZB.MOM.WW.OtOpcUa.Analyzers</RootNamespace>
</PropertyGroup>
<ItemGroup>
<PackageReference Include="Microsoft.CodeAnalysis.CSharp" Version="5.3.0" PrivateAssets="all"/>
</ItemGroup>
<ItemGroup>
<AdditionalFiles Include="AnalyzerReleases.Shipped.md"/>
<AdditionalFiles Include="AnalyzerReleases.Unshipped.md"/>
</ItemGroup>
</Project>

View File

@@ -9,7 +9,7 @@ return await new CliApplicationBuilder()
if (type.IsSubclassOf(typeof(CommandBase))) return Activator.CreateInstance(type, CommandBase.DefaultFactory)!;
return Activator.CreateInstance(type)!;
})
.SetExecutableName("lmxopcua-cli")
.SetDescription("LmxOpcUa CLI - command-line client for the LmxOpcUa OPC UA server")
.SetExecutableName("otopcua-cli")
.SetDescription("OtOpcUa CLI - command-line client for the OtOpcUa OPC UA server")
.Build()
.RunAsync(args);

View File

@@ -18,8 +18,8 @@ internal sealed class DefaultApplicationConfigurationFactory : IApplicationConfi
var config = new ApplicationConfiguration
{
ApplicationName = "LmxOpcUaClient",
ApplicationUri = "urn:localhost:LmxOpcUaClient",
ApplicationName = "OtOpcUaClient",
ApplicationUri = "urn:localhost:OtOpcUaClient",
ApplicationType = ApplicationType.Client,
SecurityConfiguration = new SecurityConfiguration
{
@@ -60,7 +60,7 @@ internal sealed class DefaultApplicationConfigurationFactory : IApplicationConfi
{
var app = new ApplicationInstance
{
ApplicationName = "LmxOpcUaClient",
ApplicationName = "OtOpcUaClient",
ApplicationType = ApplicationType.Client,
ApplicationConfiguration = config
};

View File

@@ -0,0 +1,90 @@
namespace ZB.MOM.WW.OtOpcUa.Client.Shared;
/// <summary>
/// Resolves the canonical under-LocalAppData folder for the shared OPC UA client's PKI
/// store + persisted settings. Renamed from <c>LmxOpcUaClient</c> to <c>OtOpcUaClient</c>
/// in task #208; a one-shot migration shim moves a pre-rename folder in place on first
/// resolution so existing developer boxes keep their trusted server certs + saved
/// connection settings on upgrade.
/// </summary>
/// <remarks>
/// Thread-safe: the rename uses <see cref="Directory.Move"/> which is atomic on NTFS
/// within the same volume. The lock guarantees the migration runs at most once per
/// process even under concurrent first-touch from CLI + UI.
/// </remarks>
public static class ClientStoragePaths
{
/// <summary>Canonical client folder name. Post-#208.</summary>
public const string CanonicalFolderName = "OtOpcUaClient";
/// <summary>Pre-#208 folder name. Used only by the migration shim.</summary>
public const string LegacyFolderName = "LmxOpcUaClient";
private static readonly Lock _migrationLock = new();
private static bool _migrationChecked;
/// <summary>
/// Absolute path to the client's top-level folder under LocalApplicationData. Runs the
/// one-shot legacy-folder migration before returning so callers that depend on this
/// path (PKI store, settings file) find their existing state at the canonical name.
/// </summary>
public static string GetRoot()
{
var localAppData = Environment.GetFolderPath(Environment.SpecialFolder.LocalApplicationData);
var canonical = Path.Combine(localAppData, CanonicalFolderName);
MigrateLegacyFolderIfNeeded(localAppData, canonical);
return canonical;
}
/// <summary>Subfolder for the application's PKI store — used by both CLI + UI.</summary>
public static string GetPkiPath() => Path.Combine(GetRoot(), "pki");
/// <summary>
/// Expose the migration probe for tests + for callers that want to check whether the
/// legacy folder still exists without forcing the rename. Returns true when a legacy
/// folder existed + was moved to canonical, false when no migration was needed or
/// canonical was already present.
/// </summary>
public static bool TryRunLegacyMigration()
{
var localAppData = Environment.GetFolderPath(Environment.SpecialFolder.LocalApplicationData);
var canonical = Path.Combine(localAppData, CanonicalFolderName);
return MigrateLegacyFolderIfNeeded(localAppData, canonical);
}
private static bool MigrateLegacyFolderIfNeeded(string localAppData, string canonical)
{
// Fast-path out of the lock when the migration has already been attempted this process
// — saves the IO on every subsequent call, + the migration is idempotent within the
// same process anyway.
if (_migrationChecked) return false;
lock (_migrationLock)
{
if (_migrationChecked) return false;
_migrationChecked = true;
var legacy = Path.Combine(localAppData, LegacyFolderName);
// Only migrate when the legacy folder is present + canonical isn't. Either of the
// other three combinations (neither / only-canonical / both) means migration
// should NOT run: no-op fresh install, already-migrated, or manual state the
// developer has set up — don't clobber.
if (!Directory.Exists(legacy)) return false;
if (Directory.Exists(canonical)) return false;
try
{
Directory.Move(legacy, canonical);
return true;
}
catch (IOException)
{
// Concurrent another-process-moved-it or volume-boundary or permissions — leave
// the legacy folder alone; callers that need it can either re-run migration
// manually or point CertificateStorePath explicitly.
return false;
}
}
}
}

View File

@@ -41,11 +41,11 @@ public sealed class ConnectionSettings
public bool AutoAcceptCertificates { get; set; } = true;
/// <summary>
/// Path to the certificate store. Defaults to a subdirectory under LocalApplicationData.
/// Path to the certificate store. Defaults to a subdirectory under LocalApplicationData
/// resolved via <see cref="ClientStoragePaths"/> so the one-shot legacy-folder migration
/// runs before the path is returned.
/// </summary>
public string CertificateStorePath { get; set; } = Path.Combine(
Environment.GetFolderPath(Environment.SpecialFolder.LocalApplicationData),
"LmxOpcUaClient", "pki");
public string CertificateStorePath { get; set; } = ClientStoragePaths.GetPkiPath();
/// <summary>
/// Validates the settings and throws if any required values are missing or invalid.

View File

@@ -425,7 +425,7 @@ public sealed class OpcUaClientService : IOpcUaClientService
: new UserIdentity();
var sessionTimeoutMs = (uint)(settings.SessionTimeoutSeconds * 1000);
return await _sessionFactory.CreateSessionAsync(config, endpoint, "LmxOpcUaClient", sessionTimeoutMs, identity,
return await _sessionFactory.CreateSessionAsync(config, endpoint, "OtOpcUaClient", sessionTimeoutMs, identity,
ct);
}

View File

@@ -1,4 +1,5 @@
using System.Text.Json;
using ZB.MOM.WW.OtOpcUa.Client.Shared;
namespace ZB.MOM.WW.OtOpcUa.Client.UI.Services;
@@ -7,9 +8,9 @@ namespace ZB.MOM.WW.OtOpcUa.Client.UI.Services;
/// </summary>
public sealed class JsonSettingsService : ISettingsService
{
private static readonly string SettingsDir = Path.Combine(
Environment.GetFolderPath(Environment.SpecialFolder.LocalApplicationData),
"LmxOpcUaClient");
// ClientStoragePaths.GetRoot runs the one-shot legacy-folder migration so pre-#208
// developer boxes pick up their existing settings.json on first launch post-rename.
private static readonly string SettingsDir = ClientStoragePaths.GetRoot();
private static readonly string SettingsPath = Path.Combine(SettingsDir, "settings.json");

View File

@@ -21,9 +21,7 @@ public partial class MainWindowViewModel : ObservableObject
[ObservableProperty] private bool _autoAcceptCertificates = true;
[ObservableProperty] private string _certificateStorePath = Path.Combine(
Environment.GetFolderPath(Environment.SpecialFolder.LocalApplicationData),
"LmxOpcUaClient", "pki");
[ObservableProperty] private string _certificateStorePath = ClientStoragePaths.GetPkiPath();
[ObservableProperty]
[NotifyCanExecuteChangedFor(nameof(ConnectCommand))]

View File

@@ -0,0 +1,172 @@
using Microsoft.EntityFrameworkCore.Migrations;
#nullable disable
namespace ZB.MOM.WW.OtOpcUa.Configuration.Migrations
{
/// <summary>
/// Extends <c>dbo.sp_ComputeGenerationDiff</c> to emit <c>NodeAcl</c> rows alongside the
/// existing Namespace/DriverInstance/Equipment/Tag output — closes the final slice of
/// task #196 (DiffViewer ACL section). Logical id for NodeAcl is a composite
/// <c>LdapGroup|ScopeKind|ScopeId</c> triple so a Change row surfaces whether the grant
/// shifted permissions, moved scope, or was added/removed outright.
/// </summary>
/// <inheritdoc />
public partial class ExtendComputeGenerationDiffWithNodeAcl : Migration
{
/// <inheritdoc />
protected override void Up(MigrationBuilder migrationBuilder)
{
migrationBuilder.Sql(Procs.ComputeGenerationDiffV2);
}
/// <inheritdoc />
protected override void Down(MigrationBuilder migrationBuilder)
{
migrationBuilder.Sql(Procs.ComputeGenerationDiffV1);
}
private static class Procs
{
/// <summary>V2 — adds the NodeAcl section to the diff output.</summary>
public const string ComputeGenerationDiffV2 = @"
CREATE OR ALTER PROCEDURE dbo.sp_ComputeGenerationDiff
@FromGenerationId bigint,
@ToGenerationId bigint
AS
BEGIN
SET NOCOUNT ON;
CREATE TABLE #diff (TableName nvarchar(32), LogicalId nvarchar(128), ChangeKind nvarchar(16));
WITH f AS (SELECT NamespaceId AS LogicalId, CHECKSUM(NamespaceUri, Kind, Enabled, Notes) AS Sig FROM dbo.Namespace WHERE GenerationId = @FromGenerationId),
t AS (SELECT NamespaceId AS LogicalId, CHECKSUM(NamespaceUri, Kind, Enabled, Notes) AS Sig FROM dbo.Namespace WHERE GenerationId = @ToGenerationId)
INSERT #diff
SELECT 'Namespace', CONVERT(nvarchar(128), COALESCE(f.LogicalId, t.LogicalId)),
CASE WHEN f.LogicalId IS NULL THEN 'Added'
WHEN t.LogicalId IS NULL THEN 'Removed'
WHEN f.Sig <> t.Sig THEN 'Modified'
ELSE 'Unchanged' END
FROM f FULL OUTER JOIN t ON f.LogicalId = t.LogicalId
WHERE f.LogicalId IS NULL OR t.LogicalId IS NULL OR f.Sig <> t.Sig;
WITH f AS (SELECT DriverInstanceId AS LogicalId, CHECKSUM(ClusterId, NamespaceId, Name, DriverType, Enabled, CONVERT(varchar(max), DriverConfig)) AS Sig FROM dbo.DriverInstance WHERE GenerationId = @FromGenerationId),
t AS (SELECT DriverInstanceId AS LogicalId, CHECKSUM(ClusterId, NamespaceId, Name, DriverType, Enabled, CONVERT(varchar(max), DriverConfig)) AS Sig FROM dbo.DriverInstance WHERE GenerationId = @ToGenerationId)
INSERT #diff
SELECT 'DriverInstance', CONVERT(nvarchar(128), COALESCE(f.LogicalId, t.LogicalId)),
CASE WHEN f.LogicalId IS NULL THEN 'Added'
WHEN t.LogicalId IS NULL THEN 'Removed'
WHEN f.Sig <> t.Sig THEN 'Modified'
ELSE 'Unchanged' END
FROM f FULL OUTER JOIN t ON f.LogicalId = t.LogicalId
WHERE f.LogicalId IS NULL OR t.LogicalId IS NULL OR f.Sig <> t.Sig;
WITH f AS (SELECT EquipmentId AS LogicalId, CHECKSUM(EquipmentUuid, DriverInstanceId, UnsLineId, Name, MachineCode, ZTag, SAPID, EquipmentClassRef, Manufacturer, Model, SerialNumber) AS Sig FROM dbo.Equipment WHERE GenerationId = @FromGenerationId),
t AS (SELECT EquipmentId AS LogicalId, CHECKSUM(EquipmentUuid, DriverInstanceId, UnsLineId, Name, MachineCode, ZTag, SAPID, EquipmentClassRef, Manufacturer, Model, SerialNumber) AS Sig FROM dbo.Equipment WHERE GenerationId = @ToGenerationId)
INSERT #diff
SELECT 'Equipment', CONVERT(nvarchar(128), COALESCE(f.LogicalId, t.LogicalId)),
CASE WHEN f.LogicalId IS NULL THEN 'Added'
WHEN t.LogicalId IS NULL THEN 'Removed'
WHEN f.Sig <> t.Sig THEN 'Modified'
ELSE 'Unchanged' END
FROM f FULL OUTER JOIN t ON f.LogicalId = t.LogicalId
WHERE f.LogicalId IS NULL OR t.LogicalId IS NULL OR f.Sig <> t.Sig;
WITH f AS (SELECT TagId AS LogicalId, CHECKSUM(DriverInstanceId, DeviceId, EquipmentId, PollGroupId, FolderPath, Name, DataType, AccessLevel, WriteIdempotent, CONVERT(varchar(max), TagConfig)) AS Sig FROM dbo.Tag WHERE GenerationId = @FromGenerationId),
t AS (SELECT TagId AS LogicalId, CHECKSUM(DriverInstanceId, DeviceId, EquipmentId, PollGroupId, FolderPath, Name, DataType, AccessLevel, WriteIdempotent, CONVERT(varchar(max), TagConfig)) AS Sig FROM dbo.Tag WHERE GenerationId = @ToGenerationId)
INSERT #diff
SELECT 'Tag', CONVERT(nvarchar(128), COALESCE(f.LogicalId, t.LogicalId)),
CASE WHEN f.LogicalId IS NULL THEN 'Added'
WHEN t.LogicalId IS NULL THEN 'Removed'
WHEN f.Sig <> t.Sig THEN 'Modified'
ELSE 'Unchanged' END
FROM f FULL OUTER JOIN t ON f.LogicalId = t.LogicalId
WHERE f.LogicalId IS NULL OR t.LogicalId IS NULL OR f.Sig <> t.Sig;
-- NodeAcl section. Logical id is the (LdapGroup, ScopeKind, ScopeId) triple so the diff
-- distinguishes same row with new permissions (Modified via CHECKSUM on PermissionFlags + Notes)
-- from a scope move (which surfaces as Added + Removed of different logical ids).
WITH f AS (
SELECT CONVERT(nvarchar(128), LdapGroup + '|' + CONVERT(nvarchar(16), ScopeKind) + '|' + ISNULL(ScopeId, '(cluster)')) AS LogicalId,
CHECKSUM(ClusterId, PermissionFlags, Notes) AS Sig
FROM dbo.NodeAcl WHERE GenerationId = @FromGenerationId),
t AS (
SELECT CONVERT(nvarchar(128), LdapGroup + '|' + CONVERT(nvarchar(16), ScopeKind) + '|' + ISNULL(ScopeId, '(cluster)')) AS LogicalId,
CHECKSUM(ClusterId, PermissionFlags, Notes) AS Sig
FROM dbo.NodeAcl WHERE GenerationId = @ToGenerationId)
INSERT #diff
SELECT 'NodeAcl', COALESCE(f.LogicalId, t.LogicalId),
CASE WHEN f.LogicalId IS NULL THEN 'Added'
WHEN t.LogicalId IS NULL THEN 'Removed'
WHEN f.Sig <> t.Sig THEN 'Modified'
ELSE 'Unchanged' END
FROM f FULL OUTER JOIN t ON f.LogicalId = t.LogicalId
WHERE f.LogicalId IS NULL OR t.LogicalId IS NULL OR f.Sig <> t.Sig;
SELECT TableName, LogicalId, ChangeKind FROM #diff;
DROP TABLE #diff;
END
";
/// <summary>V1 — exact proc shipped in migration 20260417215224_StoredProcedures. Restored on Down().</summary>
public const string ComputeGenerationDiffV1 = @"
CREATE OR ALTER PROCEDURE dbo.sp_ComputeGenerationDiff
@FromGenerationId bigint,
@ToGenerationId bigint
AS
BEGIN
SET NOCOUNT ON;
CREATE TABLE #diff (TableName nvarchar(32), LogicalId nvarchar(64), ChangeKind nvarchar(16));
WITH f AS (SELECT NamespaceId AS LogicalId, CHECKSUM(NamespaceUri, Kind, Enabled, Notes) AS Sig FROM dbo.Namespace WHERE GenerationId = @FromGenerationId),
t AS (SELECT NamespaceId AS LogicalId, CHECKSUM(NamespaceUri, Kind, Enabled, Notes) AS Sig FROM dbo.Namespace WHERE GenerationId = @ToGenerationId)
INSERT #diff
SELECT 'Namespace', CONVERT(nvarchar(64), COALESCE(f.LogicalId, t.LogicalId)),
CASE WHEN f.LogicalId IS NULL THEN 'Added'
WHEN t.LogicalId IS NULL THEN 'Removed'
WHEN f.Sig <> t.Sig THEN 'Modified'
ELSE 'Unchanged' END
FROM f FULL OUTER JOIN t ON f.LogicalId = t.LogicalId
WHERE f.LogicalId IS NULL OR t.LogicalId IS NULL OR f.Sig <> t.Sig;
WITH f AS (SELECT DriverInstanceId AS LogicalId, CHECKSUM(ClusterId, NamespaceId, Name, DriverType, Enabled, CONVERT(varchar(max), DriverConfig)) AS Sig FROM dbo.DriverInstance WHERE GenerationId = @FromGenerationId),
t AS (SELECT DriverInstanceId AS LogicalId, CHECKSUM(ClusterId, NamespaceId, Name, DriverType, Enabled, CONVERT(varchar(max), DriverConfig)) AS Sig FROM dbo.DriverInstance WHERE GenerationId = @ToGenerationId)
INSERT #diff
SELECT 'DriverInstance', CONVERT(nvarchar(64), COALESCE(f.LogicalId, t.LogicalId)),
CASE WHEN f.LogicalId IS NULL THEN 'Added'
WHEN t.LogicalId IS NULL THEN 'Removed'
WHEN f.Sig <> t.Sig THEN 'Modified'
ELSE 'Unchanged' END
FROM f FULL OUTER JOIN t ON f.LogicalId = t.LogicalId
WHERE f.LogicalId IS NULL OR t.LogicalId IS NULL OR f.Sig <> t.Sig;
WITH f AS (SELECT EquipmentId AS LogicalId, CHECKSUM(EquipmentUuid, DriverInstanceId, UnsLineId, Name, MachineCode, ZTag, SAPID, EquipmentClassRef, Manufacturer, Model, SerialNumber) AS Sig FROM dbo.Equipment WHERE GenerationId = @FromGenerationId),
t AS (SELECT EquipmentId AS LogicalId, CHECKSUM(EquipmentUuid, DriverInstanceId, UnsLineId, Name, MachineCode, ZTag, SAPID, EquipmentClassRef, Manufacturer, Model, SerialNumber) AS Sig FROM dbo.Equipment WHERE GenerationId = @ToGenerationId)
INSERT #diff
SELECT 'Equipment', CONVERT(nvarchar(64), COALESCE(f.LogicalId, t.LogicalId)),
CASE WHEN f.LogicalId IS NULL THEN 'Added'
WHEN t.LogicalId IS NULL THEN 'Removed'
WHEN f.Sig <> t.Sig THEN 'Modified'
ELSE 'Unchanged' END
FROM f FULL OUTER JOIN t ON f.LogicalId = t.LogicalId
WHERE f.LogicalId IS NULL OR t.LogicalId IS NULL OR f.Sig <> t.Sig;
WITH f AS (SELECT TagId AS LogicalId, CHECKSUM(DriverInstanceId, DeviceId, EquipmentId, PollGroupId, FolderPath, Name, DataType, AccessLevel, WriteIdempotent, CONVERT(varchar(max), TagConfig)) AS Sig FROM dbo.Tag WHERE GenerationId = @FromGenerationId),
t AS (SELECT TagId AS LogicalId, CHECKSUM(DriverInstanceId, DeviceId, EquipmentId, PollGroupId, FolderPath, Name, DataType, AccessLevel, WriteIdempotent, CONVERT(varchar(max), TagConfig)) AS Sig FROM dbo.Tag WHERE GenerationId = @ToGenerationId)
INSERT #diff
SELECT 'Tag', CONVERT(nvarchar(64), COALESCE(f.LogicalId, t.LogicalId)),
CASE WHEN f.LogicalId IS NULL THEN 'Added'
WHEN t.LogicalId IS NULL THEN 'Removed'
WHEN f.Sig <> t.Sig THEN 'Modified'
ELSE 'Unchanged' END
FROM f FULL OUTER JOIN t ON f.LogicalId = t.LogicalId
WHERE f.LogicalId IS NULL OR t.LogicalId IS NULL OR f.Sig <> t.Sig;
SELECT TableName, LogicalId, ChangeKind FROM #diff;
DROP TABLE #diff;
END
";
}
}
}

View File

@@ -0,0 +1,83 @@
using System.Collections.Concurrent;
using System.Security.Cryptography;
using System.Text;
namespace ZB.MOM.WW.OtOpcUa.Core.Scripting;
/// <summary>
/// Source-hash-keyed compile cache for user scripts. Roslyn compilation is the most
/// expensive step in the evaluator pipeline (5-20ms per script depending on size);
/// re-compiling on every value-change event would starve the virtual-tag engine.
/// The cache is generic on the <see cref="ScriptContext"/> subclass + result type so
/// different engines (virtual-tag / alarm-predicate / future alarm-action) each get
/// their own cache instance — there's no cross-type pollution.
/// </summary>
/// <remarks>
/// <para>
/// Concurrent-safe: <see cref="ConcurrentDictionary{TKey, TValue}"/> of
/// <see cref="Lazy{T}"/> means a miss on two threads compiles exactly once.
/// <see cref="LazyThreadSafetyMode.ExecutionAndPublication"/> guarantees other
/// threads block on the in-flight compile rather than racing to duplicate work.
/// </para>
/// <para>
/// Cache is keyed on SHA-256 of the UTF-8 bytes of the source — collision-free in
/// practice. Whitespace changes therefore miss the cache on purpose; operators
/// see re-compile time on their first evaluation after a format-only edit which
/// is rare and benign.
/// </para>
/// <para>
/// No capacity bound. Virtual-tag + alarm scripts are operator-authored and
/// bounded by config DB (typically low thousands). If that changes in v3, add an
/// LRU eviction policy — the API stays the same.
/// </para>
/// </remarks>
public sealed class CompiledScriptCache<TContext, TResult>
where TContext : ScriptContext
{
private readonly ConcurrentDictionary<string, Lazy<ScriptEvaluator<TContext, TResult>>> _cache = new();
/// <summary>
/// Return the compiled evaluator for <paramref name="scriptSource"/>, compiling
/// on first sight + reusing thereafter. If the source fails to compile, the
/// original Roslyn / sandbox exception propagates; the cache entry is removed so
/// the next call retries (useful during Admin UI authoring when the operator is
/// still fixing syntax).
/// </summary>
public ScriptEvaluator<TContext, TResult> GetOrCompile(string scriptSource)
{
if (scriptSource is null) throw new ArgumentNullException(nameof(scriptSource));
var key = HashSource(scriptSource);
var lazy = _cache.GetOrAdd(key, _ => new Lazy<ScriptEvaluator<TContext, TResult>>(
() => ScriptEvaluator<TContext, TResult>.Compile(scriptSource),
LazyThreadSafetyMode.ExecutionAndPublication));
try
{
return lazy.Value;
}
catch
{
// Failed compile — evict so a retry with corrected source can succeed.
_cache.TryRemove(key, out _);
throw;
}
}
/// <summary>Current entry count. Exposed for Admin UI diagnostics / tests.</summary>
public int Count => _cache.Count;
/// <summary>Drop every cached compile. Used on config generation publish + tests.</summary>
public void Clear() => _cache.Clear();
/// <summary>True when the exact source has been compiled at least once + is still cached.</summary>
public bool Contains(string scriptSource)
=> _cache.ContainsKey(HashSource(scriptSource));
private static string HashSource(string source)
{
var bytes = Encoding.UTF8.GetBytes(source);
var hash = SHA256.HashData(bytes);
return Convert.ToHexString(hash);
}
}

View File

@@ -0,0 +1,137 @@
using Microsoft.CodeAnalysis;
using Microsoft.CodeAnalysis.CSharp;
using Microsoft.CodeAnalysis.CSharp.Syntax;
using Microsoft.CodeAnalysis.Text;
namespace ZB.MOM.WW.OtOpcUa.Core.Scripting;
/// <summary>
/// Parses a script's source text + extracts every <c>ctx.GetTag("literal")</c> and
/// <c>ctx.SetVirtualTag("literal", ...)</c> call. Outputs the static dependency set
/// the virtual-tag engine uses to build its change-trigger subscription graph (Phase
/// 7 plan decision #7 — AST inference, operator doesn't maintain a separate list).
/// </summary>
/// <remarks>
/// <para>
/// The tag-path argument MUST be a literal string expression. Variables,
/// concatenation, interpolation, and method-returned strings are rejected because
/// the extractor can't statically know what tag they'll resolve to at evaluation
/// time — the dependency graph needs to know every possible input up front.
/// Rejections carry the exact source span so the Admin UI can point at the offending
/// token.
/// </para>
/// <para>
/// Identifier matching is by spelling: the extractor looks for
/// <c>ctx.GetTag(...)</c> / <c>ctx.SetVirtualTag(...)</c> literally. A deliberately
/// misspelled method call (<c>ctx.GetTagz</c>) is not picked up but will also fail
/// to compile against <see cref="ScriptContext"/>, so there's no way to smuggle a
/// dependency past the extractor while still having a working script.
/// </para>
/// </remarks>
public static class DependencyExtractor
{
/// <summary>
/// Parse <paramref name="scriptSource"/> + return the inferred read + write tag
/// paths, or a list of rejection messages if non-literal paths were used.
/// </summary>
public static DependencyExtractionResult Extract(string scriptSource)
{
if (string.IsNullOrWhiteSpace(scriptSource))
return new DependencyExtractionResult(
Reads: new HashSet<string>(StringComparer.Ordinal),
Writes: new HashSet<string>(StringComparer.Ordinal),
Rejections: []);
var tree = CSharpSyntaxTree.ParseText(scriptSource, options:
new CSharpParseOptions(kind: SourceCodeKind.Script));
var root = tree.GetRoot();
var walker = new Walker();
walker.Visit(root);
return new DependencyExtractionResult(
Reads: walker.Reads,
Writes: walker.Writes,
Rejections: walker.Rejections);
}
private sealed class Walker : CSharpSyntaxWalker
{
private readonly HashSet<string> _reads = new(StringComparer.Ordinal);
private readonly HashSet<string> _writes = new(StringComparer.Ordinal);
private readonly List<DependencyRejection> _rejections = [];
public IReadOnlySet<string> Reads => _reads;
public IReadOnlySet<string> Writes => _writes;
public IReadOnlyList<DependencyRejection> Rejections => _rejections;
public override void VisitInvocationExpression(InvocationExpressionSyntax node)
{
// Only interested in member-access form: ctx.GetTag(...) / ctx.SetVirtualTag(...).
// Anything else (free functions, chained calls, static calls) is ignored — but
// still visit children in case a ctx.GetTag call is nested inside.
if (node.Expression is MemberAccessExpressionSyntax member)
{
var methodName = member.Name.Identifier.ValueText;
if (methodName is nameof(ScriptContext.GetTag) or nameof(ScriptContext.SetVirtualTag))
{
HandleTagCall(node, methodName);
}
}
base.VisitInvocationExpression(node);
}
private void HandleTagCall(InvocationExpressionSyntax node, string methodName)
{
var args = node.ArgumentList.Arguments;
if (args.Count == 0)
{
_rejections.Add(new DependencyRejection(
Span: node.Span,
Message: $"Call to ctx.{methodName} has no arguments. " +
"The tag path must be the first argument."));
return;
}
var pathArg = args[0].Expression;
if (pathArg is not LiteralExpressionSyntax literal
|| !literal.Token.IsKind(SyntaxKind.StringLiteralToken))
{
_rejections.Add(new DependencyRejection(
Span: pathArg.Span,
Message: $"Tag path passed to ctx.{methodName} must be a string literal. " +
$"Dynamic paths (variables, concatenation, interpolation, method " +
$"calls) are rejected at publish so the dependency graph can be " +
$"built statically. Got: {pathArg.Kind()} ({pathArg})"));
return;
}
var path = (string?)literal.Token.Value ?? string.Empty;
if (string.IsNullOrWhiteSpace(path))
{
_rejections.Add(new DependencyRejection(
Span: literal.Span,
Message: $"Tag path passed to ctx.{methodName} is empty or whitespace."));
return;
}
if (methodName == nameof(ScriptContext.GetTag))
_reads.Add(path);
else
_writes.Add(path);
}
}
}
/// <summary>Output of <see cref="DependencyExtractor.Extract"/>.</summary>
public sealed record DependencyExtractionResult(
IReadOnlySet<string> Reads,
IReadOnlySet<string> Writes,
IReadOnlyList<DependencyRejection> Rejections)
{
/// <summary>True when no rejections were recorded — safe to publish.</summary>
public bool IsValid => Rejections.Count == 0;
}
/// <summary>A single non-literal-path rejection with the exact source span for UI pointing.</summary>
public sealed record DependencyRejection(TextSpan Span, string Message);

View File

@@ -0,0 +1,152 @@
using Microsoft.CodeAnalysis;
using Microsoft.CodeAnalysis.CSharp;
using Microsoft.CodeAnalysis.CSharp.Syntax;
using Microsoft.CodeAnalysis.Text;
namespace ZB.MOM.WW.OtOpcUa.Core.Scripting;
/// <summary>
/// Post-compile sandbox guard. <c>ScriptOptions</c> alone can't reliably
/// constrain the type surface a script can reach because .NET 10's type-forwarding
/// system resolves many BCL types through multiple assemblies — restricting the
/// reference list doesn't stop <c>System.Net.Http.HttpClient</c> from being found if
/// any transitive reference forwards to <c>System.Net.Http</c>. This analyzer walks
/// the script's syntax tree after compile, uses the <see cref="SemanticModel"/> to
/// resolve every type / member reference, and rejects any whose containing namespace
/// matches a deny-list pattern.
/// </summary>
/// <remarks>
/// <para>
/// Deny-list is the authoritative Phase 7 plan decision #6 set:
/// <c>System.IO</c>, <c>System.Net</c>, <c>System.Diagnostics.Process</c>,
/// <c>System.Reflection</c>, <c>System.Threading.Thread</c>,
/// <c>System.Runtime.InteropServices</c>. <c>System.Environment</c> (for process
/// env-var read) is explicitly left allowed — it's read-only process state, doesn't
/// persist outside, and the test file pins this compromise so tightening later is
/// a deliberate plan decision.
/// </para>
/// <para>
/// Deny-list prefix match. <c>System.Net</c> catches <c>System.Net.Http</c>,
/// <c>System.Net.Sockets</c>, <c>System.Net.NetworkInformation</c>, etc. — every
/// subnamespace. If a script needs something under a denied prefix, Phase 7's
/// operator audience authors it through a helper the plan team adds as part of
/// the <see cref="ScriptContext"/> surface, not by unlocking the namespace.
/// </para>
/// </remarks>
public static class ForbiddenTypeAnalyzer
{
/// <summary>
/// Namespace prefixes scripts are NOT allowed to reference. Each string is
/// matched as a prefix against the resolved symbol's namespace name (dot-
/// delimited), so <c>System.IO</c> catches <c>System.IO.File</c>,
/// <c>System.IO.Pipes</c>, and any future subnamespace without needing explicit
/// enumeration.
/// </summary>
public static readonly IReadOnlyList<string> ForbiddenNamespacePrefixes =
[
"System.IO",
"System.Net",
"System.Diagnostics", // catches Process, ProcessStartInfo, EventLog, Trace/Debug file sinks
"System.Reflection",
"System.Threading.Thread", // raw Thread — Tasks stay allowed (different namespace)
"System.Runtime.InteropServices",
"Microsoft.Win32", // registry
];
/// <summary>
/// Scan the <paramref name="compilation"/> for references to forbidden types.
/// Returns empty list when the script is clean; non-empty list means the script
/// must be rejected at publish with the rejections surfaced to the operator.
/// </summary>
public static IReadOnlyList<ForbiddenTypeRejection> Analyze(Compilation compilation)
{
if (compilation is null) throw new ArgumentNullException(nameof(compilation));
var rejections = new List<ForbiddenTypeRejection>();
foreach (var tree in compilation.SyntaxTrees)
{
var semantic = compilation.GetSemanticModel(tree);
var root = tree.GetRoot();
foreach (var node in root.DescendantNodes())
{
switch (node)
{
case ObjectCreationExpressionSyntax obj:
CheckSymbol(semantic.GetSymbolInfo(obj.Type).Symbol, obj.Type.Span, rejections);
break;
case InvocationExpressionSyntax inv when inv.Expression is MemberAccessExpressionSyntax memberAcc:
CheckSymbol(semantic.GetSymbolInfo(memberAcc.Expression).Symbol, memberAcc.Expression.Span, rejections);
CheckSymbol(semantic.GetSymbolInfo(inv).Symbol, inv.Span, rejections);
break;
case MemberAccessExpressionSyntax mem:
// Catches static calls like System.IO.File.ReadAllText(...) — the
// MemberAccess "System.IO.File" resolves to the File type symbol
// whose containing namespace is System.IO, triggering a rejection.
CheckSymbol(semantic.GetSymbolInfo(mem.Expression).Symbol, mem.Expression.Span, rejections);
break;
case IdentifierNameSyntax id when node.Parent is not MemberAccessExpressionSyntax:
CheckSymbol(semantic.GetSymbolInfo(id).Symbol, id.Span, rejections);
break;
}
}
}
return rejections;
}
private static void CheckSymbol(ISymbol? symbol, TextSpan span, List<ForbiddenTypeRejection> rejections)
{
if (symbol is null) return;
var typeSymbol = symbol switch
{
ITypeSymbol t => t,
IMethodSymbol m => m.ContainingType,
IPropertySymbol p => p.ContainingType,
IFieldSymbol f => f.ContainingType,
_ => null,
};
if (typeSymbol is null) return;
var ns = typeSymbol.ContainingNamespace?.ToDisplayString() ?? string.Empty;
foreach (var forbidden in ForbiddenNamespacePrefixes)
{
if (ns == forbidden || ns.StartsWith(forbidden + ".", StringComparison.Ordinal))
{
rejections.Add(new ForbiddenTypeRejection(
Span: span,
TypeName: typeSymbol.ToDisplayString(),
Namespace: ns,
Message: $"Type '{typeSymbol.ToDisplayString()}' is in the forbidden namespace '{ns}'. " +
$"Scripts cannot reach {forbidden}* per Phase 7 sandbox rules."));
return;
}
}
}
}
/// <summary>A single forbidden-type reference in a user script.</summary>
public sealed record ForbiddenTypeRejection(
TextSpan Span,
string TypeName,
string Namespace,
string Message);
/// <summary>Thrown from <see cref="ScriptEvaluator{TContext, TResult}.Compile"/> when the
/// post-compile forbidden-type analyzer finds references to denied namespaces.</summary>
public sealed class ScriptSandboxViolationException : Exception
{
public IReadOnlyList<ForbiddenTypeRejection> Rejections { get; }
public ScriptSandboxViolationException(IReadOnlyList<ForbiddenTypeRejection> rejections)
: base(BuildMessage(rejections))
{
Rejections = rejections;
}
private static string BuildMessage(IReadOnlyList<ForbiddenTypeRejection> rejections)
{
var lines = rejections.Select(r => $" - {r.Message}");
return "Script references types outside the Phase 7 sandbox allow-list:\n"
+ string.Join("\n", lines);
}
}

View File

@@ -0,0 +1,80 @@
using Serilog;
using ZB.MOM.WW.OtOpcUa.Core.Abstractions;
namespace ZB.MOM.WW.OtOpcUa.Core.Scripting;
/// <summary>
/// The API user scripts see as the global <c>ctx</c>. Abstract — concrete subclasses
/// (e.g. <c>VirtualTagScriptContext</c>, <c>AlarmScriptContext</c>) plug in the
/// actual tag-backend + logger + virtual-tag writer for each evaluation. Phase 7 plan
/// decision #6: scripts can read any tag, write only to virtual tags, and have no
/// other .NET reach — no HttpClient, no File, no Process, no reflection.
/// </summary>
/// <remarks>
/// <para>
/// Every member on this type MUST be serializable in the narrow sense that
/// <see cref="DependencyExtractor"/> can recognize tag-access call sites from the
/// script AST. Method names used from scripts are locked — renaming
/// <see cref="GetTag"/> or <see cref="SetVirtualTag"/> is a breaking change for every
/// authored script and the dependency extractor must update in lockstep.
/// </para>
/// <para>
/// New helpers (<see cref="Now"/>, <see cref="Deadband"/>) are additive: adding a
/// method doesn't invalidate existing scripts. Do not remove or rename without a
/// plan-level decision + migration for authored scripts.
/// </para>
/// </remarks>
public abstract class ScriptContext
{
/// <summary>
/// Read a tag's current value + quality + source timestamp. Path syntax is
/// <c>Enterprise/Site/Area/Line/Equipment/TagName</c> (forward-slash delimited,
/// matching the Equipment-namespace browse tree). Returns a
/// <see cref="DataValueSnapshot"/> so scripts branch on quality without a second
/// call.
/// </summary>
/// <remarks>
/// <paramref name="path"/> MUST be a string literal in the script source — dynamic
/// paths (variables, concatenation, method-returned strings) are rejected at
/// publish by <see cref="DependencyExtractor"/>. This is intentional: the static
/// dependency set is required for the change-driven scheduler to subscribe to the
/// right upstream tags at load time.
/// </remarks>
public abstract DataValueSnapshot GetTag(string path);
/// <summary>
/// Write a value to a virtual tag. Operator scripts cannot write to driver-sourced
/// tags — the OPC UA dispatch in <c>DriverNodeManager</c> rejects that separately
/// per ADR-002 with <c>BadUserAccessDenied</c>. This method is the only write path
/// virtual tags have.
/// </summary>
/// <remarks>
/// Path rules identical to <see cref="GetTag"/> — literal only, dependency
/// extractor tracks the write targets so the engine knows what downstream
/// subscribers to notify.
/// </remarks>
public abstract void SetVirtualTag(string path, object? value);
/// <summary>
/// Current UTC timestamp. Prefer this over <see cref="DateTime.UtcNow"/> in
/// scripts so the harness can supply a deterministic clock for tests.
/// </summary>
public abstract DateTime Now { get; }
/// <summary>
/// Per-script Serilog logger. Output lands in the dedicated <c>scripts-*.log</c>
/// sink with structured property <c>ScriptName</c> = the script's configured name.
/// Use at error level to surface problems; main <c>opcua-*.log</c> receives a
/// companion WARN entry so operators see script errors in the primary log.
/// </summary>
public abstract ILogger Logger { get; }
/// <summary>
/// Deadband helper — returns <c>true</c> when <paramref name="current"/> differs
/// from <paramref name="previous"/> by more than <paramref name="tolerance"/>.
/// Useful for alarm predicates that shouldn't flicker on small noise. Pure
/// function; no side effects.
/// </summary>
public static bool Deadband(double current, double previous, double tolerance)
=> Math.Abs(current - previous) > tolerance;
}

View File

@@ -0,0 +1,75 @@
using Microsoft.CodeAnalysis.CSharp.Scripting;
using Microsoft.CodeAnalysis.Scripting;
namespace ZB.MOM.WW.OtOpcUa.Core.Scripting;
/// <summary>
/// Compiles + runs user scripts against a <see cref="ScriptContext"/> subclass. Core
/// evaluator — no caching, no timeout, no logging side-effects yet (those land in
/// Stream A.3, A.4, A.5 respectively). Stream B + C wrap this with the dependency
/// scheduler + alarm state machine.
/// </summary>
/// <remarks>
/// <para>
/// Scripts are compiled against <see cref="ScriptGlobals{TContext}"/> so the
/// context member is named <c>ctx</c> in the script, matching the
/// <see cref="DependencyExtractor"/>'s walker and the Admin UI type stub.
/// </para>
/// <para>
/// Compile pipeline is a three-step gate: (1) Roslyn compile — catches syntax
/// errors + type-resolution failures, throws <see cref="CompilationErrorException"/>;
/// (2) <see cref="ForbiddenTypeAnalyzer"/> runs against the semantic model —
/// catches sandbox escapes that slipped past reference restrictions due to .NET's
/// type forwarding, throws <see cref="ScriptSandboxViolationException"/>; (3)
/// delegate creation — throws at this layer only for internal Roslyn bugs, not
/// user error.
/// </para>
/// <para>
/// Runtime exceptions thrown from user code propagate unwrapped. The virtual-tag
/// engine (Stream B) catches them per-tag + maps to <c>BadInternalError</c>
/// quality per Phase 7 decision #11 — this layer doesn't swallow anything so
/// tests can assert on the original exception type.
/// </para>
/// </remarks>
public sealed class ScriptEvaluator<TContext, TResult>
where TContext : ScriptContext
{
private readonly ScriptRunner<TResult> _runner;
private ScriptEvaluator(ScriptRunner<TResult> runner)
{
_runner = runner;
}
public static ScriptEvaluator<TContext, TResult> Compile(string scriptSource)
{
if (scriptSource is null) throw new ArgumentNullException(nameof(scriptSource));
var options = ScriptSandbox.Build(typeof(TContext));
var script = CSharpScript.Create<TResult>(
code: scriptSource,
options: options,
globalsType: typeof(ScriptGlobals<TContext>));
// Step 1 — Roslyn compile. Throws CompilationErrorException on syntax / type errors.
var diagnostics = script.Compile();
// Step 2 — forbidden-type semantic analysis. Defense-in-depth against reference-list
// leaks due to type forwarding.
var rejections = ForbiddenTypeAnalyzer.Analyze(script.GetCompilation());
if (rejections.Count > 0)
throw new ScriptSandboxViolationException(rejections);
// Step 3 — materialize the callable delegate.
var runner = script.CreateDelegate();
return new ScriptEvaluator<TContext, TResult>(runner);
}
/// <summary>Run against an already-constructed context.</summary>
public Task<TResult> RunAsync(TContext context, CancellationToken ct = default)
{
if (context is null) throw new ArgumentNullException(nameof(context));
var globals = new ScriptGlobals<TContext> { ctx = context };
return _runner(globals, ct);
}
}

View File

@@ -0,0 +1,19 @@
namespace ZB.MOM.WW.OtOpcUa.Core.Scripting;
/// <summary>
/// Wraps a <see cref="ScriptContext"/> as a named field so user scripts see
/// <c>ctx.GetTag(...)</c> instead of the bare <c>GetTag(...)</c> that Roslyn's
/// globalsType convention would produce. Keeps the script ergonomics operators
/// author against consistent with the dependency extractor (which looks for the
/// <c>ctx.</c> prefix) and with the Admin UI hand-written type stub.
/// </summary>
/// <remarks>
/// Generic on <typeparamref name="TContext"/> so alarm predicates can use a richer
/// context (e.g. with an <c>Alarm</c> property carrying the owning condition's
/// metadata) without affecting virtual-tag contexts.
/// </remarks>
public class ScriptGlobals<TContext>
where TContext : ScriptContext
{
public TContext ctx { get; set; } = default!;
}

View File

@@ -0,0 +1,65 @@
using Serilog;
using Serilog.Core;
using Serilog.Events;
namespace ZB.MOM.WW.OtOpcUa.Core.Scripting;
/// <summary>
/// Serilog sink that mirrors script log events at <see cref="LogEventLevel.Error"/>
/// or higher to a companion logger (typically the main <c>opcua-*.log</c>) at
/// <see cref="LogEventLevel.Warning"/>. Lets operators see script errors in the
/// primary server log without drowning it in Debug/Info/Warning noise from scripts.
/// </summary>
/// <remarks>
/// <para>
/// Registered alongside the dedicated <c>scripts-*.log</c> rolling file sink in
/// the root script-logger configuration — events below Error land only in the
/// scripts file; Error/Fatal events land in both the scripts file (at original
/// level) and the main log (downgraded to Warning since the main log's audience
/// is server operators, not script authors).
/// </para>
/// <para>
/// The forwarded message preserves the <c>ScriptName</c> property so operators
/// reading the main log can tell which script raised the error at a glance.
/// Original exception (if any) is attached so the main log's diagnostics keep
/// the full stack trace.
/// </para>
/// </remarks>
public sealed class ScriptLogCompanionSink : ILogEventSink
{
private readonly ILogger _mainLogger;
private readonly LogEventLevel _minMirrorLevel;
public ScriptLogCompanionSink(ILogger mainLogger, LogEventLevel minMirrorLevel = LogEventLevel.Error)
{
_mainLogger = mainLogger ?? throw new ArgumentNullException(nameof(mainLogger));
_minMirrorLevel = minMirrorLevel;
}
public void Emit(LogEvent logEvent)
{
if (logEvent is null) return;
if (logEvent.Level < _minMirrorLevel) return;
var scriptName = "unknown";
if (logEvent.Properties.TryGetValue(ScriptLoggerFactory.ScriptNameProperty, out var prop)
&& prop is ScalarValue sv && sv.Value is string s)
{
scriptName = s;
}
var rendered = logEvent.RenderMessage();
if (logEvent.Exception is not null)
{
_mainLogger.Warning(logEvent.Exception,
"[Script] {ScriptName} emitted {OriginalLevel}: {ScriptMessage}",
scriptName, logEvent.Level, rendered);
}
else
{
_mainLogger.Warning(
"[Script] {ScriptName} emitted {OriginalLevel}: {ScriptMessage}",
scriptName, logEvent.Level, rendered);
}
}
}

View File

@@ -0,0 +1,48 @@
using Serilog;
namespace ZB.MOM.WW.OtOpcUa.Core.Scripting;
/// <summary>
/// Creates per-script Serilog <see cref="ILogger"/> instances with the
/// <c>ScriptName</c> structured property pre-bound. Every log call from a user
/// script carries the owning virtual-tag or alarm name so operators can filter the
/// dedicated <c>scripts-*.log</c> sink by script in the Admin UI.
/// </summary>
/// <remarks>
/// <para>
/// Factory-based — the engine (Stream B / C) constructs exactly one instance
/// from the root script-logger pipeline at startup, then derives a per-script
/// logger for each <see cref="ScriptContext"/> it builds. No per-evaluation
/// allocation in the hot path.
/// </para>
/// <para>
/// The wrapped root logger is responsible for output wiring — typically a
/// rolling file sink to <c>scripts-*.log</c> plus a
/// <see cref="ScriptLogCompanionSink"/> that forwards Error-or-higher events
/// to the main server log at Warning level so operators see script errors
/// in the primary log without drowning it in Info noise.
/// </para>
/// </remarks>
public sealed class ScriptLoggerFactory
{
/// <summary>Structured property name the enricher binds. Stable for log filtering.</summary>
public const string ScriptNameProperty = "ScriptName";
private readonly ILogger _rootLogger;
public ScriptLoggerFactory(ILogger rootLogger)
{
_rootLogger = rootLogger ?? throw new ArgumentNullException(nameof(rootLogger));
}
/// <summary>
/// Create a per-script logger. Every event it emits carries
/// <c>ScriptName=<paramref name="scriptName"/></c> as a structured property.
/// </summary>
public ILogger Create(string scriptName)
{
if (string.IsNullOrWhiteSpace(scriptName))
throw new ArgumentException("Script name is required.", nameof(scriptName));
return _rootLogger.ForContext(ScriptNameProperty, scriptName);
}
}

View File

@@ -0,0 +1,87 @@
using Microsoft.CodeAnalysis.CSharp.Scripting;
using Microsoft.CodeAnalysis.Scripting;
using ZB.MOM.WW.OtOpcUa.Core.Abstractions;
namespace ZB.MOM.WW.OtOpcUa.Core.Scripting;
/// <summary>
/// Factory for the <see cref="ScriptOptions"/> every user script is compiled against.
/// Implements Phase 7 plan decision #6 (read-only sandbox) by whitelisting only the
/// assemblies + namespaces the script API needs; no <c>System.IO</c>, no
/// <c>System.Net</c>, no <c>System.Diagnostics.Process</c>, no
/// <c>System.Reflection</c>. Attempts to reference those types in a script fail at
/// compile with a compiler error that points at the exact span — the operator sees
/// the rejection before publish, not at evaluation.
/// </summary>
/// <remarks>
/// <para>
/// Roslyn's default <see cref="ScriptOptions"/> references <c>mscorlib</c> /
/// <c>System.Runtime</c> transitively which pulls in every type in the BCL — this
/// class overrides that with an explicit minimal allow-list.
/// </para>
/// <para>
/// Namespaces pre-imported so scripts don't have to write <c>using</c> clauses:
/// <c>System</c>, <c>System.Math</c>-style statics are reachable via
/// <see cref="Math"/>, and <c>ZB.MOM.WW.OtOpcUa.Core.Abstractions</c> so scripts
/// can name <see cref="DataValueSnapshot"/> directly.
/// </para>
/// <para>
/// The sandbox cannot prevent a script from allocating unbounded memory or
/// spinning in a tight loop — those are budget concerns, handled by the
/// per-evaluation timeout (Stream A.4) + the test-harness (Stream F.4) that lets
/// operators preview output before publishing.
/// </para>
/// </remarks>
public static class ScriptSandbox
{
/// <summary>
/// Build the <see cref="ScriptOptions"/> used for every virtual-tag / alarm
/// script. <paramref name="contextType"/> is the concrete
/// <see cref="ScriptContext"/> subclass the globals will be of — the compiler
/// uses its type to resolve <c>ctx.GetTag(...)</c> calls.
/// </summary>
public static ScriptOptions Build(Type contextType)
{
if (contextType is null) throw new ArgumentNullException(nameof(contextType));
if (!typeof(ScriptContext).IsAssignableFrom(contextType))
throw new ArgumentException(
$"Script context type must derive from {nameof(ScriptContext)}", nameof(contextType));
// Allow-listed assemblies — each explicitly chosen. Adding here is a
// plan-level decision; do not expand casually. HashSet so adding the
// contextType's assembly is idempotent when it happens to be Core.Scripting
// already.
var allowedAssemblies = new HashSet<System.Reflection.Assembly>
{
// System.Private.CoreLib — primitives (int, double, bool, string, DateTime,
// TimeSpan, Math, Convert, nullable<T>). Can't practically script without it.
typeof(object).Assembly,
// System.Linq — IEnumerable extensions (Where / Select / Sum / Average / etc.).
typeof(System.Linq.Enumerable).Assembly,
// Core.Abstractions — DataValueSnapshot + DriverDataType so scripts can name
// the types they receive from ctx.GetTag.
typeof(DataValueSnapshot).Assembly,
// Core.Scripting itself — ScriptContext base class + Deadband static.
typeof(ScriptContext).Assembly,
// Serilog.ILogger — script-side logger type.
typeof(Serilog.ILogger).Assembly,
// Concrete context type's assembly — production contexts subclass
// ScriptContext in Core.VirtualTags / Core.ScriptedAlarms; tests use their
// own subclass. The globals wrapper is generic on this type so Roslyn must
// be able to resolve it during compilation.
contextType.Assembly,
};
var allowedImports = new[]
{
"System",
"System.Linq",
"ZB.MOM.WW.OtOpcUa.Core.Abstractions",
"ZB.MOM.WW.OtOpcUa.Core.Scripting",
};
return ScriptOptions.Default
.WithReferences(allowedAssemblies)
.WithImports(allowedImports);
}
}

View File

@@ -0,0 +1,102 @@
namespace ZB.MOM.WW.OtOpcUa.Core.Scripting;
/// <summary>
/// Wraps a <see cref="ScriptEvaluator{TContext, TResult}"/> with a per-evaluation
/// wall-clock timeout. Default is 250ms per Phase 7 plan Stream A.4; configurable
/// per tag so deployments with slower backends can widen it.
/// </summary>
/// <remarks>
/// <para>
/// Implemented with <see cref="Task.WaitAsync(TimeSpan, CancellationToken)"/>
/// rather than a cancellation-token-only approach because Roslyn-compiled
/// scripts don't internally poll the cancellation token unless the user code
/// does async work. A CPU-bound infinite loop in a script won't honor a
/// cooperative cancel — <c>WaitAsync</c> returns control when the timeout fires
/// regardless of whether the inner task completes.
/// </para>
/// <para>
/// <b>Known limitation:</b> when a script times out, the underlying ScriptRunner
/// task continues running on a thread-pool thread until the Roslyn runtime
/// returns. In the CPU-bound-infinite-loop case that's effectively "leaked" —
/// the thread is tied up until the runtime decides to return, which it may
/// never do. Phase 7 plan Stream A.4 accepts this as a known trade-off; tighter
/// CPU budgeting would require an out-of-process script runner, which is a v3
/// concern. In practice, the timeout + structured warning log surfaces the
/// offending script so the operator can fix it; the orphan thread is rare.
/// </para>
/// <para>
/// Caller-supplied <see cref="CancellationToken"/> is honored — if the caller
/// cancels before the timeout fires, the caller's cancel wins and the
/// <see cref="OperationCanceledException"/> propagates (not wrapped as
/// <see cref="ScriptTimeoutException"/>). That distinction matters: the
/// virtual-tag engine's shutdown path cancels scripts on dispose; it shouldn't
/// see those as timeouts.
/// </para>
/// </remarks>
public sealed class TimedScriptEvaluator<TContext, TResult>
where TContext : ScriptContext
{
/// <summary>Default timeout per Phase 7 plan Stream A.4 — 250ms.</summary>
public static readonly TimeSpan DefaultTimeout = TimeSpan.FromMilliseconds(250);
private readonly ScriptEvaluator<TContext, TResult> _inner;
/// <summary>Wall-clock budget per evaluation. Script exceeding this throws <see cref="ScriptTimeoutException"/>.</summary>
public TimeSpan Timeout { get; }
public TimedScriptEvaluator(ScriptEvaluator<TContext, TResult> inner)
: this(inner, DefaultTimeout)
{
}
public TimedScriptEvaluator(ScriptEvaluator<TContext, TResult> inner, TimeSpan timeout)
{
_inner = inner ?? throw new ArgumentNullException(nameof(inner));
if (timeout <= TimeSpan.Zero)
throw new ArgumentOutOfRangeException(nameof(timeout), "Timeout must be positive.");
Timeout = timeout;
}
public async Task<TResult> RunAsync(TContext context, CancellationToken ct = default)
{
if (context is null) throw new ArgumentNullException(nameof(context));
// Push evaluation to a thread-pool thread so a CPU-bound script (e.g. a tight
// loop with no async work) doesn't hog the caller's thread before WaitAsync
// gets to register its timeout. Without this, Roslyn's ScriptRunner executes
// synchronously on the calling thread and returns an already-completed Task,
// so WaitAsync sees a completed task and never fires the timeout.
var runTask = Task.Run(() => _inner.RunAsync(context, ct), ct);
try
{
return await runTask.WaitAsync(Timeout, ct).ConfigureAwait(false);
}
catch (TimeoutException)
{
// WaitAsync's synthesized timeout — the inner task may still be running
// on its thread-pool thread (known leak documented in the class summary).
// Wrap so callers can distinguish from user-written timeout logic.
throw new ScriptTimeoutException(Timeout);
}
}
}
/// <summary>
/// Thrown when a script evaluation exceeds its configured timeout. The virtual-tag
/// engine (Stream B) catches this + maps the owning tag's quality to
/// <c>BadInternalError</c> per Phase 7 plan decision #11, logging a structured
/// warning with the offending script name so operators can locate + fix it.
/// </summary>
public sealed class ScriptTimeoutException : Exception
{
public TimeSpan Timeout { get; }
public ScriptTimeoutException(TimeSpan timeout)
: base($"Script evaluation exceeded the configured timeout of {timeout.TotalMilliseconds:F1} ms. " +
"The script was either CPU-bound or blocked on a slow operation; check ctx.Logger output " +
"around the timeout and consider widening the timeout per tag, simplifying the script, or " +
"moving heavy work out of the evaluation path.")
{
Timeout = timeout;
}
}

View File

@@ -0,0 +1,35 @@
<Project Sdk="Microsoft.NET.Sdk">
<PropertyGroup>
<TargetFramework>net10.0</TargetFramework>
<Nullable>enable</Nullable>
<ImplicitUsings>enable</ImplicitUsings>
<LangVersion>latest</LangVersion>
<TreatWarningsAsErrors>true</TreatWarningsAsErrors>
<GenerateDocumentationFile>true</GenerateDocumentationFile>
<NoWarn>$(NoWarn);CS1591</NoWarn>
<RootNamespace>ZB.MOM.WW.OtOpcUa.Core.Scripting</RootNamespace>
</PropertyGroup>
<ItemGroup>
<!-- Roslyn scripting API — compiles user C# snippets with a constrained ScriptOptions
allow-list so scripts can't reach Process/File/HttpClient/reflection. Per Phase 7
plan decisions #1 + #6. -->
<PackageReference Include="Microsoft.CodeAnalysis.CSharp.Scripting" Version="4.12.0"/>
<PackageReference Include="Serilog" Version="4.2.0"/>
</ItemGroup>
<ItemGroup>
<ProjectReference Include="..\ZB.MOM.WW.OtOpcUa.Core.Abstractions\ZB.MOM.WW.OtOpcUa.Core.Abstractions.csproj"/>
</ItemGroup>
<ItemGroup>
<InternalsVisibleTo Include="ZB.MOM.WW.OtOpcUa.Core.Scripting.Tests"/>
</ItemGroup>
<ItemGroup>
<NuGetAuditSuppress Include="https://github.com/advisories/GHSA-37gx-xxp4-5rgx"/>
<NuGetAuditSuppress Include="https://github.com/advisories/GHSA-w3x6-4m5h-cxqf"/>
</ItemGroup>
</Project>

View File

@@ -0,0 +1,173 @@
using ZB.MOM.WW.OtOpcUa.Configuration.Entities;
using ZB.MOM.WW.OtOpcUa.Core.Abstractions;
namespace ZB.MOM.WW.OtOpcUa.Core.OpcUa;
/// <summary>
/// Materializes the canonical Unified Namespace browse tree for an Equipment-kind
/// <see cref="Configuration.Entities.Namespace"/> from the Config DB's
/// <c>UnsArea</c> / <c>UnsLine</c> / <c>Equipment</c> / <c>Tag</c> rows. Runs during
/// address-space build per <see cref="IDriver"/> whose
/// <c>Namespace.Kind = Equipment</c>; SystemPlatform-kind namespaces (Galaxy) are
/// exempt per decision #120 and reach this walker only indirectly through
/// <see cref="ITagDiscovery.DiscoverAsync"/>.
/// </summary>
/// <remarks>
/// <para>
/// <b>Composition strategy.</b> ADR-001 (2026-04-20) accepted Option A — Config
/// primary. The walker treats the supplied <see cref="EquipmentNamespaceContent"/>
/// snapshot as the authoritative published surface. Every Equipment row becomes a
/// folder node at the UNS level-5 segment; every <see cref="Tag"/> bound to an
/// Equipment (non-null <see cref="Tag.EquipmentId"/>) becomes a variable node under
/// it. Driver-discovered tags that have no Config-DB row are not added by this
/// walker — the ITagDiscovery path continues to exist for the SystemPlatform case +
/// for enrichment, but Equipment-kind composition is fully Tag-row-driven.
/// </para>
///
/// <para>
/// <b>Under each Equipment node.</b> Five identifier properties per decision #121
/// (<c>EquipmentId</c>, <c>EquipmentUuid</c>, <c>MachineCode</c>, <c>ZTag</c>,
/// <c>SAPID</c>) are added as OPC UA properties — external systems (ERP, SAP PM)
/// resolve equipment by whichever identifier they natively use without a sidecar.
/// <see cref="IdentificationFolderBuilder.Build"/> materializes the OPC 40010
/// Identification sub-folder with the nine decision-#139 fields when at least one
/// is non-null; when all nine are null the sub-folder is omitted rather than
/// appearing empty.
/// </para>
///
/// <para>
/// <b>Address resolution.</b> Variable nodes carry the driver-side full reference
/// in <see cref="DriverAttributeInfo.FullName"/> copied from <c>Tag.TagConfig</c>
/// (the wire-level address JSON blob whose interpretation is driver-specific). At
/// runtime the dispatch layer routes Read/Write calls through the configured
/// capability invoker; an unreachable address surfaces as an OPC UA Bad status via
/// the natural driver-read failure path, NOT as a build-time reject. The ADR calls
/// this "BadNotFound placeholder" behavior — legible to operators via their Admin
/// UI + OPC UA client inspection of node status.
/// </para>
///
/// <para>
/// <b>Pure function.</b> This class has no dependency on the OPC UA SDK, no
/// Config-DB access, no state. It consumes pre-loaded EF Core rows + streams calls
/// into the supplied <see cref="IAddressSpaceBuilder"/>. The server-side wiring
/// (load snapshot → invoke walker → per-tag capability probe) lives in the Task B
/// PR alongside <c>NodeScopeResolver</c>'s Config-DB join.
/// </para>
/// </remarks>
public static class EquipmentNodeWalker
{
/// <summary>
/// Walk <paramref name="content"/> into <paramref name="namespaceBuilder"/>.
/// The builder is scoped to the Equipment-kind namespace root; the walker emits
/// Area → Line → Equipment folders under it, then identifier properties + the
/// Identification sub-folder + variable nodes per bound Tag under each Equipment.
/// </summary>
/// <param name="namespaceBuilder">
/// The builder scoped to the Equipment-kind namespace root. Caller is responsible for
/// creating this (e.g. <c>rootBuilder.Folder(namespace.NamespaceId, namespace.NamespaceUri)</c>).
/// </param>
/// <param name="content">Pre-loaded + pre-filtered rows for a single published generation.</param>
public static void Walk(IAddressSpaceBuilder namespaceBuilder, EquipmentNamespaceContent content)
{
ArgumentNullException.ThrowIfNull(namespaceBuilder);
ArgumentNullException.ThrowIfNull(content);
// Group lines by area + equipment by line + tags by equipment up-front. Avoids an
// O(N·M) re-scan at each UNS level on large fleets.
var linesByArea = content.Lines
.GroupBy(l => l.UnsAreaId, StringComparer.OrdinalIgnoreCase)
.ToDictionary(g => g.Key, g => g.OrderBy(l => l.Name, StringComparer.Ordinal).ToList(), StringComparer.OrdinalIgnoreCase);
var equipmentByLine = content.Equipment
.GroupBy(e => e.UnsLineId, StringComparer.OrdinalIgnoreCase)
.ToDictionary(g => g.Key, g => g.OrderBy(e => e.Name, StringComparer.Ordinal).ToList(), StringComparer.OrdinalIgnoreCase);
var tagsByEquipment = content.Tags
.Where(t => !string.IsNullOrEmpty(t.EquipmentId))
.GroupBy(t => t.EquipmentId!, StringComparer.OrdinalIgnoreCase)
.ToDictionary(g => g.Key, g => g.OrderBy(t => t.Name, StringComparer.Ordinal).ToList(), StringComparer.OrdinalIgnoreCase);
foreach (var area in content.Areas.OrderBy(a => a.Name, StringComparer.Ordinal))
{
var areaBuilder = namespaceBuilder.Folder(area.Name, area.Name);
if (!linesByArea.TryGetValue(area.UnsAreaId, out var areaLines)) continue;
foreach (var line in areaLines)
{
var lineBuilder = areaBuilder.Folder(line.Name, line.Name);
if (!equipmentByLine.TryGetValue(line.UnsLineId, out var lineEquipment)) continue;
foreach (var equipment in lineEquipment)
{
var equipmentBuilder = lineBuilder.Folder(equipment.Name, equipment.Name);
AddIdentifierProperties(equipmentBuilder, equipment);
IdentificationFolderBuilder.Build(equipmentBuilder, equipment);
if (!tagsByEquipment.TryGetValue(equipment.EquipmentId, out var equipmentTags)) continue;
foreach (var tag in equipmentTags)
AddTagVariable(equipmentBuilder, tag);
}
}
}
}
/// <summary>
/// Adds the five operator-facing identifiers from decision #121 as OPC UA properties
/// on the Equipment node. EquipmentId + EquipmentUuid are always populated;
/// MachineCode is required per <see cref="Equipment"/>; ZTag + SAPID are nullable in
/// the data model so they're skipped when null to avoid empty-string noise in the
/// browse tree.
/// </summary>
private static void AddIdentifierProperties(IAddressSpaceBuilder equipmentBuilder, Equipment equipment)
{
equipmentBuilder.AddProperty("EquipmentId", DriverDataType.String, equipment.EquipmentId);
equipmentBuilder.AddProperty("EquipmentUuid", DriverDataType.String, equipment.EquipmentUuid.ToString());
equipmentBuilder.AddProperty("MachineCode", DriverDataType.String, equipment.MachineCode);
if (!string.IsNullOrEmpty(equipment.ZTag))
equipmentBuilder.AddProperty("ZTag", DriverDataType.String, equipment.ZTag);
if (!string.IsNullOrEmpty(equipment.SAPID))
equipmentBuilder.AddProperty("SAPID", DriverDataType.String, equipment.SAPID);
}
/// <summary>
/// Emit a single Tag row as an <see cref="IAddressSpaceBuilder.Variable"/>. The driver
/// full reference lives in <c>Tag.TagConfig</c> (wire-level address, driver-specific
/// JSON blob); the variable node's data type derives from <c>Tag.DataType</c>.
/// Unreachable-address behavior per ADR-001 Option A: the variable is created; the
/// driver's natural Read failure surfaces an OPC UA Bad status at runtime.
/// </summary>
private static void AddTagVariable(IAddressSpaceBuilder equipmentBuilder, Tag tag)
{
var attr = new DriverAttributeInfo(
FullName: tag.TagConfig,
DriverDataType: ParseDriverDataType(tag.DataType),
IsArray: false,
ArrayDim: null,
SecurityClass: SecurityClassification.FreeAccess,
IsHistorized: false);
equipmentBuilder.Variable(tag.Name, tag.Name, attr);
}
/// <summary>
/// Parse <see cref="Tag.DataType"/> (stored as the <see cref="DriverDataType"/> enum
/// name string, decision #138) into the enum value. Unknown names fall back to
/// <see cref="DriverDataType.String"/> so a one-off driver-specific type doesn't
/// abort the whole walk; the underlying driver still sees the original TagConfig
/// address + can surface its own typed value via the OPC UA variant at read time.
/// </summary>
private static DriverDataType ParseDriverDataType(string raw) =>
Enum.TryParse<DriverDataType>(raw, ignoreCase: true, out var parsed) ? parsed : DriverDataType.String;
}
/// <summary>
/// Pre-loaded + pre-filtered snapshot of one Equipment-kind namespace's worth of Config
/// DB rows. All four collections are scoped to the same
/// <see cref="Configuration.Entities.ConfigGeneration"/> + the same
/// <see cref="Configuration.Entities.Namespace"/> row. The walker assumes this filter
/// was applied by the caller + does no cross-generation or cross-namespace validation.
/// </summary>
public sealed record EquipmentNamespaceContent(
IReadOnlyList<UnsArea> Areas,
IReadOnlyList<UnsLine> Lines,
IReadOnlyList<Equipment> Equipment,
IReadOnlyList<Tag> Tags);

View File

@@ -0,0 +1,129 @@
using ZB.MOM.WW.OtOpcUa.Core.Abstractions;
namespace ZB.MOM.WW.OtOpcUa.Core.Resilience;
/// <summary>
/// Wraps the three mutating surfaces of <see cref="IAlarmSource"/>
/// (<see cref="IAlarmSource.SubscribeAlarmsAsync"/>, <see cref="IAlarmSource.UnsubscribeAlarmsAsync"/>,
/// <see cref="IAlarmSource.AcknowledgeAsync"/>) through <see cref="CapabilityInvoker"/> so the
/// Phase 6.1 resilience pipeline runs — retry semantics match
/// <see cref="DriverCapability.AlarmSubscribe"/> (retries by default) and
/// <see cref="DriverCapability.AlarmAcknowledge"/> (does NOT retry per decision #143).
/// </summary>
/// <remarks>
/// <para>Multi-host dispatch: when the driver implements <see cref="IPerCallHostResolver"/>,
/// each source-node-id is resolved individually + grouped by host so a dead PLC inside a
/// multi-device driver doesn't poison the sibling hosts' breakers. Drivers with a single
/// host fall back to <see cref="IDriver.DriverInstanceId"/> as the single-host key.</para>
///
/// <para>Why this lives here + not on <see cref="CapabilityInvoker"/>: alarm surfaces have a
/// handle-returning shape (SubscribeAlarmsAsync returns <see cref="IAlarmSubscriptionHandle"/>)
/// + a per-call fan-out (AcknowledgeAsync gets a batch of
/// <see cref="AlarmAcknowledgeRequest"/>s that may span multiple hosts). Keeping the fan-out
/// logic here keeps the invoker's execute-overloads narrow.</para>
/// </remarks>
public sealed class AlarmSurfaceInvoker
{
private readonly CapabilityInvoker _invoker;
private readonly IAlarmSource _alarmSource;
private readonly IPerCallHostResolver? _hostResolver;
private readonly string _defaultHost;
public AlarmSurfaceInvoker(
CapabilityInvoker invoker,
IAlarmSource alarmSource,
string defaultHost,
IPerCallHostResolver? hostResolver = null)
{
ArgumentNullException.ThrowIfNull(invoker);
ArgumentNullException.ThrowIfNull(alarmSource);
ArgumentException.ThrowIfNullOrWhiteSpace(defaultHost);
_invoker = invoker;
_alarmSource = alarmSource;
_defaultHost = defaultHost;
_hostResolver = hostResolver;
}
/// <summary>
/// Subscribe to alarm events for a set of source node ids, fanning out by resolved host
/// so per-host breakers / bulkheads apply. Returns one handle per host — callers that
/// don't care about per-host separation may concatenate them.
/// </summary>
public async Task<IReadOnlyList<IAlarmSubscriptionHandle>> SubscribeAsync(
IReadOnlyList<string> sourceNodeIds,
CancellationToken cancellationToken)
{
ArgumentNullException.ThrowIfNull(sourceNodeIds);
if (sourceNodeIds.Count == 0) return [];
var byHost = GroupByHost(sourceNodeIds);
var handles = new List<IAlarmSubscriptionHandle>(byHost.Count);
foreach (var (host, ids) in byHost)
{
var handle = await _invoker.ExecuteAsync(
DriverCapability.AlarmSubscribe,
host,
async ct => await _alarmSource.SubscribeAlarmsAsync(ids, ct).ConfigureAwait(false),
cancellationToken).ConfigureAwait(false);
handles.Add(handle);
}
return handles;
}
/// <summary>Cancel an alarm subscription. Routes through the AlarmSubscribe pipeline for parity.</summary>
public ValueTask UnsubscribeAsync(IAlarmSubscriptionHandle handle, CancellationToken cancellationToken)
{
ArgumentNullException.ThrowIfNull(handle);
return _invoker.ExecuteAsync(
DriverCapability.AlarmSubscribe,
_defaultHost,
async ct => await _alarmSource.UnsubscribeAlarmsAsync(handle, ct).ConfigureAwait(false),
cancellationToken);
}
/// <summary>
/// Acknowledge alarms. Fans out by resolved host; each host's batch runs through the
/// AlarmAcknowledge pipeline (no-retry per decision #143 — an alarm-ack is not idempotent
/// at the plant-floor acknowledgement level even if the OPC UA spec permits re-issue).
/// </summary>
public async Task AcknowledgeAsync(
IReadOnlyList<AlarmAcknowledgeRequest> acknowledgements,
CancellationToken cancellationToken)
{
ArgumentNullException.ThrowIfNull(acknowledgements);
if (acknowledgements.Count == 0) return;
var byHost = _hostResolver is null
? new Dictionary<string, List<AlarmAcknowledgeRequest>> { [_defaultHost] = acknowledgements.ToList() }
: acknowledgements
.GroupBy(a => _hostResolver.ResolveHost(a.SourceNodeId))
.ToDictionary(g => g.Key, g => g.ToList());
foreach (var (host, batch) in byHost)
{
var batchSnapshot = batch; // capture for the lambda
await _invoker.ExecuteAsync(
DriverCapability.AlarmAcknowledge,
host,
async ct => await _alarmSource.AcknowledgeAsync(batchSnapshot, ct).ConfigureAwait(false),
cancellationToken).ConfigureAwait(false);
}
}
private Dictionary<string, List<string>> GroupByHost(IReadOnlyList<string> sourceNodeIds)
{
if (_hostResolver is null)
return new Dictionary<string, List<string>> { [_defaultHost] = sourceNodeIds.ToList() };
var result = new Dictionary<string, List<string>>(StringComparer.Ordinal);
foreach (var id in sourceNodeIds)
{
var host = _hostResolver.ResolveHost(id);
if (!result.TryGetValue(host, out var list))
result[host] = list = new List<string>();
list.Add(id);
}
return result;
}
}

View File

@@ -0,0 +1,232 @@
using ZB.MOM.WW.OtOpcUa.Core.Abstractions;
namespace ZB.MOM.WW.OtOpcUa.Driver.AbCip;
/// <summary>
/// Task #177 — projects AB Logix ALMD alarm instructions onto the OPC UA alarm surface by
/// polling the ALMD UDT's <c>InFaulted</c> / <c>Acked</c> / <c>Severity</c> members at a
/// configurable interval + translating state transitions into <c>OnAlarmEvent</c>
/// callbacks on the owning <see cref="AbCipDriver"/>. Feature-flagged off by default via
/// <see cref="AbCipDriverOptions.EnableAlarmProjection"/>; callers that leave the flag off
/// get a no-op subscribe path so capability negotiation still works.
/// </summary>
/// <remarks>
/// <para>ALMD-only in this pass. ALMA (analog alarm) projection is a follow-up because
/// its threshold + limit semantics need more design — ALMD's "is the alarm active + has
/// the operator acked" shape maps cleanly onto the driver-agnostic
/// <see cref="IAlarmSource"/> contract without concessions.</para>
///
/// <para>Polling reuses <see cref="AbCipDriver.ReadAsync"/>, so ALMD reads get the #194
/// whole-UDT optimization for free when the ALMD is declared with its standard members.
/// One poll loop per subscription call; the loop batches every
/// member read across the full source-node set into a single ReadAsync per tick.</para>
///
/// <para>ALMD <c>Acked</c> write semantics on Logix are rising-edge sensitive at the
/// instruction level — writing <c>Acked=1</c> directly is honored by FT View + the
/// standard HMI templates, but some PLC programs read <c>AckCmd</c> + look for the edge
/// themselves. We pick the simpler <c>Acked</c> write for first pass; operators whose
/// ladder watches <c>AckCmd</c> can wire a follow-up "AckCmd 0→1→0" pulse on the client
/// side until a driver-level knob lands.</para>
/// </remarks>
internal sealed class AbCipAlarmProjection : IAsyncDisposable
{
private readonly AbCipDriver _driver;
private readonly TimeSpan _pollInterval;
private readonly Dictionary<long, Subscription> _subs = new();
private readonly Lock _subsLock = new();
private long _nextId;
public AbCipAlarmProjection(AbCipDriver driver, TimeSpan pollInterval)
{
_driver = driver;
_pollInterval = pollInterval;
}
public async Task<IAlarmSubscriptionHandle> SubscribeAsync(
IReadOnlyList<string> sourceNodeIds, CancellationToken cancellationToken)
{
var id = Interlocked.Increment(ref _nextId);
var handle = new AbCipAlarmSubscriptionHandle(id);
var cts = new CancellationTokenSource();
var sub = new Subscription(handle, [..sourceNodeIds], cts);
lock (_subsLock) _subs[id] = sub;
sub.Loop = Task.Run(() => RunPollLoopAsync(sub, cts.Token), cts.Token);
await Task.CompletedTask;
return handle;
}
public async Task UnsubscribeAsync(IAlarmSubscriptionHandle handle, CancellationToken cancellationToken)
{
if (handle is not AbCipAlarmSubscriptionHandle h) return;
Subscription? sub;
lock (_subsLock)
{
if (!_subs.Remove(h.Id, out sub)) return;
}
try { sub.Cts.Cancel(); } catch { }
try { await sub.Loop.ConfigureAwait(false); } catch { }
sub.Cts.Dispose();
}
public async Task AcknowledgeAsync(
IReadOnlyList<AlarmAcknowledgeRequest> acknowledgements, CancellationToken cancellationToken)
{
if (acknowledgements.Count == 0) return;
// Write Acked=1 per request. IWritable isn't on AbCipAlarmProjection so route through
// the driver's public interface — delegating instead of re-implementing the write path
// keeps the bit-in-DINT + idempotency + per-call-host-resolve knobs intact.
var requests = acknowledgements
.Select(a => new WriteRequest($"{a.SourceNodeId}.Acked", true))
.ToArray();
// Best-effort — the driver's WriteAsync returns per-item status; individual ack
// failures don't poison the batch. Swallow the return so a single faulted ack
// doesn't bubble out of the caller's batch expectation.
_ = await _driver.WriteAsync(requests, cancellationToken).ConfigureAwait(false);
}
public async ValueTask DisposeAsync()
{
List<Subscription> snap;
lock (_subsLock) { snap = _subs.Values.ToList(); _subs.Clear(); }
foreach (var sub in snap)
{
try { sub.Cts.Cancel(); } catch { }
try { await sub.Loop.ConfigureAwait(false); } catch { }
sub.Cts.Dispose();
}
}
/// <summary>
/// Poll-tick body — reads <c>InFaulted</c> + <c>Severity</c> for every source node id
/// in the subscription, diffs each against last-seen state, fires raise/clear events.
/// Extracted so tests can drive one tick without standing up the Task.Run loop.
/// </summary>
internal void Tick(Subscription sub, IReadOnlyList<DataValueSnapshot> results)
{
// results index layout: for each sourceNode, [InFaulted, Severity] in order.
for (var i = 0; i < sub.SourceNodeIds.Count; i++)
{
var nodeId = sub.SourceNodeIds[i];
var inFaultedDv = results[i * 2];
var severityDv = results[i * 2 + 1];
if (inFaultedDv.StatusCode != AbCipStatusMapper.Good) continue;
var nowFaulted = ToBool(inFaultedDv.Value);
var severity = ToInt(severityDv.Value);
var wasFaulted = sub.LastInFaulted.GetValueOrDefault(nodeId, false);
sub.LastInFaulted[nodeId] = nowFaulted;
if (!wasFaulted && nowFaulted)
{
_driver.InvokeAlarmEvent(new AlarmEventArgs(
sub.Handle, nodeId, ConditionId: $"{nodeId}#active",
AlarmType: "ALMD",
Message: $"ALMD {nodeId} raised",
Severity: MapSeverity(severity),
SourceTimestampUtc: DateTime.UtcNow));
}
else if (wasFaulted && !nowFaulted)
{
_driver.InvokeAlarmEvent(new AlarmEventArgs(
sub.Handle, nodeId, ConditionId: $"{nodeId}#active",
AlarmType: "ALMD",
Message: $"ALMD {nodeId} cleared",
Severity: MapSeverity(severity),
SourceTimestampUtc: DateTime.UtcNow));
}
}
}
private async Task RunPollLoopAsync(Subscription sub, CancellationToken ct)
{
var refs = new List<string>(sub.SourceNodeIds.Count * 2);
foreach (var nodeId in sub.SourceNodeIds)
{
refs.Add($"{nodeId}.InFaulted");
refs.Add($"{nodeId}.Severity");
}
while (!ct.IsCancellationRequested)
{
try
{
var results = await _driver.ReadAsync(refs, ct).ConfigureAwait(false);
Tick(sub, results);
}
catch (OperationCanceledException) when (ct.IsCancellationRequested) { break; }
catch { /* per-tick failures are non-fatal; next tick retries */ }
try { await Task.Delay(_pollInterval, ct).ConfigureAwait(false); }
catch (OperationCanceledException) { break; }
}
}
internal static AlarmSeverity MapSeverity(int raw) => raw switch
{
<= 250 => AlarmSeverity.Low,
<= 500 => AlarmSeverity.Medium,
<= 750 => AlarmSeverity.High,
_ => AlarmSeverity.Critical,
};
private static bool ToBool(object? v) => v switch
{
bool b => b,
int i => i != 0,
long l => l != 0,
_ => false,
};
private static int ToInt(object? v) => v switch
{
int i => i,
long l => (int)l,
short s => s,
byte b => b,
_ => 0,
};
internal sealed class Subscription
{
public Subscription(AbCipAlarmSubscriptionHandle handle, IReadOnlyList<string> sourceNodeIds, CancellationTokenSource cts)
{
Handle = handle; SourceNodeIds = sourceNodeIds; Cts = cts;
}
public AbCipAlarmSubscriptionHandle Handle { get; }
public IReadOnlyList<string> SourceNodeIds { get; }
public CancellationTokenSource Cts { get; }
public Task Loop { get; set; } = Task.CompletedTask;
public Dictionary<string, bool> LastInFaulted { get; } = new(StringComparer.Ordinal);
}
}
/// <summary>Handle returned by <see cref="AbCipAlarmProjection.SubscribeAsync"/>.</summary>
public sealed record AbCipAlarmSubscriptionHandle(long Id) : IAlarmSubscriptionHandle
{
public string DiagnosticId => $"abcip-alarm-sub-{Id}";
}
/// <summary>
/// Detects the ALMD / ALMA signature in an <see cref="AbCipTagDefinition"/>'s declared
/// members. Used by both discovery (to stamp <c>IsAlarm=true</c> on the emitted
/// variable) + initial driver setup (to decide which tags the alarm projection owns).
/// </summary>
public static class AbCipAlarmDetector
{
/// <summary>
/// <c>true</c> when <paramref name="tag"/> is a Structure whose declared members match
/// the ALMD signature (<c>InFaulted</c> + <c>Acked</c> present). ALMA detection
/// (analog alarms with <c>HHLimit</c>/<c>HLimit</c>/<c>LLimit</c>/<c>LLLimit</c>)
/// ships as a follow-up.
/// </summary>
public static bool IsAlmd(AbCipTagDefinition tag)
{
if (tag.DataType != AbCipDataType.Structure || tag.Members is null) return false;
var names = tag.Members.Select(m => m.Name).ToHashSet(StringComparer.OrdinalIgnoreCase);
return names.Contains("InFaulted") && names.Contains("Acked");
}
}

Some files were not shown because too many files have changed in this diff Show More