Files
lmxopcua/docs/v2/lmx-followups.md
Joseph Doherty 8adc8f5ab8 Phase 3 PR 37 — End-to-end live-stack Galaxy smoke test. Closes the code side of LMX follow-up #5; once OtOpcUaGalaxyHost is installed + started on the dev box, the suite exercises the full topology GalaxyProxyDriver in-process → named-pipe IPC → running OtOpcUaGalaxyHost Windows service → MxAccessGalaxyBackend → live MXAccess runtime → real deployed Galaxy objects. Never spawns the Host process itself — connects to the already-running service per project_galaxy_host_service.md, which is the only way to exercise the production COM-apartment + service-account + pipe-ACL configuration.
LiveStackConfig resolves the pipe name + per-install shared secret from two sources in order: OTOPCUA_GALAXY_PIPE + OTOPCUA_GALAXY_SECRET env vars first (for CI / benchwork overrides), then the service's per-process Environment registry values under HKLM\SYSTEM\CurrentControlSet\Services\OtOpcUaGalaxyHost (what Install-Services.ps1 writes at install time). Registry read requires the test host to run elevated on most boxes — the skip message says so explicitly so operators see the right remediation. Hard-coded secrets are deliberately avoided: the installer generates 32 fresh random bytes per install, a committed secret would diverge from production the moment the service is re-installed.
LiveStackFixture is an IAsyncLifetime that (1) runs AvevaPrerequisites.CheckAllAsync with CheckGalaxyHostPipe=true + CheckHistorian=false — produces a structured PrerequisiteReport whose SkipReason is the exact operator-facing 'here's what you need to fix' text, (2) resolves LiveStackConfig and surfaces a clear skip when the secret isn't discoverable, (3) instantiates GalaxyProxyDriver + calls InitializeAsync (the IPC handshake), capturing a skip with the exception detail + common-cause hints (secret mismatch, SID not in pipe ACL, Host's backend couldn't connect to ZB) rather than letting a NullRef cascade through every subsequent test. SkipIfUnavailable() translates the captured SkipReason into Assert.Skip at the top of every fact so tests read as cleanly-skipped with a visible reason, not silently-passed or crashed.
LiveStackSmokeTests (5 facts, Collection=LiveStack, Category=LiveGalaxy): Fixture_initialized_successfully (cheapest possible end-to-end assertion — if this passes, the IPC handshake worked); Driver_reports_Healthy_after_IPC_handshake (DriverHealth.State post-connect); DiscoverAsync_returns_at_least_one_variable_from_live_galaxy (captures every Variable() call from DiscoverAsync via CapturingAddressSpaceBuilder and asserts > 0 — zero here usually means the Host couldn't read ZB, the skip message names OTOPCUA_GALAXY_ZB_CONN to check); GetHostStatuses_reports_at_least_one_platform (IHostConnectivityProbe surface — zero means the probe loop hasn't fired or no Platform is deployed locally); Can_read_a_discovered_variable_from_live_galaxy (reads the first discovered attribute's full reference, asserts status != BadInternalError — Galaxy's Uncertain-quality-until-first-Engine-scan is intentionally NOT treated as failure since it depends on runtime state that varies across test runs). Read-only by design; writes need an agreed scratch tag to avoid mutating a process-critical attribute — deferred to a follow-up PR that reuses this fixture.
CapturingAddressSpaceBuilder is a minimal IAddressSpaceBuilder that flattens every Variable() call into a list so tests can inspect what discovery produced without booting the full OPC UA node-manager stack; alarm annotation + property calls are no-ops. Scoped private to the test class.
Galaxy.Proxy.Tests csproj gains a ProjectReference to Driver.Galaxy.TestSupport (PR 36) for AvevaPrerequisites. The NU1702 warning about the Host project being net48-referenced-by-net10 is pre-existing from the HostSubprocessParityTests — Proxy.Tests only needs the Host EXE path for that parity scenario, not type surface.
Test run on THIS machine (OtOpcUaGalaxyHost not yet installed): Skipped! Failed 0, Passed 0, Skipped 5 — each skip message includes the full prerequisites report pointing at the missing service. Once the service is installed + started (scripts\install\Install-Services.ps1), the 5 facts will execute against live Galaxy. Proxy.Tests Unit: 17 pass / 0 fail (unchanged — new tests are Category=LiveGalaxy, separate suite). Full Proxy build clean. Memory already captures the 'live tests run via already-running service, don't spawn' convention (project_galaxy_host_service.md).
lmx-followups.md #5 updated: status is 'IN PROGRESS' across PRs 36 + 37 with the explicit remaining work (install + start services, subscribe-and-receive, write round-trip).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 16:49:51 -04:00

8.1 KiB

LMX Galaxy bridge — remaining follow-ups

State after PR 19: the Galaxy driver is functionally at v1 parity through the IDriver abstraction; the OPC UA server runs with LDAP-authenticated Basic256Sha256 endpoints and alarms are observable through AlarmConditionState.ReportEvent. The items below are what remains LMX- specific before the stack can fully replace the v1 deployment, in rough priority order.

1. Proxy-side IHistoryProvider for ReadAtTime / ReadEvents

Status: Capability surface complete (PR 35). OPC UA HistoryRead service-handler wiring in DriverNodeManager remains as the next step; integration-test still pending.

PR 35 extended IHistoryProvider with ReadAtTimeAsync + ReadEventsAsync (default throwing implementations so existing impls keep compiling), added the HistoricalEvent + HistoricalEventsResult records to Core.Abstractions, and implemented both methods in GalaxyProxyDriver on top of the PR 10 / PR 11 IPC messages. Wire-to-domain mapping (ToHistoricalEvent) is unit-tested for field fidelity, null-preservation, and DateTimeKind.Utc.

Remaining:

  • DriverNodeManager wires the new capability methods onto HistoryRead AtTime + Events service handlers.
  • Integration test: OPC UA client calls HistoryReadAtTime / HistoryReadEvents, value flows through IPC to the Host's HistorianDataSource, back to the client.

2. Write-gating by role — DONE (PR 26)

Landed in PR 26. WriteAuthzPolicy in Server/Security/ maps SecurityClassification → required role (FreeAccess → no role required, Operate/SecuredWriteWriteOperate, TuneWriteTune, Configure/VerifiedWriteWriteConfigure, ViewOnly → deny regardless). DriverNodeManager caches the classification per variable during discovery and checks the session's roles (via IRoleBearer) in OnWriteValue before calling IWritable.WriteAsync. Roles do not cascade — a session with WriteOperate can't write a Tune attribute unless it also carries WriteTune.

See feedback_acl_at_server_layer.md in memory for the architectural directive that authz stays at the server layer and never delegates to driver-specific auth.

3. Admin UI client-cert trust management — DONE (PR 28)

PR 28 shipped /certificates in the Admin UI. CertTrustService reads the OPC UA server's PKI store root (OpcUaServerOptions.PkiStoreRoot — default %ProgramData%\OtOpcUa\pki) and lists rejected + trusted certs by parsing the .der files directly, so it has no Opc.Ua dependency and runs on any Admin host that can reach the shared PKI directory.

Operator actions: Trust (moves rejected/certs/*.dertrusted/certs/*.der), Delete rejected, Revoke trust. The OPC UA stack re-reads the trusted store on each new client handshake, so no explicit reload signal is needed — operators retry the rejected client's connection after trusting.

Deferred: flipping AutoAcceptUntrustedClientCertificates to false as the deployment default. That's a production-hardening config change, not a code gap — the Admin UI is now ready to be the trust gate.

4. Live-LDAP integration test — DONE (PR 31)

PR 31 shipped Server.Tests/LdapUserAuthenticatorLiveTests.cs — 6 live-bind tests against the dev GLAuth instance at localhost:3893, skipped cleanly when the port is unreachable. Covers: valid bind, wrong password, unknown user, empty credentials, single-group → WriteOperate mapping, multi-group admin user surfacing all mapped roles.

Also added UserNameAttribute to LdapOptions (default uid for RFC 2307 compat) so Active Directory deployments can configure sAMAccountName / userPrincipalName without code changes. LdapUserAuthenticatorAdCompatTests (5 unit guards) pins the AD-shape DN parsing + filter escape behaviors. See docs/security.md §"Active Directory configuration" for the AD appsettings snippet.

Deferred: asserting session.Identity end-to-end on the server side (i.e. drive a full OPC UA session with username/password, then read an IHostConnectivityProbe-style "whoami" node to verify the role surfaced). That needs a test-only address-space node and is a separate PR.

5. Full Galaxy live-service smoke test against the merged v2 stack — IN PROGRESS (PRs 36 + 37)

PR 36 shipped the prerequisites helper (AvevaPrerequisites) that probes every dependency a live smoke test needs and produces actionable skip messages.

PR 37 shipped the live-stack smoke test project structure: tests/Driver.Galaxy.Proxy.Tests/LiveStack/ with LiveStackFixture (connects to the already-running OtOpcUaGalaxyHost Windows service via named pipe; never spawns the Host process) and LiveStackSmokeTests covering:

  • Fixture initializes successfully (IPC handshake succeeds end-to-end).
  • Driver reports DriverState.Healthy post-handshake.
  • DiscoverAsync returns at least one variable from the live Galaxy.
  • GetHostStatuses reports at least one Platform/AppEngine host.
  • ReadAsync on a discovered variable round-trips through Proxy → Host pipe → MXAccess → back without a BadInternalError.

Shared secret + pipe name resolve from OTOPCUA_GALAXY_SECRET / OTOPCUA_GALAXY_PIPE env vars, falling back to reading the service's registry-stored Environment values (requires elevated test host).

Remaining:

  • Install + run the OtOpcUaGalaxyHost + OtOpcUa services on the dev box (scripts/install/Install-Services.ps1) so the skip-on-unready tests actually execute and the smoke PR lands green.
  • Subscribe-and-receive-data-change fact (needs a known tag that actually ticks; deferred until operators confirm a scratch tag exists).
  • Write-and-roundtrip fact (needs a test-only UDA or agreed scratch tag so we can't accidentally mutate a process-critical value).

6. Second driver instance on the same server — DONE (PR 32)

Server.Tests/MultipleDriverInstancesIntegrationTests.cs registers two drivers with distinct DriverInstanceIds on one DriverHost, spins up the full OPC UA server, and asserts three behaviors: (1) each driver's namespace URI (urn:OtOpcUa:{id}) resolves to a distinct index in the client's NamespaceUris, (2) browsing one subtree returns that driver's folder and does NOT leak the other driver's folder, (3) reads route to the correct driver — the alpha instance returns 42 while beta returns 99, so a misroute would surface at the assertion layer.

Deferred: the alarm-event multi-driver parity case (two drivers each raising a GalaxyAlarmEvent, assert each condition lands on its owning instance's condition node). Alarm tracking already has its own integration test (AlarmSubscription*); the multi-driver alarm case would need a stub IAlarmSource that's worth its own focused PR.

7. Host-status per-AppEngine granularity → Admin UI dashboard — DONE (PRs 33 + 34)

PR 33 landed the data layer: DriverHostStatus entity + migration with composite key (NodeId, DriverInstanceId, HostName) and two query-supporting indexes (per-cluster drill-down on NodeId, stale-row detection on LastSeenUtc).

PR 34 wired the publisher + consumer. HostStatusPublisher is a BackgroundService in the Server process that walks every registered IHostConnectivityProbe-capable driver every 10s, calls GetHostStatuses(), and upserts rows (LastSeenUtc advances each tick; State + StateChangedUtc update on transitions). Admin UI /hosts page groups by cluster, shows four summary cards (Hosts / Running / Stale / Faulted), and flags rows whose LastSeenUtc is older than 30s as Stale so operators see crashed Servers without waiting for a state change.

Deferred as follow-ups:

  • Event-driven push (subscribe to OnHostStatusChanged per driver for sub-heartbeat latency). Adds DriverHost lifecycle-event plumbing; 10s polling is fine for operator-scale use.
  • Failure-count column — needs the publisher to track a transition history per host, not just current-state.
  • SignalR fan-out to the Admin page (currently the page polls the DB, not a hub). The DB-polled version is fine at current cadence but a hub push would eliminate the 10s race where a new row sits in the DB before the Admin page notices.