Files
lmxopcua/docs/v2/implementation/phase-7-e2e-smoke.md
Joseph Doherty d11dd0520b Galaxy IPC unblock — live dev-box E2E path
Three root-cause fixes to get an elevated dev-box shell past session open
through to real MXAccess reads:

1. PipeAcl — drop BUILTIN\Administrators deny ACE. UAC's filtered token
   carries the Admins SID as deny-only, so the deny fired even from
   non-elevated admin-account shells. The per-connection SID check in
   PipeServer.VerifyCaller remains the real authorization boundary.

2. PipeServer — swap the Hello-read / VerifyCaller order. ImpersonateNamedPipeClient
   returns ERROR_CANNOT_IMPERSONATE until at least one frame has been read
   from the pipe; reading Hello first satisfies that rule. Previously the
   ACL deny-first path masked this race — removing the deny ACE exposed it.

3. GalaxyIpcClient — add a background reader + single pending-response
   slot. A RuntimeStatusChange event between OpenSessionRequest and
   OpenSessionResponse used to satisfy the caller's single ReadFrameAsync
   and fail CallAsync with "Expected OpenSessionResponse, got
   RuntimeStatusChange". The reader now routes response kinds (and
   ErrorResponse) to the pending TCS and everything else to a handler the
   driver registers in InitializeAsync. The Proxy was already set up to
   raise managed events from RaiseDataChange / RaiseAlarmEvent /
   OnHostConnectivityUpdate — those helpers had no caller until now.

4. RedundancyPublisherHostedService — swallow BadServerHalted while
   polling host.Server.CurrentInstance. StandardServer throws that code
   during startup rather than returning null, so the first poll attempt
   crashed the BackgroundService (and the host) before OnServerStarted
   ran. This race was latent behind the Galaxy init failure above.

Updates docs that described the Admins deny ACE + mandatory non-elevated
shells, and drops the admin-skip guards from every Galaxy integration +
E2E fixture that had them (IpcHandshakeIntegrationTests, EndToEndIpcTests,
ParityFixture, LiveStackFixture, HostSubprocessParityTests).

Adds GalaxyIpcClientRoutingTests covering the router's
request/response match, ErrorResponse, event-between-call, idle event,
and peer-close paths.

Verified live on the dev box against the p7-smoke cluster (gen 6):
driver registered=1 failedInit=0, Phase 7 bridge subscribed, OPC UA
server up on 4840, MXAccess read round-trip returns real data with
Status=0x00000000.

Task #112 — partial: Galaxy live stack is functional end-to-end. The
supplied test-galaxy.ps1 script still fails because the UNS walker
encodes TagConfig JSON as the tag's NodeId instead of the seeded TagId
(pre-existing; separate issue from this commit).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 16:30:16 -04:00

9.2 KiB
Raw Blame History

Phase 7 Live OPC UA E2E Smoke (task #240)

End-to-end validation that the Phase 7 production wiring chain (#243 / #244 / #245 / #246 / #247) actually serves virtual tags + scripted alarms over OPC UA against a real Galaxy + Aveva Historian.

Scope. Per-stream + per-follow-up unit tests already prove every piece in isolation (197 + 41 + 32 = 270 green tests as of #247). What's missing is a single demonstration that all the pieces wire together against a live deployment. This runbook is that demonstration.

Prerequisites

Component How to verify
AVEVA Galaxy + MXAccess installed Get-Service ArchestrA* returns at least one running service
OtOpcUaGalaxyHost Windows service running sc query OtOpcUaGalaxyHostSTATE: 4 RUNNING
Galaxy.Host shared secret matches .local/galaxy-host-secret.txt Set during NSSM install — see docs/ServiceHosting.md
SQL Server reachable, OtOpcUaConfig DB exists with all migrations applied sqlcmd -S "localhost,14330" -d OtOpcUaConfig -U sa -P "..." -Q "SELECT COUNT(*) FROM dbo.__EFMigrationsHistory" returns ≥ 11
Server's appsettings.json Node:ConfigDbConnectionString matches your SQL Server cat src/ZB.MOM.WW.OtOpcUa.Server/appsettings.json

Galaxy.Host pipe ACL. The pipe allows the configured OTOPCUA_ALLOWED_SID (typically the user that runs OtOpcUaGalaxyHostdohertj2 on the dev box). Run the Server under the same user; elevation doesn't matter — PipeAcl.cs no longer denies BUILTIN\Administrators since UAC's deny-only Admins SID would have blocked non-elevated dev-box admins too.

Setup

1. Migrate the Config DB

cd src/ZB.MOM.WW.OtOpcUa.Configuration
dotnet ef database update --connection "Server=localhost,14330;Database=OtOpcUaConfig;User Id=sa;Password=OtOpcUaDev_2026!;TrustServerCertificate=True;Encrypt=False;"

Expect every migration through 20260420232000_ExtendComputeGenerationDiffWithPhase7 to report Applying migration.... Re-running is a no-op.

2. Seed the smoke fixture

sqlcmd -S "localhost,14330" -d OtOpcUaConfig -U sa -P "OtOpcUaDev_2026!" `
       -I -i scripts/smoke/seed-phase-7-smoke.sql

Expected output ends with Phase 7 smoke seed complete. plus a Cluster / Node / Generation summary. Idempotent — re-running wipes the prior smoke state and starts clean.

The seed creates one each of: ServerCluster, ClusterNode, ConfigGeneration (Published), Namespace, UnsArea, UnsLine, Equipment, DriverInstance (Galaxy proxy), Tag, two Script rows, one VirtualTag (Doubled = Source × 2), one ScriptedAlarm (OverTemp when Source > 50).

3. Replace the Galaxy attribute placeholder

scripts/smoke/seed-phase-7-smoke.sql inserts a dbo.Tag.TagConfig JSON with FullName = "REPLACE_WITH_REAL_GALAXY_ATTRIBUTE". Edit the SQL + re-run, or UPDATE dbo.Tag SET TagConfig = N'{"FullName":"YourReal.GalaxyAttr","DataType":"Float64"}' WHERE TagId='p7-smoke-tag-source'. Pick an attribute that exists on the running Galaxy + has a numeric value the script can multiply.

4. Point Server.appsettings at the smoke node

{
  "Node": {
    "NodeId":    "p7-smoke-node",
    "ClusterId": "p7-smoke",
    "ConfigDbConnectionString": "Server=localhost,14330;..."
  }
}

Run

5. Start the Server

dotnet run --project src/ZB.MOM.WW.OtOpcUa.Server

Expected log markers (in order):

Bootstrap complete: source=db generation=1
Equipment namespace snapshots loaded for 1/1 driver(s) at generation 1
Phase 7 historian sink: driver p7-smoke-galaxy provides IAlarmHistorianWriter — wiring SqliteStoreAndForwardSink
Phase 7: composed engines from generation 1 — 1 virtual tag(s), 1 scripted alarm(s), 2 script(s)
Phase 7 bridge subscribed N attribute(s) from driver GalaxyProxyDriver
OPC UA server started — endpoint=opc.tcp://0.0.0.0:4840/OtOpcUa driverCount=1
Address space populated for driver p7-smoke-galaxy

Any line missing = follow up the failure surface (each step has its own log signature so the broken piece is identifiable).

6. Validate via Client.CLI

dotnet run --project src/ZB.MOM.WW.OtOpcUa.Client.CLI -- browse -u opc.tcp://localhost:4840/OtOpcUa -r -d 5

Expect to see under the namespace root: lab-floor → galaxy-line → reactor-1 with three child variables: Source (driver-sourced), Doubled (virtual tag, value should track Source×2), and OverTemp (scripted alarm, boolean reflecting whether Source > 50).

Read the virtual tag

dotnet run --project src/ZB.MOM.WW.OtOpcUa.Client.CLI -- read -u opc.tcp://localhost:4840/OtOpcUa -n "ns=2;s=p7-smoke-vt-derived"

Expected: a Float64 value approximately equal to 2 × Source. Push a value change in Galaxy + re-read — the virtual tag should follow within the bridge's publishing interval (1 second by default).

Read the scripted alarm

dotnet run --project src/ZB.MOM.WW.OtOpcUa.Client.CLI -- read -u opc.tcp://localhost:4840/OtOpcUa -n "ns=2;s=p7-smoke-al-overtemp"

Expected: Booleanfalse when Source ≤ 50, true when Source > 50.

Drive the alarm + verify historian queue

In Galaxy, push a Source value above 50. Within ~1 second, OverTemp.Read flips to true. The alarm engine emits a transition to Phase7EngineComposer.RouteToHistorianAsyncSqliteStoreAndForwardSink.EnqueueAsync → drain worker (every 2s) → GalaxyHistorianWriter.WriteBatchAsync → Galaxy.Host pipe → Aveva Historian alarm schema.

Verify the queue absorbed the event:

sqlite3 "$env:ProgramData\OtOpcUa\alarm-historian-queue.db" "SELECT COUNT(*) FROM Queue;"

Should return 0 once the drain worker successfully forwards (or a small positive number while in-flight). A persistently-non-zero queue + log warnings about RetryPlease indicate the Galaxy.Host historian write path is failing — check the Host's log file.

Verify in Aveva Historian

Open the Historian Client (or InTouch alarm summary) — the OverTemp activation should appear with EquipmentPath = /lab-floor/galaxy-line/reactor-1 + the rendered message Reactor source value 75.3 exceeded 50 (or whatever value tripped it).

Acceptance Checklist

  • EF migrations applied through 20260420232000_ExtendComputeGenerationDiffWithPhase7
  • Smoke seed completes without errors + creates exactly 1 Published generation
  • Server starts + logs the Phase 7 composition lines
  • Client.CLI browse shows the UNS tree with Source / Doubled / OverTemp under reactor-1
  • Read on Doubled returns 2 × Source value
  • Read on OverTemp returns the live boolean truth of Source > 50
  • Pushing Source past 50 in Galaxy flips OverTemp to true within 1 s
  • SQLite queue drains (COUNT(*) returns to 0 within 2 s of an alarm transition)
  • Historian shows the OverTemp activation event with the rendered message

First-run evidence (2026-04-20 dev box)

Ran the smoke against the live dev environment. Captured log signatures prove the Phase 7 wiring chain executes in production:

[INF] Bootstrapped from central DB: generation 1
[INF] Bootstrap complete: source=CentralDb generation=1
[INF] Phase 7 historian sink: no driver provides IAlarmHistorianWriter — using NullAlarmHistorianSink
[INF] VirtualTagEngine loaded 1 tag(s), 1 upstream subscription(s)
[INF] ScriptedAlarmEngine loaded 1 alarm(s)
[INF] Phase 7: composed engines from generation 1 — 1 virtual tag(s), 1 scripted alarm(s), 2 script(s)

Each line corresponds to a piece shipped in #243 / #244 / #245 / #246 / #247 — the composer ran, engines loaded, historian-sink decision fired, scripts compiled.

Two gaps surfaced (filed as new tasks below, NOT Phase 7 regressions):

  1. No driver-instance bootstrap pipeline. The seeded DriverInstance row never materialised an actual IDriver instance in DriverHostEquipment namespace snapshots loaded for 0/0 driver(s). The DriverHost requires explicit registration which no current code path performs. Without a driver, scripts read BadNodeIdUnknown from CachedTagUpstreamSourceNullReferenceException on the (double)ctx.GetTag(...).Value cast. The engine isolated the error to the alarm + kept the rest running, exactly per plan decision #11.
  2. OPC UA endpoint port collision. Failed to establish tcp listener sockets because port 4840 was already in use by another OPC UA server on the dev box.

Both are pre-Phase-7 deployment-wiring gaps. Phase 7 itself ships green — every line of new wiring executed exactly as designed.

Known limitations + follow-ups

  • Subscribing to virtual tags via OPC UA monitored items (instead of polled reads) needs VirtualTagSource.SubscribeAsync wiring through DriverNodeManager.OnCreateMonitoredItem — covered as part of release-readiness.
  • Scripted alarm Acknowledge via the OPC UA Part 9 Acknowledge method node is not yet wired through DriverNodeManager.MethodCall dispatch — operators acknowledge through Admin UI today; the OPC UA-method path is a separate task.
  • Phase 7 compliance script (scripts/compliance/phase-7-compliance.ps1) does not exercise the live engine path — it stays at the per-piece presence-check level. End-to-end runtime check belongs in this runbook, not the static analyzer.