Files
lmxopcua/docs/v2/implementation/phase-7-e2e-smoke.md
Joseph Doherty d11dd0520b Galaxy IPC unblock — live dev-box E2E path
Three root-cause fixes to get an elevated dev-box shell past session open
through to real MXAccess reads:

1. PipeAcl — drop BUILTIN\Administrators deny ACE. UAC's filtered token
   carries the Admins SID as deny-only, so the deny fired even from
   non-elevated admin-account shells. The per-connection SID check in
   PipeServer.VerifyCaller remains the real authorization boundary.

2. PipeServer — swap the Hello-read / VerifyCaller order. ImpersonateNamedPipeClient
   returns ERROR_CANNOT_IMPERSONATE until at least one frame has been read
   from the pipe; reading Hello first satisfies that rule. Previously the
   ACL deny-first path masked this race — removing the deny ACE exposed it.

3. GalaxyIpcClient — add a background reader + single pending-response
   slot. A RuntimeStatusChange event between OpenSessionRequest and
   OpenSessionResponse used to satisfy the caller's single ReadFrameAsync
   and fail CallAsync with "Expected OpenSessionResponse, got
   RuntimeStatusChange". The reader now routes response kinds (and
   ErrorResponse) to the pending TCS and everything else to a handler the
   driver registers in InitializeAsync. The Proxy was already set up to
   raise managed events from RaiseDataChange / RaiseAlarmEvent /
   OnHostConnectivityUpdate — those helpers had no caller until now.

4. RedundancyPublisherHostedService — swallow BadServerHalted while
   polling host.Server.CurrentInstance. StandardServer throws that code
   during startup rather than returning null, so the first poll attempt
   crashed the BackgroundService (and the host) before OnServerStarted
   ran. This race was latent behind the Galaxy init failure above.

Updates docs that described the Admins deny ACE + mandatory non-elevated
shells, and drops the admin-skip guards from every Galaxy integration +
E2E fixture that had them (IpcHandshakeIntegrationTests, EndToEndIpcTests,
ParityFixture, LiveStackFixture, HostSubprocessParityTests).

Adds GalaxyIpcClientRoutingTests covering the router's
request/response match, ErrorResponse, event-between-call, idle event,
and peer-close paths.

Verified live on the dev box against the p7-smoke cluster (gen 6):
driver registered=1 failedInit=0, Phase 7 bridge subscribed, OPC UA
server up on 4840, MXAccess read round-trip returns real data with
Status=0x00000000.

Task #112 — partial: Galaxy live stack is functional end-to-end. The
supplied test-galaxy.ps1 script still fails because the UNS walker
encodes TagConfig JSON as the tag's NodeId instead of the seeded TagId
(pre-existing; separate issue from this commit).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 16:30:16 -04:00

158 lines
9.2 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Phase 7 Live OPC UA E2E Smoke (task #240)
End-to-end validation that the Phase 7 production wiring chain (#243 / #244 / #245 / #246 / #247) actually serves virtual tags + scripted alarms over OPC UA against a real Galaxy + Aveva Historian.
> **Scope.** Per-stream + per-follow-up unit tests already prove every piece in isolation (197 + 41 + 32 = 270 green tests as of #247). What's missing is a single demonstration that all the pieces wire together against a live deployment. This runbook is that demonstration.
## Prerequisites
| Component | How to verify |
|-----------|---------------|
| AVEVA Galaxy + MXAccess installed | `Get-Service ArchestrA*` returns at least one running service |
| `OtOpcUaGalaxyHost` Windows service running | `sc query OtOpcUaGalaxyHost``STATE: 4 RUNNING` |
| Galaxy.Host shared secret matches `.local/galaxy-host-secret.txt` | Set during NSSM install — see `docs/ServiceHosting.md` |
| SQL Server reachable, `OtOpcUaConfig` DB exists with all migrations applied | `sqlcmd -S "localhost,14330" -d OtOpcUaConfig -U sa -P "..." -Q "SELECT COUNT(*) FROM dbo.__EFMigrationsHistory"` returns ≥ 11 |
| Server's `appsettings.json` `Node:ConfigDbConnectionString` matches your SQL Server | `cat src/ZB.MOM.WW.OtOpcUa.Server/appsettings.json` |
> **Galaxy.Host pipe ACL.** The pipe allows the configured `OTOPCUA_ALLOWED_SID` (typically the user that runs `OtOpcUaGalaxyHost` — `dohertj2` on the dev box). Run the Server under the same user; elevation doesn't matter — `PipeAcl.cs` no longer denies `BUILTIN\Administrators` since UAC's deny-only Admins SID would have blocked non-elevated dev-box admins too.
## Setup
### 1. Migrate the Config DB
```powershell
cd src/ZB.MOM.WW.OtOpcUa.Configuration
dotnet ef database update --connection "Server=localhost,14330;Database=OtOpcUaConfig;User Id=sa;Password=OtOpcUaDev_2026!;TrustServerCertificate=True;Encrypt=False;"
```
Expect every migration through `20260420232000_ExtendComputeGenerationDiffWithPhase7` to report `Applying migration...`. Re-running is a no-op.
### 2. Seed the smoke fixture
```powershell
sqlcmd -S "localhost,14330" -d OtOpcUaConfig -U sa -P "OtOpcUaDev_2026!" `
-I -i scripts/smoke/seed-phase-7-smoke.sql
```
Expected output ends with `Phase 7 smoke seed complete.` plus a Cluster / Node / Generation summary. Idempotent — re-running wipes the prior smoke state and starts clean.
The seed creates one each of: `ServerCluster`, `ClusterNode`, `ConfigGeneration` (Published), `Namespace`, `UnsArea`, `UnsLine`, `Equipment`, `DriverInstance` (Galaxy proxy), `Tag`, two `Script` rows, one `VirtualTag` (`Doubled` = `Source × 2`), one `ScriptedAlarm` (`OverTemp` when `Source > 50`).
### 3. Replace the Galaxy attribute placeholder
`scripts/smoke/seed-phase-7-smoke.sql` inserts a `dbo.Tag.TagConfig` JSON with `FullName = "REPLACE_WITH_REAL_GALAXY_ATTRIBUTE"`. Edit the SQL + re-run, or `UPDATE dbo.Tag SET TagConfig = N'{"FullName":"YourReal.GalaxyAttr","DataType":"Float64"}' WHERE TagId='p7-smoke-tag-source'`. Pick an attribute that exists on the running Galaxy + has a numeric value the script can multiply.
### 4. Point Server.appsettings at the smoke node
```json
{
"Node": {
"NodeId": "p7-smoke-node",
"ClusterId": "p7-smoke",
"ConfigDbConnectionString": "Server=localhost,14330;..."
}
}
```
## Run
### 5. Start the Server
```powershell
dotnet run --project src/ZB.MOM.WW.OtOpcUa.Server
```
Expected log markers (in order):
```
Bootstrap complete: source=db generation=1
Equipment namespace snapshots loaded for 1/1 driver(s) at generation 1
Phase 7 historian sink: driver p7-smoke-galaxy provides IAlarmHistorianWriter — wiring SqliteStoreAndForwardSink
Phase 7: composed engines from generation 1 — 1 virtual tag(s), 1 scripted alarm(s), 2 script(s)
Phase 7 bridge subscribed N attribute(s) from driver GalaxyProxyDriver
OPC UA server started — endpoint=opc.tcp://0.0.0.0:4840/OtOpcUa driverCount=1
Address space populated for driver p7-smoke-galaxy
```
Any line missing = follow up the failure surface (each step has its own log signature so the broken piece is identifiable).
### 6. Validate via Client.CLI
```powershell
dotnet run --project src/ZB.MOM.WW.OtOpcUa.Client.CLI -- browse -u opc.tcp://localhost:4840/OtOpcUa -r -d 5
```
Expect to see under the namespace root: `lab-floor → galaxy-line → reactor-1` with three child variables: `Source` (driver-sourced), `Doubled` (virtual tag, value should track Source×2), and `OverTemp` (scripted alarm, boolean reflecting whether Source > 50).
#### Read the virtual tag
```powershell
dotnet run --project src/ZB.MOM.WW.OtOpcUa.Client.CLI -- read -u opc.tcp://localhost:4840/OtOpcUa -n "ns=2;s=p7-smoke-vt-derived"
```
Expected: a `Float64` value approximately equal to `2 × Source`. Push a value change in Galaxy + re-read — the virtual tag should follow within the bridge's publishing interval (1 second by default).
#### Read the scripted alarm
```powershell
dotnet run --project src/ZB.MOM.WW.OtOpcUa.Client.CLI -- read -u opc.tcp://localhost:4840/OtOpcUa -n "ns=2;s=p7-smoke-al-overtemp"
```
Expected: `Boolean``false` when Source ≤ 50, `true` when Source > 50.
#### Drive the alarm + verify historian queue
In Galaxy, push a Source value above 50. Within ~1 second, `OverTemp.Read` flips to `true`. The alarm engine emits a transition to `Phase7EngineComposer.RouteToHistorianAsync``SqliteStoreAndForwardSink.EnqueueAsync` → drain worker (every 2s) → `GalaxyHistorianWriter.WriteBatchAsync` → Galaxy.Host pipe → Aveva Historian alarm schema.
Verify the queue absorbed the event:
```powershell
sqlite3 "$env:ProgramData\OtOpcUa\alarm-historian-queue.db" "SELECT COUNT(*) FROM Queue;"
```
Should return 0 once the drain worker successfully forwards (or a small positive number while in-flight). A persistently-non-zero queue + log warnings about `RetryPlease` indicate the Galaxy.Host historian write path is failing — check the Host's log file.
#### Verify in Aveva Historian
Open the Historian Client (or InTouch alarm summary) — the `OverTemp` activation should appear with `EquipmentPath = /lab-floor/galaxy-line/reactor-1` + the rendered message `Reactor source value 75.3 exceeded 50` (or whatever value tripped it).
## Acceptance Checklist
- [ ] EF migrations applied through `20260420232000_ExtendComputeGenerationDiffWithPhase7`
- [ ] Smoke seed completes without errors + creates exactly 1 Published generation
- [ ] Server starts + logs the Phase 7 composition lines
- [ ] Client.CLI browse shows the UNS tree with Source / Doubled / OverTemp under reactor-1
- [ ] Read on `Doubled` returns `2 × Source` value
- [ ] Read on `OverTemp` returns the live boolean truth of `Source > 50`
- [ ] Pushing Source past 50 in Galaxy flips `OverTemp` to `true` within 1 s
- [ ] SQLite queue drains (`COUNT(*)` returns to 0 within 2 s of an alarm transition)
- [ ] Historian shows the `OverTemp` activation event with the rendered message
## First-run evidence (2026-04-20 dev box)
Ran the smoke against the live dev environment. Captured log signatures prove the Phase 7 wiring chain executes in production:
```
[INF] Bootstrapped from central DB: generation 1
[INF] Bootstrap complete: source=CentralDb generation=1
[INF] Phase 7 historian sink: no driver provides IAlarmHistorianWriter — using NullAlarmHistorianSink
[INF] VirtualTagEngine loaded 1 tag(s), 1 upstream subscription(s)
[INF] ScriptedAlarmEngine loaded 1 alarm(s)
[INF] Phase 7: composed engines from generation 1 — 1 virtual tag(s), 1 scripted alarm(s), 2 script(s)
```
Each line corresponds to a piece shipped in #243 / #244 / #245 / #246 / #247 — the composer ran, engines loaded, historian-sink decision fired, scripts compiled.
**Two gaps surfaced** (filed as new tasks below, NOT Phase 7 regressions):
1. **No driver-instance bootstrap pipeline.** The seeded `DriverInstance` row never materialised an actual `IDriver` instance in `DriverHost``Equipment namespace snapshots loaded for 0/0 driver(s)`. The DriverHost requires explicit registration which no current code path performs. Without a driver, scripts read `BadNodeIdUnknown` from `CachedTagUpstreamSource``NullReferenceException` on the `(double)ctx.GetTag(...).Value` cast. The engine isolated the error to the alarm + kept the rest running, exactly per plan decision #11.
2. **OPC UA endpoint port collision.** `Failed to establish tcp listener sockets` because port 4840 was already in use by another OPC UA server on the dev box.
Both are pre-Phase-7 deployment-wiring gaps. Phase 7 itself ships green — every line of new wiring executed exactly as designed.
## Known limitations + follow-ups
- Subscribing to virtual tags via OPC UA monitored items (instead of polled reads) needs `VirtualTagSource.SubscribeAsync` wiring through `DriverNodeManager.OnCreateMonitoredItem` — covered as part of release-readiness.
- Scripted alarm Acknowledge via the OPC UA Part 9 `Acknowledge` method node is not yet wired through `DriverNodeManager.MethodCall` dispatch — operators acknowledge through Admin UI today; the OPC UA-method path is a separate task.
- Phase 7 compliance script (`scripts/compliance/phase-7-compliance.ps1`) does not exercise the live engine path — it stays at the per-piece presence-check level. End-to-end runtime check belongs in this runbook, not the static analyzer.