Phase 3 PR 37 — End-to-end live-stack Galaxy smoke test #36

Merged
dohertj2 merged 1 commits from phase-3-pr37-live-stack-smoke into v2 2026-04-18 16:56:51 -04:00
Owner

Closes the code side of LMX follow-up #5. Once OtOpcUaGalaxyHost is installed + started on a box, this exercises the full topology end-to-end: GalaxyProxyDriver in-process → named-pipe IPC → running service → MxAccessGalaxyBackend → live MXAccess runtime → real deployed Galaxy objects.

Never spawns the Host

Connects to the already-running OtOpcUaGalaxyHost Windows service, per the memory captured after the user clarified the production topology. Spawning a second Host process would bypass the COM-apartment + service-account + pipe-ACL setup and fail differently than production.

Shared-secret resolution

LiveStackConfig.Resolve() checks, in order:

  1. OTOPCUA_GALAXY_PIPE + OTOPCUA_GALAXY_SECRET env vars (CI / benchwork overrides).
  2. The service's registry Environment values (HKLM\SYSTEM\CurrentControlSet\Services\OtOpcUaGalaxyHost\Environment) — what Install-Services.ps1 writes at install time. Requires elevated test host on most boxes; the skip message says so explicitly.

No hard-coded secrets — installer generates 32 fresh random bytes per install; a committed secret would diverge from production the moment the service is re-installed.

Fixture flow

LiveStackFixture (xUnit IAsyncLifetime, consumed by LiveStackCollection):

  1. AvevaPrerequisites.CheckAllAsync (from PR 36). Captures a PrerequisiteReport; if not IsLivetestReady, SkipReason is the full operator-facing list.
  2. LiveStackConfig.Resolve. Skip with a clear 'set env var OR run elevated' message when absent.
  3. GalaxyProxyDriver.InitializeAsync. On exception, skip with the error detail plus common-cause hints (secret mismatch, SID not in pipe ACL, Host's ZB connection broken).

SkipIfUnavailable translates the captured reason into Assert.Skip at the top of every fact — no NullRef cascades.

Tests (Category=LiveGalaxy)

  • Fixture_initialized_successfully — cheapest end-to-end check; IPC handshake worked.
  • Driver_reports_Healthy_after_IPC_handshakeDriverHealth.State post-connect.
  • DiscoverAsync_returns_at_least_one_variable_from_live_galaxy — flattens every Variable() call via CapturingAddressSpaceBuilder, asserts > 0.
  • GetHostStatuses_reports_at_least_one_platformIHostConnectivityProbe surface.
  • Can_read_a_discovered_variable_from_live_galaxy — reads the first discovered attribute's full reference, asserts status != BadInternalError (Galaxy's Uncertain-until-first-Engine-scan is intentionally not treated as failure).

Read-only by design

Writes need an agreed scratch tag or test-only UDA to avoid mutating a process-critical attribute. Deferred to a follow-up PR that reuses this fixture.

Test run on THIS box (OtOpcUaGalaxyHost not installed)

Skipped! - Failed: 0, Passed: 0, Skipped: 5 — Driver.Galaxy.Proxy.Tests.dll

Each skip message surfaces • [OtOpcUaService] service:OtOpcUaGalaxyHost — Not installed (...). Once the service is installed + started (scripts\install\Install-Services.ps1), the 5 facts execute against live Galaxy.

  • Proxy.Tests Unit: 17 / 0 unchanged.
  • Full solution build clean.

LMX #5

Updated to 'IN PROGRESS' across PRs 36 + 37. Remaining: install + start services, subscribe-and-receive-data-change test, write round-trip test.

Closes the code side of LMX follow-up #5. Once `OtOpcUaGalaxyHost` is installed + started on a box, this exercises the full topology end-to-end: `GalaxyProxyDriver` in-process → named-pipe IPC → running service → `MxAccessGalaxyBackend` → live MXAccess runtime → real deployed Galaxy objects. ## Never spawns the Host Connects to the **already-running** `OtOpcUaGalaxyHost` Windows service, per the memory captured after the user clarified the production topology. Spawning a second Host process would bypass the COM-apartment + service-account + pipe-ACL setup and fail differently than production. ## Shared-secret resolution `LiveStackConfig.Resolve()` checks, in order: 1. `OTOPCUA_GALAXY_PIPE` + `OTOPCUA_GALAXY_SECRET` env vars (CI / benchwork overrides). 2. The service's registry Environment values (`HKLM\SYSTEM\CurrentControlSet\Services\OtOpcUaGalaxyHost\Environment`) — what `Install-Services.ps1` writes at install time. Requires elevated test host on most boxes; the skip message says so explicitly. No hard-coded secrets — installer generates 32 fresh random bytes per install; a committed secret would diverge from production the moment the service is re-installed. ## Fixture flow `LiveStackFixture` (xUnit `IAsyncLifetime`, consumed by `LiveStackCollection`): 1. `AvevaPrerequisites.CheckAllAsync` (from PR 36). Captures a `PrerequisiteReport`; if not `IsLivetestReady`, `SkipReason` is the full operator-facing list. 2. `LiveStackConfig.Resolve`. Skip with a clear 'set env var OR run elevated' message when absent. 3. `GalaxyProxyDriver.InitializeAsync`. On exception, skip with the error detail plus common-cause hints (secret mismatch, SID not in pipe ACL, Host's ZB connection broken). `SkipIfUnavailable` translates the captured reason into `Assert.Skip` at the top of every fact — no NullRef cascades. ## Tests (Category=LiveGalaxy) - `Fixture_initialized_successfully` — cheapest end-to-end check; IPC handshake worked. - `Driver_reports_Healthy_after_IPC_handshake` — `DriverHealth.State` post-connect. - `DiscoverAsync_returns_at_least_one_variable_from_live_galaxy` — flattens every `Variable()` call via `CapturingAddressSpaceBuilder`, asserts > 0. - `GetHostStatuses_reports_at_least_one_platform` — `IHostConnectivityProbe` surface. - `Can_read_a_discovered_variable_from_live_galaxy` — reads the first discovered attribute's full reference, asserts status != BadInternalError (Galaxy's Uncertain-until-first-Engine-scan is intentionally not treated as failure). ## Read-only by design Writes need an agreed scratch tag or test-only UDA to avoid mutating a process-critical attribute. Deferred to a follow-up PR that reuses this fixture. ## Test run on THIS box (`OtOpcUaGalaxyHost` not installed) ``` Skipped! - Failed: 0, Passed: 0, Skipped: 5 — Driver.Galaxy.Proxy.Tests.dll ``` Each skip message surfaces `• [OtOpcUaService] service:OtOpcUaGalaxyHost — Not installed (...)`. Once the service is installed + started (`scripts\install\Install-Services.ps1`), the 5 facts execute against live Galaxy. - Proxy.Tests Unit: **17 / 0** unchanged. - Full solution build clean. ## LMX #5 Updated to 'IN PROGRESS' across PRs 36 + 37. Remaining: install + start services, subscribe-and-receive-data-change test, write round-trip test.
dohertj2 added 1 commit 2026-04-18 16:56:49 -04:00
LiveStackConfig resolves the pipe name + per-install shared secret from two sources in order: OTOPCUA_GALAXY_PIPE + OTOPCUA_GALAXY_SECRET env vars first (for CI / benchwork overrides), then the service's per-process Environment registry values under HKLM\SYSTEM\CurrentControlSet\Services\OtOpcUaGalaxyHost (what Install-Services.ps1 writes at install time). Registry read requires the test host to run elevated on most boxes — the skip message says so explicitly so operators see the right remediation. Hard-coded secrets are deliberately avoided: the installer generates 32 fresh random bytes per install, a committed secret would diverge from production the moment the service is re-installed.
LiveStackFixture is an IAsyncLifetime that (1) runs AvevaPrerequisites.CheckAllAsync with CheckGalaxyHostPipe=true + CheckHistorian=false — produces a structured PrerequisiteReport whose SkipReason is the exact operator-facing 'here's what you need to fix' text, (2) resolves LiveStackConfig and surfaces a clear skip when the secret isn't discoverable, (3) instantiates GalaxyProxyDriver + calls InitializeAsync (the IPC handshake), capturing a skip with the exception detail + common-cause hints (secret mismatch, SID not in pipe ACL, Host's backend couldn't connect to ZB) rather than letting a NullRef cascade through every subsequent test. SkipIfUnavailable() translates the captured SkipReason into Assert.Skip at the top of every fact so tests read as cleanly-skipped with a visible reason, not silently-passed or crashed.
LiveStackSmokeTests (5 facts, Collection=LiveStack, Category=LiveGalaxy): Fixture_initialized_successfully (cheapest possible end-to-end assertion — if this passes, the IPC handshake worked); Driver_reports_Healthy_after_IPC_handshake (DriverHealth.State post-connect); DiscoverAsync_returns_at_least_one_variable_from_live_galaxy (captures every Variable() call from DiscoverAsync via CapturingAddressSpaceBuilder and asserts > 0 — zero here usually means the Host couldn't read ZB, the skip message names OTOPCUA_GALAXY_ZB_CONN to check); GetHostStatuses_reports_at_least_one_platform (IHostConnectivityProbe surface — zero means the probe loop hasn't fired or no Platform is deployed locally); Can_read_a_discovered_variable_from_live_galaxy (reads the first discovered attribute's full reference, asserts status != BadInternalError — Galaxy's Uncertain-quality-until-first-Engine-scan is intentionally NOT treated as failure since it depends on runtime state that varies across test runs). Read-only by design; writes need an agreed scratch tag to avoid mutating a process-critical attribute — deferred to a follow-up PR that reuses this fixture.
CapturingAddressSpaceBuilder is a minimal IAddressSpaceBuilder that flattens every Variable() call into a list so tests can inspect what discovery produced without booting the full OPC UA node-manager stack; alarm annotation + property calls are no-ops. Scoped private to the test class.
Galaxy.Proxy.Tests csproj gains a ProjectReference to Driver.Galaxy.TestSupport (PR 36) for AvevaPrerequisites. The NU1702 warning about the Host project being net48-referenced-by-net10 is pre-existing from the HostSubprocessParityTests — Proxy.Tests only needs the Host EXE path for that parity scenario, not type surface.
Test run on THIS machine (OtOpcUaGalaxyHost not yet installed): Skipped! Failed 0, Passed 0, Skipped 5 — each skip message includes the full prerequisites report pointing at the missing service. Once the service is installed + started (scripts\install\Install-Services.ps1), the 5 facts will execute against live Galaxy. Proxy.Tests Unit: 17 pass / 0 fail (unchanged — new tests are Category=LiveGalaxy, separate suite). Full Proxy build clean. Memory already captures the 'live tests run via already-running service, don't spawn' convention (project_galaxy_host_service.md).
lmx-followups.md #5 updated: status is 'IN PROGRESS' across PRs 36 + 37 with the explicit remaining work (install + start services, subscribe-and-receive, write round-trip).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
dohertj2 merged commit 19bcf20fbe into v2 2026-04-18 16:56:51 -04:00
dohertj2 referenced this issue from a commit 2026-04-19 03:16:40 -04:00
Phase 6 — Draft 4 implementation plans covering v2 unimplemented features + adversarial review + adjustments. After drivers were paused per user direction, audited the v2 plan for features documented-but-unshipped and identified four coherent tracks that had no implementation plan at all. Each plan follows the docs/v2/implementation/phase-*.md template (DRAFT status, branch name, Stream A-E task breakdown, Compliance Checks, Risks, Completion Checklist). docs/v2/implementation/phase-6-1-resilience-and-observability.md (243 lines) covers Polly resilience pipelines wired to every capability interface, Tier A/B/C runtime enforcement (memory watchdog generalized beyond Galaxy, scheduled recycle per decision #67, wedge detection), health endpoints on :4841, structured Serilog with correlation IDs, LiteDB local-cache fallback per decision #36. phase-6-2-authorization-runtime.md (145 lines) wires ACL enforcement on every OPC UA Read/Write/Subscribe/Call path + LDAP-group-to-admin-role grants per decisions #105 and #129 -- runtime permission-trie evaluator over the 6-level Cluster/Namespace/UnsArea/UnsLine/Equipment/Tag hierarchy, per-session cache invalidated on generation-apply + LDAP-cache expiry. phase-6-3-redundancy-runtime.md (165 lines) lands the non-transparent warm/hot redundancy runtime per decisions #79-85: dynamic ServiceLevel node, ServerUriArray peer broadcast, mid-apply dip via sp_PublishGeneration hook, operator-driven role transition (no auto-election -- plan remains explicit about what's out of scope). phase-6-4-admin-ui-completion.md (178 lines) closes Phase 1 Stream E completion-checklist items that never landed: UNS drag-reorder + impact preview, Equipment CSV import, 5-identifier search, draft-diff viewer enhancements, OPC 40010 _base Identification field exposure per decisions #138-139. Each plan then got a Codex adversarial-review pass (codex mcp tool, read-only sandbox, synchronous). Reviews explicitly targeted decision-log conflicts, API-shape assumptions, unbounded blast radius, under-specified state transitions, and testing holes. Appended 'Adversarial Review — 2026-04-19' section to each plan with numbered findings (severity / finding / why-it-matters / adjustment accepted). Review surfaced real substantive issues that the initial drafts glossed over: Phase 6.1 auto-retry conflicting with decisions #44-45 no-auto-write-retry rule; Phase 6.1 per-driver-instance pipeline breaking decision #35's per-device isolation; Phase 6.1 recycle/watchdog at Tier A/B breaching decisions #73-74 Tier-C-only constraint; Phase 6.2 conflating control-plane LdapGroupRoleMapping with data-plane ACL grants; Phase 6.2 missing Browse enforcement entirely; Phase 6.2 subscription re-authorization policy unresolved between create-time-only and per-publish; Phase 6.3 ServiceLevel=0 colliding with OPC UA Part 5 Maintenance semantics; Phase 6.3 ServerUriArray excluding self (spec-bug); Phase 6.3 apply-window counter race on cancellation; Phase 6.3 client cutover for Kepware/Aveva OI Gateway is unverified hearsay; Phase 6.4 stale UNS impact preview overwriting concurrent draft edits; Phase 6.4 identifier contract drifting from admin-ui.md canonical set (ZTag/MachineCode/SAPID/EquipmentId/EquipmentUuid, not ZTag/SAPID/UniqueId/Alias1/Alias2); Phase 6.4 CSV import atomicity internally contradictory (single txn vs chunked inserts); Phase 6.4 OPC 40010 field list not matching decision #139. Every finding has an adjustment in the plan doc -- plans are meant to be executable from the next session with the critique already baked in rather than a clean draft that would run into the same issues at implementation time. Codex thread IDs cited in each plan's review section for reproducibility. Pure documentation PR -- no code changes. Plans are DRAFT status; each becomes its own implementation phase with its own entry-gate + exit-gate when business prioritizes.
Sign in to join this conversation.
No Reviewers
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: dohertj2/lmxopcua#36