Phase 2 PR 4 — close 4 open MXAccess findings (push frames + reconnect + write-await + read-cancel) #3

Merged
dohertj2 merged 14 commits from phase-2-pr4-findings into v2 2026-04-18 06:57:22 -04:00

14 Commits

Author SHA1 Message Date
Joseph Doherty
caa9cb86f6 Phase 2 PR 4 — close the 4 open high/medium MXAccess findings from exit-gate-phase-2-final.md. High 1 (ReadAsync subscription-leak on cancel): the one-shot read now wraps subscribe→first-OnDataChange→unsubscribe in try/finally so the per-tag callback is always detached, and if the read installed the underlying MXAccess subscription itself (the prior _addressToHandle key was absent) it tears it down on the way out — no leaked probe item handles when the caller cancels or times out. High 2 (no reconnect loop): MxAccessClient gets a MxAccessClientOptions {AutoReconnect, MonitorInterval=5s, StaleThreshold=60s} + a background MonitorLoopAsync started at first ConnectAsync. The loop wakes every MonitorInterval, checks _lastObservedActivityUtc (bumped by every OnDataChange callback), and if stale probes the proxy with a no-op COM AddItem("$Heartbeat") on the StaPump; if the probe throws or returns false, the loop reconnects-with-replay — Unregister (best-effort), Register, snapshot _addressToHandle.Keys + clear, re-AddItem every previously-active subscription, ConnectionStateChanged events fire for the false→true transition, ReconnectCount bumps. Medium 3 (subscriptions don't push frames back to Proxy): IGalaxyBackend gains OnDataChange/OnAlarmEvent/OnHostStatusChanged events; new IFrameHandler.AttachConnection(FrameWriter) is called per-connection by PipeServer after Hello + the returned IDisposable disposes at connection close; GalaxyFrameHandler.ConnectionSink subscribes the events for the connection lifetime, fire-and-forget pushes them as MessageKind.OnDataChangeNotification / AlarmEvent / RuntimeStatusChange frames through the writer, swallows ObjectDisposedException for the dispose race, and unsubscribes in Dispose to prevent leaked invocation list refs across reconnects. MxAccessGalaxyBackend's existing SubscribeAsync (which previously discarded values via a (_, __) => {} callback) now wires OnTagValueChanged that fans out per-tag value changes to every subscription ID listening (one MXAccess subscription, multi-fan-out — _refToSubs reverse map). UnsubscribeAsync also reverse-walks the map to only call mx.UnsubscribeAsync when the LAST sub for a tag drops. Stub + DbBacked backends declare the events with #pragma warning disable CS0067 because they never raise them but must satisfy the interface (treat-warnings-as-errors would otherwise fail). Medium 4 (WriteValuesAsync doesn't await OnWriteComplete): MxAccessClient.WriteAsync rewritten to return Task<bool> via the v1-style TaskCompletionSource-keyed-by-item-handle pattern in _pendingWrites — adds the TCS before the Write call, awaits it with a configurable timeout (default 5s), removes the TCS in finally, returns true only when OnWriteComplete reported success. MxAccessGalaxyBackend.WriteValuesAsync now reports per-tag Bad_InternalError ("MXAccess runtime reported write failure") when the bool returns false, instead of false-positive Good. PipeServer's IFrameHandler interface adds the AttachConnection(FrameWriter):IDisposable method + a public NoopAttachment nested class (net48 doesn't support default interface methods so the empty-attach is exposed for stub implementations). StubFrameHandler returns IFrameHandler.NoopAttachment.Instance. RunOneConnectionAsync calls AttachConnection after HelloAck and usings the returned disposable so it disposes at the connection scope's finally. ConnectionStateChanged event added on MxAccessClient (caller-facing diagnostics for false→true reconnect transitions). docs/v2/implementation/pr-4-body.md is the Gitea web-UI paste-in for opening PR 4 once pushed; includes 2 new low-priority adversarial findings (probe item-handle leak; replay-loop silently swallows per-subscription failures) flagged as follow-ups not PR 4 blockers. Full solution 460 pass / 7 skip (E2E on admin shell) / 1 pre-existing Phase 0 baseline. No regressions vs PR 2's baseline.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 01:12:09 -04:00
Joseph Doherty
a3d16a28f1 Phase 2 Stream D Option B — archive v1 surface + new Driver.Galaxy.E2E parity suite. Non-destructive intermediate state: the v1 OtOpcUa.Host + Historian.Aveva + Tests + IntegrationTests projects all still build (494 v1 unit + 6 v1 integration tests still pass when run explicitly), but solution-level dotnet test ZB.MOM.WW.OtOpcUa.slnx now skips them via IsTestProject=false on the test projects + archive-status PropertyGroup comments on the src projects. The destructive deletion is reserved for Phase 2 PR 3 with explicit operator review per CLAUDE.md "only use destructive operations when truly the best approach". tests/ZB.MOM.WW.OtOpcUa.Tests/ renamed via git mv to tests/ZB.MOM.WW.OtOpcUa.Tests.v1Archive/; csproj <AssemblyName> kept as the original ZB.MOM.WW.OtOpcUa.Tests so v1 OtOpcUa.Host's [InternalsVisibleTo("ZB.MOM.WW.OtOpcUa.Tests")] still matches and the project rebuilds clean. tests/ZB.MOM.WW.OtOpcUa.IntegrationTests gets <IsTestProject>false</IsTestProject>. src/ZB.MOM.WW.OtOpcUa.Host + src/ZB.MOM.WW.OtOpcUa.Historian.Aveva get PropertyGroup archive-status comments documenting they're functionally superseded but kept in-build because cascading dependencies (Historian.Aveva → Host; IntegrationTests → Host) make a single-PR deletion high blast-radius. New tests/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.E2E/ project (.NET 10) with ParityFixture that spawns OtOpcUa.Driver.Galaxy.Host.exe (net48 x86) as a Process.Start subprocess with OTOPCUA_GALAXY_BACKEND=db env vars, awaits 2s for the PipeServer to bind, then exposes a connected GalaxyProxyDriver; skips on non-Windows / Administrator shells (PipeAcl denies admins per decision #76) / ZB unreachable / Host EXE not built — each skip carries a SkipReason string the test method reads via Assert.Skip(SkipReason). RecordingAddressSpaceBuilder captures every Folder/Variable/AddProperty registration so parity tests can assert on the same shape v1 LmxNodeManager produced. HierarchyParityTests (3) — Discover returns gobjects with attributes; attribute full references match the tag.attribute Galaxy reference grammar; HistoryExtension flag flows through correctly. StabilityFindingsRegressionTests (4) — one test per 2026-04-13 stability finding from commits c76ab8f and 7310925: phantom probe subscription doesn't corrupt unrelated host status; HostStatusChangedEventArgs structurally carries a specific HostName + OldState + NewState (event signature mathematically prevents the v1 cross-host quality-clear bug); all GalaxyProxyDriver capability methods return Task or Task<T> (sync-over-async would deadlock OPC UA stack thread); AcknowledgeAsync completes before returning (no fire-and-forget background work that could race shutdown). Solution test count: 470 pass / 7 skip (E2E on admin shell) / 1 pre-existing Phase 0 baseline. Run archived suites explicitly: dotnet test tests/ZB.MOM.WW.OtOpcUa.Tests.v1Archive (494 pass) + dotnet test tests/ZB.MOM.WW.OtOpcUa.IntegrationTests (6 pass). docs/v2/V1_ARCHIVE_STATUS.md inventories every archived surface with run-it-explicitly instructions + a 10-step deletion plan for PR 3 + rollback procedure (git revert restores all four projects). docs/v2/implementation/exit-gate-phase-2-final.md supersedes the two partial-exit docs with the per-stream status table (A/B/C/D/E all addressed, D split across PR 2/3 per safety protocol), the test count breakdown, fresh adversarial review of PR 2 deltas (4 new findings: medium IsTestProject=false safety net loss, medium structural-vs-behavioral stability tests, low backend=db default, low Process.Start env inheritance), the 8 carried-forward findings from exit-gate-phase-2.md, the recommended PR order (1 → 2 → 3 → 4). docs/v2/implementation/pr-2-body.md is the Gitea web-UI paste-in for opening PR 2 once pushed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 00:56:21 -04:00
Joseph Doherty
50f81a156d Doc — PR 1 body for Gitea web UI paste-in. PR title + summary + test matrix + reviewer test plan + follow-up tracking. Source phase-1-configuration → target v2; URL https://gitea.dohertylan.com/dohertj2/lmxopcua/pulls/new/phase-1-configuration. No gh/tea CLI on this box, so the body is staged here for the operator to paste into the Gitea web UI rather than auto-created via API.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 00:46:23 -04:00
Joseph Doherty
7403b92b72 Phase 2 Stream D progress — non-destructive deliverables: appsettings → DriverConfig migration script, two-service Windows installer scripts, process-spawn cross-FX parity test, Stream D removal procedure doc with both Option A (rewrite 494 v1 tests) and Option B (archive + new v2 E2E suite) spelled out step-by-step. Cannot one-shot the actual legacy-Host deletion in any unattended session — explained in the procedure doc; the parity-defect debug cycle is intrinsically interactive (each iteration requires inspecting a v1↔v2 diff and deciding if it's a legitimate v2 improvement or a regression, then either widening the assertion or fixing the v2 code), and git rm -r src/ZB.MOM.WW.OtOpcUa.Host is destructive enough to need explicit operator authorization on a real PR review. scripts/migration/Migrate-AppSettings-To-DriverConfig.ps1 takes a v1 appsettings.json and emits the v2 DriverInstance.DriverConfig JSON blob (MxAccess/Database/Historian sections) ready to upsert into the central Configuration DB; null-leaf stripping; -DryRun mode; smoke-tested against the dev appsettings.json and produces the expected three-section ordered-dictionary output. scripts/install/Install-Services.ps1 registers the two v2 services with sc.exe — OtOpcUaGalaxyHost first (net48 x86 EXE with OTOPCUA_GALAXY_PIPE/OTOPCUA_ALLOWED_SID/OTOPCUA_GALAXY_SECRET/OTOPCUA_GALAXY_BACKEND/OTOPCUA_GALAXY_ZB_CONN/OTOPCUA_GALAXY_CLIENT_NAME env vars set via HKLM:\SYSTEM\CurrentControlSet\Services\OtOpcUaGalaxyHost\Environment registry), then OtOpcUa with depend=OtOpcUaGalaxyHost; resolves down-level account names to SID for the IPC ACL; generates a fresh 32-byte base64 shared secret per install if not supplied (kept out of registry — operators record offline for service rebinding scenarios); echoes start commands. scripts/install/Uninstall-Services.ps1 stops + removes both services. tests/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Proxy.Tests/HostSubprocessParityTests.cs is the production-shape parity test — Proxy (.NET 10) spawns the actual OtOpcUa.Driver.Galaxy.Host.exe (net48 x86) as a subprocess via Process.Start with backend=db env vars, connects via real named pipe, calls Discover, asserts at least one Galaxy gobject comes back. Skipped when running as Administrator (PipeAcl denies admins, same guard as other IPC integration tests), when the Host EXE hasn't been built, or when the ZB SQL endpoint is unreachable. This is the cross-FX integration that the parity suite genuinely needs — the previous IPC tests all ran in-process; this one validates the production deployment topology where Proxy and Host are separate processes communicating only over the named pipe. docs/v2/implementation/stream-d-removal-procedure.md is the next-session playbook: Option A (rewrite 494 v1 tests via a ProxyMxAccessClientAdapter that implements v1's IMxAccessClient by forwarding to GalaxyProxyDriver — Vtq↔DataValueSnapshot, Quality↔StatusCode, OnTagValueChanged↔OnDataChange mapping; 3-5 days, full coverage), Option B (rename OtOpcUa.Tests → OtOpcUa.Tests.v1Archive with [Trait("Category", "v1Archive")] for opt-in CI runs; new OtOpcUa.Driver.Galaxy.E2E test project with 10-20 representative tests via the HostSubprocessParityTests pattern; 1-2 days, accreted coverage); deletion checklist with eight pre-conditions, ten ordered steps, and a rollback path (git revert restores the legacy Host alongside the v2 stack — both topologies remain installable until the downstream consumer cutover). Full solution 964 pass / 1 pre-existing Phase 0 baseline; the 494 v1 IntegrationTests + 6 v1 IntegrationTests-net48 still pass because legacy OtOpcUa.Host stays untouched until an interactive session executes the procedure doc.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 00:38:44 -04:00
Joseph Doherty
a7126ba953 Phase 2 — port MXAccess COM client to Galaxy.Host + MxAccessGalaxyBackend (3rd IGalaxyBackend) + live MXAccess smoke + Phase 2 exit-gate doc + adversarial review. The full Galaxy data-plane now flows through the v2 IPC topology end-to-end against live ArchestrA.MxAccess.dll, on this dev box, with 30/30 Host tests + 9/9 Proxy tests + 963/963 solution tests passing alongside the unchanged 494 v1 IntegrationTests baseline. Backend/MxAccess/Vtq is a focused port of v1's Vtq value-timestamp-quality DTO. Backend/MxAccess/IMxProxy abstracts LMXProxyServer (port of v1's IMxProxy with the same Register/Unregister/AddItem/RemoveItem/AdviseSupervisory/UnAdviseSupervisory/Write surface + OnDataChange + OnWriteComplete events); MxProxyAdapter is the concrete COM-backed implementation that does Marshal.ReleaseComObject-loop on Unregister, must be constructed on an STA thread. Backend/MxAccess/MxAccessClient is the focused port of v1's MxAccessClient partials — Connect/Disconnect/Read/Write/Subscribe/Unsubscribe through the new Sta/StaPump (the real Win32 GetMessage pump from the previous commit), ConcurrentDictionary handle tracking, OnDataChange event marshalling to per-tag callbacks, ReadAsync implemented as the canonical subscribe → first-OnDataChange → unsubscribe one-shot pattern. Galaxy.Host csproj flipped to x86 PlatformTarget + Prefer32Bit=true with the ArchestrA.MxAccess HintPath ..\..\lib\ArchestrA.MxAccess.dll reference (lib/ already contains the production DLL). Backend/MxAccessGalaxyBackend is the third IGalaxyBackend implementation (alongside StubGalaxyBackend and DbBackedGalaxyBackend): combines GalaxyRepository (Discover) with MxAccessClient (Read/Write/Subscribe), MessagePack-deserializes inbound write values, MessagePack-serializes outbound read values into ValueBytes, decodes ArrayDimension/SecurityClassification/category_id with the same v1 mapping. Program.cs selects between stub|db|mxaccess via OTOPCUA_GALAXY_BACKEND env var (default = mxaccess); OTOPCUA_GALAXY_ZB_CONN overrides the ZB connection string; OTOPCUA_GALAXY_CLIENT_NAME sets the Wonderware client identity; the StaPump and MxAccessClient lifecycles are tied to the server.RunAsync try/finally so a clean Ctrl+C tears down the COM proxy via Marshal.ReleaseComObject before the pump's WM_QUIT. Live MXAccess smoke tests (MxAccessLiveSmokeTests, net48 x86) — skipped when ZB unreachable or aaBootstrap not running, otherwise verify (1) MxAccessClient.ConnectAsync returns a positive LMXProxyServer handle on the StaPump, (2) MxAccessGalaxyBackend.OpenSession + Discover returns at least one gobject with attributes, (3) MxAccessGalaxyBackend.ReadValues against the first discovered attribute returns a response with the correct TagReference shape (value + quality vary by what's running, so we don't assert specific values). All 3 pass on this dev box. EndToEndIpcTests + IpcHandshakeIntegrationTests moved from Galaxy.Proxy.Tests (net10) to Galaxy.Host.Tests (net48 x86) — the previous test placement silently dropped them at xUnit discovery because Host became net48 x86 and net10 process can't load it. Rewritten to use Shared's FrameReader/FrameWriter directly instead of going through Proxy's GalaxyIpcClient (functionally equivalent — same wire protocol, framing primitives + dispatcher are the production code path verbatim). 7 IPC tests now run cleanly: Hello+heartbeat round-trip, wrong-secret rejection, OpenSession session-id assignment, Discover error-response surfacing, WriteValues per-tag bad status, Subscribe id assignment, Recycle grace window. Phase 2 exit-gate doc (docs/v2/implementation/exit-gate-phase-2.md) supersedes the partial-exit doc with the as-built state — Streams A/B/C complete; D/E gated only on the legacy-Host removal + parity-test rewrite cycle that fundamentally requires multi-day debug iteration; full adversarial-review section ranking 8 findings (2 high, 3 medium, 3 low) all explicitly deferred to Stream D/E or v2.1 with rationale; Stream-D removal checklist gives the next-session entry point with two policy options for the 494 v1 tests (rewrite-to-use-Proxy vs archive-and-write-smaller-v2-parity-suite). Cannot one-shot Stream D.1 in any single session because deleting OtOpcUa.Host requires the v1 IntegrationTests cycle to be retargeted first; that's the structural blocker, not "needs more code" — and the plan itself budgets 3-4 weeks for it.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 00:23:24 -04:00
Joseph Doherty
549cd36662 Phase 2 — port GalaxyRepository to Galaxy.Host + DbBackedGalaxyBackend, smoke-tested against live ZB. Real Galaxy gobject hierarchy + dynamic attributes now flow through the IPC contract end-to-end without any MXAccess code involvement, so the OPC UA address-space build (Stream C.4 acceptance) becomes parity-testable today even before the COM client port lands. Backend/Galaxy/GalaxyRepository.cs is a byte-for-byte port of v1 GalaxyRepositoryService's HierarchySql + AttributesSql (the two SQL bodies, both ~50 lines of recursive CTE template-chain + deployed_package_chain logic, are identical to v1 so the row set is verifiably the same — extended-attributes + scope-filter queries from v1 are intentionally not ported yet, they're refinements not on the Phase 2 critical path); plus TestConnectionAsync (SELECT 1) and GetLastDeployTimeAsync (SELECT time_of_last_deploy FROM galaxy) for the ChangeDetection deploy-watermark path. Backend/Galaxy/GalaxyRepositoryOptions defaults to localhost ZB Integrated Security; runtime override comes from DriverConfig.Database section per plan.md §"Galaxy DriverConfig". Backend/Galaxy/GalaxyHierarchyRow + GalaxyAttributeRow are the row-shape DTOs (no required modifier — net48 lacks RequiredMemberAttribute and we'd need a polyfill shim like the existing IsExternalInit one; default-string init is simpler). System.Data.SqlClient 4.9.0 added (the same package the v1 Host uses; net48-compatible). Backend/DbBackedGalaxyBackend wraps the repository: DiscoverAsync builds a real DiscoverHierarchyResponse (groups attributes by gobject, resolves parent-by-tagname, maps category_id → human-readable template-category name mirroring v1 AlarmObjectFilter); ReadValuesAsync/WriteValuesAsync/HistoryReadAsync still surface "MXAccess code lift pending (Phase 2 Task B.1)" because runtime data values genuinely need the COM client; OpenSession/CloseSession/Subscribe/Unsubscribe/AlarmSubscribe/AlarmAck/Recycle return success without backend work (subscription ID is a synthetic counter for now). Live smoke tests (GalaxyRepositoryLiveSmokeTests) skip when localhost ZB is unreachable; when present they verify (1) TestConnection returns true, (2) GetHierarchy returns at least one deployed gobject with a non-empty TagName, (3) GetAttributes returns rows with FullTagReference matching the "tag.attribute" shape, (4) GetLastDeployTime returns a value, (5) DbBackedBackend.DiscoverAsync returns at least one gobject with attributes and a populated TemplateCategory. All 5 pass against the local Galaxy. Full solution 957 pass / 1 pre-existing Phase 0 baseline; the 494 v1 IntegrationTests + 6 v1 IntegrationTests-net48 tests still pass — legacy OtOpcUa.Host untouched. Remaining for the Phase 2 exit gate is the MXAccess COM client port itself (the v1 MxAccessClient partials + IMxProxy abstraction + StaPump-based Connect/Subscribe/Read/Write semantics) — Discover is now solved in DB-backed form, so the lift can focus exclusively on the runtime data-plane.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 23:14:09 -04:00
Joseph Doherty
32eeeb9e04 Phase 2 Streams A+B+C feature-complete — real Win32 pump, all 9 IDriver capabilities, end-to-end IPC dispatch. Streams D+E remain (Galaxy MXAccess code lift + parity-debug cycle, plan-budgeted 3-4 weeks). The 494 v1 IntegrationTests still pass — legacy OtOpcUa.Host untouched. StaPump replaces the BlockingCollection placeholder with a real Win32 message pump lifted from v1 StaComThread per CLAUDE.md "Reference Implementation": dedicated STA Thread with SetApartmentState(STA), GetMessage/PostThreadMessage/PeekMessage/TranslateMessage/DispatchMessage/PostQuitMessage P/Invoke, WM_APP=0x8000 for work-item dispatch, WM_APP+1 for graceful-drain → PostQuitMessage, peek-pm-noremove on entry to force the system to create the thread message queue before signalling Started, IsResponsiveAsync probe still no-op-round-trips through PostThreadMessage so the wedge detection works against the real pump. Concurrent ConcurrentQueue<WorkItem> drains on every WM_APP; fault path on dispose drains-and-faults all pending work-item TCSes with InvalidOperationException("STA pump has exited"). All three StaPumpTests pass against the real pump (apartment state STA, healthy probe true, wedged probe false). GalaxyProxyDriver now implements every Phase 2 Stream C capability — IDriver, ITagDiscovery, IReadable, IWritable, ISubscribable, IAlarmSource, IHistoryProvider, IRediscoverable, IHostConnectivityProbe — each forwarding through the matching IPC contract. ReadAsync preserves request order even when the Host returns out-of-order values; WriteAsync MessagePack-serializes the value into ValueBytes; SubscribeAsync wraps SubscriptionId in a GalaxySubscriptionHandle record; UnsubscribeAsync uses the new SendOneWayAsync helper on GalaxyIpcClient (fire-and-forget but still gated through the call-semaphore so it doesn't interleave with CallAsync); AlarmSubscribe is one-way and the Host pushes events back via OnAlarmEvent; ReadProcessedAsync short-circuits to NotSupportedException (Galaxy historian only does raw); IRediscoverable's OnRediscoveryNeeded fires when the Host pushes a deploy-watermark notification; IHostConnectivityProbe.GetHostStatuses() snapshots and OnHostStatusChanged fires on Running↔Stopped/Faulted transitions, with IpcHostConnectivityStatus aliased to disambiguate from the Core.Abstractions namespace's same-named type. Internal RaiseDataChange/RaiseAlarmEvent/RaiseRediscoveryNeeded/OnHostConnectivityUpdate methods are the entry points the IPC client will invoke when push frames arrive. Host side: new Backend/IGalaxyBackend interface defines the seam between IPC dispatch and the live MXAccess code (so the dispatcher is unit-testable against an in-memory mock without needing live Galaxy); Backend/StubGalaxyBackend returns success for OpenSession/CloseSession/Subscribe/Unsubscribe/AlarmSubscribe/AlarmAck/Recycle and a recognizable "stub: MXAccess code lift pending (Phase 2 Task B.1)"-tagged error for Discover/ReadValues/WriteValues/HistoryRead — keeps the IPC end-to-end testable today and gives the parity team a clear seam to slot the real implementation into; Ipc/GalaxyFrameHandler is the new real dispatcher (replaces StubFrameHandler in Program.cs) — switch on MessageKind, deserialize the matching contract, await backend method, write the response (one-way for Unsubscribe/AlarmSubscribe/AlarmAck/CloseSession), heartbeat handled inline so liveness still works if the backend is sick, exceptions caught and surfaced as ErrorResponse with code "handler-exception" so the Proxy raises GalaxyIpcException instead of disconnecting. End-to-end IPC integration test (EndToEndIpcTests) drives every operation through the full stack — Initialize → Read → Write → Subscribe → Unsubscribe → SubscribeAlarms → AlarmAck → ReadRaw → ReadProcessed (short-circuit) — proving the wire protocol, dispatcher, capability forwarding, and one-way semantics agree end-to-end. Skipped on Windows administrator shells per the same PipeAcl-denies-Administrators reasoning the IpcHandshakeIntegrationTests use. Full solution 952 pass / 1 pre-existing Phase 0 baseline. Phase 2 evidence doc updated: status header now reads "Streams A+B+C complete... Streams D+E remain — gated only on the iterative Galaxy code lift + parity-debug cycle"; new Update 2026-04-17 (later) callout enumerates the upgrade with explicit "what's left for the Phase 2 exit gate" — replace StubGalaxyBackend with a MxAccessClient-backed implementation calling on the StaPump, then run the v1 IntegrationTests against the v2 topology and iterate on parity defects until green, then delete legacy OtOpcUa.Host.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 23:02:00 -04:00
Joseph Doherty
a1e9ed40fb Doc — record that this dev box (DESKTOP-6JL3KKO) hosts the full AVEVA stack required for the LmxOpcUa Phase 2 breakout, removing the "needs live MXAccess runtime" environmental blocker that the partial-exit evidence cited as gating Streams D + E. Inventory verified via Get-Service: 27 ArchestrA / Wonderware / AVEVA services running including aaBootstrap, aaGR (Galaxy Repository), aaLogger, aaUserValidator, aaPim, ArchestrADataStore, AsbServiceManager, AutoBuild_Service; the full Historian set (aahClientAccessPoint, aahGateway, aahInSight, aahSearchIndexer, aahSupervisor, InSQLStorage, InSQLConfiguration, InSQLEventSystem, InSQLIndexing, InSQLIOServer, InSQLManualStorage, InSQLSystemDriver, HistorianSearch-x64); slssvc (Wonderware SuiteLink); MXAccess COM DLL at C:\Program Files (x86)\ArchestrA\Framework\bin\ArchestrA.MXAccess.dll plus the matching .tlb files; OI-Gateway install at C:\Program Files (x86)\Wonderware\OI-Server\OI-Gateway\ — which means the Phase 1 Task E.10 AppServer-via-OI-Gateway smoke test (decision #142) is *also* runnable on the same box, not blocked on a separate AVEVA test machine as the original deferral assumed. dev-environment.md inventory row for "Dev Galaxy" now lists every service and file path; status flips to "Fully available — Phase 2 lift unblocked"; the GLAuth row also fills out v2.4.0 actual install details (direct-bind cn={user},dc=lmxopcua,dc=local; users readonly/writeop/writetune/writeconfig/alarmack/admin/serviceaccount; running under NSSM service GLAuth; current GroupToRole mapping ReadOnly→ConfigViewer / WriteOperate→ConfigEditor / AlarmAck→FleetAdmin) and notes the v2-rebrand to dc=otopcua,dc=local is a future cosmetic change. phase-2-partial-exit-evidence.md status header gains "runtime now in place"; an Update 2026-04-17 callout enumerates the same service inventory and concludes "no environmental blocker remains"; the next-session checklist's first step changes from "stand up dev Galaxy" to "verify the local AVEVA stack is still green (Get-Service aaGR, aaBootstrap, slssvc → Running) and the Galaxy ZB repository is reachable" with a new step 9 calling out that the AppServer-via-OI-Gateway smoke test should now be folded in opportunistically. plan.md §"4. Galaxy/MXAccess as Out-of-Process Driver" gains a "Dev environment for the LmxOpcUa breakout" paragraph documenting which physical machine has the runtime so the planning doc no longer reads as if AVEVA capability were a future logistical concern. No source / test changes.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 22:42:15 -04:00
Joseph Doherty
18f93d72bb Phase 1 LDAP auth + SignalR real-time — closes the last two open Admin UI TODOs. LDAP: Admin/Security/ gets SecurityOptions (bound from appsettings.json Authentication:Ldap), LdapAuthResult record, ILdapAuthService + LdapAuthService ported from scadalink-design's LdapAuthService (TLS guard, search-then-bind when a service account is configured, direct-bind fallback, service-account re-bind after user bind so attribute lookup uses the service principal's read rights, LdapException-to-friendly-message translation, OperationCanceledException pass-through), RoleMapper (pure function: case-insensitive group-name match against LdapOptions.GroupToRole, returns the distinct set of mapped Admin roles). EscapeLdapFilter escapes the five LDAP filter control chars (\, *, (, ), \0); ExtractFirstRdnValue pulls the value portion of a DN's leading RDN for memberOf parsing; ExtractOuSegment added as a GLAuth-specific fallback when the directory doesn't populate memberOf but does embed ou=PrimaryGroup into user DNs (actual GLAuth config in C:\publish\glauth\glauth.cfg uses nameformat=cn, groupformat=ou — direct bind is enough). Login page rewritten: EditForm → ILdapAuthService.AuthenticateAsync → cookie sign-in with claims (Name = displayName, NameIdentifier = username, Role for each mapped role, ldap_group for each raw group); failed bind shows the service's error; empty-role-map returns an explicit "no Admin role mapped" message rather than silently succeeding. appsettings.json gains an Authentication:Ldap section with dev-GLAuth defaults (localhost:3893, UseTls=false, AllowInsecureLdap=true for dev, GroupToRole maps GLAuth's ReadOnly/WriteOperate/AlarmAck → ConfigViewer/ConfigEditor/FleetAdmin). SignalR: two hubs + a BackgroundService poller. FleetStatusHub routes per-cluster NodeStateChanged pushes (SubscribeCluster/UnsubscribeCluster on connection; FleetGroup for dashboard-wide) with a typed NodeStateChangedMessage payload. AlertHub auto-subscribes every connection to the AllAlertsGroup and exposes AcknowledgeAsync (ack persistence deferred to v2.1). FleetStatusPoller (IHostedService, 5s default cadence) scans ClusterNodeGenerationState joined with ClusterNode, caches the prior snapshot per NodeId, pushes NodeStateChanged on any delta, raises AlertMessage("apply-failed") on transition INTO Failed (sticky — the hub client acks later). Program.cs registers HttpContextAccessor (sign-in needs it), SignalR, LdapOptions + ILdapAuthService, the poller as hosted service, and maps /hubs/fleet + /hubs/alerts endpoints. ClusterDetail adds @rendermode RenderMode.InteractiveServer, @implements IAsyncDisposable, and a HubConnectionBuilder subscription that calls LoadAsync() on each NodeStateChanged for its cluster so the "current published" card refreshes without a page reload; a dismissable "Live update" info banner surfaces the most recent event. Microsoft.AspNetCore.SignalR.Client 10.0.0 + Novell.Directory.Ldap.NETStandard 3.6.0 added. Tests: 13 new — RoleMapperTests (single group, case-insensitive match, multi-group distinct-roles, unknown-group ignored, empty-map); LdapAuthServiceTests (EscapeLdapFilter with 4 inputs, ExtractFirstRdnValue with 4 inputs — all via reflection against internals); LdapLiveBindTests (skip when localhost:3893 unreachable; valid-credentials-bind-succeeds; wrong-password-fails-with-recognizable-error; empty-username-rejected-before-hitting-directory); FleetStatusPollerTests (throwaway DB, seeds cluster+node+generation+apply-state, runs PollOnceAsync, asserts NodeStateChanged hit the recorder; second test seeds a Failed state and asserts AlertRaised fired) — backed by RecordingHubContext/RecordingHubClients/RecordingClientProxy that capture SendCoreAsync invocations while throwing NotImplementedException for the IHubClients methods the poller doesn't call (fail-fast if evolution adds new dependencies). InternalsVisibleTo added so the test project can call FleetStatusPoller.PollOnceAsync directly. Full solution 946 pass / 1 pre-existing Phase 0 baseline failure.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 22:28:49 -04:00
Joseph Doherty
7a5b535cd6 Phase 1 Stream E Admin UI — finish Blazor pages so operators can run the draft → publish → rollback workflow end-to-end without hand-executing SQL. Adds eight new scoped services that wrap the Configuration stored procs + managed validators: EquipmentService (CRUD with auto-derived EquipmentId per decision #125), UnsService (areas + lines), NamespaceService, DriverInstanceService (generic JSON DriverConfig editor per decision #94 — per-driver schema validation lands in each driver's phase), NodeAclService (grant + revoke with bundled-preset permission sets; full per-flag editor + bulk-grant + permission simulator deferred to v2.1), ReservationService (fleet-wide active + released reservation inspector + FleetAdmin-only sp_ReleaseExternalIdReservation wrapper with required-reason invariant), DraftValidationService (hydrates a DraftSnapshot from the draft's rows plus prior-cluster Equipment + active reservations, runs the managed DraftValidator to surface every rule in one pass for inline validation panel), AuditLogService (recent ConfigAuditLog reader). Pages: /clusters list with create-new shortcut; /clusters/new wizard that creates the cluster row + initial empty draft in one go; /clusters/{id} detail with 8 tabs (Overview / Generations / Equipment / UNS Structure / Namespaces / Drivers / ACLs / Audit) — tabs that write always target the active draft, published generations stay read-only; /clusters/{id}/draft/{gen} editor with live validation panel (errors list with stable code + message + context; publish button disabled while any error exists) and tab-embedded sub-components; /clusters/{id}/draft/{gen}/diff three-column view backed by sp_ComputeGenerationDiff with Added/Removed/Modified badges; Generations tab with per-row rollback action wired to sp_RollbackToGeneration; /reservations FleetAdmin-only page (CanPublish policy) with active + released lists and a modal release dialog that enforces non-empty reason and round-trips through sp_ReleaseExternalIdReservation; /login scaffold with stub credential accept + FleetAdmin-role cookie issuance (real LDAP bind via the ScadaLink-parity LdapAuthService is deferred until live GLAuth integration — marked in the login view and in the Phase 1 partial-exit TODO). Layout: sidebar gets Overview / Clusters / Reservations + AuthorizeView with signed-in username + roles + sign-out POST to /auth/logout; cascading authentication state registered for <AuthorizeView> to work in RenderMode.InteractiveServer. Integration testing: AdminServicesIntegrationTests creates a throwaway per-run database (same pattern as the Configuration test fixture), applies all three migrations, and exercises (1) create-cluster → add-namespace+UNS+driver+equipment → validate (expects zero errors) → publish (expects Published status) → rollback (expects one new Published + at least one Superseded); (2) cross-cluster namespace binding draft → validates to BadCrossClusterNamespaceBinding per decision #122. Old flat Components/Pages/Clusters.razor moved to Components/Pages/Clusters/ClustersList.razor so the Clusters folder can host tab sub-components without the razor generator creating a type-and-namespace collision. Dev appsettings.json connection string switched from Integrated Security to sa auth to match the otopcua-mssql container on port 14330 (remapped from 1433 to coexist with the native MSSQL14 Galaxy ZB instance). Browser smoke test completed: home page, clusters list, new-cluster form, cluster detail with a seeded row, reservations (redirected to login for anon user) all return 200 / 302-to-login as expected; full solution 928 pass / 1 pre-existing Phase 0 baseline failure. Phase 1 Stream E items explicitly deferred with TODOs: CSV import for Equipment, SignalR FleetStatusHub + AlertHub real-time push, bulk-grant workflow, permission-simulator trie, merge-equipment draft, AppServer-via-OI-Gateway end-to-end smoke test (decision #142), and the real LDAP bind replacing the Login page stub.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 21:52:42 -04:00
Joseph Doherty
01fd90c178 Phase 1 Streams B–E scaffold + Phase 2 Streams A–C scaffold — 8 new projects with ~70 new tests, all green alongside the 494 v1 IntegrationTests baseline (parity preserved: no v1 tests broken; legacy OtOpcUa.Host untouched). Phase 1 finish: Configuration project (16 entities + 10 enums + DbContext + DesignTimeDbContextFactory + InitialSchema/StoredProcedures/AuthorizationGrants migrations — 8 procs including sp_PublishGeneration with MERGE on ExternalIdReservation per decision #124, sp_RollbackToGeneration cloning rows into a new published generation, sp_ValidateDraft with cross-cluster-namespace + EquipmentUuid-immutability + ZTag/SAPID reservation pre-flight, sp_ComputeGenerationDiff with CHECKSUM-based row signature — plus OtOpcUaNode/OtOpcUaAdmin SQL roles with EXECUTE grants scoped to per-principal-class proc sets and DENY UPDATE/DELETE/INSERT/SELECT on dbo schema); managed DraftValidator covering UNS segment regex, path length, EquipmentUuid immutability across generations, same-cluster namespace binding (decision #122), reservation pre-flight, EquipmentId derivation (decision #125), driver↔namespace compatibility — returning every failing rule in one pass; LiteDB local cache with round-trip + ring pruning + corruption-fast-fail; GenerationApplier with per-entity Added/Removed/Modified diff and dependency-ordered callbacks (namespace → driver → device → equipment → poll-group → tag, Removed before Added); Core project with GenericDriverNodeManager (scaffold for the Phase 2 Galaxy port) and DriverHost lifecycle registry; Server project using Microsoft.Extensions.Hosting BackgroundService replacing TopShelf, with NodeBootstrap that falls back to LiteDB cache when the central DB is unreachable (decision #79); Admin project scaffolded as Blazor Server with Bootstrap 5 sidebar layout, cookie auth, three admin roles (ConfigViewer/ConfigEditor/FleetAdmin), Cluster + Generation services fronting the stored procs. Phase 2 scaffold: Driver.Galaxy.Shared (netstandard2.0) with full MessagePack IPC contract surface — Hello version negotiation, Open/CloseSession, Heartbeat, DiscoverHierarchy + GalaxyObjectInfo/GalaxyAttributeInfo, Read/WriteValues, Subscribe/Unsubscribe/OnDataChange, AlarmSubscribe/Event/Ack, HistoryRead, HostConnectivityStatus, Recycle — plus length-prefixed framing (decision #28) with a 16 MiB cap and thread-safe FrameWriter/FrameReader; Driver.Galaxy.Host (net48) implementing the Tier C cross-cutting protections from driver-stability.md — strict PipeAcl (allow configured server SID only, explicit deny on LocalSystem + Administrators), PipeServer with caller-SID verification via pipe.RunAsClient + WindowsIdentity.GetCurrent and per-process shared-secret Hello, Galaxy-specific MemoryWatchdog (warn at max(1.5×baseline, +200 MB), soft-recycle at max(2×baseline, +200 MB), hard ceiling 1.5 GB, slope ≥5 MB/min over 30-min rolling window), RecyclePolicy (1 soft recycle per hour cap + 03:00 local daily scheduled), PostMortemMmf (1000-entry ring buffer in %ProgramData%\OtOpcUa\driver-postmortem\galaxy.mmf, survives hard crash, readable cross-process), MxAccessHandle : SafeHandle (ReleaseHandle loops Marshal.ReleaseComObject until refcount=0 then calls optional unregister callback), StaPump with responsiveness probe (BlockingCollection dispatcher for Phase 1 — real Win32 GetMessage/DispatchMessage pump slots in with the same semantics when the Galaxy code lift happens), IsExternalInit shim for init setters on .NET 4.8; Driver.Galaxy.Proxy (net10) implementing IDriver + ITagDiscovery forwarding over the IPC channel with MX data-type and security-classification mapping, plus Supervisor pieces — Backoff (5s → 15s → 60s capped, reset-on-stable-run), CircuitBreaker (3 crashes per 5 min opens; 1h → 4h → manual cooldown escalation; sticky alert doesn't auto-clear), HeartbeatMonitor (2s cadence, 3 consecutive misses = host dead per driver-stability.md). Infrastructure: docker SQL Server remapped to host port 14330 to coexist with the native MSSQL14 Galaxy ZB DB instance on 1433; NuGetAuditSuppress applied per-project for two System.Security.Cryptography.Xml advisories that only reach via EF Core Design with PrivateAssets=all (fix ships in 11.0.0-preview); .slnx gains 14 project registrations. Deferred with explicit TODOs in docs/v2/implementation/phase-2-partial-exit-evidence.md: Phase 1 Stream E Admin UI pages (Generations listing + draft-diff-publish, Equipment CRUD with OPC 40010 fields, UNS Areas/Lines tabs, ACLs + permission simulator, Generic JSON config editor, SignalR real-time, Release-Reservation + Merge-Equipment workflows, LDAP login page, AppServer smoke test per decision #142), Phase 2 Stream D (Galaxy MXAccess code lift out of legacy OtOpcUa.Host, dual-service installer, appsettings → DriverConfig migration script, legacy Host deletion — blocked by parity), Phase 2 Stream E (v1 IntegrationTests against v2 topology, Client.CLI walkthrough diff, four 2026-04-13 stability findings regression tests, adversarial review — requires live MXAccess runtime).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 21:35:25 -04:00
Joseph Doherty
fc0ce36308 Add Installed Inventory section to dev-environment.md tracking every v2 dev service, toolchain, credential, port, data location, and container volume stood up on this machine. Records what is actually running (not just planned) so future setup work and troubleshooting has a single source of truth. Four subsections: Host (machine identity, VM platform, CPU, OS features); Toolchain (.NET 10 SDK 10.0.201 + runtimes 10.0.5, WSL2 default v2 with docker-desktop distro Running, Docker Desktop 29.3.1 / engine 29.3.1, dotnet-ef CLI 10.0.6 — each row records install method and date); Services (SQL Server 2022 container otopcua-mssql at localhost:1433 with sa/OtOpcUaDev_2026! credentials and Docker named volume otopcua-mssql-data mounted at /var/opt/mssql, dev Galaxy, GLAuth at C:\publish\glauth\ on ports 3893/3894, plus rows for not-yet-standing services like OPC Foundation reference server / FOCAS stub / Modbus simulator / ab_server / Snap7 / TwinCAT XAR VM with target ports to stand up later); Connection strings for appsettings.Development.json (copy-paste-ready, flagged never-commit); Container management quick reference (start/stop/logs/shell/query/nuclear-reset); Credential rotation note.
Per decision #137 (dev env credentials documented openly in dev-environment.md; production uses Integrated Security / gMSA per decision #46 and never any value from this table). Section lives at the top of the doc immediately after Two Environment Tiers, so it's discoverable as the single source of truth for "what's actually running here right now".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 16:57:09 -04:00
Joseph Doherty
bf6741ba7f Doc — flesh out dev-environment.md inner-loop bootstrap with the explicit Windows install steps that surfaced when actually trying to stand up SQL Server on the local box: prereq winget commands per tool (.NET 10 SDK / .NET Framework 4.8 SDK + targeting pack / Git / PowerShell 7.4+); WSL2 install (UAC-elevated) as a separate sub-step before Docker Desktop; Docker Desktop install (UAC-elevated) followed by sign-out/sign-in for docker-users group membership; explicit post-install Docker Desktop config checklist (WSL 2 based engine = checked, Windows containers = NOT checked, WSL Integration enabled for Ubuntu) per decision #134; named volume otopcua-mssql-data:/var/opt/mssql on the SQL Server container so DB files survive container restart and docker rm; sqlcmd verification command using the new mssql-tools18 path that the 2022 image ships with; EF Core CLI install for use starting in Phase 1 Stream B; bumped step count from 8 → 10. Also adds a Troubleshooting subsection covering the seven most common Windows install snags (WSL distro not auto-installed needs -d Ubuntu; Docker PATH not refreshed needs new shell or sign-in; docker-users group membership needs sign-out/in; WSL 2 kernel update needs manual install on legacy systems; SA password complexity rules; Linux vs Windows containers mode mismatch; Hyper-V coexistence with Docker requires WSL 2 backend not Hyper-V backend per decision #134). Step 1 acceptance criteria gain "docker ps shows otopcua-mssql Up" and explicit note that steps 4a/4b need admin elevation (no silent admin-free path exists on Windows).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 14:54:52 -04:00
Joseph Doherty
980ea5190c Phase 1 Stream A — Core.Abstractions project + 11 capability interfaces + DriverTypeRegistry + interface-independence tests
New project src/ZB.MOM.WW.OtOpcUa.Core.Abstractions (.NET 10, BCL-only dependencies, GenerateDocumentationFile=true, TreatWarningsAsErrors=true) defining the contract surface every driver implements. Per docs/v2/plan.md decisions #4 (composable capability interfaces), #52 (streaming IAddressSpaceBuilder), #53 (capability discovery via `is` checks no flag enum), #54 (optional IRediscoverable sub-interface), #59 (Core.Abstractions internal-only for now design as if public).

Eleven capability interfaces:
- IDriver — required lifecycle / health / config-apply / memory-footprint accounting (per driver-stability.md Tier A/B allocation tracking)
- ITagDiscovery — discovers tags streaming to IAddressSpaceBuilder
- IReadable — on-demand reads idempotent for Polly retry
- IWritable — writes NOT auto-retried by default per decisions #44 + #45
- ISubscribable — data-change subscriptions covering both native (Galaxy MXAccess advisory, OPC UA monitored items, TwinCAT ADS) and driver-internal polled (Modbus, AB CIP, S7, FOCAS) mechanisms; OnDataChange callback regardless of source
- IAlarmSource — alarm events + acknowledge + AlarmSeverity enum mirroring acl-design.md NodePermissions alarm-severity values
- IHistoryProvider — HistoryReadRaw + HistoryReadProcessed with continuation points
- IRediscoverable — opt-in change-detection signal; static drivers don't implement
- IHostConnectivityProbe — generalized from Galaxy's GalaxyRuntimeProbeManager per plan §5a
- IDriverConfigEditor — Admin UI plug-point for per-driver custom config editors deferred to each driver's phase per decision #27
- IAddressSpaceBuilder — streaming builder API for driver-driven address-space construction

Plus DTOs: DriverDataType, SecurityClassification (mirroring v1 Galaxy model), DriverAttributeInfo (replaces Galaxy-specific GalaxyAttributeInfo per plan §5a), DriverHealth + DriverState, DataValueSnapshot (universal OPC UA quality + timestamp carrier per decision #13), HostConnectivityStatus + HostState + HostStatusChangedEventArgs, RediscoveryEventArgs, DataChangeEventArgs, AlarmEventArgs + AlarmAcknowledgeRequest + AlarmSeverity, WriteRequest + WriteResult, HistoryReadResult + HistoryAggregateType, ISubscriptionHandle + IAlarmSubscriptionHandle + IVariableHandle.

DriverTypeRegistry singleton with Register / Get / TryGet / All; thread-safe via Interlocked.Exchange snapshot replacement on registration; case-insensitive lookups; rejects duplicate registrations; rejects empty type names. DriverTypeMetadata record carries TypeName + AllowedNamespaceKinds (NamespaceKindCompatibility flags enum per decision #111) + per-config-tier JSON Schemas the validator checks at draft-publish time (decision #91).

Tests project tests/ZB.MOM.WW.OtOpcUa.Core.Abstractions.Tests (xUnit v3 1.1.0 matching existing test projects). 24 tests covering: 1) interface independence reflection check (no references outside BCL/System; all public types in root namespace; every capability interface is public); 2) DriverTypeRegistry round-trip, case-insensitive lookups, KeyNotFoundException on unknown, null on TryGet of unknown, InvalidOperationException on duplicate registration (case-insensitive too), All() enumeration, NamespaceKindCompatibility bitmask combinations, ArgumentException on empty type names.

Build: 0 errors, 4 warnings (only pre-existing transitive package vulnerability + analyzer hints). Full test suite: 845 passing / 1 failing — strict improvement over Phase 0 baseline (821/1) by the 24 new Core.Abstractions tests; no regressions in any other test project.

Phase 1 entry-gate record (docs/v2/implementation/entry-gate-phase-1.md) documents the deviation: only Stream A executed in this continuation since Streams B-E need SQL Server / GLAuth / Galaxy infrastructure standup per dev-environment.md Step 1, which is currently TODO.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 14:15:55 -04:00