Commit Graph

9 Commits

Author SHA1 Message Date
Joseph Doherty
1bd8a1875b PR 7.3 tail — doc + memory housekeeping for retired Galaxy.Host
Closes the v2-mxgw migration's housekeeping debt now that PR 7.2 has
retired the legacy projects + service.

Repo docs:
- CLAUDE.md: rewrote the Galaxy section + reference-impl + MXAccess
  documentation pointers; replaced .NET 4.8 x86 / COM apartment
  constraints with .NET 10 AnyCPU + a pointer to the gateway. Dropped
  the "Service hosting (Galaxy.Host)" library-preferences row.
- docs/ServiceHosting.md: rewrote (was 156 lines of Galaxy.Host pipe
  IPC details). Now reflects the v2 process shape: OtOpcUa.Server +
  OtOpcUa.Admin + optional OtOpcUaWonderwareHistorian, with Galaxy
  access via the in-process driver → mxaccessgw.
- docs/v2/dev-environment.md: scrubbed four Galaxy.Host references
  (TwinCAT/Galaxy.Host shared-host note; .NET 4.8 SDK row; install
  step #2; risks table). The .NET 4.8 SDK is now correctly framed as
  "optional, only needed when building the mxaccessgw worker".
- mxaccess_documentation.md: deleted from the repo root (obsolete; the
  gateway repo is the canonical MxAccess API doc).

Memory housekeeping (under ~/.claude/projects/.../memory/):
- Retired: project_galaxy_host_service.md,
  project_galaxy_host_installed.md, reference_impl.md (the LmxProxy
  Host MXAccess reference is no longer the design pattern this repo
  uses).
- Revised: project_overview.md (now describes the .NET 10 + mxaccessgw
  shape), project_aveva_platform_installed.md (AVEVA still required
  on the dev box but consumed by the gateway worker, not by anything
  here), project_galaxy_via_mxgateway.md (post-7.2 state — flagged as
  the only Galaxy backend), project_server_history_alarm_subsystems.md
  (per-driver fallbacks retired in PR 7.2).
- MEMORY.md index updated to match.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 08:14:22 -04:00
Joseph Doherty
d11dd0520b Galaxy IPC unblock — live dev-box E2E path
Three root-cause fixes to get an elevated dev-box shell past session open
through to real MXAccess reads:

1. PipeAcl — drop BUILTIN\Administrators deny ACE. UAC's filtered token
   carries the Admins SID as deny-only, so the deny fired even from
   non-elevated admin-account shells. The per-connection SID check in
   PipeServer.VerifyCaller remains the real authorization boundary.

2. PipeServer — swap the Hello-read / VerifyCaller order. ImpersonateNamedPipeClient
   returns ERROR_CANNOT_IMPERSONATE until at least one frame has been read
   from the pipe; reading Hello first satisfies that rule. Previously the
   ACL deny-first path masked this race — removing the deny ACE exposed it.

3. GalaxyIpcClient — add a background reader + single pending-response
   slot. A RuntimeStatusChange event between OpenSessionRequest and
   OpenSessionResponse used to satisfy the caller's single ReadFrameAsync
   and fail CallAsync with "Expected OpenSessionResponse, got
   RuntimeStatusChange". The reader now routes response kinds (and
   ErrorResponse) to the pending TCS and everything else to a handler the
   driver registers in InitializeAsync. The Proxy was already set up to
   raise managed events from RaiseDataChange / RaiseAlarmEvent /
   OnHostConnectivityUpdate — those helpers had no caller until now.

4. RedundancyPublisherHostedService — swallow BadServerHalted while
   polling host.Server.CurrentInstance. StandardServer throws that code
   during startup rather than returning null, so the first poll attempt
   crashed the BackgroundService (and the host) before OnServerStarted
   ran. This race was latent behind the Galaxy init failure above.

Updates docs that described the Admins deny ACE + mandatory non-elevated
shells, and drops the admin-skip guards from every Galaxy integration +
E2E fixture that had them (IpcHandshakeIntegrationTests, EndToEndIpcTests,
ParityFixture, LiveStackFixture, HostSubprocessParityTests).

Adds GalaxyIpcClientRoutingTests covering the router's
request/response match, ErrorResponse, event-between-call, idle event,
and peer-close paths.

Verified live on the dev box against the p7-smoke cluster (gen 6):
driver registered=1 failedInit=0, Phase 7 bridge subscribed, OPC UA
server up on 4840, MXAccess read round-trip returns real data with
Status=0x00000000.

Task #112 — partial: Galaxy live stack is functional end-to-end. The
supplied test-galaxy.ps1 script still fails because the UNS walker
encodes TagConfig JSON as the tag's NodeId instead of the seeded TagId
(pre-existing; separate issue from this commit).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 16:30:16 -04:00
Joseph Doherty
5506b43ddc Doc refresh (task #204) — operational docs for multi-process multi-driver OtOpcUa
Five operational docs rewritten for v2 (multi-process, multi-driver, Config-DB authoritative):

- docs/Configuration.md — replaced appsettings-only story with the two-layer model.
  appsettings.json is bootstrap only (Node identity, Config DB connection string,
  transport security, LDAP bind, logging). Authoritative config (clusters, namespaces,
  UNS, equipment, tags, driver instances, ACLs, role grants, poll groups) lives in
  the Config DB accessed via OtOpcUaConfigDbContext and edited through the Admin UI
  draft/publish workflow. Added v1-to-v2 migration index so operators can locate where
  each old section moved. Cross-links to docs/v2/config-db-schema.md + docs/v2/admin-ui.md.

- docs/Redundancy.md — Phase 6.3 rewrite. Named every class under
  src/ZB.MOM.WW.OtOpcUa.Server/Redundancy/: RedundancyCoordinator, RedundancyTopology,
  ApplyLeaseRegistry (publish fencing), PeerReachabilityTracker, RecoveryStateManager,
  ServiceLevelCalculator (pure function), RedundancyStatePublisher. Documented the
  full 11-band ServiceLevel matrix (Maintenance=0 through AuthoritativePrimary=255)
  from ServiceLevelCalculator.cs and the per-ClusterNode fields (RedundancyRole,
  ServiceLevelBase, ApplicationUri). Covered metrics
  (otopcua.redundancy.role_transition counter + primary/secondary/stale_count gauges
  on meter ZB.MOM.WW.OtOpcUa.Redundancy) and SignalR RoleChanged push from
  FleetStatusPoller to RedundancyTab.razor.

- docs/security.md — preserved the transport-security section (still accurate) and
  added Phase 6.2 authorization. Four concerns now documented in one place:
  (1) transport security profiles, (2) OPC UA auth via LdapUserAuthenticator
  (note: task spec called this LdapAuthenticationProvider — actual class name is
  LdapUserAuthenticator in Server/Security/), (3) data-plane authorization via
  NodeAcl + PermissionTrie + AuthorizationGate — additive-only model per decision
  #129, ClusterId → Namespace → UnsArea → UnsLine → Equipment → Tag hierarchy,
  NodePermissions bundle, PermissionProbeService in Admin for "probe this permission",
  (4) control-plane authorization via LdapGroupRoleMapping + AdminRole
  (ConfigViewer / ConfigEditor / FleetAdmin, CanEdit / CanPublish policies) —
  deliberately independent of data-plane ACLs per decision #150. Documented the
  OTOPCUA0001 Roslyn analyzer (UnwrappedCapabilityCallAnalyzer) as the compile-time
  guard ensuring every driver-capability async call is wrapped by CapabilityInvoker.

- docs/ServiceHosting.md — three-process rewrite: OtOpcUa Server (net10 x64,
  BackgroundService + AddWindowsService, hosts OPC UA endpoint + all non-Galaxy
  drivers), OtOpcUa Admin (net10 x64, Blazor Server + SignalR + /metrics via
  OpenTelemetry Prometheus exporter), OtOpcUa Galaxy.Host (.NET Framework 4.8 x86,
  NSSM-wrapped, env-variable driven, STA thread + MXAccess COM). Pipe ACL
  denies-Admins detail + non-elevated shell requirement captured from feedback memory.
  Divergence from CLAUDE.md: task spec said "TopShelf is still the service-installer
  wrapper per CLAUDE.md note" but no csproj in the repo references TopShelf — decision
  #30 replaced it with the generic host's AddWindowsService wrapper (per the doc
  comment on OpcUaServerService). Reflected the actual state + flagged this divergence
  here so someone can update CLAUDE.md separately.

- docs/StatusDashboard.md — replaced the full v1 reference (dashboard endpoints,
  health check rules, StatusData DTO, etc.) with a short "superseded by Admin UI"
  pointer that preserves git-blame continuity + avoids broken links from other docs
  that reference it.

Class references verified by reading:
  src/ZB.MOM.WW.OtOpcUa.Server/Redundancy/{RedundancyCoordinator, ServiceLevelCalculator,
      ApplyLeaseRegistry, RedundancyStatePublisher}.cs
  src/ZB.MOM.WW.OtOpcUa.Core/Authorization/{PermissionTrie, PermissionTrieBuilder,
      PermissionTrieCache, TriePermissionEvaluator, AuthorizationGate}.cs
  src/ZB.MOM.WW.OtOpcUa.Server/Security/{AuthorizationGate, LdapUserAuthenticator}.cs
  src/ZB.MOM.WW.OtOpcUa.Admin/{Program.cs, Services/AdminRoles.cs,
      Services/RedundancyMetrics.cs, Hubs/FleetStatusPoller.cs}
  src/ZB.MOM.WW.OtOpcUa.Server/Program.cs + appsettings.json
  src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/{Program.cs, Ipc/PipeServer.cs}
  src/ZB.MOM.WW.OtOpcUa.Configuration/Entities/{ClusterNode, NodeAcl,
      LdapGroupRoleMapping}.cs
  src/ZB.MOM.WW.OtOpcUa.Analyzers/UnwrappedCapabilityCallAnalyzer.cs

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 01:34:25 -04:00
Joseph Doherty
3b2defd94f Phase 0 — mechanical rename ZB.MOM.WW.LmxOpcUa.* → ZB.MOM.WW.OtOpcUa.*
Renames all 11 projects (5 src + 6 tests), the .slnx solution file, all source-file namespaces, all axaml namespace references, and all v1 documentation references in CLAUDE.md and docs/*.md (excluding docs/v2/ which is already in OtOpcUa form). Also updates the TopShelf service registration name from "LmxOpcUa" to "OtOpcUa" per Phase 0 Task 0.6.

Preserves runtime identifiers per Phase 0 Out-of-Scope rules to avoid breaking v1/v2 client trust during coexistence: OPC UA `ApplicationUri` defaults (`urn:{GalaxyName}:LmxOpcUa`), server `EndpointPath` (`/LmxOpcUa`), `ServerName` default (feeds cert subject CN), `MxAccessConfiguration.ClientName` default (defensive — stays "LmxOpcUa" for MxAccess audit-trail consistency), client OPC UA identifiers (`ApplicationName = "LmxOpcUaClient"`, `ApplicationUri = "urn:localhost:LmxOpcUaClient"`, cert directory `%LocalAppData%\LmxOpcUaClient\pki\`), and the `LmxOpcUaServer` class name (class rename out of Phase 0 scope per Task 0.5 sed pattern; happens in Phase 1 alongside `LmxNodeManager → GenericDriverNodeManager` Core extraction). 23 LmxOpcUa references retained, all enumerated and justified in `docs/v2/implementation/exit-gate-phase-0.md`.

Build clean: 0 errors, 30 warnings (lower than baseline 167). Tests at strict improvement over baseline: 821 passing / 1 failing vs baseline 820 / 2 (one flaky pre-existing failure passed this run; the other still fails — both pre-existing and unrelated to the rename). `Client.UI.Tests`, `Historian.Aveva.Tests`, `Client.Shared.Tests`, `IntegrationTests` all match baseline exactly. Exit gate compliance results recorded in `docs/v2/implementation/exit-gate-phase-0.md` with all 7 checks PASS or DEFERRED-to-PR-review (#7 service install verification needs Windows service permissions on the reviewer's box).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 13:57:47 -04:00
Joseph Doherty
c76ab8fdee Close all four stability-review 2026-04-13 findings so a failed runtime probe subscription can no longer leave a phantom entry that Tick() flips to Stopped and fans out false BadOutOfService quality across a host's subtree, a silently-failed dashboard bind no longer lets the service advertise a successful start while an operator-visible endpoint is dead, the seven sync-over-async sites in LmxNodeManager (rebuild probe sync, Read, Write, four HistoryRead overrides) can no longer park the OPC UA stack thread indefinitely on a hung backend, and alarm auto-subscribe + transferred-subscription restore no longer race shutdown as untracked fire-and-forget tasks.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 00:48:07 -04:00
Joseph Doherty
9b42b61eb6 Extract historian into a runtime-loaded plugin so hosts without the Wonderware SDK can run with Historian.Enabled=false
The aahClientManaged SDK is now isolated in ZB.MOM.WW.LmxOpcUa.Historian.Aveva and loaded via HistorianPluginLoader from a Historian/ subfolder only when enabled, removing the SDK from Host's compile-time and deploy-time surface.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-12 15:16:07 -04:00
Joseph Doherty
95ad9c6866 Resolve 6 of 7 stability review findings and close test coverage gaps
Fixes P1 StaComThread hang (crash-path faulting via WorkItem queue), P1 subscription
fire-and-forget (block+log or ContinueWith on 5 call sites), P2 continuation point
leak (PurgeExpired on Retrieve/Release), P2 dashboard bind failure (localhost prefix,
bool Start), P3 background loop double-start (task handles + join on stop in 3 files),
and P3 config logging exposure (SqlConnectionStringBuilder password masking). Adds
FakeMxAccessClient fault injection and 12 new tests. Documents required runtime
assemblies in ServiceHosting.md.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-07 15:37:27 -04:00
Joseph Doherty
a55153d7d5 Add configurable non-transparent OPC UA server redundancy
Separates ApplicationUri from namespace identity so each instance in a
redundant pair has a unique server URI while sharing the same Galaxy
namespace. Exposes RedundancySupport, ServerUriArray, and dynamic
ServiceLevel through the standard OPC UA server object. ServiceLevel
is computed from role (Primary/Secondary) and runtime health (MXAccess
and DB connectivity). Adds CLI redundancy command, second deployed
service instance, and 31 new tests including paired-server integration.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-28 13:32:17 -04:00
Joseph Doherty
965e430f48 Add component-level documentation for all 14 server subsystems
Provides technical documentation covering OPC UA server, address space,
Galaxy repository, MXAccess bridge, data types, read/write, subscriptions,
alarms, historian, incremental sync, configuration, dashboard, service
hosting, and CLI tool. Updates README with component documentation table.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26 15:47:59 -04:00