Files
lmxopcua/docs/ServiceHosting.md
Joseph Doherty d11dd0520b Galaxy IPC unblock — live dev-box E2E path
Three root-cause fixes to get an elevated dev-box shell past session open
through to real MXAccess reads:

1. PipeAcl — drop BUILTIN\Administrators deny ACE. UAC's filtered token
   carries the Admins SID as deny-only, so the deny fired even from
   non-elevated admin-account shells. The per-connection SID check in
   PipeServer.VerifyCaller remains the real authorization boundary.

2. PipeServer — swap the Hello-read / VerifyCaller order. ImpersonateNamedPipeClient
   returns ERROR_CANNOT_IMPERSONATE until at least one frame has been read
   from the pipe; reading Hello first satisfies that rule. Previously the
   ACL deny-first path masked this race — removing the deny ACE exposed it.

3. GalaxyIpcClient — add a background reader + single pending-response
   slot. A RuntimeStatusChange event between OpenSessionRequest and
   OpenSessionResponse used to satisfy the caller's single ReadFrameAsync
   and fail CallAsync with "Expected OpenSessionResponse, got
   RuntimeStatusChange". The reader now routes response kinds (and
   ErrorResponse) to the pending TCS and everything else to a handler the
   driver registers in InitializeAsync. The Proxy was already set up to
   raise managed events from RaiseDataChange / RaiseAlarmEvent /
   OnHostConnectivityUpdate — those helpers had no caller until now.

4. RedundancyPublisherHostedService — swallow BadServerHalted while
   polling host.Server.CurrentInstance. StandardServer throws that code
   during startup rather than returning null, so the first poll attempt
   crashed the BackgroundService (and the host) before OnServerStarted
   ran. This race was latent behind the Galaxy init failure above.

Updates docs that described the Admins deny ACE + mandatory non-elevated
shells, and drops the admin-skip guards from every Galaxy integration +
E2E fixture that had them (IpcHandshakeIntegrationTests, EndToEndIpcTests,
ParityFixture, LiveStackFixture, HostSubprocessParityTests).

Adds GalaxyIpcClientRoutingTests covering the router's
request/response match, ErrorResponse, event-between-call, idle event,
and peer-close paths.

Verified live on the dev box against the p7-smoke cluster (gen 6):
driver registered=1 failedInit=0, Phase 7 bridge subscribed, OPC UA
server up on 4840, MXAccess read round-trip returns real data with
Status=0x00000000.

Task #112 — partial: Galaxy live stack is functional end-to-end. The
supplied test-galaxy.ps1 script still fails because the UNS walker
encodes TagConfig JSON as the tag's NodeId instead of the seeded TagId
(pre-existing; separate issue from this commit).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 16:30:16 -04:00

134 lines
11 KiB
Markdown

# Service Hosting
## Overview
A production OtOpcUa deployment runs **three processes**, each with a distinct runtime, platform target, and install surface:
| Process | Project | Runtime | Platform | Responsibility |
|---|---|---|---|---|
| **OtOpcUa Server** | `src/ZB.MOM.WW.OtOpcUa.Server` | .NET 10 | x64 | Hosts the OPC UA endpoint; loads every non-Galaxy driver in-process; exposes `/healthz`. |
| **OtOpcUa Admin** | `src/ZB.MOM.WW.OtOpcUa.Admin` | .NET 10 (ASP.NET Core / Blazor Server) | x64 | Operator UI for Config DB editing + fleet status, SignalR hubs (`FleetStatusHub`, `AlertHub`), Prometheus `/metrics`. |
| **OtOpcUa Galaxy.Host** | `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host` | .NET Framework 4.8 | x86 (32-bit) | Hosts MXAccess COM on a dedicated STA thread with a Win32 message pump; exposes a named-pipe IPC surface consumed by `Driver.Galaxy.Proxy` inside the Server process. |
The x86 / .NET Framework 4.8 constraint applies **only** to Galaxy.Host because the MXAccess toolkit DLLs (`Program Files (x86)\ArchestrA\Framework\bin`) are 32-bit-only COM. Every other driver (Modbus, S7, OpcUaClient, AbCip, AbLegacy, TwinCAT, FOCAS) runs in-process in the 64-bit Server.
## Server process
`src/ZB.MOM.WW.OtOpcUa.Server/Program.cs` uses the generic host:
```csharp
var builder = Host.CreateApplicationBuilder(args);
builder.Services.AddSerilog();
builder.Services.AddWindowsService(o => o.ServiceName = "OtOpcUa");
builder.Services.AddHostedService<OpcUaServerService>();
builder.Services.AddHostedService<HostStatusPublisher>();
```
`OpcUaServerService` is a `BackgroundService` (decision #30 — TopShelf from v1 was replaced by the generic-host `AddWindowsService` wrapper; no TopShelf dependency remains in any csproj). It owns:
1. Config bootstrap — reads `Node:NodeId`, `Node:ClusterId`, `Node:ConfigDbConnectionString`, `Node:LocalCachePath` from `appsettings.json`.
2. `NodeBootstrap` — pulls the latest published generation from the Config DB into the LiteDB local cache (`LiteDbConfigCache`) so the node starts even if the central DB is briefly unreachable.
3. `DriverHost` — instantiates configured driver instances from the generation, wires each through `CapabilityInvoker` resilience pipelines.
4. `OpcUaApplicationHost` — builds the OPC UA endpoint, applies `OpcUaServerOptions` + `LdapOptions`, registers `AuthorizationGate` at dispatch.
5. `HostStatusPublisher` — a second hosted service that heartbeats `DriverHostStatus` rows so the Admin UI Fleet view sees the node.
### Installation
Same executable, different modes driven by the .NET generic-host `AddWindowsService` wrapper:
| Mode | Invocation |
|---|---|
| Console | `ZB.MOM.WW.OtOpcUa.Server.exe` |
| Install as Windows service | `sc create OtOpcUa binPath="C:\Program Files\OtOpcUa\Server\ZB.MOM.WW.OtOpcUa.Server.exe" start=auto` |
| Start | `sc start OtOpcUa` |
| Stop | `sc stop OtOpcUa` |
| Uninstall | `sc delete OtOpcUa` |
### Health endpoints
The Server exposes `/healthz` + `/readyz` used by (a) the Admin `FleetStatusPoller` as input to Fleet status and (b) `PeerReachabilityTracker` in a peer Server process as the HTTP side of the peer-reachability probe.
## Admin process
`src/ZB.MOM.WW.OtOpcUa.Admin/Program.cs` is a stock `WebApplication`. Highlights:
- Cookie auth (`CookieAuthenticationDefaults`, scheme name `OtOpcUa.Admin`) + Blazor Server (`AddInteractiveServerComponents`) + SignalR.
- Authorization policies gated by `AdminRoles`: `ConfigViewer`, `ConfigEditor`, `FleetAdmin` (see `Services/AdminRoles.cs`). `CanEdit` policy requires `ConfigEditor` or `FleetAdmin`; `CanPublish` requires `FleetAdmin`.
- `OtOpcUaConfigDbContext` registered against `ConnectionStrings:ConfigDb`.
- Scoped services: `ClusterService`, `GenerationService`, `EquipmentService`, `UnsService`, `NamespaceService`, `DriverInstanceService`, `NodeAclService`, `PermissionProbeService`, `AclChangeNotifier`, `ReservationService`, `DraftValidationService`, `AuditLogService`, `HostStatusService`, `ClusterNodeService`, `EquipmentImportBatchService`, `ILdapGroupRoleMappingService`.
- Singleton `RedundancyMetrics` (meter name `ZB.MOM.WW.OtOpcUa.Redundancy`) + `CertTrustService` (promotes rejected client certs in the Server's PKI store to trusted via the Admin Certificates page).
- `LdapAuthService` bound to `Authentication:Ldap` — same LDAP flow as ScadaLink CentralUI for visual parity.
- SignalR hubs mapped at `/hubs/fleet` and `/hubs/alerts`; `FleetStatusPoller` runs as a hosted service and pushes `RoleChanged`, host status, and alert events.
- OpenTelemetry → Prometheus exporter at `/metrics` when `Metrics:Prometheus:Enabled=true` (default). Pull-based means no Collector required in the common K8s deploy.
### Installation
Deployed as an ASP.NET Core service; the generic-host `AddWindowsService` wrapper (or IIS reverse-proxy for multi-node fleets) provides install/uninstall. Listens on whatever `ASPNETCORE_URLS` specifies.
## Galaxy.Host process
`src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/Program.cs` is a .NET Framework 4.8 x86 console executable. Configuration comes from environment variables supplied by the supervisor (`Driver.Galaxy.Proxy.Supervisor`):
| Env var | Purpose |
|---|---|
| `OTOPCUA_GALAXY_PIPE` | Pipe name the host listens on (default `OtOpcUaGalaxy`). |
| `OTOPCUA_ALLOWED_SID` | SID of the Server process's principal; anyone else is refused during the handshake. |
| `OTOPCUA_GALAXY_SECRET` | Per-spawn shared secret the client must present in the Hello frame. |
| `OTOPCUA_GALAXY_BACKEND` | `mxaccess` (default), `db` (ZB-only, no COM), `stub` (in-memory; for tests). |
| `OTOPCUA_GALAXY_ZB_CONN` | SQL connection string to the ZB Galaxy repository. |
| `OTOPCUA_HISTORIAN_*` | Optional Wonderware Historian SDK config if Historian is enabled for this node. |
The host spins up `StaPump` (the STA thread with message pump), creates the MXAccess `LMXProxyServer` COM object on that thread, and handles all COM calls there; the IPC layer marshals work items via `PostThreadMessage`.
### Pipe security
`PipeServer` builds a `PipeAcl` from the provided `SecurityIdentifier` + uses `NamedPipeServerStream` with `maxNumberOfServerInstances: 1`. The handshake requires a matching shared secret in the first Hello frame; callers whose SID doesn't match `OTOPCUA_ALLOWED_SID` are rejected before any frame is processed via `NamedPipeServerStream.RunAsClient` + a SID comparison against the configured allow list. The DACL grants `ReadWrite | Synchronize` only to the allowed SID and denies `LocalSystem`. The installed dev host (`OtOpcUaGalaxyHost`) runs as `dohertj2` with the secret at `.local/galaxy-host-secret.txt`.
### Installation
NSSM-wrapped (the Non-Sucking Service Manager) because the executable itself is a plain console app, not a `ServiceBase` Windows service. The supervisor then adopts the child process over the pipe after install. Install/uninstall commands follow the NSSM pattern:
```bash
nssm install OtOpcUaGalaxyHost "C:\Program Files (x86)\OtOpcUa\Galaxy.Host\ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host.exe"
nssm set OtOpcUaGalaxyHost ObjectName .\dohertj2 <password>
nssm set OtOpcUaGalaxyHost AppEnvironmentExtra OTOPCUA_GALAXY_BACKEND=mxaccess OTOPCUA_GALAXY_SECRET=OTOPCUA_ALLOWED_SID=
nssm start OtOpcUaGalaxyHost
```
(Exact values for the environment block are generated by the Admin UI + committed alongside `.local/galaxy-host-secret.txt` on the dev box.)
## Inter-process communication
```
┌──────────────────────────┐ LDAP bind (Authentication:Ldap) ┌──────────────────────────┐
│ OtOpcUa Admin (x64) │ ─────────────────────────────────────────────▶│ LDAP / AD │
│ Blazor Server + SignalR │ └──────────────────────────┘
│ /metrics (Prometheus) │ FleetStatusPoller → ClusterNode poll
│ │ ─────────────────────────────────────────────▶┌──────────────────────────┐
│ │ Cluster/Generation/ACL writes │ Config DB (SQL Server) │
└──────────────────────────┘ ─────────────────────────────────────────────▶│ OtOpcUaConfigDbContext │
▲ └──────────────────────────┘
│ SignalR ▲
│ (role change, │ sp_GetCurrentGenerationForCluster
│ host status, │ sp_PublishGeneration
│ alerts) │
┌──────────────────────────┐ │
│ OtOpcUa Server (x64) │ ──────────────────────────────────────────────────────────┘
│ OPC UA endpoint │
│ Non-Galaxy drivers │ Named pipe (OtOpcUaGalaxy) ┌──────────────────────────┐
│ Driver.Galaxy.Proxy │ ─────────────────────────────────────────────▶│ Galaxy.Host (x86 .NFx) │
│ │ SID + shared-secret handshake │ STA + message pump │
│ /healthz /readyz │ │ MXAccess COM │
└──────────────────────────┘ │ Historian SDK (opt) │
└──────────────────────────┘
```
## appsettings.json boundary
Each process reads its own `appsettings.json` for **bootstrap only** — connection strings, LDAP bind config, transport security profile, redundancy node id, logging. The authoritative configuration tree (drivers, UNS, tags, ACLs) lives in the Config DB and is edited through the Admin UI. See [`Configuration.md`](Configuration.md) for the split.
## Development bootstrap
For the Windows install steps (SQL Server in Docker, .NET 10 SDK, .NET Framework 4.8 SDK, Docker Desktop WSL 2 backend, EF Core CLI, first-run migration), see [`docs/v2/dev-environment.md`](v2/dev-environment.md).