Doc refresh (task #205) — requirements updated for multi-driver OtOpcUa three-process deploy

Per-file summary:

- docs/reqs/OpcUaServerReqs.md — rewritten driver-agnostic. OPC-001..OPC-013 re-scoped to multi-driver address-space composition + capability dispatch; OPC-014 AuthorizationGate + permission trie; OPC-015 dynamic ServiceLevel via RedundancyCoordinator; OPC-017 surgical generation-apply rebuild; OPC-012 capability dispatch via CapabilityInvoker (decision #143 idempotence-aware retry); OPC-013 per-host Polly isolation (decision #144); OPC-019 OpenTelemetry metrics. Transport-security profile matrix (OPC-010) + UserName/LDAP (OPC-011) preserved.

- docs/reqs/GalaxyRepositoryReqs.md — scope clarified as Galaxy-driver-only (not platform). GR-001..GR-004 tied to ITagDiscovery.DiscoverAsync + IRediscoverable; all SQL runs inside OtOpcUa.Galaxy.Host and streams to Proxy via named pipe. GR-008 capability wrapping via CapabilityInvoker added. Cross-links to docs/v2/driver-specs.md + docs/GalaxyRepository.md.

- docs/reqs/MxAccessClientReqs.md — scope clarified as Galaxy-Host-only. MXA-001..MXA-009 preserved (STA pump, register/unregister, subscription refcount, auto-reconnect, probe, COM cleanup, operation metrics, error translation). MXA-010 Proxy-side capability wrapping + MXA-011 pipe ACL + per-process shared secret (OTOPCUA_ALLOWED_SID / OTOPCUA_GALAXY_SECRET) added.

- docs/reqs/ServiceHostReqs.md — rewritten for three-process deployment. Shared section (SVC-SHARED-001/002) for Serilog + bootstrap-only appsettings. SRV-* for OtOpcUa.Server (net10 x64, Microsoft.Extensions.Hosting + AddWindowsService, in-process driver hosting, redundancy-node bootstrap). ADM-* for OtOpcUa.Admin (Blazor Server, cookie+LDAP auth, CanEdit/CanPublish policies, sole DB writer, Prometheus /metrics, audit logging). GHX-* for OtOpcUa.Galaxy.Host (TopShelf, net48 x86, named-pipe IPC bootstrap, STA backend lifecycle, crash handling tied to supervisor).

- docs/reqs/ClientRequirements.md — restructured as numbered, verifiable requirements. SHR-* for Client.Shared (single IOpcUaClientService, ConnectionSettings, failover, cross-platform certs, type-coercing write, UI-thread neutrality). CLI-001..CLI-011 cover connect/read/write/browse/subscribe/historyread/alarms/redundancy. UI-001..UI-008 cover connection panel, tree browser, each tab, connection-state reflection, cross-platform build. Reference design content (IOpcUaClientService shape, models, view-model map, mock layout) preserved.

- docs/reqs/StatusDashboardReqs.md — retired cleanly. Replaced with a pointer to docs/v2/admin-ui.md + HLR-015 / HLR-016 / HLR-017 / ADM-*. Mapping table shows each retired DASH-001..DASH-009 requirement's replacement (live cluster-node view via SignalR, Prometheus metrics, driver-instance detail views, etc.). Note that a formal AdminUiReqs.md can be written later if needed for cert compliance.

HighLevelReqs.md was already at the target shape (HLR-001..HLR-018 with Revision header noting retired HLR-009) as of commit f217636; verified identical and no additional edit required.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Joseph Doherty
2026-04-20 01:31:58 -04:00
parent f217636467
commit 48970af416
6 changed files with 739 additions and 644 deletions

View File

@@ -1,117 +1,265 @@
# Service Host — Component Requirements
Parent: [HLR-006](HighLevelReqs.md#hlr-006-windows-service-hosting), [HLR-007](HighLevelReqs.md#hlr-007-logging)
> **Revision** — Refreshed 2026-04-19 for the OtOpcUa v2 multi-driver platform (task #205). v1 was a single Windows service; v2 ships **three cooperating Windows services** and the service-host requirements are rewritten per-process. SVC-001…SVC-006 from v1 are preserved in spirit (TopShelf, Serilog, config loading, graceful shutdown, startup sequence, unhandled-exception handling) but are now scoped to the process they apply to. SRV-* prefixes the Server process, ADM-* the Admin process, GHX-* the Galaxy Host process. A shared-requirements section at the top covers cross-process concerns (Serilog, logging rotation, bootstrap config scope).
## SVC-001: TopShelf Hosting
Parent: [HLR-007](HighLevelReqs.md#hlr-007-service-hosting), [HLR-008](HighLevelReqs.md#hlr-008-logging), [HLR-011](HighLevelReqs.md#hlr-011-config-db-and-draft-publish)
The application shall use TopShelf for Windows service lifecycle (install, uninstall, start, stop) and interactive console mode for development.
## Shared Requirements (all three processes)
### Acceptance Criteria
### SVC-SHARED-001: Serilog Logging
- TopShelf HostFactory configures the service with name `LmxOpcUa`, display name `LMX OPC UA Server`.
- Service installs via command line: `ZB.MOM.WW.OtOpcUa.Host.exe install`.
- Service uninstalls via: `ZB.MOM.WW.OtOpcUa.Host.exe uninstall`.
- Service runs as LocalSystem account (needed for MXAccess COM access and Windows Auth to SQL Server).
- Interactive console mode (exe with no args) works for development/debugging.
- `StartAutomatically` is set for Windows service registration.
Every process shall use Serilog with a rolling daily file sink at Information level minimum, plus a console sink, plus opt-in CompactJsonFormatter file sink.
### Details
#### Acceptance Criteria
- Platform target: x86 (32-bit) — required for MXAccess COM interop.
- Service description: "OPC UA server exposing System Platform Galaxy tags via MXAccess."
- Console sink active on every process (for interactive / debug mode).
- Rolling daily file sink:
- Server: `logs/otopcua-YYYYMMDD.log`
- Admin: `logs/otopcua-admin-YYYYMMDD.log`
- Galaxy Host: `%ProgramData%\OtOpcUa\galaxy-host-YYYYMMDD.log`
- Retention count and min level configurable via `Serilog:*` in each process's `appsettings.json`.
- JSON sink opt-in via `Serilog:WriteJson = true` (emits `*.json.log` alongside the plain-text file) for SIEM ingestion.
- `Log.CloseAndFlush()` invoked in a `finally` block on shutdown.
- Structured logging (Serilog message templates) — no `string.Format`.
---
## SVC-002: Serilog Logging
### SVC-SHARED-002: Bootstrap Configuration Scope
The application shall configure Serilog with a rolling daily file sink and console sink, with log files retained for a configurable number of days (default 31).
`appsettings.json` is bootstrap-only per HLR-011. Operational configuration (clusters, drivers, namespaces, tags, ACLs, poll groups) lives in the Config DB.
### Acceptance Criteria
#### Acceptance Criteria
- Console sink active (for interactive/debug mode).
- Rolling daily file sink writing to `logs/lmxopcua-YYYYMMDD.log`.
- Retained file count: configurable, default 31 days.
- Minimum log level: configurable, default Information.
- Log file path: configurable, default `logs/lmxopcua-.log`.
- Serilog is initialized before any other component (first thing in Main).
- `Log.CloseAndFlush()` called in finally block on exit.
### Details
- Structured logging with Serilog message templates (not string.Format).
- Log output includes timestamp, level, source context, message, and exception.
- Fatal exceptions are caught at the top level and logged before exit.
- `appsettings.json` may contain only: Config DB connection string, `Node:NodeId`, `Node:ClusterId`, `Node:LocalCachePath`, `OpcUa:*` security bootstrap fields, `Ldap:*` bootstrap fields, `Serilog:*`, `Redundancy:*` role id.
- Any attempt to configure driver instances, tags, or equipment through `appsettings.json` shall be rejected at startup with a descriptive error.
- Invalid or missing required bootstrap fields are detected at startup with a clear error (`"Node:NodeId not configured"` style).
---
## SVC-003: Configuration
## OtOpcUa.Server — Service Host Requirements (SRV-*)
The application shall load configuration from appsettings.json with support for environment-specific overrides (appsettings.*.json) and environment variables.
### SRV-001: Microsoft.Extensions.Hosting + AddWindowsService
### Acceptance Criteria
The Server shall use `Host.CreateApplicationBuilder(args)` with `AddWindowsService(o => o.ServiceName = "OtOpcUa")` to run as a Windows service.
- `appsettings.json` is the primary configuration file.
- Environment-specific overrides via `appsettings.{environment}.json`.
- Configuration sections: `OpcUa`, `MxAccess`, `GalaxyRepository`, `Dashboard`.
- Missing optional configuration keys use documented defaults (service does not crash).
- Invalid configuration (e.g., port = -1) is detected at startup with a clear error message.
#### Acceptance Criteria
### Details
- Config is loaded once at startup. No hot-reload (service restart required for config changes). This is appropriate for an industrial service.
- All configurable values and their defaults are documented in `appsettings.json`.
- Service name `OtOpcUa`.
- Installs via standard `sc.exe` tooling or the build-provided installer.
- Runs as a configured service account (typically a domain service account with Config DB read access; Windows Auth to SQL Server).
- Console mode (running `ZB.MOM.WW.OtOpcUa.Server.exe` with no Windows service context) works for development and debugging.
- Platform target: .NET 10 x64 (default per decision in `plan.md` §3).
---
## SVC-004: Graceful Shutdown
### SRV-002: Startup Sequence
On service stop, the application shall gracefully shut down all components and flush logs before exiting.
The Server shall start components in a defined order, with failure handling at each step.
### Acceptance Criteria
- TopShelf WhenStopped triggers orderly shutdown.
- Shutdown sequence: (1) stop change detection polling, (2) stop OPC UA server (stop accepting new sessions, complete pending operations), (3) disconnect MXAccess (cleanup all COM objects), (4) stop status dashboard HTTP listener, (5) flush Serilog.
- Shutdown completes within 30 seconds (Windows SCM timeout).
- All IDisposable components are disposed in reverse-creation order.
### Details
- `CancellationTokenSource` signals all background loops (monitor, change detection, HTTP listener) to stop.
- Log "Service shutdown complete" at Information level as the final log entry before flush.
---
## SVC-005: Startup Sequence
The service shall start components in a defined order, with failure handling at each step.
### Acceptance Criteria
#### Acceptance Criteria
- Startup sequence:
1. Load configuration
2. Initialize Serilog
3. Start STA thread
4. Connect to MXAccess
5. Query Galaxy Repository for initial build
6. Build OPC UA address space
7. Start OPC UA server listener
8. Start change detection polling
9. Start status dashboard HTTP listener
- Failure in steps 1-4 prevents startup (service fails to start).
- Failure in steps 5-9 logs Error but allows the service to run in degraded mode.
### Details
- Degraded mode means the service is running but may have an empty address space (waiting for Galaxy DB) or no dashboard (port conflict). MXAccess connection is the minimum required for the service to be useful.
1. Load `appsettings.json` bootstrap configuration + initialize Serilog.
2. Validate bootstrap fields (NodeId, ClusterId, Config DB connection).
3. Initialize `OpcUaApplicationHost` (server-certificate resolution via `SecurityProfileResolver`).
4. Connect to Config DB; request current published generation for `ClusterId`.
5. If unreachable, fall back to `LiteDbConfigCache` (latest applied generation).
6. Apply generation: register driver instances, build namespaces, wire capability pipelines.
7. Start `OpcUaServerService` hosted service (opens endpoint listener).
8. Start `HostStatusPublisher` (pushes `ClusterNodeGenerationState` to Config DB for Admin UI SignalR consumers).
9. Start `RedundancyCoordinator` + `ServiceLevelCalculator`.
- Failure in steps 1-3 prevents startup.
- Failure in steps 4-6 logs Error and enters degraded mode (empty namespaces, `DriverHealth.Unavailable` on every driver, `ServiceLevel = 0`).
- Failure in steps 7-9 logs Error and shuts down (endpoint is non-optional).
---
## SVC-006: Unhandled Exception Handling
### SRV-003: Graceful Shutdown
The service shall handle unexpected crashes gracefully.
On service stop, the Server shall gracefully shut down all driver instances, the OPC UA listener, and flush logs before exiting.
### Acceptance Criteria
#### Acceptance Criteria
- Register `AppDomain.CurrentDomain.UnhandledException` handler that logs Fatal before the process terminates.
- TopShelf service recovery is configured: restart on failure with 60-second delay.
- Fatal-level log entry includes the full exception details.
- `IHostApplicationLifetime.ApplicationStopping` triggers orderly shutdown.
- Shutdown sequence: stop `HostStatusPublisher` → stop driver instances (disconnect each via `IDriver.DisposeAsync`, which for Galaxy tears down the named pipe) → stop OPC UA server (stop accepting new sessions, complete pending reads/writes) → flush Serilog.
- Shutdown completes within 30 seconds (Windows SCM timeout).
- All `IDisposable` / `IAsyncDisposable` components disposed in reverse-creation order.
- Final log entry: `"OtOpcUa.Server shutdown complete"` at Information level.
---
### SRV-004: Unhandled Exception Handling
The Server shall handle unexpected crashes gracefully.
#### Acceptance Criteria
- Registers `AppDomain.CurrentDomain.UnhandledException` handler that logs Fatal before the process terminates.
- Windows service recovery configured: restart on failure with 60-second delay.
- Fatal log entry includes full exception details.
---
### SRV-005: Drivers Hosted In-Process
All drivers except Galaxy run in-process within the Server.
#### Acceptance Criteria
- Modbus TCP, AB CIP, AB Legacy, S7, TwinCAT, FOCAS, OPC UA Client drivers are resolved from the DI container and managed by `DriverHost`.
- Galaxy driver in-process component is `Driver.Galaxy.Proxy`, which forwards to `OtOpcUa.Galaxy.Host` over the named pipe (see GHX-*).
- Each driver instance's lifecycle (connect, discover, subscribe, dispose) is orchestrated by `DriverHost`.
---
### SRV-006: Redundancy-Node Bootstrap
The Server shall bootstrap its redundancy identity from `appsettings.json` and the Config DB.
#### Acceptance Criteria
- `Node:NodeId` + `Node:ClusterId` identify this node uniquely; the `Redundancy` coordinator looks up `ClusterNode.RedundancyRole` (Primary / Secondary) from the Config DB.
- Two nodes of the same cluster connect to the same Config DB and the same ClusterId but have different NodeIds and different `ApplicationUri` values.
- Missing or ambiguous `(ClusterId, NodeId)` causes startup failure.
---
## OtOpcUa.Admin — Service Host Requirements (ADM-*)
### ADM-001: ASP.NET Core Blazor Server
The Admin app shall use `WebApplication.CreateBuilder` with Razor Components (`AddRazorComponents().AddInteractiveServerComponents()`), SignalR, and cookie authentication.
#### Acceptance Criteria
- Blazor Server (not WebAssembly) per `plan.md` §Tech Stack.
- Hosts SignalR hubs for live cluster state (used by `ClusterNodeGenerationState` views, crash-loop alerts, etc.).
- Runs as a Windows service via `AddWindowsService` OR as a standard ASP.NET Core process behind IIS / reverse proxy (site decides).
- Platform target: .NET 10 x64.
---
### ADM-002: Authentication and Authorization
Admin users authenticate via LDAP bind with cookie auth; three admin roles gate operations.
#### Acceptance Criteria
- Cookie auth scheme: `OtOpcUa.Admin`, 8-hour expiry, path `/login` for challenge.
- LDAP bind via `LdapAuthService`; user group memberships map to admin roles (`ConfigViewer`, `ConfigEditor`, `FleetAdmin`).
- Authorization policies:
- `CanEdit` requires `ConfigEditor` or `FleetAdmin`.
- `CanPublish` requires `FleetAdmin`.
- View-only access requires `ConfigViewer` (or higher).
- Unauthenticated requests to any Admin page redirect to `/login`.
- Per-cluster role grants layer on top: a `ConfigEditor` with no grant for cluster X can view it but not edit.
---
### ADM-003: Config DB as Sole Write Path
The Admin service shall be the only process with write access to the Config DB.
#### Acceptance Criteria
- EF Core `OtOpcUaConfigDbContext` configured with the SQL login / connection string that has read+write permission on config tables.
- Server nodes connect with a read-only principal (`grant SELECT` only).
- Admin writes produce draft-generation rows; publish writes are atomic and transactional.
- Every write is audited via `AuditLogService` per ADM-006.
---
### ADM-004: Prometheus /metrics Endpoint
The Admin service shall expose an OpenTelemetry → Prometheus metrics endpoint at `/metrics`.
#### Acceptance Criteria
- `OpenTelemetry.Metrics` registered with Prometheus exporter.
- `/metrics` scrapeable without authentication (standard Prometheus pattern) OR gated behind an infrastructure allow-list (site-configurable).
- Exports metrics from Server nodes of managed clusters (aggregated via Config DB heartbeat telemetry) plus Admin-local metrics (login attempts, publish duration, active sessions).
---
### ADM-005: Graceful Shutdown
On shutdown, the Admin service shall disconnect SignalR clients cleanly, finish in-flight DB writes, and flush Serilog.
#### Acceptance Criteria
- `IHostApplicationLifetime.ApplicationStopping` closes SignalR hub connections gracefully.
- In-flight publish transactions are allowed to complete up to 30 seconds.
- Final log entry: `"OtOpcUa.Admin shutdown complete"`.
---
### ADM-006: Audit Logging
Every publish and every ACL / role-grant change shall produce an immutable audit row via `AuditLogService`.
#### Acceptance Criteria
- Audit rows include: timestamp (UTC), acting principal (LDAP DN + display name), action, entity kind + id, before/after generation number where applicable, session id, source IP.
- Audit rows are never mutated or deleted by application code.
- Audit table schema enforces immutability via DB permissions (no UPDATE / DELETE granted to the Admin app's principal).
---
## OtOpcUa.Galaxy.Host — Service Host Requirements (GHX-*)
### GHX-001: TopShelf Windows Service Hosting
The Galaxy Host shall use TopShelf for Windows service lifecycle (install, uninstall, start, stop) and interactive console mode.
#### Acceptance Criteria
- Service name `OtOpcUaGalaxyHost`, display name `OtOpcUa Galaxy Host`.
- Installs via `ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host.exe install`.
- Uninstalls via `ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host.exe uninstall`.
- Runs as a configured user account (typically the same account as the Server, or a dedicated Galaxy service account with ArchestrA platform access).
- Interactive console mode (no args) for development / debugging.
- Platform target: **.NET Framework 4.8 x86** — required for MXAccess COM 32-bit interop.
- Development deployments may use NSSM in place of TopShelf (memory: `project_galaxy_host_installed`).
### Details
- Service description: "OtOpcUa Galaxy Host — MXAccess + Galaxy Repository backend for the Galaxy driver, named-pipe IPC to OtOpcUa.Server."
---
### GHX-002: Named-Pipe IPC Bootstrap
The Host shall open a named pipe on startup whose name, ACL, and shared secret come from environment variables supplied by the supervisor at spawn time.
#### Acceptance Criteria
- `OTOPCUA_GALAXY_PIPE` → pipe name (default `OtOpcUaGalaxy`).
- `OTOPCUA_ALLOWED_SID` → SID of the principal allowed to connect; any other principal is denied at the ACL layer.
- `OTOPCUA_GALAXY_SECRET` → per-process shared secret; `Driver.Galaxy.Proxy` must present it on handshake.
- `OTOPCUA_GALAXY_BACKEND``stub` / `db` / `mxaccess` (default `mxaccess`) — selects which backend implementation is loaded.
- Missing `OTOPCUA_ALLOWED_SID` or `OTOPCUA_GALAXY_SECRET` at startup throws with a descriptive error.
---
### GHX-003: Backend Lifecycle
The Host shall instantiate the STA pump + MXAccess backend + Galaxy Repository + optional Historian plugin in a defined order and tear them down cleanly on shutdown.
#### Acceptance Criteria
- Startup (mxaccess backend): initialize Serilog → resolve env vars → create `PipeServer` → start `StaPump` → create `MxAccessClient` on STA thread → initialize `GalaxyRepository` → optionally initialize Historian plugin → begin pipe request handling.
- Shutdown: stop pipe → dispose MxAccessClient (MXA-007 COM cleanup) → dispose STA pump → flush Serilog.
- Shutdown must complete within 30 seconds (Windows SCM timeout).
- `Console.CancelKeyPress` triggers the same sequence in console mode.
---
### GHX-004: Unhandled Exception Handling
The Host shall log Fatal on crash and let the supervisor restart it.
#### Acceptance Criteria
- `AppDomain.CurrentDomain.UnhandledException` handler logs Fatal with full exception details before termination.
- The supervisor's driver-stability policy (`docs/v2/driver-stability.md`) governs restart behavior — backoff, crash-loop detection, and alerting live there, not in the Host.
- Server-side: `Driver.Galaxy.Proxy` detects pipe disconnect, opens its capability circuit, reports Bad quality on Galaxy nodes; reconnects automatically when the Host is back.