Files
lmxopcua/docs/reqs/ServiceHostReqs.md
Joseph Doherty 48970af416 Doc refresh (task #205) — requirements updated for multi-driver OtOpcUa three-process deploy
Per-file summary:

- docs/reqs/OpcUaServerReqs.md — rewritten driver-agnostic. OPC-001..OPC-013 re-scoped to multi-driver address-space composition + capability dispatch; OPC-014 AuthorizationGate + permission trie; OPC-015 dynamic ServiceLevel via RedundancyCoordinator; OPC-017 surgical generation-apply rebuild; OPC-012 capability dispatch via CapabilityInvoker (decision #143 idempotence-aware retry); OPC-013 per-host Polly isolation (decision #144); OPC-019 OpenTelemetry metrics. Transport-security profile matrix (OPC-010) + UserName/LDAP (OPC-011) preserved.

- docs/reqs/GalaxyRepositoryReqs.md — scope clarified as Galaxy-driver-only (not platform). GR-001..GR-004 tied to ITagDiscovery.DiscoverAsync + IRediscoverable; all SQL runs inside OtOpcUa.Galaxy.Host and streams to Proxy via named pipe. GR-008 capability wrapping via CapabilityInvoker added. Cross-links to docs/v2/driver-specs.md + docs/GalaxyRepository.md.

- docs/reqs/MxAccessClientReqs.md — scope clarified as Galaxy-Host-only. MXA-001..MXA-009 preserved (STA pump, register/unregister, subscription refcount, auto-reconnect, probe, COM cleanup, operation metrics, error translation). MXA-010 Proxy-side capability wrapping + MXA-011 pipe ACL + per-process shared secret (OTOPCUA_ALLOWED_SID / OTOPCUA_GALAXY_SECRET) added.

- docs/reqs/ServiceHostReqs.md — rewritten for three-process deployment. Shared section (SVC-SHARED-001/002) for Serilog + bootstrap-only appsettings. SRV-* for OtOpcUa.Server (net10 x64, Microsoft.Extensions.Hosting + AddWindowsService, in-process driver hosting, redundancy-node bootstrap). ADM-* for OtOpcUa.Admin (Blazor Server, cookie+LDAP auth, CanEdit/CanPublish policies, sole DB writer, Prometheus /metrics, audit logging). GHX-* for OtOpcUa.Galaxy.Host (TopShelf, net48 x86, named-pipe IPC bootstrap, STA backend lifecycle, crash handling tied to supervisor).

- docs/reqs/ClientRequirements.md — restructured as numbered, verifiable requirements. SHR-* for Client.Shared (single IOpcUaClientService, ConnectionSettings, failover, cross-platform certs, type-coercing write, UI-thread neutrality). CLI-001..CLI-011 cover connect/read/write/browse/subscribe/historyread/alarms/redundancy. UI-001..UI-008 cover connection panel, tree browser, each tab, connection-state reflection, cross-platform build. Reference design content (IOpcUaClientService shape, models, view-model map, mock layout) preserved.

- docs/reqs/StatusDashboardReqs.md — retired cleanly. Replaced with a pointer to docs/v2/admin-ui.md + HLR-015 / HLR-016 / HLR-017 / ADM-*. Mapping table shows each retired DASH-001..DASH-009 requirement's replacement (live cluster-node view via SignalR, Prometheus metrics, driver-instance detail views, etc.). Note that a formal AdminUiReqs.md can be written later if needed for cert compliance.

HighLevelReqs.md was already at the target shape (HLR-001..HLR-018 with Revision header noting retired HLR-009) as of commit f217636; verified identical and no additional edit required.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 01:31:58 -04:00

13 KiB

Service Host — Component Requirements

Revision — Refreshed 2026-04-19 for the OtOpcUa v2 multi-driver platform (task #205). v1 was a single Windows service; v2 ships three cooperating Windows services and the service-host requirements are rewritten per-process. SVC-001…SVC-006 from v1 are preserved in spirit (TopShelf, Serilog, config loading, graceful shutdown, startup sequence, unhandled-exception handling) but are now scoped to the process they apply to. SRV-* prefixes the Server process, ADM-* the Admin process, GHX-* the Galaxy Host process. A shared-requirements section at the top covers cross-process concerns (Serilog, logging rotation, bootstrap config scope).

Parent: HLR-007, HLR-008, HLR-011

Shared Requirements (all three processes)

SVC-SHARED-001: Serilog Logging

Every process shall use Serilog with a rolling daily file sink at Information level minimum, plus a console sink, plus opt-in CompactJsonFormatter file sink.

Acceptance Criteria

  • Console sink active on every process (for interactive / debug mode).
  • Rolling daily file sink:
    • Server: logs/otopcua-YYYYMMDD.log
    • Admin: logs/otopcua-admin-YYYYMMDD.log
    • Galaxy Host: %ProgramData%\OtOpcUa\galaxy-host-YYYYMMDD.log
  • Retention count and min level configurable via Serilog:* in each process's appsettings.json.
  • JSON sink opt-in via Serilog:WriteJson = true (emits *.json.log alongside the plain-text file) for SIEM ingestion.
  • Log.CloseAndFlush() invoked in a finally block on shutdown.
  • Structured logging (Serilog message templates) — no string.Format.

SVC-SHARED-002: Bootstrap Configuration Scope

appsettings.json is bootstrap-only per HLR-011. Operational configuration (clusters, drivers, namespaces, tags, ACLs, poll groups) lives in the Config DB.

Acceptance Criteria

  • appsettings.json may contain only: Config DB connection string, Node:NodeId, Node:ClusterId, Node:LocalCachePath, OpcUa:* security bootstrap fields, Ldap:* bootstrap fields, Serilog:*, Redundancy:* role id.
  • Any attempt to configure driver instances, tags, or equipment through appsettings.json shall be rejected at startup with a descriptive error.
  • Invalid or missing required bootstrap fields are detected at startup with a clear error ("Node:NodeId not configured" style).

OtOpcUa.Server — Service Host Requirements (SRV-*)

SRV-001: Microsoft.Extensions.Hosting + AddWindowsService

The Server shall use Host.CreateApplicationBuilder(args) with AddWindowsService(o => o.ServiceName = "OtOpcUa") to run as a Windows service.

Acceptance Criteria

  • Service name OtOpcUa.
  • Installs via standard sc.exe tooling or the build-provided installer.
  • Runs as a configured service account (typically a domain service account with Config DB read access; Windows Auth to SQL Server).
  • Console mode (running ZB.MOM.WW.OtOpcUa.Server.exe with no Windows service context) works for development and debugging.
  • Platform target: .NET 10 x64 (default per decision in plan.md §3).

SRV-002: Startup Sequence

The Server shall start components in a defined order, with failure handling at each step.

Acceptance Criteria

  • Startup sequence:
    1. Load appsettings.json bootstrap configuration + initialize Serilog.
    2. Validate bootstrap fields (NodeId, ClusterId, Config DB connection).
    3. Initialize OpcUaApplicationHost (server-certificate resolution via SecurityProfileResolver).
    4. Connect to Config DB; request current published generation for ClusterId.
    5. If unreachable, fall back to LiteDbConfigCache (latest applied generation).
    6. Apply generation: register driver instances, build namespaces, wire capability pipelines.
    7. Start OpcUaServerService hosted service (opens endpoint listener).
    8. Start HostStatusPublisher (pushes ClusterNodeGenerationState to Config DB for Admin UI SignalR consumers).
    9. Start RedundancyCoordinator + ServiceLevelCalculator.
  • Failure in steps 1-3 prevents startup.
  • Failure in steps 4-6 logs Error and enters degraded mode (empty namespaces, DriverHealth.Unavailable on every driver, ServiceLevel = 0).
  • Failure in steps 7-9 logs Error and shuts down (endpoint is non-optional).

SRV-003: Graceful Shutdown

On service stop, the Server shall gracefully shut down all driver instances, the OPC UA listener, and flush logs before exiting.

Acceptance Criteria

  • IHostApplicationLifetime.ApplicationStopping triggers orderly shutdown.
  • Shutdown sequence: stop HostStatusPublisher → stop driver instances (disconnect each via IDriver.DisposeAsync, which for Galaxy tears down the named pipe) → stop OPC UA server (stop accepting new sessions, complete pending reads/writes) → flush Serilog.
  • Shutdown completes within 30 seconds (Windows SCM timeout).
  • All IDisposable / IAsyncDisposable components disposed in reverse-creation order.
  • Final log entry: "OtOpcUa.Server shutdown complete" at Information level.

SRV-004: Unhandled Exception Handling

The Server shall handle unexpected crashes gracefully.

Acceptance Criteria

  • Registers AppDomain.CurrentDomain.UnhandledException handler that logs Fatal before the process terminates.
  • Windows service recovery configured: restart on failure with 60-second delay.
  • Fatal log entry includes full exception details.

SRV-005: Drivers Hosted In-Process

All drivers except Galaxy run in-process within the Server.

Acceptance Criteria

  • Modbus TCP, AB CIP, AB Legacy, S7, TwinCAT, FOCAS, OPC UA Client drivers are resolved from the DI container and managed by DriverHost.
  • Galaxy driver in-process component is Driver.Galaxy.Proxy, which forwards to OtOpcUa.Galaxy.Host over the named pipe (see GHX-*).
  • Each driver instance's lifecycle (connect, discover, subscribe, dispose) is orchestrated by DriverHost.

SRV-006: Redundancy-Node Bootstrap

The Server shall bootstrap its redundancy identity from appsettings.json and the Config DB.

Acceptance Criteria

  • Node:NodeId + Node:ClusterId identify this node uniquely; the Redundancy coordinator looks up ClusterNode.RedundancyRole (Primary / Secondary) from the Config DB.
  • Two nodes of the same cluster connect to the same Config DB and the same ClusterId but have different NodeIds and different ApplicationUri values.
  • Missing or ambiguous (ClusterId, NodeId) causes startup failure.

OtOpcUa.Admin — Service Host Requirements (ADM-*)

ADM-001: ASP.NET Core Blazor Server

The Admin app shall use WebApplication.CreateBuilder with Razor Components (AddRazorComponents().AddInteractiveServerComponents()), SignalR, and cookie authentication.

Acceptance Criteria

  • Blazor Server (not WebAssembly) per plan.md §Tech Stack.
  • Hosts SignalR hubs for live cluster state (used by ClusterNodeGenerationState views, crash-loop alerts, etc.).
  • Runs as a Windows service via AddWindowsService OR as a standard ASP.NET Core process behind IIS / reverse proxy (site decides).
  • Platform target: .NET 10 x64.

ADM-002: Authentication and Authorization

Admin users authenticate via LDAP bind with cookie auth; three admin roles gate operations.

Acceptance Criteria

  • Cookie auth scheme: OtOpcUa.Admin, 8-hour expiry, path /login for challenge.
  • LDAP bind via LdapAuthService; user group memberships map to admin roles (ConfigViewer, ConfigEditor, FleetAdmin).
  • Authorization policies:
    • CanEdit requires ConfigEditor or FleetAdmin.
    • CanPublish requires FleetAdmin.
    • View-only access requires ConfigViewer (or higher).
  • Unauthenticated requests to any Admin page redirect to /login.
  • Per-cluster role grants layer on top: a ConfigEditor with no grant for cluster X can view it but not edit.

ADM-003: Config DB as Sole Write Path

The Admin service shall be the only process with write access to the Config DB.

Acceptance Criteria

  • EF Core OtOpcUaConfigDbContext configured with the SQL login / connection string that has read+write permission on config tables.
  • Server nodes connect with a read-only principal (grant SELECT only).
  • Admin writes produce draft-generation rows; publish writes are atomic and transactional.
  • Every write is audited via AuditLogService per ADM-006.

ADM-004: Prometheus /metrics Endpoint

The Admin service shall expose an OpenTelemetry → Prometheus metrics endpoint at /metrics.

Acceptance Criteria

  • OpenTelemetry.Metrics registered with Prometheus exporter.
  • /metrics scrapeable without authentication (standard Prometheus pattern) OR gated behind an infrastructure allow-list (site-configurable).
  • Exports metrics from Server nodes of managed clusters (aggregated via Config DB heartbeat telemetry) plus Admin-local metrics (login attempts, publish duration, active sessions).

ADM-005: Graceful Shutdown

On shutdown, the Admin service shall disconnect SignalR clients cleanly, finish in-flight DB writes, and flush Serilog.

Acceptance Criteria

  • IHostApplicationLifetime.ApplicationStopping closes SignalR hub connections gracefully.
  • In-flight publish transactions are allowed to complete up to 30 seconds.
  • Final log entry: "OtOpcUa.Admin shutdown complete".

ADM-006: Audit Logging

Every publish and every ACL / role-grant change shall produce an immutable audit row via AuditLogService.

Acceptance Criteria

  • Audit rows include: timestamp (UTC), acting principal (LDAP DN + display name), action, entity kind + id, before/after generation number where applicable, session id, source IP.
  • Audit rows are never mutated or deleted by application code.
  • Audit table schema enforces immutability via DB permissions (no UPDATE / DELETE granted to the Admin app's principal).

OtOpcUa.Galaxy.Host — Service Host Requirements (GHX-*)

GHX-001: TopShelf Windows Service Hosting

The Galaxy Host shall use TopShelf for Windows service lifecycle (install, uninstall, start, stop) and interactive console mode.

Acceptance Criteria

  • Service name OtOpcUaGalaxyHost, display name OtOpcUa Galaxy Host.
  • Installs via ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host.exe install.
  • Uninstalls via ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host.exe uninstall.
  • Runs as a configured user account (typically the same account as the Server, or a dedicated Galaxy service account with ArchestrA platform access).
  • Interactive console mode (no args) for development / debugging.
  • Platform target: .NET Framework 4.8 x86 — required for MXAccess COM 32-bit interop.
  • Development deployments may use NSSM in place of TopShelf (memory: project_galaxy_host_installed).

Details

  • Service description: "OtOpcUa Galaxy Host — MXAccess + Galaxy Repository backend for the Galaxy driver, named-pipe IPC to OtOpcUa.Server."

GHX-002: Named-Pipe IPC Bootstrap

The Host shall open a named pipe on startup whose name, ACL, and shared secret come from environment variables supplied by the supervisor at spawn time.

Acceptance Criteria

  • OTOPCUA_GALAXY_PIPE → pipe name (default OtOpcUaGalaxy).
  • OTOPCUA_ALLOWED_SID → SID of the principal allowed to connect; any other principal is denied at the ACL layer.
  • OTOPCUA_GALAXY_SECRET → per-process shared secret; Driver.Galaxy.Proxy must present it on handshake.
  • OTOPCUA_GALAXY_BACKENDstub / db / mxaccess (default mxaccess) — selects which backend implementation is loaded.
  • Missing OTOPCUA_ALLOWED_SID or OTOPCUA_GALAXY_SECRET at startup throws with a descriptive error.

GHX-003: Backend Lifecycle

The Host shall instantiate the STA pump + MXAccess backend + Galaxy Repository + optional Historian plugin in a defined order and tear them down cleanly on shutdown.

Acceptance Criteria

  • Startup (mxaccess backend): initialize Serilog → resolve env vars → create PipeServer → start StaPump → create MxAccessClient on STA thread → initialize GalaxyRepository → optionally initialize Historian plugin → begin pipe request handling.
  • Shutdown: stop pipe → dispose MxAccessClient (MXA-007 COM cleanup) → dispose STA pump → flush Serilog.
  • Shutdown must complete within 30 seconds (Windows SCM timeout).
  • Console.CancelKeyPress triggers the same sequence in console mode.

GHX-004: Unhandled Exception Handling

The Host shall log Fatal on crash and let the supervisor restart it.

Acceptance Criteria

  • AppDomain.CurrentDomain.UnhandledException handler logs Fatal with full exception details before termination.
  • The supervisor's driver-stability policy (docs/v2/driver-stability.md) governs restart behavior — backoff, crash-loop detection, and alerting live there, not in the Host.
  • Server-side: Driver.Galaxy.Proxy detects pipe disconnect, opens its capability circuit, reports Bad quality on Galaxy nodes; reconnects automatically when the Host is back.