Per-file summary: - docs/reqs/OpcUaServerReqs.md — rewritten driver-agnostic. OPC-001..OPC-013 re-scoped to multi-driver address-space composition + capability dispatch; OPC-014 AuthorizationGate + permission trie; OPC-015 dynamic ServiceLevel via RedundancyCoordinator; OPC-017 surgical generation-apply rebuild; OPC-012 capability dispatch via CapabilityInvoker (decision #143 idempotence-aware retry); OPC-013 per-host Polly isolation (decision #144); OPC-019 OpenTelemetry metrics. Transport-security profile matrix (OPC-010) + UserName/LDAP (OPC-011) preserved. - docs/reqs/GalaxyRepositoryReqs.md — scope clarified as Galaxy-driver-only (not platform). GR-001..GR-004 tied to ITagDiscovery.DiscoverAsync + IRediscoverable; all SQL runs inside OtOpcUa.Galaxy.Host and streams to Proxy via named pipe. GR-008 capability wrapping via CapabilityInvoker added. Cross-links to docs/v2/driver-specs.md + docs/GalaxyRepository.md. - docs/reqs/MxAccessClientReqs.md — scope clarified as Galaxy-Host-only. MXA-001..MXA-009 preserved (STA pump, register/unregister, subscription refcount, auto-reconnect, probe, COM cleanup, operation metrics, error translation). MXA-010 Proxy-side capability wrapping + MXA-011 pipe ACL + per-process shared secret (OTOPCUA_ALLOWED_SID / OTOPCUA_GALAXY_SECRET) added. - docs/reqs/ServiceHostReqs.md — rewritten for three-process deployment. Shared section (SVC-SHARED-001/002) for Serilog + bootstrap-only appsettings. SRV-* for OtOpcUa.Server (net10 x64, Microsoft.Extensions.Hosting + AddWindowsService, in-process driver hosting, redundancy-node bootstrap). ADM-* for OtOpcUa.Admin (Blazor Server, cookie+LDAP auth, CanEdit/CanPublish policies, sole DB writer, Prometheus /metrics, audit logging). GHX-* for OtOpcUa.Galaxy.Host (TopShelf, net48 x86, named-pipe IPC bootstrap, STA backend lifecycle, crash handling tied to supervisor). - docs/reqs/ClientRequirements.md — restructured as numbered, verifiable requirements. SHR-* for Client.Shared (single IOpcUaClientService, ConnectionSettings, failover, cross-platform certs, type-coercing write, UI-thread neutrality). CLI-001..CLI-011 cover connect/read/write/browse/subscribe/historyread/alarms/redundancy. UI-001..UI-008 cover connection panel, tree browser, each tab, connection-state reflection, cross-platform build. Reference design content (IOpcUaClientService shape, models, view-model map, mock layout) preserved. - docs/reqs/StatusDashboardReqs.md — retired cleanly. Replaced with a pointer to docs/v2/admin-ui.md + HLR-015 / HLR-016 / HLR-017 / ADM-*. Mapping table shows each retired DASH-001..DASH-009 requirement's replacement (live cluster-node view via SignalR, Prometheus metrics, driver-instance detail views, etc.). Note that a formal AdminUiReqs.md can be written later if needed for cert compliance. HighLevelReqs.md was already at the target shape (HLR-001..HLR-018 with Revision header noting retired HLR-009) as of commit f217636; verified identical and no additional edit required. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
266 lines
13 KiB
Markdown
266 lines
13 KiB
Markdown
# Service Host — Component Requirements
|
|
|
|
> **Revision** — Refreshed 2026-04-19 for the OtOpcUa v2 multi-driver platform (task #205). v1 was a single Windows service; v2 ships **three cooperating Windows services** and the service-host requirements are rewritten per-process. SVC-001…SVC-006 from v1 are preserved in spirit (TopShelf, Serilog, config loading, graceful shutdown, startup sequence, unhandled-exception handling) but are now scoped to the process they apply to. SRV-* prefixes the Server process, ADM-* the Admin process, GHX-* the Galaxy Host process. A shared-requirements section at the top covers cross-process concerns (Serilog, logging rotation, bootstrap config scope).
|
|
|
|
Parent: [HLR-007](HighLevelReqs.md#hlr-007-service-hosting), [HLR-008](HighLevelReqs.md#hlr-008-logging), [HLR-011](HighLevelReqs.md#hlr-011-config-db-and-draft-publish)
|
|
|
|
## Shared Requirements (all three processes)
|
|
|
|
### SVC-SHARED-001: Serilog Logging
|
|
|
|
Every process shall use Serilog with a rolling daily file sink at Information level minimum, plus a console sink, plus opt-in CompactJsonFormatter file sink.
|
|
|
|
#### Acceptance Criteria
|
|
|
|
- Console sink active on every process (for interactive / debug mode).
|
|
- Rolling daily file sink:
|
|
- Server: `logs/otopcua-YYYYMMDD.log`
|
|
- Admin: `logs/otopcua-admin-YYYYMMDD.log`
|
|
- Galaxy Host: `%ProgramData%\OtOpcUa\galaxy-host-YYYYMMDD.log`
|
|
- Retention count and min level configurable via `Serilog:*` in each process's `appsettings.json`.
|
|
- JSON sink opt-in via `Serilog:WriteJson = true` (emits `*.json.log` alongside the plain-text file) for SIEM ingestion.
|
|
- `Log.CloseAndFlush()` invoked in a `finally` block on shutdown.
|
|
- Structured logging (Serilog message templates) — no `string.Format`.
|
|
|
|
---
|
|
|
|
### SVC-SHARED-002: Bootstrap Configuration Scope
|
|
|
|
`appsettings.json` is bootstrap-only per HLR-011. Operational configuration (clusters, drivers, namespaces, tags, ACLs, poll groups) lives in the Config DB.
|
|
|
|
#### Acceptance Criteria
|
|
|
|
- `appsettings.json` may contain only: Config DB connection string, `Node:NodeId`, `Node:ClusterId`, `Node:LocalCachePath`, `OpcUa:*` security bootstrap fields, `Ldap:*` bootstrap fields, `Serilog:*`, `Redundancy:*` role id.
|
|
- Any attempt to configure driver instances, tags, or equipment through `appsettings.json` shall be rejected at startup with a descriptive error.
|
|
- Invalid or missing required bootstrap fields are detected at startup with a clear error (`"Node:NodeId not configured"` style).
|
|
|
|
---
|
|
|
|
## OtOpcUa.Server — Service Host Requirements (SRV-*)
|
|
|
|
### SRV-001: Microsoft.Extensions.Hosting + AddWindowsService
|
|
|
|
The Server shall use `Host.CreateApplicationBuilder(args)` with `AddWindowsService(o => o.ServiceName = "OtOpcUa")` to run as a Windows service.
|
|
|
|
#### Acceptance Criteria
|
|
|
|
- Service name `OtOpcUa`.
|
|
- Installs via standard `sc.exe` tooling or the build-provided installer.
|
|
- Runs as a configured service account (typically a domain service account with Config DB read access; Windows Auth to SQL Server).
|
|
- Console mode (running `ZB.MOM.WW.OtOpcUa.Server.exe` with no Windows service context) works for development and debugging.
|
|
- Platform target: .NET 10 x64 (default per decision in `plan.md` §3).
|
|
|
|
---
|
|
|
|
### SRV-002: Startup Sequence
|
|
|
|
The Server shall start components in a defined order, with failure handling at each step.
|
|
|
|
#### Acceptance Criteria
|
|
|
|
- Startup sequence:
|
|
1. Load `appsettings.json` bootstrap configuration + initialize Serilog.
|
|
2. Validate bootstrap fields (NodeId, ClusterId, Config DB connection).
|
|
3. Initialize `OpcUaApplicationHost` (server-certificate resolution via `SecurityProfileResolver`).
|
|
4. Connect to Config DB; request current published generation for `ClusterId`.
|
|
5. If unreachable, fall back to `LiteDbConfigCache` (latest applied generation).
|
|
6. Apply generation: register driver instances, build namespaces, wire capability pipelines.
|
|
7. Start `OpcUaServerService` hosted service (opens endpoint listener).
|
|
8. Start `HostStatusPublisher` (pushes `ClusterNodeGenerationState` to Config DB for Admin UI SignalR consumers).
|
|
9. Start `RedundancyCoordinator` + `ServiceLevelCalculator`.
|
|
- Failure in steps 1-3 prevents startup.
|
|
- Failure in steps 4-6 logs Error and enters degraded mode (empty namespaces, `DriverHealth.Unavailable` on every driver, `ServiceLevel = 0`).
|
|
- Failure in steps 7-9 logs Error and shuts down (endpoint is non-optional).
|
|
|
|
---
|
|
|
|
### SRV-003: Graceful Shutdown
|
|
|
|
On service stop, the Server shall gracefully shut down all driver instances, the OPC UA listener, and flush logs before exiting.
|
|
|
|
#### Acceptance Criteria
|
|
|
|
- `IHostApplicationLifetime.ApplicationStopping` triggers orderly shutdown.
|
|
- Shutdown sequence: stop `HostStatusPublisher` → stop driver instances (disconnect each via `IDriver.DisposeAsync`, which for Galaxy tears down the named pipe) → stop OPC UA server (stop accepting new sessions, complete pending reads/writes) → flush Serilog.
|
|
- Shutdown completes within 30 seconds (Windows SCM timeout).
|
|
- All `IDisposable` / `IAsyncDisposable` components disposed in reverse-creation order.
|
|
- Final log entry: `"OtOpcUa.Server shutdown complete"` at Information level.
|
|
|
|
---
|
|
|
|
### SRV-004: Unhandled Exception Handling
|
|
|
|
The Server shall handle unexpected crashes gracefully.
|
|
|
|
#### Acceptance Criteria
|
|
|
|
- Registers `AppDomain.CurrentDomain.UnhandledException` handler that logs Fatal before the process terminates.
|
|
- Windows service recovery configured: restart on failure with 60-second delay.
|
|
- Fatal log entry includes full exception details.
|
|
|
|
---
|
|
|
|
### SRV-005: Drivers Hosted In-Process
|
|
|
|
All drivers except Galaxy run in-process within the Server.
|
|
|
|
#### Acceptance Criteria
|
|
|
|
- Modbus TCP, AB CIP, AB Legacy, S7, TwinCAT, FOCAS, OPC UA Client drivers are resolved from the DI container and managed by `DriverHost`.
|
|
- Galaxy driver in-process component is `Driver.Galaxy.Proxy`, which forwards to `OtOpcUa.Galaxy.Host` over the named pipe (see GHX-*).
|
|
- Each driver instance's lifecycle (connect, discover, subscribe, dispose) is orchestrated by `DriverHost`.
|
|
|
|
---
|
|
|
|
### SRV-006: Redundancy-Node Bootstrap
|
|
|
|
The Server shall bootstrap its redundancy identity from `appsettings.json` and the Config DB.
|
|
|
|
#### Acceptance Criteria
|
|
|
|
- `Node:NodeId` + `Node:ClusterId` identify this node uniquely; the `Redundancy` coordinator looks up `ClusterNode.RedundancyRole` (Primary / Secondary) from the Config DB.
|
|
- Two nodes of the same cluster connect to the same Config DB and the same ClusterId but have different NodeIds and different `ApplicationUri` values.
|
|
- Missing or ambiguous `(ClusterId, NodeId)` causes startup failure.
|
|
|
|
---
|
|
|
|
## OtOpcUa.Admin — Service Host Requirements (ADM-*)
|
|
|
|
### ADM-001: ASP.NET Core Blazor Server
|
|
|
|
The Admin app shall use `WebApplication.CreateBuilder` with Razor Components (`AddRazorComponents().AddInteractiveServerComponents()`), SignalR, and cookie authentication.
|
|
|
|
#### Acceptance Criteria
|
|
|
|
- Blazor Server (not WebAssembly) per `plan.md` §Tech Stack.
|
|
- Hosts SignalR hubs for live cluster state (used by `ClusterNodeGenerationState` views, crash-loop alerts, etc.).
|
|
- Runs as a Windows service via `AddWindowsService` OR as a standard ASP.NET Core process behind IIS / reverse proxy (site decides).
|
|
- Platform target: .NET 10 x64.
|
|
|
|
---
|
|
|
|
### ADM-002: Authentication and Authorization
|
|
|
|
Admin users authenticate via LDAP bind with cookie auth; three admin roles gate operations.
|
|
|
|
#### Acceptance Criteria
|
|
|
|
- Cookie auth scheme: `OtOpcUa.Admin`, 8-hour expiry, path `/login` for challenge.
|
|
- LDAP bind via `LdapAuthService`; user group memberships map to admin roles (`ConfigViewer`, `ConfigEditor`, `FleetAdmin`).
|
|
- Authorization policies:
|
|
- `CanEdit` requires `ConfigEditor` or `FleetAdmin`.
|
|
- `CanPublish` requires `FleetAdmin`.
|
|
- View-only access requires `ConfigViewer` (or higher).
|
|
- Unauthenticated requests to any Admin page redirect to `/login`.
|
|
- Per-cluster role grants layer on top: a `ConfigEditor` with no grant for cluster X can view it but not edit.
|
|
|
|
---
|
|
|
|
### ADM-003: Config DB as Sole Write Path
|
|
|
|
The Admin service shall be the only process with write access to the Config DB.
|
|
|
|
#### Acceptance Criteria
|
|
|
|
- EF Core `OtOpcUaConfigDbContext` configured with the SQL login / connection string that has read+write permission on config tables.
|
|
- Server nodes connect with a read-only principal (`grant SELECT` only).
|
|
- Admin writes produce draft-generation rows; publish writes are atomic and transactional.
|
|
- Every write is audited via `AuditLogService` per ADM-006.
|
|
|
|
---
|
|
|
|
### ADM-004: Prometheus /metrics Endpoint
|
|
|
|
The Admin service shall expose an OpenTelemetry → Prometheus metrics endpoint at `/metrics`.
|
|
|
|
#### Acceptance Criteria
|
|
|
|
- `OpenTelemetry.Metrics` registered with Prometheus exporter.
|
|
- `/metrics` scrapeable without authentication (standard Prometheus pattern) OR gated behind an infrastructure allow-list (site-configurable).
|
|
- Exports metrics from Server nodes of managed clusters (aggregated via Config DB heartbeat telemetry) plus Admin-local metrics (login attempts, publish duration, active sessions).
|
|
|
|
---
|
|
|
|
### ADM-005: Graceful Shutdown
|
|
|
|
On shutdown, the Admin service shall disconnect SignalR clients cleanly, finish in-flight DB writes, and flush Serilog.
|
|
|
|
#### Acceptance Criteria
|
|
|
|
- `IHostApplicationLifetime.ApplicationStopping` closes SignalR hub connections gracefully.
|
|
- In-flight publish transactions are allowed to complete up to 30 seconds.
|
|
- Final log entry: `"OtOpcUa.Admin shutdown complete"`.
|
|
|
|
---
|
|
|
|
### ADM-006: Audit Logging
|
|
|
|
Every publish and every ACL / role-grant change shall produce an immutable audit row via `AuditLogService`.
|
|
|
|
#### Acceptance Criteria
|
|
|
|
- Audit rows include: timestamp (UTC), acting principal (LDAP DN + display name), action, entity kind + id, before/after generation number where applicable, session id, source IP.
|
|
- Audit rows are never mutated or deleted by application code.
|
|
- Audit table schema enforces immutability via DB permissions (no UPDATE / DELETE granted to the Admin app's principal).
|
|
|
|
---
|
|
|
|
## OtOpcUa.Galaxy.Host — Service Host Requirements (GHX-*)
|
|
|
|
### GHX-001: TopShelf Windows Service Hosting
|
|
|
|
The Galaxy Host shall use TopShelf for Windows service lifecycle (install, uninstall, start, stop) and interactive console mode.
|
|
|
|
#### Acceptance Criteria
|
|
|
|
- Service name `OtOpcUaGalaxyHost`, display name `OtOpcUa Galaxy Host`.
|
|
- Installs via `ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host.exe install`.
|
|
- Uninstalls via `ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host.exe uninstall`.
|
|
- Runs as a configured user account (typically the same account as the Server, or a dedicated Galaxy service account with ArchestrA platform access).
|
|
- Interactive console mode (no args) for development / debugging.
|
|
- Platform target: **.NET Framework 4.8 x86** — required for MXAccess COM 32-bit interop.
|
|
- Development deployments may use NSSM in place of TopShelf (memory: `project_galaxy_host_installed`).
|
|
|
|
### Details
|
|
|
|
- Service description: "OtOpcUa Galaxy Host — MXAccess + Galaxy Repository backend for the Galaxy driver, named-pipe IPC to OtOpcUa.Server."
|
|
|
|
---
|
|
|
|
### GHX-002: Named-Pipe IPC Bootstrap
|
|
|
|
The Host shall open a named pipe on startup whose name, ACL, and shared secret come from environment variables supplied by the supervisor at spawn time.
|
|
|
|
#### Acceptance Criteria
|
|
|
|
- `OTOPCUA_GALAXY_PIPE` → pipe name (default `OtOpcUaGalaxy`).
|
|
- `OTOPCUA_ALLOWED_SID` → SID of the principal allowed to connect; any other principal is denied at the ACL layer.
|
|
- `OTOPCUA_GALAXY_SECRET` → per-process shared secret; `Driver.Galaxy.Proxy` must present it on handshake.
|
|
- `OTOPCUA_GALAXY_BACKEND` → `stub` / `db` / `mxaccess` (default `mxaccess`) — selects which backend implementation is loaded.
|
|
- Missing `OTOPCUA_ALLOWED_SID` or `OTOPCUA_GALAXY_SECRET` at startup throws with a descriptive error.
|
|
|
|
---
|
|
|
|
### GHX-003: Backend Lifecycle
|
|
|
|
The Host shall instantiate the STA pump + MXAccess backend + Galaxy Repository + optional Historian plugin in a defined order and tear them down cleanly on shutdown.
|
|
|
|
#### Acceptance Criteria
|
|
|
|
- Startup (mxaccess backend): initialize Serilog → resolve env vars → create `PipeServer` → start `StaPump` → create `MxAccessClient` on STA thread → initialize `GalaxyRepository` → optionally initialize Historian plugin → begin pipe request handling.
|
|
- Shutdown: stop pipe → dispose MxAccessClient (MXA-007 COM cleanup) → dispose STA pump → flush Serilog.
|
|
- Shutdown must complete within 30 seconds (Windows SCM timeout).
|
|
- `Console.CancelKeyPress` triggers the same sequence in console mode.
|
|
|
|
---
|
|
|
|
### GHX-004: Unhandled Exception Handling
|
|
|
|
The Host shall log Fatal on crash and let the supervisor restart it.
|
|
|
|
#### Acceptance Criteria
|
|
|
|
- `AppDomain.CurrentDomain.UnhandledException` handler logs Fatal with full exception details before termination.
|
|
- The supervisor's driver-stability policy (`docs/v2/driver-stability.md`) governs restart behavior — backoff, crash-loop detection, and alerting live there, not in the Host.
|
|
- Server-side: `Driver.Galaxy.Proxy` detects pipe disconnect, opens its capability circuit, reports Bad quality on Galaxy nodes; reconnects automatically when the Host is back.
|