docs: post-PR-7.2 cleanup — audit + three-track scrub

Audit (three parallel agent passes) found 43 markdown files carrying
stale references to the deleted Galaxy.Host/Proxy/Shared projects
after the v2-mxgw merge. This commit lands the prioritized fixes.

Track 1 — high-traffic in-place rewrites (3 files, ~454 lines deleted)
- README.md (202 → 91 lines): drops .NET 4.8 / x86 / TopShelf install
  text; leads with the multi-driver .NET 10 server identity and points
  at scripts/install/Install-Services.ps1 and the parity rig.
- docs/v2/driver-specs.md §1 Galaxy (~289 → ~66 lines): replaces the
  Tier-C out-of-process spec with a Tier-A in-process description
  matching the current GalaxyDriver code, with the four-section
  GalaxyDriverOptions JSON shape pulled verbatim from
  Config/GalaxyDriverOptions.cs.
- docs/drivers/Galaxy.md (211 → 92 lines): full rewrite around the
  current Browse/Runtime/Health/Config sub-folders.

Track 2 — historical banners (5 files)
- lmx_mxgw.md, lmx_mxgw_impl.md, lmx_backend.md,
  docs/v2/Galaxy.ParityMatrix.md,
  docs/v2/implementation/phase-2-galaxy-out-of-process.md each get a
  " Completed 2026-04-30 — historical record" banner block. lmx_mxgw.md
  also fixes two dead links (`docs/Galaxy.Driver.md` and
  `docs/v2/Galaxy.Driver.md`) → `docs/drivers/Galaxy.md`.

Track 3 — v1 archive sweep (10 git mv + 1 new index + 2 in-place scrubs)
- Moved 10 v1 docs under docs/v1/ preserving subpath structure:
  AlarmTracking, Configuration, DataTypeMapping, HistoricalDataAccess,
  Subscriptions (top-level); drivers/Galaxy-Repository,
  drivers/Galaxy-Test-Fixture; reqs/GalaxyRepositoryReqs,
  reqs/MxAccessClientReqs, reqs/ServiceHostReqs.
- New docs/v1/README.md is the shared archive banner + per-file table.
- docs/README.md repointed to the v1 paths and updated to reflect the
  v2 two-process deploy shape (Server + Admin + optional
  OtOpcUaWonderwareHistorian).
- docs/v2/Galaxy.ParityRig.md got a historical banner + four inline
  scrubs marking the OtOpcUaGalaxyHost service / Driver.Galaxy.Host
  EXE / Driver.Galaxy.ParityTests project as deleted-in-PR-7.2.

The repo's live-reading surface (README + CLAUDE.md + docs/v2/) now
describes only the post-PR-7.2 architecture. v1 docs are preserved as
a labelled archive under docs/v1/.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Joseph Doherty
2026-04-30 08:59:59 -04:00
parent ae7106dfce
commit 006af51768
21 changed files with 322 additions and 671 deletions

View File

@@ -0,0 +1,141 @@
# Galaxy Driver — Galaxy Repository Requirements
> **Revision** — Refreshed 2026-04-19 for the OtOpcUa v2 multi-driver platform (task #205). Scope clarified: this document is **Galaxy-driver-specific**. Galaxy is one of seven drivers in the OtOpcUa platform; the requirements below describe the SQL-side of the Galaxy driver (hierarchy/attribute/change-detection queries against the ZB database) that backs the Galaxy driver's `ITagDiscovery.DiscoverAsync` and `IRediscoverable` implementations. All Galaxy-specific SQL runs inside `OtOpcUa.Galaxy.Host` (.NET 4.8 x86 Windows service); the in-server `Driver.Galaxy.Proxy` calls it over a named pipe. For platform-wide tag discovery requirements see `OpcUaServerReqs.md` OPC-002. For deeper spec see `docs/GalaxyRepository.md` and `docs/v2/driver-specs.md`.
Parent: [HLR-002](HighLevelReqs.md#hlr-002-multi-driver-plug-in-model), [HLR-003](HighLevelReqs.md#hlr-003-address-space-composition-per-namespace), [HLR-006](HighLevelReqs.md#hlr-006-change-detection-and-rediscovery)
Driver scope: Galaxy only. Namespace kind: `SystemPlatform`.
## GR-001: Hierarchy Extraction
The Galaxy driver's `ITagDiscovery.DiscoverAsync` implementation shall query the ZB Galaxy Repository database to extract all deployed objects with their parent-child containment relationships, contained names, and tag names.
### Acceptance Criteria
- Executes `queries/hierarchy.sql` against the ZB database from within `OtOpcUa.Galaxy.Host`.
- Returns a list of objects with: `gobject_id`, `tag_name`, `contained_name`, `browse_name`, `parent_gobject_id`, `is_area`.
- Objects with `parent_gobject_id = 0` become children of the root ZB node inside the `SystemPlatform` namespace.
- Only deployed, non-template objects matching the category filter (areas, engines, user-defined objects, etc.) are returned.
- Query completes within 10 seconds on a typical Galaxy (hundreds of objects). Log Warning if it takes longer.
### Details
- Results are ordered by `parent_gobject_id, tag_name` for deterministic tree building.
- Empty result → Warning logged (Galaxy may have no deployed objects, or the DB connection may be misconfigured).
- Orphan detection: a row referencing a non-existent `parent_gobject_id` (and not 0) is skipped with a Warning.
- Streamed to the core via `IAddressSpaceBuilder.AddFolder` / `AddObject` calls over the Galaxy named pipe; no in-memory full-tree buffering on the Host side.
---
## GR-002: Attribute Extraction
The Galaxy driver shall query user-defined (dynamic) attributes for deployed objects, including data type, array flag, and array dimensions.
### Acceptance Criteria
- Executes `queries/attributes.sql` using the template chain CTE to resolve inherited attributes.
- Returns: `gobject_id`, `tag_name`, `attribute_name`, `full_tag_reference`, `mx_data_type`, `is_array`, `array_dimension`, `security_classification`.
- Attributes starting with `_` are filtered out by the query.
- `array_dimension` is extracted from the `mx_value` hex bytes (positions 13-16, little-endian uint16).
### Details
- CTE recursion depth is limited to 10 levels.
- `mx_data_type` not in the known set (1-8, 13-16) defaults to String.
- `gobject_id` that doesn't match a hierarchy object is skipped (object may not be deployed).
- Each emitted attribute is reported via `DriverAttributeInfo` to the core through `IAddressSpaceBuilder.AddVariable`.
---
## GR-003: Change Detection and IRediscoverable
The Galaxy driver shall implement `IRediscoverable` by polling `galaxy.time_of_last_deploy` on a configurable interval to detect when a new deployment has occurred.
### Acceptance Criteria
- Polls `SELECT time_of_last_deploy FROM galaxy` at a configurable interval (`Galaxy:ChangeDetectionIntervalSeconds`, default 30 seconds).
- Compares the returned timestamp to the last known value stored in memory.
- If different, raises the `IRediscoverable.RediscoveryNeeded` signal so the core re-runs `ITagDiscovery.DiscoverAsync` and surgically rebuilds the Galaxy namespace subtree (per OPC-017).
- First poll after startup always triggers an initial discovery.
- Query failure → Warning logged; no rediscovery triggered; retry at next interval.
### Details
- Polling runs on a background `Task` inside `OtOpcUa.Galaxy.Host`, not on the STA message-pump thread.
- `time_of_last_deploy` is a `datetime` column; compared using exact equality (not a range).
- Signal delivery to the Proxy happens via a server-push message on the Galaxy named pipe.
---
## GR-004: Rediscovery Data Flow
On a deployment change, the Galaxy driver shall re-query hierarchy + attributes and stream the updated structure to the core for surgical namespace rebuild.
### Acceptance Criteria
- On change signal, re-run `GR-001` (hierarchy) and `GR-002` (attributes) queries.
- Stream the new tree to the core via `IAddressSpaceBuilder` over the named pipe.
- Log at Information level: `"Galaxy deployment change detected. Rebuilding. ({ObjectCount} objects, {AttributeCount} attributes)"`.
- Log total rediscovery duration at Information level.
- On re-query failure: Error logged; existing Galaxy subtree is retained.
### Details
- Rediscovery is not atomic from the DB perspective — hierarchy and attributes are two separate queries. Acceptable; Galaxy deployment is an infrequent operation.
- The core owns the diff/surgical apply per OPC-017; the Galaxy driver only streams the new authoritative tree.
---
## GR-005: Connection Configuration
Galaxy DB connection parameters shall be configurable via environment variables passed from the `OtOpcUa.Galaxy.Host` supervisor at spawn time.
### Acceptance Criteria
- Connection string via `OTOPCUA_GALAXY_ZB_CONN` environment variable.
- Default: `Server=localhost;Database=ZB;Integrated Security=True;TrustServerCertificate=True;Encrypt=False;` (Windows Auth).
- ADO.NET `SqlConnection` used for queries (.NET Framework 4.8).
- Connection is opened per-query (not kept open). Connection pooling handles efficiency.
- If the initial connection test at startup fails, log Error with the connection string sanitized and continue attempting (change-detection polls keep retrying).
### Details
- Command timeout: `Galaxy:CommandTimeoutSeconds` in Config DB driver JSON (default 30 seconds).
- No ORM. Raw ADO.NET with `SqlCommand` and `SqlDataReader`. SQL text embedded as constants.
---
## GR-006: Query Safety
All Galaxy SQL queries shall be static read-only SELECT statements. No writes to the Galaxy Repository database.
### Acceptance Criteria
- All queries are hardcoded SQL strings with no string concatenation or user-supplied parameters.
- No INSERT, UPDATE, DELETE, or DDL statements are ever executed against the Galaxy database.
- Queries use only SELECT with read-only intent.
---
## GR-007: Startup Validation
On startup, the Galaxy driver's DB component inside `OtOpcUa.Galaxy.Host` shall validate database connectivity.
### Acceptance Criteria
- Execute a simple test query (`SELECT 1`) against the configured Galaxy DB.
- If the database is unreachable, log Error but do not prevent Host startup.
- The Galaxy driver runs in degraded mode (empty SystemPlatform namespace) until the database becomes available and the next change-detection poll succeeds.
- In degraded mode the Galaxy driver instance reports `DriverHealth.Unavailable`, causing its Polly circuit state to be open until the first successful discovery.
---
## GR-008: Capability Wrapping
All calls into the Galaxy DB component from the Proxy side shall route through `CapabilityInvoker.InvokeAsync(DriverCapability.Discover, …)`.
### Acceptance Criteria
- `Driver.Galaxy.Proxy.DiscoverAsync` is a thin capability-invoker call that sends a MessagePack request over the named pipe to the Host's DB component.
- Roslyn analyzer **OTOPCUA0001** validates there are no direct discovery calls bypassing the invoker.
- Polly pipeline for `DriverCapability.Discover` on the Galaxy driver instance carries Timeout + Retry + CircuitBreaker.

View File

@@ -0,0 +1,205 @@
# Galaxy Driver — MXAccess Client Requirements
> **Revision** — Refreshed 2026-04-19 for the OtOpcUa v2 multi-driver platform (task #205). Scope narrowed: this document covers the MXAccess surface **inside `OtOpcUa.Galaxy.Host`** (.NET Framework 4.8 x86 Windows service). The in-server `Driver.Galaxy.Proxy` implements the `IReadable` / `IWritable` / `ISubscribable` / `IAlarmSource` / `IHistoryProvider` capability interfaces and routes every wire call through the named pipe to this Host process. The STA thread + reconnect playback + subscription refcount requirements from v1 are preserved; what changed is where they live (Host service, not the Server process). MXA-010 (proxy-side wrapping) and MXA-011 (pipe ACL / shared secret) are new.
Parent: [HLR-002](HighLevelReqs.md#hlr-002-multi-driver-plug-in-model), [HLR-005](HighLevelReqs.md#hlr-005-live-data-access), [HLR-007](HighLevelReqs.md#hlr-007-service-hosting)
Driver scope: Galaxy only. Process scope: `OtOpcUa.Galaxy.Host` (Host side) and `Driver.Galaxy.Proxy` (server-side forwarder).
## MXA-001: STA Thread with Message Pump
All MXAccess COM objects shall be created and called on a dedicated STA thread running a Win32 message pump to ensure COM callbacks are delivered.
### Acceptance Criteria
- A dedicated thread is created with `ApartmentState.STA` before any MXAccess COM object is instantiated; implementation lives in `StaPump` inside `OtOpcUa.Galaxy.Host`.
- The thread runs a Win32 message pump using `GetMessage` / `TranslateMessage` / `DispatchMessage`.
- Work items are marshalled to the STA thread via `PostThreadMessage(WM_APP)` and a concurrent queue.
- All COM object creation (`LMXProxyServer`), method calls, and event callbacks happen on this thread.
- Thread name `Galaxy.Sta` (for diagnostics).
### Details
- If the STA thread dies unexpectedly, log Fatal and trigger Host service shutdown. The supervisor restarts the Host under its driver-stability policy (`docs/v2/driver-stability.md`). COM objects on the dead thread are unrecoverable; no in-process recovery is attempted.
- `RunAsync(Action)` returns a `Task` that completes when the action executes on the STA thread. Callers can `await` it.
---
## MXA-002: Connection Lifecycle
The Host shall support Register/Unregister lifecycle with the `LMXProxyServer` COM object, tracking the connection handle.
### Acceptance Criteria
- `Register(clientName)` is called on the STA thread and returns a positive connection handle on success.
- Handle ≤ 0 → descriptive error thrown; Host reports `DriverHealth.Unavailable` via the pipe so the Proxy reports Bad quality to the core.
- `Unregister(handle)` is called during disconnect after all subscriptions are removed.
- Client name comes from `OTOPCUA_GALAXY_CLIENT_NAME` environment variable; default `OtOpcUa-Galaxy.Host`. Must be unique per MXAccess registration (a cluster's Primary and Secondary each get their own client-name suffix via node override).
- Connection state transitions: Disconnected → Connecting → Connected → Disconnecting → Disconnected (and Error from any state).
### Details
- `ConnectedSince` (UTC) recorded after successful Register.
- `ReconnectCount` tracked for diagnostics and `/metrics`.
- State changes are emitted over the pipe as `DriverHealth` updates.
---
## MXA-003: Tag Subscription
The Host shall support subscribing to tags via AddItem + AdviseSupervisory, receiving value updates through OnDataChange callbacks.
### Acceptance Criteria
- Subscribe sequence: `AddItem(handle, address)` returns item handle, then `AdviseSupervisory(handle, itemHandle)` starts the subscription.
- `OnDataChange` callback delivers value, quality, timestamp, and MXSTATUS_PROXY array.
- Item address format: `tag_name.AttributeName` for scalars, `tag_name.AttributeName[]` for whole arrays.
- AddItem failure → Warning logged, failure propagated over the pipe to the Proxy.
- Bidirectional maps of `address ↔ itemHandle` maintained for callback resolution.
- Multi-client refcounting: two Proxy-side subscribe calls for the same address produce one MXAccess subscription; refcount decrement on the last unsubscribe triggers `UnAdvise` / `RemoveItem`.
### Details
- `AdviseSupervisory` (not `Advise`) is used because this is a background service without an interactive user session.
- Stored subscriptions dictionary maps address → callback for reconnect replay.
- On reconnect, every entry in stored subscriptions is re-subscribed (AddItem + AdviseSupervisory with new handles).
---
## MXA-004: Tag Read/Write
The Host shall support synchronous-style read and write operations, marshalled to the STA thread, with configurable timeouts.
### Acceptance Criteria
- Read pattern: prefer cached subscription value; fall back to subscribe-get-first-value-unsubscribe (AddItem → AdviseSupervisory → wait for OnDataChange → UnAdvise → RemoveItem).
- Write: AddItem → AdviseSupervisory → `Write()` → await `OnWriteComplete` callback → cleanup.
- Read timeout: `Galaxy:ReadTimeoutSeconds` in driver config (default 5 seconds) — enforced on the Host side in addition to the Proxy-side Polly `Timeout` leg.
- Write timeout: `Galaxy:WriteTimeoutSeconds` (default 5 seconds) — enforced similarly.
- Concurrent operation limit: configurable semaphore (`Galaxy:MaxConcurrentOperations`, default 10).
- All operations marshalled to the STA thread.
### Details
- Write uses security classification `-1` (no security). Galaxy runtime enforces security; OtOpcUa authorization is enforced server-side before the call ever reaches the pipe (per OPC-014 `AuthorizationGate`).
- `OnWriteComplete`: check `MXSTATUS_PROXY.success`. If 0, extract detail code and propagate as an error over the pipe.
- COM exceptions translated to meaningful error messages.
---
## MXA-005: Auto-Reconnect
The Host shall monitor connection health and automatically reconnect on failure, replaying all stored subscriptions after reconnect.
### Acceptance Criteria
- Monitor loop runs on a background thread at `Galaxy:MonitorIntervalSeconds` (default 5 seconds).
- On disconnect, attempt reconnect. On success, replay all stored subscriptions.
- On reconnect failure, log Warning and retry at next interval (no exponential backoff inside the Host; the Proxy-side Polly pipeline handles cross-process backoff against pipe failures).
- Reconnect count is incremented on each successful reconnect.
- Monitor loop is cancellable for clean Host shutdown.
### Details
- Reconnect cleans up old COM objects before creating new ones.
- After reconnect, probe subscription (MXA-006) is re-established first, then stored subscriptions.
- No max retry limit — keep trying indefinitely until the Host service is stopped.
---
## MXA-006: Probe-Based Health Monitoring
The Host shall optionally subscribe to a configurable probe tag and use OnDataChange callback staleness to detect silent connection failures.
### Acceptance Criteria
- Probe tag address configured via `Galaxy:ProbeTag`. If unset, probe monitoring is disabled.
- Track `_lastProbeValueTime` (UTC) updated on each OnDataChange for the probe tag.
- If `DateTime.UtcNow - _lastProbeValueTime > staleThreshold`, force disconnect and reconnect.
- Stale threshold: `Galaxy:ProbeStaleThresholdSeconds` (default 60 seconds).
- Implements `IHostConnectivityProbe` on the Proxy side so the core's `CapabilityInvoker` records probe outcomes with `DriverCapability.Probe` telemetry.
### Details
- The probe tag should be an attribute the Galaxy runtime updates regularly (platform heartbeat, area timestamp). Specific tag is site-dependent.
- After forced reconnect, reset `_lastProbeValueTime` to `DateTime.UtcNow`.
---
## MXA-007: COM Cleanup
On disconnect or disposal, the Host shall unwire event handlers, unadvise/remove all items, unregister, and release COM objects via `Marshal.ReleaseComObject`.
### Acceptance Criteria
- Cleanup order: UnAdvise all active subscriptions → RemoveItem all items → unwire OnDataChange and OnWriteComplete handlers → Unregister → `Marshal.ReleaseComObject`.
- On dispose: run disconnect if still connected, then dispose STA thread.
- Each cleanup step wrapped in try/catch (cleanup must not throw).
- After cleanup: handle maps cleared, pending write TCS entries abandoned, COM reference set to null.
### Details
- Stored subscriptions are NOT cleared on disconnect (preserved for reconnect replay). Only cleared on Dispose.
- Event handlers unwired BEFORE Unregister (else callbacks may fire on a dead object).
- `Marshal.ReleaseComObject` in a `finally` block, always.
---
## MXA-008: Operation Metrics
The MXAccess Host shall record timing and success/failure for Read, Write, and Subscribe operations.
### Acceptance Criteria
- Each operation records duration (ms) + success/failure.
- Metrics exposed over the pipe to the Proxy, which re-publishes them via OpenTelemetry → Prometheus under `DriverInstanceId = "galaxy-*"`, `HostName = "galaxy.host"`.
- Rolling 1000-entry buffer for percentile calculation.
- Uses an `ITimingScope` pattern: `using (var scope = metrics.BeginOperation("read")) { ... }`.
---
## MXA-009: Error Code Translation
The Host shall translate known MXAccess error codes from `MXSTATUS_PROXY.detail` into human-readable messages for logging and OPC UA status propagation.
### Acceptance Criteria
- Error 1008 → "User lacks security permission"
- Error 1012 → "Secured write required (one signature)"
- Error 1013 → "Verified write required (two signatures)"
- Unknown error codes logged with their numeric value.
- Translated messages flow back through the pipe and surface in OPC UA `StatusCode` descriptions and Server logs.
- Errors 1008 / 1012 / 1013 on write operations map to `Bad_UserAccessDenied` at the OPC UA surface.
---
## MXA-010: Proxy-Side Capability Wrapping
`Driver.Galaxy.Proxy` shall implement the capability interfaces as thin forwarders that serialize every call through the named pipe and route every call through `CapabilityInvoker`.
### Acceptance Criteria
- `Driver.Galaxy.Proxy` implements `IDriver` + `IReadable` + `IWritable` + `ISubscribable` + `ITagDiscovery` + `IRediscoverable` + `IAlarmSource` + `IHistoryProvider` + `IHostConnectivityProbe`.
- Each implementation uses `CapabilityInvoker.InvokeAsync(DriverCapability.<...>, …)` — direct pipe calls bypassing the invoker are caught by Roslyn **OTOPCUA0001**.
- Each method serializes a MessagePack request frame, sends over the pipe, awaits the response frame, deserializes, returns.
- Pipe disconnect mid-call → `CapabilityInvoker`'s circuit breaker counts the failure; sustained disconnect opens the circuit and Galaxy nodes surface Bad quality until the pipe reconnects.
- Proxy tolerates Host service restarts — it automatically reconnects and replays subscription setup (parallel to MXA-005 but across the IPC boundary).
---
## MXA-011: Pipe Security
The named pipe between Proxy and Host shall be restricted to the Server's runtime principal via SID-based ACL and authenticated with a per-process shared secret.
### Acceptance Criteria
- Pipe name from `OTOPCUA_GALAXY_PIPE` environment variable; default `OtOpcUaGalaxy`.
- Allowed SID passed as `OTOPCUA_ALLOWED_SID` — only the declared principal (typically the Server service account) can open the pipe; `Administrators` is explicitly NOT granted (per the `project_galaxy_host_installed` memory note).
- Shared secret passed via `OTOPCUA_GALAXY_SECRET` at spawn time; the Proxy must present the matching secret on the opening handshake.
- Secret is process-scoped (regenerated per Host restart) and never persisted to disk or Config DB.
- Pipe ACL denials are logged as Warning with the rejected principal SID.
### Details
- Environment variables are passed by the supervisor launching the Host (`docs/v2/driver-stability.md`).
- Dev-box secret is stored at `.local/galaxy-host-secret.txt` for NSSM-wrapped development runs (memory note: `project_galaxy_host_installed`).

View File

@@ -0,0 +1,265 @@
# Service Host — Component Requirements
> **Revision** — Refreshed 2026-04-19 for the OtOpcUa v2 multi-driver platform (task #205). v1 was a single Windows service; v2 ships **three cooperating Windows services** and the service-host requirements are rewritten per-process. SVC-001…SVC-006 from v1 are preserved in spirit (TopShelf, Serilog, config loading, graceful shutdown, startup sequence, unhandled-exception handling) but are now scoped to the process they apply to. SRV-* prefixes the Server process, ADM-* the Admin process, GHX-* the Galaxy Host process. A shared-requirements section at the top covers cross-process concerns (Serilog, logging rotation, bootstrap config scope).
Parent: [HLR-007](HighLevelReqs.md#hlr-007-service-hosting), [HLR-008](HighLevelReqs.md#hlr-008-logging), [HLR-011](HighLevelReqs.md#hlr-011-config-db-and-draft-publish)
## Shared Requirements (all three processes)
### SVC-SHARED-001: Serilog Logging
Every process shall use Serilog with a rolling daily file sink at Information level minimum, plus a console sink, plus opt-in CompactJsonFormatter file sink.
#### Acceptance Criteria
- Console sink active on every process (for interactive / debug mode).
- Rolling daily file sink:
- Server: `logs/otopcua-YYYYMMDD.log`
- Admin: `logs/otopcua-admin-YYYYMMDD.log`
- Galaxy Host: `%ProgramData%\OtOpcUa\galaxy-host-YYYYMMDD.log`
- Retention count and min level configurable via `Serilog:*` in each process's `appsettings.json`.
- JSON sink opt-in via `Serilog:WriteJson = true` (emits `*.json.log` alongside the plain-text file) for SIEM ingestion.
- `Log.CloseAndFlush()` invoked in a `finally` block on shutdown.
- Structured logging (Serilog message templates) — no `string.Format`.
---
### SVC-SHARED-002: Bootstrap Configuration Scope
`appsettings.json` is bootstrap-only per HLR-011. Operational configuration (clusters, drivers, namespaces, tags, ACLs, poll groups) lives in the Config DB.
#### Acceptance Criteria
- `appsettings.json` may contain only: Config DB connection string, `Node:NodeId`, `Node:ClusterId`, `Node:LocalCachePath`, `OpcUa:*` security bootstrap fields, `Ldap:*` bootstrap fields, `Serilog:*`, `Redundancy:*` role id.
- Any attempt to configure driver instances, tags, or equipment through `appsettings.json` shall be rejected at startup with a descriptive error.
- Invalid or missing required bootstrap fields are detected at startup with a clear error (`"Node:NodeId not configured"` style).
---
## OtOpcUa.Server — Service Host Requirements (SRV-*)
### SRV-001: Microsoft.Extensions.Hosting + AddWindowsService
The Server shall use `Host.CreateApplicationBuilder(args)` with `AddWindowsService(o => o.ServiceName = "OtOpcUa")` to run as a Windows service.
#### Acceptance Criteria
- Service name `OtOpcUa`.
- Installs via standard `sc.exe` tooling or the build-provided installer.
- Runs as a configured service account (typically a domain service account with Config DB read access; Windows Auth to SQL Server).
- Console mode (running `ZB.MOM.WW.OtOpcUa.Server.exe` with no Windows service context) works for development and debugging.
- Platform target: .NET 10 x64 (default per decision in `plan.md` §3).
---
### SRV-002: Startup Sequence
The Server shall start components in a defined order, with failure handling at each step.
#### Acceptance Criteria
- Startup sequence:
1. Load `appsettings.json` bootstrap configuration + initialize Serilog.
2. Validate bootstrap fields (NodeId, ClusterId, Config DB connection).
3. Initialize `OpcUaApplicationHost` (server-certificate resolution via `SecurityProfileResolver`).
4. Connect to Config DB; request current published generation for `ClusterId`.
5. If unreachable, fall back to `LiteDbConfigCache` (latest applied generation).
6. Apply generation: register driver instances, build namespaces, wire capability pipelines.
7. Start `OpcUaServerService` hosted service (opens endpoint listener).
8. Start `HostStatusPublisher` (pushes `ClusterNodeGenerationState` to Config DB for Admin UI SignalR consumers).
9. Start `RedundancyCoordinator` + `ServiceLevelCalculator`.
- Failure in steps 1-3 prevents startup.
- Failure in steps 4-6 logs Error and enters degraded mode (empty namespaces, `DriverHealth.Unavailable` on every driver, `ServiceLevel = 0`).
- Failure in steps 7-9 logs Error and shuts down (endpoint is non-optional).
---
### SRV-003: Graceful Shutdown
On service stop, the Server shall gracefully shut down all driver instances, the OPC UA listener, and flush logs before exiting.
#### Acceptance Criteria
- `IHostApplicationLifetime.ApplicationStopping` triggers orderly shutdown.
- Shutdown sequence: stop `HostStatusPublisher` → stop driver instances (disconnect each via `IDriver.DisposeAsync`, which for Galaxy tears down the named pipe) → stop OPC UA server (stop accepting new sessions, complete pending reads/writes) → flush Serilog.
- Shutdown completes within 30 seconds (Windows SCM timeout).
- All `IDisposable` / `IAsyncDisposable` components disposed in reverse-creation order.
- Final log entry: `"OtOpcUa.Server shutdown complete"` at Information level.
---
### SRV-004: Unhandled Exception Handling
The Server shall handle unexpected crashes gracefully.
#### Acceptance Criteria
- Registers `AppDomain.CurrentDomain.UnhandledException` handler that logs Fatal before the process terminates.
- Windows service recovery configured: restart on failure with 60-second delay.
- Fatal log entry includes full exception details.
---
### SRV-005: Drivers Hosted In-Process
All drivers except Galaxy run in-process within the Server.
#### Acceptance Criteria
- Modbus TCP, AB CIP, AB Legacy, S7, TwinCAT, FOCAS, OPC UA Client drivers are resolved from the DI container and managed by `DriverHost`.
- Galaxy driver in-process component is `Driver.Galaxy.Proxy`, which forwards to `OtOpcUa.Galaxy.Host` over the named pipe (see GHX-*).
- Each driver instance's lifecycle (connect, discover, subscribe, dispose) is orchestrated by `DriverHost`.
---
### SRV-006: Redundancy-Node Bootstrap
The Server shall bootstrap its redundancy identity from `appsettings.json` and the Config DB.
#### Acceptance Criteria
- `Node:NodeId` + `Node:ClusterId` identify this node uniquely; the `Redundancy` coordinator looks up `ClusterNode.RedundancyRole` (Primary / Secondary) from the Config DB.
- Two nodes of the same cluster connect to the same Config DB and the same ClusterId but have different NodeIds and different `ApplicationUri` values.
- Missing or ambiguous `(ClusterId, NodeId)` causes startup failure.
---
## OtOpcUa.Admin — Service Host Requirements (ADM-*)
### ADM-001: ASP.NET Core Blazor Server
The Admin app shall use `WebApplication.CreateBuilder` with Razor Components (`AddRazorComponents().AddInteractiveServerComponents()`), SignalR, and cookie authentication.
#### Acceptance Criteria
- Blazor Server (not WebAssembly) per `plan.md` §Tech Stack.
- Hosts SignalR hubs for live cluster state (used by `ClusterNodeGenerationState` views, crash-loop alerts, etc.).
- Runs as a Windows service via `AddWindowsService` OR as a standard ASP.NET Core process behind IIS / reverse proxy (site decides).
- Platform target: .NET 10 x64.
---
### ADM-002: Authentication and Authorization
Admin users authenticate via LDAP bind with cookie auth; three admin roles gate operations.
#### Acceptance Criteria
- Cookie auth scheme: `OtOpcUa.Admin`, 8-hour expiry, path `/login` for challenge.
- LDAP bind via `LdapAuthService`; user group memberships map to admin roles (`ConfigViewer`, `ConfigEditor`, `FleetAdmin`).
- Authorization policies:
- `CanEdit` requires `ConfigEditor` or `FleetAdmin`.
- `CanPublish` requires `FleetAdmin`.
- View-only access requires `ConfigViewer` (or higher).
- Unauthenticated requests to any Admin page redirect to `/login`.
- Per-cluster role grants layer on top: a `ConfigEditor` with no grant for cluster X can view it but not edit.
---
### ADM-003: Config DB as Sole Write Path
The Admin service shall be the only process with write access to the Config DB.
#### Acceptance Criteria
- EF Core `OtOpcUaConfigDbContext` configured with the SQL login / connection string that has read+write permission on config tables.
- Server nodes connect with a read-only principal (`grant SELECT` only).
- Admin writes produce draft-generation rows; publish writes are atomic and transactional.
- Every write is audited via `AuditLogService` per ADM-006.
---
### ADM-004: Prometheus /metrics Endpoint
The Admin service shall expose an OpenTelemetry → Prometheus metrics endpoint at `/metrics`.
#### Acceptance Criteria
- `OpenTelemetry.Metrics` registered with Prometheus exporter.
- `/metrics` scrapeable without authentication (standard Prometheus pattern) OR gated behind an infrastructure allow-list (site-configurable).
- Exports metrics from Server nodes of managed clusters (aggregated via Config DB heartbeat telemetry) plus Admin-local metrics (login attempts, publish duration, active sessions).
---
### ADM-005: Graceful Shutdown
On shutdown, the Admin service shall disconnect SignalR clients cleanly, finish in-flight DB writes, and flush Serilog.
#### Acceptance Criteria
- `IHostApplicationLifetime.ApplicationStopping` closes SignalR hub connections gracefully.
- In-flight publish transactions are allowed to complete up to 30 seconds.
- Final log entry: `"OtOpcUa.Admin shutdown complete"`.
---
### ADM-006: Audit Logging
Every publish and every ACL / role-grant change shall produce an immutable audit row via `AuditLogService`.
#### Acceptance Criteria
- Audit rows include: timestamp (UTC), acting principal (LDAP DN + display name), action, entity kind + id, before/after generation number where applicable, session id, source IP.
- Audit rows are never mutated or deleted by application code.
- Audit table schema enforces immutability via DB permissions (no UPDATE / DELETE granted to the Admin app's principal).
---
## OtOpcUa.Galaxy.Host — Service Host Requirements (GHX-*)
### GHX-001: TopShelf Windows Service Hosting
The Galaxy Host shall use TopShelf for Windows service lifecycle (install, uninstall, start, stop) and interactive console mode.
#### Acceptance Criteria
- Service name `OtOpcUaGalaxyHost`, display name `OtOpcUa Galaxy Host`.
- Installs via `ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host.exe install`.
- Uninstalls via `ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host.exe uninstall`.
- Runs as a configured user account (typically the same account as the Server, or a dedicated Galaxy service account with ArchestrA platform access).
- Interactive console mode (no args) for development / debugging.
- Platform target: **.NET Framework 4.8 x86** — required for MXAccess COM 32-bit interop.
- Development deployments may use NSSM in place of TopShelf (memory: `project_galaxy_host_installed`).
### Details
- Service description: "OtOpcUa Galaxy Host — MXAccess + Galaxy Repository backend for the Galaxy driver, named-pipe IPC to OtOpcUa.Server."
---
### GHX-002: Named-Pipe IPC Bootstrap
The Host shall open a named pipe on startup whose name, ACL, and shared secret come from environment variables supplied by the supervisor at spawn time.
#### Acceptance Criteria
- `OTOPCUA_GALAXY_PIPE` → pipe name (default `OtOpcUaGalaxy`).
- `OTOPCUA_ALLOWED_SID` → SID of the principal allowed to connect; any other principal is denied at the ACL layer.
- `OTOPCUA_GALAXY_SECRET` → per-process shared secret; `Driver.Galaxy.Proxy` must present it on handshake.
- `OTOPCUA_GALAXY_BACKEND``stub` / `db` / `mxaccess` (default `mxaccess`) — selects which backend implementation is loaded.
- Missing `OTOPCUA_ALLOWED_SID` or `OTOPCUA_GALAXY_SECRET` at startup throws with a descriptive error.
---
### GHX-003: Backend Lifecycle
The Host shall instantiate the STA pump + MXAccess backend + Galaxy Repository + optional Historian plugin in a defined order and tear them down cleanly on shutdown.
#### Acceptance Criteria
- Startup (mxaccess backend): initialize Serilog → resolve env vars → create `PipeServer` → start `StaPump` → create `MxAccessClient` on STA thread → initialize `GalaxyRepository` → optionally initialize Historian plugin → begin pipe request handling.
- Shutdown: stop pipe → dispose MxAccessClient (MXA-007 COM cleanup) → dispose STA pump → flush Serilog.
- Shutdown must complete within 30 seconds (Windows SCM timeout).
- `Console.CancelKeyPress` triggers the same sequence in console mode.
---
### GHX-004: Unhandled Exception Handling
The Host shall log Fatal on crash and let the supervisor restart it.
#### Acceptance Criteria
- `AppDomain.CurrentDomain.UnhandledException` handler logs Fatal with full exception details before termination.
- The supervisor's driver-stability policy (`docs/v2/driver-stability.md`) governs restart behavior — backoff, crash-loop detection, and alerting live there, not in the Host.
- Server-side: `Driver.Galaxy.Proxy` detects pipe disconnect, opens its capability circuit, reports Bad quality on Galaxy nodes; reconnects automatically when the Host is back.