docs: post-PR-7.2 cleanup — audit + three-track scrub
Audit (three parallel agent passes) found 43 markdown files carrying stale references to the deleted Galaxy.Host/Proxy/Shared projects after the v2-mxgw merge. This commit lands the prioritized fixes. Track 1 — high-traffic in-place rewrites (3 files, ~454 lines deleted) - README.md (202 → 91 lines): drops .NET 4.8 / x86 / TopShelf install text; leads with the multi-driver .NET 10 server identity and points at scripts/install/Install-Services.ps1 and the parity rig. - docs/v2/driver-specs.md §1 Galaxy (~289 → ~66 lines): replaces the Tier-C out-of-process spec with a Tier-A in-process description matching the current GalaxyDriver code, with the four-section GalaxyDriverOptions JSON shape pulled verbatim from Config/GalaxyDriverOptions.cs. - docs/drivers/Galaxy.md (211 → 92 lines): full rewrite around the current Browse/Runtime/Health/Config sub-folders. Track 2 — historical banners (5 files) - lmx_mxgw.md, lmx_mxgw_impl.md, lmx_backend.md, docs/v2/Galaxy.ParityMatrix.md, docs/v2/implementation/phase-2-galaxy-out-of-process.md each get a "✅ Completed 2026-04-30 — historical record" banner block. lmx_mxgw.md also fixes two dead links (`docs/Galaxy.Driver.md` and `docs/v2/Galaxy.Driver.md`) → `docs/drivers/Galaxy.md`. Track 3 — v1 archive sweep (10 git mv + 1 new index + 2 in-place scrubs) - Moved 10 v1 docs under docs/v1/ preserving subpath structure: AlarmTracking, Configuration, DataTypeMapping, HistoricalDataAccess, Subscriptions (top-level); drivers/Galaxy-Repository, drivers/Galaxy-Test-Fixture; reqs/GalaxyRepositoryReqs, reqs/MxAccessClientReqs, reqs/ServiceHostReqs. - New docs/v1/README.md is the shared archive banner + per-file table. - docs/README.md repointed to the v1 paths and updated to reflect the v2 two-process deploy shape (Server + Admin + optional OtOpcUaWonderwareHistorian). - docs/v2/Galaxy.ParityRig.md got a historical banner + four inline scrubs marking the OtOpcUaGalaxyHost service / Driver.Galaxy.Host EXE / Driver.Galaxy.ParityTests project as deleted-in-PR-7.2. The repo's live-reading surface (README + CLAUDE.md + docs/v2/) now describes only the post-PR-7.2 architecture. v1 docs are preserved as a labelled archive under docs/v1/. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
141
docs/v1/reqs/GalaxyRepositoryReqs.md
Normal file
141
docs/v1/reqs/GalaxyRepositoryReqs.md
Normal file
@@ -0,0 +1,141 @@
|
||||
# Galaxy Driver — Galaxy Repository Requirements
|
||||
|
||||
> **Revision** — Refreshed 2026-04-19 for the OtOpcUa v2 multi-driver platform (task #205). Scope clarified: this document is **Galaxy-driver-specific**. Galaxy is one of seven drivers in the OtOpcUa platform; the requirements below describe the SQL-side of the Galaxy driver (hierarchy/attribute/change-detection queries against the ZB database) that backs the Galaxy driver's `ITagDiscovery.DiscoverAsync` and `IRediscoverable` implementations. All Galaxy-specific SQL runs inside `OtOpcUa.Galaxy.Host` (.NET 4.8 x86 Windows service); the in-server `Driver.Galaxy.Proxy` calls it over a named pipe. For platform-wide tag discovery requirements see `OpcUaServerReqs.md` OPC-002. For deeper spec see `docs/GalaxyRepository.md` and `docs/v2/driver-specs.md`.
|
||||
|
||||
Parent: [HLR-002](HighLevelReqs.md#hlr-002-multi-driver-plug-in-model), [HLR-003](HighLevelReqs.md#hlr-003-address-space-composition-per-namespace), [HLR-006](HighLevelReqs.md#hlr-006-change-detection-and-rediscovery)
|
||||
|
||||
Driver scope: Galaxy only. Namespace kind: `SystemPlatform`.
|
||||
|
||||
## GR-001: Hierarchy Extraction
|
||||
|
||||
The Galaxy driver's `ITagDiscovery.DiscoverAsync` implementation shall query the ZB Galaxy Repository database to extract all deployed objects with their parent-child containment relationships, contained names, and tag names.
|
||||
|
||||
### Acceptance Criteria
|
||||
|
||||
- Executes `queries/hierarchy.sql` against the ZB database from within `OtOpcUa.Galaxy.Host`.
|
||||
- Returns a list of objects with: `gobject_id`, `tag_name`, `contained_name`, `browse_name`, `parent_gobject_id`, `is_area`.
|
||||
- Objects with `parent_gobject_id = 0` become children of the root ZB node inside the `SystemPlatform` namespace.
|
||||
- Only deployed, non-template objects matching the category filter (areas, engines, user-defined objects, etc.) are returned.
|
||||
- Query completes within 10 seconds on a typical Galaxy (hundreds of objects). Log Warning if it takes longer.
|
||||
|
||||
### Details
|
||||
|
||||
- Results are ordered by `parent_gobject_id, tag_name` for deterministic tree building.
|
||||
- Empty result → Warning logged (Galaxy may have no deployed objects, or the DB connection may be misconfigured).
|
||||
- Orphan detection: a row referencing a non-existent `parent_gobject_id` (and not 0) is skipped with a Warning.
|
||||
- Streamed to the core via `IAddressSpaceBuilder.AddFolder` / `AddObject` calls over the Galaxy named pipe; no in-memory full-tree buffering on the Host side.
|
||||
|
||||
---
|
||||
|
||||
## GR-002: Attribute Extraction
|
||||
|
||||
The Galaxy driver shall query user-defined (dynamic) attributes for deployed objects, including data type, array flag, and array dimensions.
|
||||
|
||||
### Acceptance Criteria
|
||||
|
||||
- Executes `queries/attributes.sql` using the template chain CTE to resolve inherited attributes.
|
||||
- Returns: `gobject_id`, `tag_name`, `attribute_name`, `full_tag_reference`, `mx_data_type`, `is_array`, `array_dimension`, `security_classification`.
|
||||
- Attributes starting with `_` are filtered out by the query.
|
||||
- `array_dimension` is extracted from the `mx_value` hex bytes (positions 13-16, little-endian uint16).
|
||||
|
||||
### Details
|
||||
|
||||
- CTE recursion depth is limited to 10 levels.
|
||||
- `mx_data_type` not in the known set (1-8, 13-16) defaults to String.
|
||||
- `gobject_id` that doesn't match a hierarchy object is skipped (object may not be deployed).
|
||||
- Each emitted attribute is reported via `DriverAttributeInfo` to the core through `IAddressSpaceBuilder.AddVariable`.
|
||||
|
||||
---
|
||||
|
||||
## GR-003: Change Detection and IRediscoverable
|
||||
|
||||
The Galaxy driver shall implement `IRediscoverable` by polling `galaxy.time_of_last_deploy` on a configurable interval to detect when a new deployment has occurred.
|
||||
|
||||
### Acceptance Criteria
|
||||
|
||||
- Polls `SELECT time_of_last_deploy FROM galaxy` at a configurable interval (`Galaxy:ChangeDetectionIntervalSeconds`, default 30 seconds).
|
||||
- Compares the returned timestamp to the last known value stored in memory.
|
||||
- If different, raises the `IRediscoverable.RediscoveryNeeded` signal so the core re-runs `ITagDiscovery.DiscoverAsync` and surgically rebuilds the Galaxy namespace subtree (per OPC-017).
|
||||
- First poll after startup always triggers an initial discovery.
|
||||
- Query failure → Warning logged; no rediscovery triggered; retry at next interval.
|
||||
|
||||
### Details
|
||||
|
||||
- Polling runs on a background `Task` inside `OtOpcUa.Galaxy.Host`, not on the STA message-pump thread.
|
||||
- `time_of_last_deploy` is a `datetime` column; compared using exact equality (not a range).
|
||||
- Signal delivery to the Proxy happens via a server-push message on the Galaxy named pipe.
|
||||
|
||||
---
|
||||
|
||||
## GR-004: Rediscovery Data Flow
|
||||
|
||||
On a deployment change, the Galaxy driver shall re-query hierarchy + attributes and stream the updated structure to the core for surgical namespace rebuild.
|
||||
|
||||
### Acceptance Criteria
|
||||
|
||||
- On change signal, re-run `GR-001` (hierarchy) and `GR-002` (attributes) queries.
|
||||
- Stream the new tree to the core via `IAddressSpaceBuilder` over the named pipe.
|
||||
- Log at Information level: `"Galaxy deployment change detected. Rebuilding. ({ObjectCount} objects, {AttributeCount} attributes)"`.
|
||||
- Log total rediscovery duration at Information level.
|
||||
- On re-query failure: Error logged; existing Galaxy subtree is retained.
|
||||
|
||||
### Details
|
||||
|
||||
- Rediscovery is not atomic from the DB perspective — hierarchy and attributes are two separate queries. Acceptable; Galaxy deployment is an infrequent operation.
|
||||
- The core owns the diff/surgical apply per OPC-017; the Galaxy driver only streams the new authoritative tree.
|
||||
|
||||
---
|
||||
|
||||
## GR-005: Connection Configuration
|
||||
|
||||
Galaxy DB connection parameters shall be configurable via environment variables passed from the `OtOpcUa.Galaxy.Host` supervisor at spawn time.
|
||||
|
||||
### Acceptance Criteria
|
||||
|
||||
- Connection string via `OTOPCUA_GALAXY_ZB_CONN` environment variable.
|
||||
- Default: `Server=localhost;Database=ZB;Integrated Security=True;TrustServerCertificate=True;Encrypt=False;` (Windows Auth).
|
||||
- ADO.NET `SqlConnection` used for queries (.NET Framework 4.8).
|
||||
- Connection is opened per-query (not kept open). Connection pooling handles efficiency.
|
||||
- If the initial connection test at startup fails, log Error with the connection string sanitized and continue attempting (change-detection polls keep retrying).
|
||||
|
||||
### Details
|
||||
|
||||
- Command timeout: `Galaxy:CommandTimeoutSeconds` in Config DB driver JSON (default 30 seconds).
|
||||
- No ORM. Raw ADO.NET with `SqlCommand` and `SqlDataReader`. SQL text embedded as constants.
|
||||
|
||||
---
|
||||
|
||||
## GR-006: Query Safety
|
||||
|
||||
All Galaxy SQL queries shall be static read-only SELECT statements. No writes to the Galaxy Repository database.
|
||||
|
||||
### Acceptance Criteria
|
||||
|
||||
- All queries are hardcoded SQL strings with no string concatenation or user-supplied parameters.
|
||||
- No INSERT, UPDATE, DELETE, or DDL statements are ever executed against the Galaxy database.
|
||||
- Queries use only SELECT with read-only intent.
|
||||
|
||||
---
|
||||
|
||||
## GR-007: Startup Validation
|
||||
|
||||
On startup, the Galaxy driver's DB component inside `OtOpcUa.Galaxy.Host` shall validate database connectivity.
|
||||
|
||||
### Acceptance Criteria
|
||||
|
||||
- Execute a simple test query (`SELECT 1`) against the configured Galaxy DB.
|
||||
- If the database is unreachable, log Error but do not prevent Host startup.
|
||||
- The Galaxy driver runs in degraded mode (empty SystemPlatform namespace) until the database becomes available and the next change-detection poll succeeds.
|
||||
- In degraded mode the Galaxy driver instance reports `DriverHealth.Unavailable`, causing its Polly circuit state to be open until the first successful discovery.
|
||||
|
||||
---
|
||||
|
||||
## GR-008: Capability Wrapping
|
||||
|
||||
All calls into the Galaxy DB component from the Proxy side shall route through `CapabilityInvoker.InvokeAsync(DriverCapability.Discover, …)`.
|
||||
|
||||
### Acceptance Criteria
|
||||
|
||||
- `Driver.Galaxy.Proxy.DiscoverAsync` is a thin capability-invoker call that sends a MessagePack request over the named pipe to the Host's DB component.
|
||||
- Roslyn analyzer **OTOPCUA0001** validates there are no direct discovery calls bypassing the invoker.
|
||||
- Polly pipeline for `DriverCapability.Discover` on the Galaxy driver instance carries Timeout + Retry + CircuitBreaker.
|
||||
205
docs/v1/reqs/MxAccessClientReqs.md
Normal file
205
docs/v1/reqs/MxAccessClientReqs.md
Normal file
@@ -0,0 +1,205 @@
|
||||
# Galaxy Driver — MXAccess Client Requirements
|
||||
|
||||
> **Revision** — Refreshed 2026-04-19 for the OtOpcUa v2 multi-driver platform (task #205). Scope narrowed: this document covers the MXAccess surface **inside `OtOpcUa.Galaxy.Host`** (.NET Framework 4.8 x86 Windows service). The in-server `Driver.Galaxy.Proxy` implements the `IReadable` / `IWritable` / `ISubscribable` / `IAlarmSource` / `IHistoryProvider` capability interfaces and routes every wire call through the named pipe to this Host process. The STA thread + reconnect playback + subscription refcount requirements from v1 are preserved; what changed is where they live (Host service, not the Server process). MXA-010 (proxy-side wrapping) and MXA-011 (pipe ACL / shared secret) are new.
|
||||
|
||||
Parent: [HLR-002](HighLevelReqs.md#hlr-002-multi-driver-plug-in-model), [HLR-005](HighLevelReqs.md#hlr-005-live-data-access), [HLR-007](HighLevelReqs.md#hlr-007-service-hosting)
|
||||
|
||||
Driver scope: Galaxy only. Process scope: `OtOpcUa.Galaxy.Host` (Host side) and `Driver.Galaxy.Proxy` (server-side forwarder).
|
||||
|
||||
## MXA-001: STA Thread with Message Pump
|
||||
|
||||
All MXAccess COM objects shall be created and called on a dedicated STA thread running a Win32 message pump to ensure COM callbacks are delivered.
|
||||
|
||||
### Acceptance Criteria
|
||||
|
||||
- A dedicated thread is created with `ApartmentState.STA` before any MXAccess COM object is instantiated; implementation lives in `StaPump` inside `OtOpcUa.Galaxy.Host`.
|
||||
- The thread runs a Win32 message pump using `GetMessage` / `TranslateMessage` / `DispatchMessage`.
|
||||
- Work items are marshalled to the STA thread via `PostThreadMessage(WM_APP)` and a concurrent queue.
|
||||
- All COM object creation (`LMXProxyServer`), method calls, and event callbacks happen on this thread.
|
||||
- Thread name `Galaxy.Sta` (for diagnostics).
|
||||
|
||||
### Details
|
||||
|
||||
- If the STA thread dies unexpectedly, log Fatal and trigger Host service shutdown. The supervisor restarts the Host under its driver-stability policy (`docs/v2/driver-stability.md`). COM objects on the dead thread are unrecoverable; no in-process recovery is attempted.
|
||||
- `RunAsync(Action)` returns a `Task` that completes when the action executes on the STA thread. Callers can `await` it.
|
||||
|
||||
---
|
||||
|
||||
## MXA-002: Connection Lifecycle
|
||||
|
||||
The Host shall support Register/Unregister lifecycle with the `LMXProxyServer` COM object, tracking the connection handle.
|
||||
|
||||
### Acceptance Criteria
|
||||
|
||||
- `Register(clientName)` is called on the STA thread and returns a positive connection handle on success.
|
||||
- Handle ≤ 0 → descriptive error thrown; Host reports `DriverHealth.Unavailable` via the pipe so the Proxy reports Bad quality to the core.
|
||||
- `Unregister(handle)` is called during disconnect after all subscriptions are removed.
|
||||
- Client name comes from `OTOPCUA_GALAXY_CLIENT_NAME` environment variable; default `OtOpcUa-Galaxy.Host`. Must be unique per MXAccess registration (a cluster's Primary and Secondary each get their own client-name suffix via node override).
|
||||
- Connection state transitions: Disconnected → Connecting → Connected → Disconnecting → Disconnected (and Error from any state).
|
||||
|
||||
### Details
|
||||
|
||||
- `ConnectedSince` (UTC) recorded after successful Register.
|
||||
- `ReconnectCount` tracked for diagnostics and `/metrics`.
|
||||
- State changes are emitted over the pipe as `DriverHealth` updates.
|
||||
|
||||
---
|
||||
|
||||
## MXA-003: Tag Subscription
|
||||
|
||||
The Host shall support subscribing to tags via AddItem + AdviseSupervisory, receiving value updates through OnDataChange callbacks.
|
||||
|
||||
### Acceptance Criteria
|
||||
|
||||
- Subscribe sequence: `AddItem(handle, address)` returns item handle, then `AdviseSupervisory(handle, itemHandle)` starts the subscription.
|
||||
- `OnDataChange` callback delivers value, quality, timestamp, and MXSTATUS_PROXY array.
|
||||
- Item address format: `tag_name.AttributeName` for scalars, `tag_name.AttributeName[]` for whole arrays.
|
||||
- AddItem failure → Warning logged, failure propagated over the pipe to the Proxy.
|
||||
- Bidirectional maps of `address ↔ itemHandle` maintained for callback resolution.
|
||||
- Multi-client refcounting: two Proxy-side subscribe calls for the same address produce one MXAccess subscription; refcount decrement on the last unsubscribe triggers `UnAdvise` / `RemoveItem`.
|
||||
|
||||
### Details
|
||||
|
||||
- `AdviseSupervisory` (not `Advise`) is used because this is a background service without an interactive user session.
|
||||
- Stored subscriptions dictionary maps address → callback for reconnect replay.
|
||||
- On reconnect, every entry in stored subscriptions is re-subscribed (AddItem + AdviseSupervisory with new handles).
|
||||
|
||||
---
|
||||
|
||||
## MXA-004: Tag Read/Write
|
||||
|
||||
The Host shall support synchronous-style read and write operations, marshalled to the STA thread, with configurable timeouts.
|
||||
|
||||
### Acceptance Criteria
|
||||
|
||||
- Read pattern: prefer cached subscription value; fall back to subscribe-get-first-value-unsubscribe (AddItem → AdviseSupervisory → wait for OnDataChange → UnAdvise → RemoveItem).
|
||||
- Write: AddItem → AdviseSupervisory → `Write()` → await `OnWriteComplete` callback → cleanup.
|
||||
- Read timeout: `Galaxy:ReadTimeoutSeconds` in driver config (default 5 seconds) — enforced on the Host side in addition to the Proxy-side Polly `Timeout` leg.
|
||||
- Write timeout: `Galaxy:WriteTimeoutSeconds` (default 5 seconds) — enforced similarly.
|
||||
- Concurrent operation limit: configurable semaphore (`Galaxy:MaxConcurrentOperations`, default 10).
|
||||
- All operations marshalled to the STA thread.
|
||||
|
||||
### Details
|
||||
|
||||
- Write uses security classification `-1` (no security). Galaxy runtime enforces security; OtOpcUa authorization is enforced server-side before the call ever reaches the pipe (per OPC-014 `AuthorizationGate`).
|
||||
- `OnWriteComplete`: check `MXSTATUS_PROXY.success`. If 0, extract detail code and propagate as an error over the pipe.
|
||||
- COM exceptions translated to meaningful error messages.
|
||||
|
||||
---
|
||||
|
||||
## MXA-005: Auto-Reconnect
|
||||
|
||||
The Host shall monitor connection health and automatically reconnect on failure, replaying all stored subscriptions after reconnect.
|
||||
|
||||
### Acceptance Criteria
|
||||
|
||||
- Monitor loop runs on a background thread at `Galaxy:MonitorIntervalSeconds` (default 5 seconds).
|
||||
- On disconnect, attempt reconnect. On success, replay all stored subscriptions.
|
||||
- On reconnect failure, log Warning and retry at next interval (no exponential backoff inside the Host; the Proxy-side Polly pipeline handles cross-process backoff against pipe failures).
|
||||
- Reconnect count is incremented on each successful reconnect.
|
||||
- Monitor loop is cancellable for clean Host shutdown.
|
||||
|
||||
### Details
|
||||
|
||||
- Reconnect cleans up old COM objects before creating new ones.
|
||||
- After reconnect, probe subscription (MXA-006) is re-established first, then stored subscriptions.
|
||||
- No max retry limit — keep trying indefinitely until the Host service is stopped.
|
||||
|
||||
---
|
||||
|
||||
## MXA-006: Probe-Based Health Monitoring
|
||||
|
||||
The Host shall optionally subscribe to a configurable probe tag and use OnDataChange callback staleness to detect silent connection failures.
|
||||
|
||||
### Acceptance Criteria
|
||||
|
||||
- Probe tag address configured via `Galaxy:ProbeTag`. If unset, probe monitoring is disabled.
|
||||
- Track `_lastProbeValueTime` (UTC) updated on each OnDataChange for the probe tag.
|
||||
- If `DateTime.UtcNow - _lastProbeValueTime > staleThreshold`, force disconnect and reconnect.
|
||||
- Stale threshold: `Galaxy:ProbeStaleThresholdSeconds` (default 60 seconds).
|
||||
- Implements `IHostConnectivityProbe` on the Proxy side so the core's `CapabilityInvoker` records probe outcomes with `DriverCapability.Probe` telemetry.
|
||||
|
||||
### Details
|
||||
|
||||
- The probe tag should be an attribute the Galaxy runtime updates regularly (platform heartbeat, area timestamp). Specific tag is site-dependent.
|
||||
- After forced reconnect, reset `_lastProbeValueTime` to `DateTime.UtcNow`.
|
||||
|
||||
---
|
||||
|
||||
## MXA-007: COM Cleanup
|
||||
|
||||
On disconnect or disposal, the Host shall unwire event handlers, unadvise/remove all items, unregister, and release COM objects via `Marshal.ReleaseComObject`.
|
||||
|
||||
### Acceptance Criteria
|
||||
|
||||
- Cleanup order: UnAdvise all active subscriptions → RemoveItem all items → unwire OnDataChange and OnWriteComplete handlers → Unregister → `Marshal.ReleaseComObject`.
|
||||
- On dispose: run disconnect if still connected, then dispose STA thread.
|
||||
- Each cleanup step wrapped in try/catch (cleanup must not throw).
|
||||
- After cleanup: handle maps cleared, pending write TCS entries abandoned, COM reference set to null.
|
||||
|
||||
### Details
|
||||
|
||||
- Stored subscriptions are NOT cleared on disconnect (preserved for reconnect replay). Only cleared on Dispose.
|
||||
- Event handlers unwired BEFORE Unregister (else callbacks may fire on a dead object).
|
||||
- `Marshal.ReleaseComObject` in a `finally` block, always.
|
||||
|
||||
---
|
||||
|
||||
## MXA-008: Operation Metrics
|
||||
|
||||
The MXAccess Host shall record timing and success/failure for Read, Write, and Subscribe operations.
|
||||
|
||||
### Acceptance Criteria
|
||||
|
||||
- Each operation records duration (ms) + success/failure.
|
||||
- Metrics exposed over the pipe to the Proxy, which re-publishes them via OpenTelemetry → Prometheus under `DriverInstanceId = "galaxy-*"`, `HostName = "galaxy.host"`.
|
||||
- Rolling 1000-entry buffer for percentile calculation.
|
||||
- Uses an `ITimingScope` pattern: `using (var scope = metrics.BeginOperation("read")) { ... }`.
|
||||
|
||||
---
|
||||
|
||||
## MXA-009: Error Code Translation
|
||||
|
||||
The Host shall translate known MXAccess error codes from `MXSTATUS_PROXY.detail` into human-readable messages for logging and OPC UA status propagation.
|
||||
|
||||
### Acceptance Criteria
|
||||
|
||||
- Error 1008 → "User lacks security permission"
|
||||
- Error 1012 → "Secured write required (one signature)"
|
||||
- Error 1013 → "Verified write required (two signatures)"
|
||||
- Unknown error codes logged with their numeric value.
|
||||
- Translated messages flow back through the pipe and surface in OPC UA `StatusCode` descriptions and Server logs.
|
||||
- Errors 1008 / 1012 / 1013 on write operations map to `Bad_UserAccessDenied` at the OPC UA surface.
|
||||
|
||||
---
|
||||
|
||||
## MXA-010: Proxy-Side Capability Wrapping
|
||||
|
||||
`Driver.Galaxy.Proxy` shall implement the capability interfaces as thin forwarders that serialize every call through the named pipe and route every call through `CapabilityInvoker`.
|
||||
|
||||
### Acceptance Criteria
|
||||
|
||||
- `Driver.Galaxy.Proxy` implements `IDriver` + `IReadable` + `IWritable` + `ISubscribable` + `ITagDiscovery` + `IRediscoverable` + `IAlarmSource` + `IHistoryProvider` + `IHostConnectivityProbe`.
|
||||
- Each implementation uses `CapabilityInvoker.InvokeAsync(DriverCapability.<...>, …)` — direct pipe calls bypassing the invoker are caught by Roslyn **OTOPCUA0001**.
|
||||
- Each method serializes a MessagePack request frame, sends over the pipe, awaits the response frame, deserializes, returns.
|
||||
- Pipe disconnect mid-call → `CapabilityInvoker`'s circuit breaker counts the failure; sustained disconnect opens the circuit and Galaxy nodes surface Bad quality until the pipe reconnects.
|
||||
- Proxy tolerates Host service restarts — it automatically reconnects and replays subscription setup (parallel to MXA-005 but across the IPC boundary).
|
||||
|
||||
---
|
||||
|
||||
## MXA-011: Pipe Security
|
||||
|
||||
The named pipe between Proxy and Host shall be restricted to the Server's runtime principal via SID-based ACL and authenticated with a per-process shared secret.
|
||||
|
||||
### Acceptance Criteria
|
||||
|
||||
- Pipe name from `OTOPCUA_GALAXY_PIPE` environment variable; default `OtOpcUaGalaxy`.
|
||||
- Allowed SID passed as `OTOPCUA_ALLOWED_SID` — only the declared principal (typically the Server service account) can open the pipe; `Administrators` is explicitly NOT granted (per the `project_galaxy_host_installed` memory note).
|
||||
- Shared secret passed via `OTOPCUA_GALAXY_SECRET` at spawn time; the Proxy must present the matching secret on the opening handshake.
|
||||
- Secret is process-scoped (regenerated per Host restart) and never persisted to disk or Config DB.
|
||||
- Pipe ACL denials are logged as Warning with the rejected principal SID.
|
||||
|
||||
### Details
|
||||
|
||||
- Environment variables are passed by the supervisor launching the Host (`docs/v2/driver-stability.md`).
|
||||
- Dev-box secret is stored at `.local/galaxy-host-secret.txt` for NSSM-wrapped development runs (memory note: `project_galaxy_host_installed`).
|
||||
265
docs/v1/reqs/ServiceHostReqs.md
Normal file
265
docs/v1/reqs/ServiceHostReqs.md
Normal file
@@ -0,0 +1,265 @@
|
||||
# Service Host — Component Requirements
|
||||
|
||||
> **Revision** — Refreshed 2026-04-19 for the OtOpcUa v2 multi-driver platform (task #205). v1 was a single Windows service; v2 ships **three cooperating Windows services** and the service-host requirements are rewritten per-process. SVC-001…SVC-006 from v1 are preserved in spirit (TopShelf, Serilog, config loading, graceful shutdown, startup sequence, unhandled-exception handling) but are now scoped to the process they apply to. SRV-* prefixes the Server process, ADM-* the Admin process, GHX-* the Galaxy Host process. A shared-requirements section at the top covers cross-process concerns (Serilog, logging rotation, bootstrap config scope).
|
||||
|
||||
Parent: [HLR-007](HighLevelReqs.md#hlr-007-service-hosting), [HLR-008](HighLevelReqs.md#hlr-008-logging), [HLR-011](HighLevelReqs.md#hlr-011-config-db-and-draft-publish)
|
||||
|
||||
## Shared Requirements (all three processes)
|
||||
|
||||
### SVC-SHARED-001: Serilog Logging
|
||||
|
||||
Every process shall use Serilog with a rolling daily file sink at Information level minimum, plus a console sink, plus opt-in CompactJsonFormatter file sink.
|
||||
|
||||
#### Acceptance Criteria
|
||||
|
||||
- Console sink active on every process (for interactive / debug mode).
|
||||
- Rolling daily file sink:
|
||||
- Server: `logs/otopcua-YYYYMMDD.log`
|
||||
- Admin: `logs/otopcua-admin-YYYYMMDD.log`
|
||||
- Galaxy Host: `%ProgramData%\OtOpcUa\galaxy-host-YYYYMMDD.log`
|
||||
- Retention count and min level configurable via `Serilog:*` in each process's `appsettings.json`.
|
||||
- JSON sink opt-in via `Serilog:WriteJson = true` (emits `*.json.log` alongside the plain-text file) for SIEM ingestion.
|
||||
- `Log.CloseAndFlush()` invoked in a `finally` block on shutdown.
|
||||
- Structured logging (Serilog message templates) — no `string.Format`.
|
||||
|
||||
---
|
||||
|
||||
### SVC-SHARED-002: Bootstrap Configuration Scope
|
||||
|
||||
`appsettings.json` is bootstrap-only per HLR-011. Operational configuration (clusters, drivers, namespaces, tags, ACLs, poll groups) lives in the Config DB.
|
||||
|
||||
#### Acceptance Criteria
|
||||
|
||||
- `appsettings.json` may contain only: Config DB connection string, `Node:NodeId`, `Node:ClusterId`, `Node:LocalCachePath`, `OpcUa:*` security bootstrap fields, `Ldap:*` bootstrap fields, `Serilog:*`, `Redundancy:*` role id.
|
||||
- Any attempt to configure driver instances, tags, or equipment through `appsettings.json` shall be rejected at startup with a descriptive error.
|
||||
- Invalid or missing required bootstrap fields are detected at startup with a clear error (`"Node:NodeId not configured"` style).
|
||||
|
||||
---
|
||||
|
||||
## OtOpcUa.Server — Service Host Requirements (SRV-*)
|
||||
|
||||
### SRV-001: Microsoft.Extensions.Hosting + AddWindowsService
|
||||
|
||||
The Server shall use `Host.CreateApplicationBuilder(args)` with `AddWindowsService(o => o.ServiceName = "OtOpcUa")` to run as a Windows service.
|
||||
|
||||
#### Acceptance Criteria
|
||||
|
||||
- Service name `OtOpcUa`.
|
||||
- Installs via standard `sc.exe` tooling or the build-provided installer.
|
||||
- Runs as a configured service account (typically a domain service account with Config DB read access; Windows Auth to SQL Server).
|
||||
- Console mode (running `ZB.MOM.WW.OtOpcUa.Server.exe` with no Windows service context) works for development and debugging.
|
||||
- Platform target: .NET 10 x64 (default per decision in `plan.md` §3).
|
||||
|
||||
---
|
||||
|
||||
### SRV-002: Startup Sequence
|
||||
|
||||
The Server shall start components in a defined order, with failure handling at each step.
|
||||
|
||||
#### Acceptance Criteria
|
||||
|
||||
- Startup sequence:
|
||||
1. Load `appsettings.json` bootstrap configuration + initialize Serilog.
|
||||
2. Validate bootstrap fields (NodeId, ClusterId, Config DB connection).
|
||||
3. Initialize `OpcUaApplicationHost` (server-certificate resolution via `SecurityProfileResolver`).
|
||||
4. Connect to Config DB; request current published generation for `ClusterId`.
|
||||
5. If unreachable, fall back to `LiteDbConfigCache` (latest applied generation).
|
||||
6. Apply generation: register driver instances, build namespaces, wire capability pipelines.
|
||||
7. Start `OpcUaServerService` hosted service (opens endpoint listener).
|
||||
8. Start `HostStatusPublisher` (pushes `ClusterNodeGenerationState` to Config DB for Admin UI SignalR consumers).
|
||||
9. Start `RedundancyCoordinator` + `ServiceLevelCalculator`.
|
||||
- Failure in steps 1-3 prevents startup.
|
||||
- Failure in steps 4-6 logs Error and enters degraded mode (empty namespaces, `DriverHealth.Unavailable` on every driver, `ServiceLevel = 0`).
|
||||
- Failure in steps 7-9 logs Error and shuts down (endpoint is non-optional).
|
||||
|
||||
---
|
||||
|
||||
### SRV-003: Graceful Shutdown
|
||||
|
||||
On service stop, the Server shall gracefully shut down all driver instances, the OPC UA listener, and flush logs before exiting.
|
||||
|
||||
#### Acceptance Criteria
|
||||
|
||||
- `IHostApplicationLifetime.ApplicationStopping` triggers orderly shutdown.
|
||||
- Shutdown sequence: stop `HostStatusPublisher` → stop driver instances (disconnect each via `IDriver.DisposeAsync`, which for Galaxy tears down the named pipe) → stop OPC UA server (stop accepting new sessions, complete pending reads/writes) → flush Serilog.
|
||||
- Shutdown completes within 30 seconds (Windows SCM timeout).
|
||||
- All `IDisposable` / `IAsyncDisposable` components disposed in reverse-creation order.
|
||||
- Final log entry: `"OtOpcUa.Server shutdown complete"` at Information level.
|
||||
|
||||
---
|
||||
|
||||
### SRV-004: Unhandled Exception Handling
|
||||
|
||||
The Server shall handle unexpected crashes gracefully.
|
||||
|
||||
#### Acceptance Criteria
|
||||
|
||||
- Registers `AppDomain.CurrentDomain.UnhandledException` handler that logs Fatal before the process terminates.
|
||||
- Windows service recovery configured: restart on failure with 60-second delay.
|
||||
- Fatal log entry includes full exception details.
|
||||
|
||||
---
|
||||
|
||||
### SRV-005: Drivers Hosted In-Process
|
||||
|
||||
All drivers except Galaxy run in-process within the Server.
|
||||
|
||||
#### Acceptance Criteria
|
||||
|
||||
- Modbus TCP, AB CIP, AB Legacy, S7, TwinCAT, FOCAS, OPC UA Client drivers are resolved from the DI container and managed by `DriverHost`.
|
||||
- Galaxy driver in-process component is `Driver.Galaxy.Proxy`, which forwards to `OtOpcUa.Galaxy.Host` over the named pipe (see GHX-*).
|
||||
- Each driver instance's lifecycle (connect, discover, subscribe, dispose) is orchestrated by `DriverHost`.
|
||||
|
||||
---
|
||||
|
||||
### SRV-006: Redundancy-Node Bootstrap
|
||||
|
||||
The Server shall bootstrap its redundancy identity from `appsettings.json` and the Config DB.
|
||||
|
||||
#### Acceptance Criteria
|
||||
|
||||
- `Node:NodeId` + `Node:ClusterId` identify this node uniquely; the `Redundancy` coordinator looks up `ClusterNode.RedundancyRole` (Primary / Secondary) from the Config DB.
|
||||
- Two nodes of the same cluster connect to the same Config DB and the same ClusterId but have different NodeIds and different `ApplicationUri` values.
|
||||
- Missing or ambiguous `(ClusterId, NodeId)` causes startup failure.
|
||||
|
||||
---
|
||||
|
||||
## OtOpcUa.Admin — Service Host Requirements (ADM-*)
|
||||
|
||||
### ADM-001: ASP.NET Core Blazor Server
|
||||
|
||||
The Admin app shall use `WebApplication.CreateBuilder` with Razor Components (`AddRazorComponents().AddInteractiveServerComponents()`), SignalR, and cookie authentication.
|
||||
|
||||
#### Acceptance Criteria
|
||||
|
||||
- Blazor Server (not WebAssembly) per `plan.md` §Tech Stack.
|
||||
- Hosts SignalR hubs for live cluster state (used by `ClusterNodeGenerationState` views, crash-loop alerts, etc.).
|
||||
- Runs as a Windows service via `AddWindowsService` OR as a standard ASP.NET Core process behind IIS / reverse proxy (site decides).
|
||||
- Platform target: .NET 10 x64.
|
||||
|
||||
---
|
||||
|
||||
### ADM-002: Authentication and Authorization
|
||||
|
||||
Admin users authenticate via LDAP bind with cookie auth; three admin roles gate operations.
|
||||
|
||||
#### Acceptance Criteria
|
||||
|
||||
- Cookie auth scheme: `OtOpcUa.Admin`, 8-hour expiry, path `/login` for challenge.
|
||||
- LDAP bind via `LdapAuthService`; user group memberships map to admin roles (`ConfigViewer`, `ConfigEditor`, `FleetAdmin`).
|
||||
- Authorization policies:
|
||||
- `CanEdit` requires `ConfigEditor` or `FleetAdmin`.
|
||||
- `CanPublish` requires `FleetAdmin`.
|
||||
- View-only access requires `ConfigViewer` (or higher).
|
||||
- Unauthenticated requests to any Admin page redirect to `/login`.
|
||||
- Per-cluster role grants layer on top: a `ConfigEditor` with no grant for cluster X can view it but not edit.
|
||||
|
||||
---
|
||||
|
||||
### ADM-003: Config DB as Sole Write Path
|
||||
|
||||
The Admin service shall be the only process with write access to the Config DB.
|
||||
|
||||
#### Acceptance Criteria
|
||||
|
||||
- EF Core `OtOpcUaConfigDbContext` configured with the SQL login / connection string that has read+write permission on config tables.
|
||||
- Server nodes connect with a read-only principal (`grant SELECT` only).
|
||||
- Admin writes produce draft-generation rows; publish writes are atomic and transactional.
|
||||
- Every write is audited via `AuditLogService` per ADM-006.
|
||||
|
||||
---
|
||||
|
||||
### ADM-004: Prometheus /metrics Endpoint
|
||||
|
||||
The Admin service shall expose an OpenTelemetry → Prometheus metrics endpoint at `/metrics`.
|
||||
|
||||
#### Acceptance Criteria
|
||||
|
||||
- `OpenTelemetry.Metrics` registered with Prometheus exporter.
|
||||
- `/metrics` scrapeable without authentication (standard Prometheus pattern) OR gated behind an infrastructure allow-list (site-configurable).
|
||||
- Exports metrics from Server nodes of managed clusters (aggregated via Config DB heartbeat telemetry) plus Admin-local metrics (login attempts, publish duration, active sessions).
|
||||
|
||||
---
|
||||
|
||||
### ADM-005: Graceful Shutdown
|
||||
|
||||
On shutdown, the Admin service shall disconnect SignalR clients cleanly, finish in-flight DB writes, and flush Serilog.
|
||||
|
||||
#### Acceptance Criteria
|
||||
|
||||
- `IHostApplicationLifetime.ApplicationStopping` closes SignalR hub connections gracefully.
|
||||
- In-flight publish transactions are allowed to complete up to 30 seconds.
|
||||
- Final log entry: `"OtOpcUa.Admin shutdown complete"`.
|
||||
|
||||
---
|
||||
|
||||
### ADM-006: Audit Logging
|
||||
|
||||
Every publish and every ACL / role-grant change shall produce an immutable audit row via `AuditLogService`.
|
||||
|
||||
#### Acceptance Criteria
|
||||
|
||||
- Audit rows include: timestamp (UTC), acting principal (LDAP DN + display name), action, entity kind + id, before/after generation number where applicable, session id, source IP.
|
||||
- Audit rows are never mutated or deleted by application code.
|
||||
- Audit table schema enforces immutability via DB permissions (no UPDATE / DELETE granted to the Admin app's principal).
|
||||
|
||||
---
|
||||
|
||||
## OtOpcUa.Galaxy.Host — Service Host Requirements (GHX-*)
|
||||
|
||||
### GHX-001: TopShelf Windows Service Hosting
|
||||
|
||||
The Galaxy Host shall use TopShelf for Windows service lifecycle (install, uninstall, start, stop) and interactive console mode.
|
||||
|
||||
#### Acceptance Criteria
|
||||
|
||||
- Service name `OtOpcUaGalaxyHost`, display name `OtOpcUa Galaxy Host`.
|
||||
- Installs via `ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host.exe install`.
|
||||
- Uninstalls via `ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host.exe uninstall`.
|
||||
- Runs as a configured user account (typically the same account as the Server, or a dedicated Galaxy service account with ArchestrA platform access).
|
||||
- Interactive console mode (no args) for development / debugging.
|
||||
- Platform target: **.NET Framework 4.8 x86** — required for MXAccess COM 32-bit interop.
|
||||
- Development deployments may use NSSM in place of TopShelf (memory: `project_galaxy_host_installed`).
|
||||
|
||||
### Details
|
||||
|
||||
- Service description: "OtOpcUa Galaxy Host — MXAccess + Galaxy Repository backend for the Galaxy driver, named-pipe IPC to OtOpcUa.Server."
|
||||
|
||||
---
|
||||
|
||||
### GHX-002: Named-Pipe IPC Bootstrap
|
||||
|
||||
The Host shall open a named pipe on startup whose name, ACL, and shared secret come from environment variables supplied by the supervisor at spawn time.
|
||||
|
||||
#### Acceptance Criteria
|
||||
|
||||
- `OTOPCUA_GALAXY_PIPE` → pipe name (default `OtOpcUaGalaxy`).
|
||||
- `OTOPCUA_ALLOWED_SID` → SID of the principal allowed to connect; any other principal is denied at the ACL layer.
|
||||
- `OTOPCUA_GALAXY_SECRET` → per-process shared secret; `Driver.Galaxy.Proxy` must present it on handshake.
|
||||
- `OTOPCUA_GALAXY_BACKEND` → `stub` / `db` / `mxaccess` (default `mxaccess`) — selects which backend implementation is loaded.
|
||||
- Missing `OTOPCUA_ALLOWED_SID` or `OTOPCUA_GALAXY_SECRET` at startup throws with a descriptive error.
|
||||
|
||||
---
|
||||
|
||||
### GHX-003: Backend Lifecycle
|
||||
|
||||
The Host shall instantiate the STA pump + MXAccess backend + Galaxy Repository + optional Historian plugin in a defined order and tear them down cleanly on shutdown.
|
||||
|
||||
#### Acceptance Criteria
|
||||
|
||||
- Startup (mxaccess backend): initialize Serilog → resolve env vars → create `PipeServer` → start `StaPump` → create `MxAccessClient` on STA thread → initialize `GalaxyRepository` → optionally initialize Historian plugin → begin pipe request handling.
|
||||
- Shutdown: stop pipe → dispose MxAccessClient (MXA-007 COM cleanup) → dispose STA pump → flush Serilog.
|
||||
- Shutdown must complete within 30 seconds (Windows SCM timeout).
|
||||
- `Console.CancelKeyPress` triggers the same sequence in console mode.
|
||||
|
||||
---
|
||||
|
||||
### GHX-004: Unhandled Exception Handling
|
||||
|
||||
The Host shall log Fatal on crash and let the supervisor restart it.
|
||||
|
||||
#### Acceptance Criteria
|
||||
|
||||
- `AppDomain.CurrentDomain.UnhandledException` handler logs Fatal with full exception details before termination.
|
||||
- The supervisor's driver-stability policy (`docs/v2/driver-stability.md`) governs restart behavior — backoff, crash-loop detection, and alerting live there, not in the Host.
|
||||
- Server-side: `Driver.Galaxy.Proxy` detects pipe disconnect, opens its capability circuit, reports Bad quality on Galaxy nodes; reconnects automatically when the Host is back.
|
||||
Reference in New Issue
Block a user