Extract historian into a runtime-loaded plugin so hosts without the Wonderware SDK can run with Historian.Enabled=false

The aahClientManaged SDK is now isolated in ZB.MOM.WW.LmxOpcUa.Historian.Aveva and loaded via HistorianPluginLoader from a Historian/ subfolder only when enabled, removing the SDK from Host's compile-time and deploy-time surface.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
Joseph Doherty
2026-04-12 15:16:07 -04:00
parent 9e1a180ce3
commit 9b42b61eb6
40 changed files with 739 additions and 15855 deletions

View File

@@ -232,7 +232,7 @@ Example — two-instance redundant pair (Primary):
Three boolean properties act as feature flags that control optional subsystems:
- **`OpcUa.AlarmTrackingEnabled`** -- When `true`, the node manager creates `AlarmConditionState` nodes for alarm attributes and monitors `InAlarm` transitions. Disabled by default because alarm tracking adds per-attribute overhead.
- **`Historian.Enabled`** -- When `true`, the service creates a `HistorianDataSource` connected to the Wonderware Historian via the aahClientManaged SDK and registers it with the OPC UA server host. Disabled by default because not all deployments have a Historian instance.
- **`Historian.Enabled`** -- When `true`, the service calls `HistorianPluginLoader.TryLoad(config)` to load the `ZB.MOM.WW.LmxOpcUa.Historian.Aveva` plugin from the `Historian/` subfolder next to the host exe and registers the resulting `IHistorianDataSource` with the OPC UA server host. Disabled by default because not all deployments have a Historian instance -- when disabled the plugin is not probed and the Wonderware SDK DLLs are not required on the host. If the flag is `true` but the plugin or its SDK dependencies cannot be loaded, the server still starts and every history read returns `BadHistoryOperationUnsupported` with a warning in the log.
- **`GalaxyRepository.ExtendedAttributes`** -- When `true`, the repository loads additional Galaxy attribute metadata beyond the core set needed for the address space. Disabled by default to minimize startup query time.
## Configuration Validation

View File

@@ -75,7 +75,7 @@ Galaxy attributes carry a `security_classification` value that controls write pe
Most attributes default to Operate (1). The mapper treats SecuredWrite, VerifiedWrite, and ViewOnly as read-only because the OPC UA server does not implement the Galaxy's multi-level authentication model. Allowing writes to SecuredWrite or VerifiedWrite attributes without proper verification would bypass Galaxy security.
For historized attributes, `AccessLevels.HistoryRead` is added to the access level via bitwise OR, enabling OPC UA history read requests when a `HistorianDataSource` is configured.
For historized attributes, `AccessLevels.HistoryRead` is added to the access level via bitwise OR, enabling OPC UA history read requests when an `IHistorianDataSource` is configured via the runtime-loaded historian plugin.
## Key source files

View File

@@ -1,15 +1,41 @@
# Historical Data Access
`LmxNodeManager` exposes OPC UA historical data access (HDA) by querying the Wonderware Historian via the `aahClientManaged` SDK. The `HistorianDataSource` class translates OPC UA history requests into SDK queries using the `ArchestrA.HistorianAccess` API, and the node manager overrides wire the results back into the OPC UA response.
`LmxNodeManager` exposes OPC UA historical data access (HDA) through an abstract `IHistorianDataSource` interface (`Historian/IHistorianDataSource.cs`). The Wonderware Historian implementation lives in a separate assembly, `ZB.MOM.WW.LmxOpcUa.Historian.Aveva`, which is loaded at runtime only when `Historian.Enabled=true`. This keeps the `aahClientManaged` SDK out of the core Host so deployments that do not need history do not need the SDK installed.
## Plugin Architecture
The historian surface is split across two assemblies:
- **`ZB.MOM.WW.LmxOpcUa.Host`** (core) owns only OPC UA / BCL types:
- `IHistorianDataSource` -- the interface `LmxNodeManager` depends on
- `HistorianEventDto` -- SDK-free representation of a historian event record
- `HistorianAggregateMap` -- maps OPC UA aggregate NodeIds to AnalogSummary column names
- `HistorianPluginLoader` -- loads the plugin via `Assembly.LoadFrom` at startup
- `HistoryContinuationPointManager` -- paginates HistoryRead results
- **`ZB.MOM.WW.LmxOpcUa.Historian.Aveva`** (plugin) owns everything SDK-bound:
- `HistorianDataSource` -- implements `IHistorianDataSource`, wraps `aahClientManaged`
- `IHistorianConnectionFactory` / `SdkHistorianConnectionFactory` -- opens and polls `ArchestrA.HistorianAccess` connections
- `AvevaHistorianPluginEntry.Create(HistorianConfiguration)` -- the static factory invoked by the loader
The plugin assembly and its SDK dependencies (`aahClientManaged.dll`, `aahClient.dll`, `aahClientCommon.dll`, `Historian.CBE.dll`, `Historian.DPAPI.dll`, `ArchestrA.CloudHistorian.Contract.dll`) deploy to a `Historian/` subfolder next to `ZB.MOM.WW.LmxOpcUa.Host.exe`. See [Service Hosting](ServiceHosting.md#required-runtime-assemblies) for the full layout and deployment matrix.
## Plugin Loading
When the service starts with `Historian.Enabled=true`, `OpcUaService` calls `HistorianPluginLoader.TryLoad(config)`. The loader:
1. Probes `AppDomain.CurrentDomain.BaseDirectory\Historian\ZB.MOM.WW.LmxOpcUa.Historian.Aveva.dll`.
2. Installs a one-shot `AppDomain.AssemblyResolve` handler that redirects any `aahClientManaged`/`aahClientCommon`/`Historian.*` lookups to the same subfolder, so the CLR can resolve SDK dependencies when the plugin first JITs.
3. Calls the plugin's `AvevaHistorianPluginEntry.Create(HistorianConfiguration)` via reflection and returns the resulting `IHistorianDataSource`.
4. On any failure (plugin missing, entry type not found, SDK assembly unresolvable, bad image), logs a warning with the expected plugin path and returns `null`. The server starts normally and `LmxNodeManager` returns `BadHistoryOperationUnsupported` for every history call.
## Wonderware Historian SDK
The server uses the AVEVA Historian managed SDK (`aahClientManaged.dll`) to query historical data. The SDK provides a cursor-based query API through `ArchestrA.HistorianAccess`, replacing direct SQL queries against the Historian Runtime database. Two query types are used:
The plugin uses the AVEVA Historian managed SDK (`aahClientManaged.dll`) to query historical data. The SDK provides a cursor-based query API through `ArchestrA.HistorianAccess`, replacing direct SQL queries against the Historian Runtime database. Two query types are used:
- **`HistoryQuery`** -- Raw historical samples with timestamp, value (numeric or string), and OPC quality.
- **`AnalogSummaryQuery`** -- Pre-computed aggregates with properties for Average, Minimum, Maximum, ValueCount, First, Last, StdDev, and more.
The SDK DLLs are located in `lib/` and originate from `C:\Program Files (x86)\Wonderware\Historian\`.
The SDK DLLs are located in `lib/` and originate from `C:\Program Files (x86)\Wonderware\Historian\`. Only the plugin project (`src/ZB.MOM.WW.LmxOpcUa.Historian.Aveva/`) references them at build time; the core Host project does not.
## Configuration
@@ -29,7 +55,7 @@ public class HistorianConfiguration
}
```
When `Enabled` is `false`, the `HistorianDataSource` is not instantiated and the node manager returns `BadHistoryOperationUnsupported` for history read requests.
When `Enabled` is `false`, `HistorianPluginLoader.TryLoad` is not called, no plugin is loaded, and the node manager returns `BadHistoryOperationUnsupported` for history read requests. When `Enabled` is `true` but the plugin cannot be loaded (missing `Historian/` subfolder, SDK assembly resolve failure, etc.), the server still starts and returns the same `BadHistoryOperationUnsupported` status with a warning in the log.
### Connection Properties
@@ -45,7 +71,7 @@ When `Enabled` is `false`, the `HistorianDataSource` is not instantiated and the
## Connection Lifecycle
`HistorianDataSource` maintains a persistent connection to the Historian server via `ArchestrA.HistorianAccess`:
`HistorianDataSource` (in the plugin assembly) maintains a persistent connection to the Historian server via `ArchestrA.HistorianAccess`:
1. **Lazy connect** -- The connection is established on the first query via `EnsureConnected()`.
2. **Connection reuse** -- Subsequent queries reuse the same connection.
@@ -56,7 +82,7 @@ The connection is opened with `ReadOnly = true` and `ConnectionType = Process`.
## Raw Reads
`HistorianDataSource.ReadRawAsync` uses a `HistoryQuery` to retrieve individual samples within a time range:
`IHistorianDataSource.ReadRawAsync` (plugin implementation) uses a `HistoryQuery` to retrieve individual samples within a time range:
1. Create a `HistoryQuery` via `_connection.CreateHistoryQuery()`
2. Configure `HistoryQueryArgs` with `TagNames`, `StartDateTime`, `EndDateTime`, and `RetrievalMode = Full`
@@ -70,7 +96,7 @@ Each result row is converted to an OPC UA `DataValue`:
## Aggregate Reads
`HistorianDataSource.ReadAggregateAsync` uses an `AnalogSummaryQuery` to retrieve pre-computed aggregates:
`IHistorianDataSource.ReadAggregateAsync` (plugin implementation) uses an `AnalogSummaryQuery` to retrieve pre-computed aggregates:
1. Create an `AnalogSummaryQuery` via `_connection.CreateAnalogSummaryQuery()`
2. Configure `AnalogSummaryQueryArgs` with `TagNames`, `StartDateTime`, `EndDateTime`, and `Resolution` (milliseconds)
@@ -93,7 +119,7 @@ See `Domain/QualityMapper.cs` and `Domain/Quality.cs` for the full mapping table
## Aggregate Function Mapping
`MapAggregateToColumn` translates OPC UA aggregate NodeIds to `AnalogSummaryQueryResult` property names:
`HistorianAggregateMap.MapAggregateToColumn` (in the core Host assembly, so the node manager can validate aggregate support without requiring the plugin to be loaded) translates OPC UA aggregate NodeIds to `AnalogSummaryQueryResult` property names:
| OPC UA Aggregate | Result Property |
|---|---|

View File

@@ -94,7 +94,7 @@ On startup, `OpcUaServerHost.StartAsync` calls `CheckApplicationInstanceCertific
`LmxOpcUaServer` inherits from the OPC Foundation `StandardServer` base class and overrides two methods:
- **`CreateMasterNodeManager`** -- Instantiates `LmxNodeManager` with the Galaxy namespace URI, the `IMxAccessClient` for runtime I/O, performance metrics, and an optional `HistorianDataSource`. The node manager is wrapped in a `MasterNodeManager` with no additional core node managers.
- **`CreateMasterNodeManager`** -- Instantiates `LmxNodeManager` with the Galaxy namespace URI, the `IMxAccessClient` for runtime I/O, performance metrics, and an optional `IHistorianDataSource` (supplied by the runtime-loaded historian plugin, see [Historical Data Access](HistoricalDataAccess.md)). The node manager is wrapped in a `MasterNodeManager` with no additional core node managers.
- **`OnServerStarted`** -- Configures redundancy, history capabilities, and server capabilities at startup. Called after the server is fully initialized.
- **`LoadServerProperties`** -- Returns server metadata: manufacturer `ZB MOM`, product `LmxOpcUa Server`, and the assembly version as the software version.

View File

@@ -61,7 +61,7 @@ This is necessary because Windows services default their working directory to `S
5. **Create and connect MXAccess client** -- Starts the STA COM thread, creates the `MxAccessClient`, and attempts an initial connection. If the connection fails, the service logs a warning and continues -- the monitor loop will retry in the background.
6. **Start MXAccess monitor** -- Starts the connectivity monitor loop that probes the runtime connection at the configured interval and handles auto-reconnect.
7. **Test Galaxy repository connection** -- Calls `TestConnectionAsync()` on the Galaxy repository to verify the SQL Server database is reachable. If it fails, the service continues without initial address-space data.
8. **Create OPC UA server host** -- Creates `OpcUaServerHost` with the effective MXAccess client (real, override, or null fallback), performance metrics, and optional historian data source.
8. **Create OPC UA server host** -- Creates `OpcUaServerHost` with the effective MXAccess client (real, override, or null fallback), performance metrics, and an optional `IHistorianDataSource` obtained from `HistorianPluginLoader.TryLoad` when `Historian.Enabled=true` (returns `null` if the plugin is absent or fails to load).
9. **Query Galaxy hierarchy** -- Fetches the object hierarchy and attribute definitions from the Galaxy repository database, recording object and attribute counts.
10. **Start server and build address space** -- Starts the OPC UA server, retrieves the `LmxNodeManager`, and calls `BuildAddressSpace()` with the queried hierarchy and attributes. If the query or build fails, the server still starts with an empty address space.
11. **Start change detection** -- Creates and starts `ChangeDetectionService`, which polls `galaxy.time_of_last_deploy` at the configured interval. When a change is detected, it triggers an address-space rebuild via the `OnGalaxyChanged` event.
@@ -153,21 +153,37 @@ See [Redundancy Guide](Redundancy.md) for full deployment details.
## Required Runtime Assemblies
The build uses Costura.Fody to embed all NuGet dependencies into the single `ZB.MOM.WW.LmxOpcUa.Host.exe`. However, the following ArchestrA and Historian DLLs are **excluded from embedding** and must be present alongside the executable at runtime:
The build uses Costura.Fody to embed all NuGet dependencies into the single `ZB.MOM.WW.LmxOpcUa.Host.exe`. The only native dependency that must sit alongside the executable in every deployment is the MXAccess COM toolkit:
| Assembly | Purpose |
|----------|---------|
| `ArchestrA.MxAccess.dll` | MXAccess COM interop — runtime data access to Galaxy tags |
| `aahClientManaged.dll` | Wonderware Historian managed SDK — historical data queries |
| `aahClient.dll` | Historian native dependency |
| `aahClientCommon.dll` | Historian native dependency |
| `Historian.CBE.dll` | Historian native dependency |
| `Historian.DPAPI.dll` | Historian native dependency |
| `ArchestrA.CloudHistorian.Contract.dll` | Historian contract dependency |
These DLLs are sourced from the `lib/` folder in the repository and are copied to the build output directory automatically. When deploying, ensure all seven DLLs are in the same directory as the executable.
The Wonderware Historian SDK is packaged as a **runtime-loaded plugin** so hosts that will not use historical data access do not need the SDK installed. The plugin lives in a `Historian/` subfolder next to `ZB.MOM.WW.LmxOpcUa.Host.exe`:
These assemblies are not redistributable — they are provided by the AVEVA System Platform and Historian installations on the target machine. The copies in `lib/` are taken from `Program Files (x86)\ArchestrA\Framework\bin` on a machine with the platform installed.
```
ZB.MOM.WW.LmxOpcUa.Host.exe
ArchestrA.MxAccess.dll
Historian/
ZB.MOM.WW.LmxOpcUa.Historian.Aveva.dll
aahClientManaged.dll
aahClientCommon.dll
aahClient.dll
Historian.CBE.dll
Historian.DPAPI.dll
ArchestrA.CloudHistorian.Contract.dll
```
At startup, if `Historian.Enabled=true` in `appsettings.json`, `HistorianPluginLoader` probes `Historian/ZB.MOM.WW.LmxOpcUa.Historian.Aveva.dll` via `Assembly.LoadFrom` and instantiates the plugin's entry point. An `AppDomain.AssemblyResolve` handler redirects the SDK assembly lookups (`aahClientManaged`, `aahClientCommon`, …) to the same subfolder so the CLR can resolve them when the plugin first JITs. If the plugin directory is absent or any SDK dependency fails to load, the loader logs a warning and the server continues to run with history support disabled — `LmxNodeManager` returns `BadHistoryOperationUnsupported` for every history call.
Deployment matrix:
| Scenario | Host exe | `ArchestrA.MxAccess.dll` | `Historian/` subfolder |
|----------|----------|--------------------------|------------------------|
| `Historian.Enabled=false` | required | required | **omit** |
| `Historian.Enabled=true` | required | required | required |
`ArchestrA.MxAccess.dll` and the historian SDK DLLs are not redistributable — they are provided by the AVEVA System Platform and Historian installations on the target machine. The copies in `lib/` are taken from `Program Files (x86)\ArchestrA\Framework\bin` on a machine with the platform installed.
## Platform Target

View File

@@ -1,246 +0,0 @@
# Stability Review
Date: 2026-04-07
Scope:
- Service startup/shutdown lifecycle
- MXAccess threading and reconnect behavior
- OPC UA node manager request paths
- Historian history-read paths
- Status dashboard hosting
- Test coverage around the above
## Findings
### P1: `StaComThread` can leave callers blocked forever after pump failure or shutdown races
Evidence:
- `src/ZB.MOM.WW.LmxOpcUa.Host/MxAccess/StaComThread.cs:106`
- `src/ZB.MOM.WW.LmxOpcUa.Host/MxAccess/StaComThread.cs:132`
- `src/ZB.MOM.WW.LmxOpcUa.Host/MxAccess/StaComThread.cs:178`
Details:
- `RunAsync` enqueues work and calls `PostThreadMessage(...)`, but it ignores the returned `bool`.
- If the STA thread has already exited, or if posting fails for any other reason, the queued `TaskCompletionSource` is never completed or faulted.
- `ThreadEntry` logs pump crashes, but it does not drain/fault queued work, does not reset `_nativeThreadId`, and does not prevent later calls from queueing more work unless `Dispose()` happened first.
Note:
- Clean shutdown via `Dispose()` posts `WM_APP+1`, which calls `DrainQueue()` before `PostQuitMessage`, so queued work is drained on the normal shutdown path. The gap is crash and unexpected-exit paths only.
- `ThreadEntry` catches crashes and calls `_ready.TrySetException(ex)`, but `_ready` is only awaited during `Start()`. A crash *after* startup completes does not fault any pending or future callers.
Impact:
- Any caller waiting synchronously on these tasks can hang indefinitely after a pump crash (not a clean shutdown).
- This is especially dangerous because higher layers regularly use `.GetAwaiter().GetResult()` during connect, disconnect, rebuild, and request processing.
Recommendation:
- Check the return value of `PostThreadMessage`.
- If post fails, remove/fault the queued work item immediately.
- Mark the worker unusable when the pump exits unexpectedly and fault all remaining queued items.
- Add a shutdown/crash-path test that verifies queued callers fail fast instead of hanging.
**Status: Resolved (2026-04-07)**
Fix: Refactored queue to `WorkItem` type with separate `Execute`/`Fault` actions. Added `_pumpExited` flag set in `ThreadEntry` finally block. `DrainAndFaultQueue()` faults all pending TCS instances without executing user actions. `RunAsync` checks `_pumpExited` before enqueueing. `PostThreadMessage` return value is checked — false triggers drain-and-fault. Added crash-path test via `PostQuitMessage`.
### P1: `LmxNodeManager` discards subscription tasks, so failures can be silent
Evidence:
- `src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxNodeManager.cs:396`
- `src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxNodeManager.cs:1906`
- `src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxNodeManager.cs:1934`
- `src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxNodeManager.cs:1960`
- `tests/ZB.MOM.WW.LmxOpcUa.Tests/Helpers/FakeMxAccessClient.cs:83`
Details:
- Several subscription and unsubscription calls are fire-and-forget.
- `SubscribeAlarmTags()` even wraps `SubscribeAsync(...)` in `try/catch`, but because the returned task is not awaited, asynchronous failures bypass that catch.
- The test suite mostly uses `FakeMxAccessClient`, whose subscribe/unsubscribe methods complete immediately, so these failure paths are not exercised.
Impact:
- A failed runtime subscribe/unsubscribe can silently leave monitored OPC UA items stale or orphaned.
- The service can appear healthy while live updates quietly stop flowing for part of the address space.
Note:
- `LmxNodeManager` also runs a separate `_dataChangeDispatchThread` that batches MXAccess callbacks into OPC UA value updates. Subscription failures upstream mean this thread will simply never receive data for the affected tags, with no indication of the gap. Failures should be cross-referenced with dispatch-thread health to surface silent data loss.
Recommendation:
- Stop discarding these tasks.
- If the boundary must remain synchronous, centralize the wait and log/fail deterministically.
- Add tests that inject asynchronously failing subscribe/unsubscribe operations.
**Status: Resolved (2026-04-07)**
Fix: `SubscribeTag` and `UnsubscribeTag` (critical monitored-item paths) now use `.GetAwaiter().GetResult()` with try/catch logging. `SubscribeAlarmTags`, `BuildSubtree` alarm subscribes, and `RestoreTransferredSubscriptions` (batch paths) now use `.ContinueWith(OnlyOnFaulted)` to log failures instead of silently discarding tasks.
### P2: history continuation points can leak memory after expiry
Evidence:
- `src/ZB.MOM.WW.LmxOpcUa.Host/Historian/HistoryContinuationPoint.cs:23`
- `src/ZB.MOM.WW.LmxOpcUa.Host/Historian/HistoryContinuationPoint.cs:66`
- `tests/ZB.MOM.WW.LmxOpcUa.Tests/Historian/HistoryContinuationPointTests.cs:25`
Details:
- Expired continuation points are purged only from `Store()`.
- If a client requests continuation points and then never resumes or releases them, the stored `List<DataValue>` instances remain in memory until another `Store()` happens.
- Existing tests cover store/retrieve/release but do not cover expiration or reclamation.
Impact:
- A burst of abandoned history reads can retain large result sets in memory until the next `Store()` call triggers `PurgeExpired()`. On an otherwise idle system with no new history reads, this retention is indefinite.
Recommendation:
- Purge expired entries on `Retrieve()` and `Release()`.
- Consider a periodic sweep or a hard cap on stored continuation payloads.
- Add an expiry-focused test.
**Status: Resolved (2026-04-07)**
Fix: `PurgeExpired()` now called at the start of both `Retrieve()` and `Release()`. Added internal constructor accepting `TimeSpan timeout` for testability. Added two expiry-focused tests.
### P2: the status dashboard can fail to bind and disable itself silently
Evidence:
- `src/ZB.MOM.WW.LmxOpcUa.Host/Status/StatusWebServer.cs:53`
- `src/ZB.MOM.WW.LmxOpcUa.Host/Status/StatusWebServer.cs:64`
- `tests/ZB.MOM.WW.LmxOpcUa.Tests/Status/StatusWebServerTests.cs:30`
- `tests/ZB.MOM.WW.LmxOpcUa.Tests/Status/StatusWebServerTests.cs:149`
Observed test result:
- `dotnet test tests\ZB.MOM.WW.LmxOpcUa.Tests\ZB.MOM.WW.LmxOpcUa.Tests.csproj --no-restore --filter StatusWebServerTests`
- Result: 9 failed, 0 passed
- Failures were all consistent with the listener not starting (`IsRunning == false`, connection refused).
Details:
- `Start()` swallows startup exceptions, logs a warning, and leaves `_listener = null`.
- The code binds `http://+:{port}/`, which is more permission-sensitive than a narrower host-specific prefix.
- Callers get no explicit failure signal, so the dashboard can simply vanish at runtime.
Impact:
- Operators and external checks can assume a dashboard exists when it does not.
- Health visibility degrades exactly when the service most needs diagnosability.
Note:
- The `http://+:{port}/` wildcard prefix requires either administrator privileges or a pre-configured URL ACL (`netsh http add urlacl`). This is also the likely cause of the 9/9 test failures — tests run without elevation will always fail to bind.
Recommendation:
- Fail fast, or at least return an explicit startup status.
- Default to `http://localhost:{port}/` unless wildcard binding is explicitly configured — this avoids the ACL requirement for single-machine deployments and fixes the test suite without special privileges.
- Add a startup test that asserts the service reports bind failures clearly.
**Status: Resolved (2026-04-07)**
Fix: Changed prefix from `http://+:{port}/` to `http://localhost:{port}/`. `Start()` now returns `bool`. Bind failure logged at Error level. Test suite now passes 9/9.
### P2: blocking remote I/O is performed directly in request and rebuild paths
Evidence:
- `src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxNodeManager.cs:586`
- `src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxNodeManager.cs:617`
- `src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxNodeManager.cs:641`
- `src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxNodeManager.cs:1228`
- `src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxNodeManager.cs:1289`
- `src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxNodeManager.cs:1386`
- `src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxNodeManager.cs:1526`
- `src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxNodeManager.cs:1601`
- `src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxNodeManager.cs:1655`
- `src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxNodeManager.cs:1718`
- `src/ZB.MOM.WW.LmxOpcUa.Host/Historian/HistorianDataSource.cs:175`
Details:
- OPC UA read/write/history handlers synchronously block on MXAccess and Historian calls.
- Incremental sync also performs blocking subscribe/unsubscribe operations while holding the node-manager lock.
- Historian connection establishment uses polling plus `Thread.Sleep(250)`, so slow connects directly occupy request threads.
Impact:
- Slow runtime dependencies can starve OPC UA worker threads and make rebuilds stall the namespace lock.
- This is not just a latency issue; it turns transient backend slowness into whole-service responsiveness problems.
Recommendation:
- Move I/O out of locked sections.
- Propagate cancellation/timeouts explicitly through the request path.
- Add load/fault tests against the real async MXAccess client behavior, not only synchronous fakes.
**Status: Resolved (2026-04-07)**
Fix: Moved subscribe/unsubscribe I/O outside `lock(Lock)` in `SyncAddressSpace` and `TearDownGobjects` — bookkeeping is done under lock, actual MXAccess calls happen after the lock is released. Replaced blocking `ReadAsync` calls for alarm priority/description in the dispatch loop with cached values populated from subscription data changes via new `_alarmPriorityTags`/`_alarmDescTags` reverse lookup dictionaries. Refactored Historian `EnsureConnected`/`EnsureEventConnected` with double-check locking so `WaitForConnection` polling runs outside `_connectionLock`. OPC UA Read/Write/HistoryRead handlers remain synchronously blocking (framework constraint: `CustomNodeManager2` overrides are `void`) but `MxAccessClient.ReadAsync`/`WriteAsync` already enforce configurable timeouts (default 5s).
### P3: several background loops can be started multiple times and are not joined on shutdown
Evidence:
- `src/ZB.MOM.WW.LmxOpcUa.Host/GalaxyRepository/ChangeDetectionService.cs:58`
- `src/ZB.MOM.WW.LmxOpcUa.Host/MxAccess/MxAccessClient.Monitor.cs:15`
- `src/ZB.MOM.WW.LmxOpcUa.Host/Status/StatusWebServer.cs:57`
Details:
- `Start()` methods overwrite cancellation tokens and launch `Task.Run(...)` without keeping the returned `Task`.
- Calling `Start()` twice leaks the earlier loop and its CTS.
- `Stop()` only cancels; it does not wait for loop completion.
Impact:
- Duplicate starts or restart paths become nondeterministic.
- Shutdown can race active loops that are still touching shared state.
Recommendation:
- Guard against duplicate starts.
- Keep the background task handle and wait for orderly exit during stop/dispose.
**Status: Resolved (2026-04-07)**
Fix: All three services (`ChangeDetectionService`, `MxAccessClient.Monitor`, `StatusWebServer`) now store the `Task` returned by `Task.Run`. `Start()` cancels+joins any previous loop before launching a new one. `Stop()` cancels the token and waits on the task with a 5-second timeout.
### P3: startup logging exposes sensitive configuration
Evidence:
- `src/ZB.MOM.WW.LmxOpcUa.Host/Configuration/ConfigurationValidator.cs:71`
- `src/ZB.MOM.WW.LmxOpcUa.Host/Configuration/ConfigurationValidator.cs:118`
Details:
- The validator logs the full Galaxy repository connection string and detailed authentication-related settings.
- In many deployments, the connection string will contain credentials.
Impact:
- Credential exposure in logs increases operational risk and complicates incident handling.
Details on scope:
- The primary exposure is `GalaxyRepository.ConnectionString` logged verbatim at `ConfigurationValidator.cs:72`. When using SQL authentication, this contains the password in the connection string.
- Historian credentials (`UserName`/`Password`) are checked for emptiness but not logged as values — this section is safe.
- LDAP `ServiceAccountDn` is checked for emptiness but not logged as a value — also safe.
Recommendation:
- Redact secrets before logging. Parse the connection string and mask or omit password segments.
- Log connection targets (server, database) and non-sensitive settings only.
**Status: Resolved (2026-04-07)**
Fix: Added `SanitizeConnectionString` helper using `SqlConnectionStringBuilder` to mask passwords with `********`. Falls back to `(unparseable)` if the string can't be parsed.
## Test Coverage Gaps
### ~~Real async failure modes are under-tested~~ (Resolved)
`FakeMxAccessClient` now supports fault injection via `SubscribeException`, `UnsubscribeException`, `ReadException`, and `WriteException` properties. When set, the corresponding async methods return `Task.FromException`. Three tests in `LmxNodeManagerSubscriptionFaultTests` verify that subscribe/unsubscribe faults are caught and logged instead of silently discarded, and that ref-count bookkeeping survives a transient fault.
### ~~Historian lifecycle coverage is minimal~~ (Resolved)
Extracted `IHistorianConnectionFactory` abstraction from `HistorianDataSource`, with `SdkHistorianConnectionFactory` as the production implementation and `FakeHistorianConnectionFactory` for tests. Eleven lifecycle tests in `HistorianDataSourceLifecycleTests` now cover: post-dispose rejection for all four read methods, double-dispose idempotency, aggregate column mapping, connection failure (returns empty results), connection timeout (returns empty results), reconnect-after-error (factory called twice), connection failure state resilience, and dispose-after-failure safety.
### ~~Continuation-point expiry is not tested~~ (Resolved)
Two expiry tests added: `Retrieve_ExpiredContinuationPoint_ReturnsNull` and `Release_PurgesExpiredEntries`.
## Commands Run
Successful:
- `dotnet test tests\ZB.MOM.WW.LmxOpcUa.Tests\ZB.MOM.WW.LmxOpcUa.Tests.csproj --no-restore --filter HistoryContinuationPointTests`
- `dotnet test tests\ZB.MOM.WW.LmxOpcUa.Tests\ZB.MOM.WW.LmxOpcUa.Tests.csproj --no-restore --filter ChangeDetectionServiceTests`
- `dotnet test tests\ZB.MOM.WW.LmxOpcUa.Tests\ZB.MOM.WW.LmxOpcUa.Tests.csproj --no-restore --filter StaComThreadTests`
Failed:
- `dotnet test tests\ZB.MOM.WW.LmxOpcUa.Tests\ZB.MOM.WW.LmxOpcUa.Tests.csproj --no-restore --filter StatusWebServerTests`
Timed out:
- `dotnet test tests\ZB.MOM.WW.LmxOpcUa.Tests\ZB.MOM.WW.LmxOpcUa.Tests.csproj --no-restore`
## Bottom Line
All findings have been resolved:
- StaComThread crash-path faulting prevents callers from hanging forever.
- Subscription tasks are no longer silently discarded — failures are caught and logged.
- Subscribe/unsubscribe I/O moved outside `lock(Lock)` in rebuild paths; alarm metadata cached from subscriptions instead of blocking reads; Historian connection polling no longer holds the connection lock.
- Dashboard binds to localhost and reports startup failures explicitly.
- Background loops guard against double-start and join on stop.
- Connection strings are sanitized before logging.
Remaining architectural note: OPC UA Read/Write/HistoryRead handlers still use `.GetAwaiter().GetResult()` because `CustomNodeManager2` overrides are synchronous. This is mitigated by the existing configurable timeouts in `MxAccessClient` (default 5s).