diff --git a/docs/HistoricalDataAccess.md b/docs/HistoricalDataAccess.md index 57987a1..6b531b6 100644 --- a/docs/HistoricalDataAccess.md +++ b/docs/HistoricalDataAccess.md @@ -1,228 +1,109 @@ # Historical Data Access -`LmxNodeManager` exposes OPC UA historical data access (HDA) through an abstract `IHistorianDataSource` interface (`Historian/IHistorianDataSource.cs`). The Wonderware Historian implementation lives in a separate assembly, `ZB.MOM.WW.OtOpcUa.Historian.Aveva`, which is loaded at runtime only when `Historian.Enabled=true`. This keeps the `aahClientManaged` SDK out of the core Host so deployments that do not need history do not need the SDK installed. +OPC UA HistoryRead is a **per-driver optional capability** in OtOpcUa. The Core dispatches HistoryRead service calls to the owning driver through the `IHistoryProvider` capability interface (`src/ZB.MOM.WW.OtOpcUa.Core.Abstractions/IHistoryProvider.cs`). Drivers that don't implement the interface return `BadHistoryOperationUnsupported` for every history call on their nodes; that is the expected behavior for protocol drivers (Modbus, S7, AB CIP, AB Legacy, TwinCAT, FOCAS) whose wire protocols carry no time-series data. -## Plugin Architecture +Historian integration is no longer a separate bolt-on assembly, as it was in v1 (`ZB.MOM.WW.LmxOpcUa.Historian.Aveva` plugin). It is now one optional capability any driver can implement. The first implementation is the Galaxy driver's Wonderware Historian integration; OPC UA Client forwards HistoryRead to the upstream server. Every other driver leaves the capability unimplemented and the Core short-circuits history calls on nodes that belong to those drivers. -The historian surface is split across two assemblies: +## `IHistoryProvider` -- **`ZB.MOM.WW.OtOpcUa.Host`** (core) owns only OPC UA / BCL types: - - `IHistorianDataSource` -- the interface `LmxNodeManager` depends on - - `HistorianEventDto` -- SDK-free representation of a historian event record - - `HistorianAggregateMap` -- maps OPC UA aggregate NodeIds to AnalogSummary column names - - `HistorianPluginLoader` -- loads the plugin via `Assembly.LoadFrom` at startup - - `HistoryContinuationPointManager` -- paginates HistoryRead results -- **`ZB.MOM.WW.OtOpcUa.Historian.Aveva`** (plugin) owns everything SDK-bound: - - `HistorianDataSource` -- implements `IHistorianDataSource`, wraps `aahClientManaged` - - `IHistorianConnectionFactory` / `SdkHistorianConnectionFactory` -- opens and polls `ArchestrA.HistorianAccess` connections - - `AvevaHistorianPluginEntry.Create(HistorianConfiguration)` -- the static factory invoked by the loader +Four methods, mapping onto the four OPC UA HistoryRead service variants: -The plugin assembly and its SDK dependencies (`aahClientManaged.dll`, `aahClient.dll`, `aahClientCommon.dll`, `Historian.CBE.dll`, `Historian.DPAPI.dll`, `ArchestrA.CloudHistorian.Contract.dll`) deploy to a `Historian/` subfolder next to `ZB.MOM.WW.OtOpcUa.Host.exe`. See [Service Hosting](ServiceHosting.md#required-runtime-assemblies) for the full layout and deployment matrix. +| Method | OPC UA service | Notes | +|--------|----------------|-------| +| `ReadRawAsync` | HistoryReadRawModified (raw subset) | Returns `HistoryReadResult { Samples, ContinuationPoint? }`. The Core handles `ContinuationPoint` pagination. | +| `ReadProcessedAsync` | HistoryReadProcessed | Takes a `HistoryAggregateType` (Average / Minimum / Maximum / Total / Count) and a bucket `interval`. Drivers that can't express an aggregate throw `NotSupportedException`; the Core translates that into `BadAggregateNotSupported`. | +| `ReadAtTimeAsync` | HistoryReadAtTime | Default implementation throws `NotSupportedException` — drivers without interpolation / prior-boundary support leave the default. | +| `ReadEventsAsync` | HistoryReadEvents | Historical alarm/event rows, distinct from the live `IAlarmSource` stream. Default throws; only drivers with an event historian (Galaxy's A&E log) override. | -## Plugin Loading +Supporting DTOs live alongside the interface in `Core.Abstractions`: -When the service starts with `Historian.Enabled=true`, `OpcUaService` calls `HistorianPluginLoader.TryLoad(config)`. The loader: +- `HistoryReadResult(IReadOnlyList Samples, byte[]? ContinuationPoint)` +- `HistoryAggregateType` — enum `{ Average, Minimum, Maximum, Total, Count }` +- `HistoricalEvent(EventId, SourceName?, EventTimeUtc, ReceivedTimeUtc, Message?, Severity)` +- `HistoricalEventsResult(IReadOnlyList Events, byte[]? ContinuationPoint)` -1. Probes `AppDomain.CurrentDomain.BaseDirectory\Historian\ZB.MOM.WW.OtOpcUa.Historian.Aveva.dll`. -2. Installs a one-shot `AppDomain.AssemblyResolve` handler that redirects any `aahClientManaged`/`aahClientCommon`/`Historian.*` lookups to the same subfolder, so the CLR can resolve SDK dependencies when the plugin first JITs. -3. Calls the plugin's `AvevaHistorianPluginEntry.Create(HistorianConfiguration)` via reflection and returns the resulting `IHistorianDataSource`. -4. On any failure (plugin missing, entry type not found, SDK assembly unresolvable, bad image), logs a warning with the expected plugin path and returns `null`. The server starts normally and `LmxNodeManager` returns `BadHistoryOperationUnsupported` for every history call. +## Dispatch through `CapabilityInvoker` -## Wonderware Historian SDK +All four HistoryRead surfaces are wrapped by `CapabilityInvoker` (`Core/Resilience/CapabilityInvoker.cs`) with `DriverCapability.HistoryRead`. The Polly pipeline keyed on `(DriverInstanceId, HostName, DriverCapability.HistoryRead)` provides timeout, circuit-breaker, and bulkhead defaults per the driver's stability tier (see [docs/v2/driver-stability.md](v2/driver-stability.md)). -The plugin uses the AVEVA Historian managed SDK (`aahClientManaged.dll`) to query historical data. The SDK provides a cursor-based query API through `ArchestrA.HistorianAccess`, replacing direct SQL queries against the Historian Runtime database. Two query types are used: +The dispatch point is `DriverNodeManager` in `ZB.MOM.WW.OtOpcUa.Server`. When the OPC UA stack calls `HistoryRead`, the node manager: -- **`HistoryQuery`** -- Raw historical samples with timestamp, value (numeric or string), and OPC quality. -- **`AnalogSummaryQuery`** -- Pre-computed aggregates with properties for Average, Minimum, Maximum, ValueCount, First, Last, StdDev, and more. +1. Resolves the target `NodeHandle` to a `(DriverInstanceId, fullReference)` pair. +2. Checks the owning driver's `DriverTypeMetadata` to see if the type may advertise history at all (fast reject for types that never implement `IHistoryProvider`). +3. If the driver instance implements `IHistoryProvider`, wraps the `ReadRawAsync` / `ReadProcessedAsync` / `ReadAtTimeAsync` / `ReadEventsAsync` call in `CapabilityInvoker.InvokeAsync(... DriverCapability.HistoryRead ...)`. +4. Translates the `HistoryReadResult` into an OPC UA `HistoryData` + `ExtensionObject`. +5. Manages the continuation point via `HistoryContinuationPointManager` so clients can page through large result sets. -The SDK DLLs are located in `lib/` and originate from `C:\Program Files (x86)\Wonderware\Historian\`. Only the plugin project (`src/ZB.MOM.WW.OtOpcUa.Historian.Aveva/`) references them at build time; the core Host project does not. +Driver-level history code never sees the continuation-point protocol or the OPC UA stack types — those stay in the Core. -## Configuration +## Driver coverage -`HistorianConfiguration` controls the SDK connection: +| Driver | Implements `IHistoryProvider`? | Source | +|--------|:------------------------------:|--------| +| Galaxy | Yes — raw, processed, at-time, events | `aahClientManaged` SDK (Wonderware Historian) on the Host side, forwarded through the Proxy's IPC | +| OPC UA Client | Yes — raw, processed, at-time, events (forwarded to upstream) | `Opc.Ua.Client.Session.HistoryRead` against the remote server | +| Modbus | No | Wire protocol has no time-series concept | +| Siemens S7 | No | S7comm has no time-series concept | +| AB CIP | No | CIP has no time-series concept | +| AB Legacy | No | PCCC has no time-series concept | +| TwinCAT | No | ADS symbol reads are point-in-time; archiving is an external concern | +| FOCAS | No | Default — FOCAS has no general-purpose historian API | -```csharp -public class HistorianConfiguration -{ - public bool Enabled { get; set; } = false; - public string ServerName { get; set; } = "localhost"; - public List ServerNames { get; set; } = new(); - public int FailureCooldownSeconds { get; set; } = 60; - public bool IntegratedSecurity { get; set; } = true; - public string? UserName { get; set; } - public string? Password { get; set; } - public int Port { get; set; } = 32568; - public int CommandTimeoutSeconds { get; set; } = 30; - public int MaxValuesPerRead { get; set; } = 10000; - public int RequestTimeoutSeconds { get; set; } = 60; -} -``` +## Galaxy — Wonderware Historian (`aahClientManaged`) -When `Enabled` is `false`, `HistorianPluginLoader.TryLoad` is not called, no plugin is loaded, and the node manager returns `BadHistoryOperationUnsupported` for history read requests. When `Enabled` is `true` but the plugin cannot be loaded (missing `Historian/` subfolder, SDK assembly resolve failure, etc.), the server still starts and returns the same `BadHistoryOperationUnsupported` status with a warning in the log. +The Galaxy driver's `IHistoryProvider` implementation lives on the Host side (`.NET 4.8 x86`) in `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/Backend/Historian/`. The Proxy's `GalaxyProxyDriver.ReadRawAsync` / `ReadProcessedAsync` / `ReadAtTimeAsync` / `ReadEventsAsync` each serializes a `HistoryRead*Request` and awaits the matching `HistoryRead*Response` over the named pipe (see [drivers/Galaxy.md](drivers/Galaxy.md#ipc-transport)). -### Connection Properties +Host-side, `HistorianDataSource` uses the AVEVA Historian managed SDK (`aahClientManaged.dll`) to query historical data via a cursor-based API through `ArchestrA.HistorianAccess`: -| Property | Default | Description | -|---|---|---| -| `ServerName` | `localhost` | Single Historian server hostname used when `ServerNames` is empty. Preserved for backward compatibility with pre-cluster deployments | -| `ServerNames` | `[]` | Ordered list of Historian cluster nodes. When non-empty, supersedes `ServerName` and enables read-only cluster failover (see [Cluster Failover](#read-only-cluster-failover)) | -| `FailureCooldownSeconds` | `60` | How long a failed cluster node is skipped before being re-tried. Zero means no cooldown (retry on every request) | -| `IntegratedSecurity` | `true` | Use Windows authentication | -| `UserName` | `null` | Username when `IntegratedSecurity` is false | -| `Password` | `null` | Password when `IntegratedSecurity` is false | -| `Port` | `32568` | Historian TCP port | -| `CommandTimeoutSeconds` | `30` | SDK packet timeout in seconds (inner async bound) | -| `RequestTimeoutSeconds` | `60` | Outer safety timeout applied to sync-over-async history reads on the OPC UA stack thread. Backstop for `CommandTimeoutSeconds`; a timed-out read returns `BadTimeout`. Should be greater than `CommandTimeoutSeconds`. Stability review 2026-04-13 Finding 3 | -| `MaxValuesPerRead` | `10000` | Maximum values per history read request | +- **`HistoryQuery`** — raw historical samples (timestamp, value, OPC quality) +- **`AnalogSummaryQuery`** — pre-computed aggregates (Average, Minimum, Maximum, ValueCount, First, Last, StdDev) -## Connection Lifecycle +The SDK DLLs are pulled into the Galaxy.Host project at build time; the Server and every other driver project remain SDK-free. -`HistorianDataSource` (in the plugin assembly) maintains a persistent connection to the Historian server via `ArchestrA.HistorianAccess`: +> **Gap / status note.** The raw SDK wrapper (`HistorianDataSource`, `HistorianClusterEndpointPicker`, `HistorianHealthSnapshot`, etc.) has been ported from the v1 `ZB.MOM.WW.LmxOpcUa.Historian.Aveva` plugin into `Driver.Galaxy.Host/Backend/Historian/`. The **IPC wire-up** — `HistoryReadRequest` / `HistoryReadResponse` message kinds, Proxy-side `ReadRawAsync` / `ReadProcessedAsync` / `ReadAtTimeAsync` / `ReadEventsAsync` forwarding — is in place on `GalaxyProxyDriver`. What remains to close on a given branch is Host-side **mapping of `HistoryAggregateType` onto the `AnalogSummaryQuery` column names** (done in `GalaxyProxyDriver.MapAggregateToColumn`; the Host side must mirror it) and the **end-to-end integration test** that was held by the v1 plugin suite. Until those land on a given driver branch, history calls against Galaxy may surface `GalaxyIpcException { Code = "not-implemented" }` or backend-specific errors rather than populated `HistoryReadResult`s. Track the remaining work against the Phase 2 Galaxy out-of-process gate in `docs/v2/plan.md`. -1. **Lazy connect** -- The connection is established on the first query via `EnsureConnected()`. When a cluster is configured, the data source iterates `HistorianClusterEndpointPicker.GetHealthyNodes()` in order and returns the first node that successfully connects. -2. **Connection reuse** -- Subsequent queries reuse the same connection. The active node is tracked in `_activeProcessNode` / `_activeEventNode` and surfaced on the dashboard. -3. **Auto-reconnect** -- On connection failure, the connection is disposed, the active node is marked failed in the picker, and the next query re-enters the picker loop to try the next eligible candidate. -4. **Clean shutdown** -- `Dispose()` closes the connection when the service stops. +### Aggregate function mapping -The connection is opened with `ReadOnly = true` and `ConnectionType = Process`. The event (alarm history) path uses a separate connection with `ConnectionType = Event`, but both silos share the same cluster picker so a node that fails on one silo is immediately skipped on the other. +`GalaxyProxyDriver.MapAggregateToColumn` (Proxy-side) translates the OPC UA Part 13 standard aggregate enum onto `AnalogSummaryQuery` column names consumed by `HistorianDataSource.ReadAggregateAsync`: -## Read-Only Cluster Failover +| `HistoryAggregateType` | Result Property | +|------------------------|-----------------| +| `Average` | `Average` | +| `Minimum` | `Minimum` | +| `Maximum` | `Maximum` | +| `Count` | `ValueCount` | -When `HistorianConfiguration.ServerNames` is non-empty, the plugin picks from an ordered list of cluster nodes instead of a single `ServerName`. Each connection attempt tries candidates in configuration order until one succeeds. Failed nodes are placed into a timed cooldown and re-admitted when the cooldown elapses. +`HistoryAggregateType.Total` is **not supported** by Wonderware `AnalogSummary` and raises `NotSupportedException`, which the Core translates to `BadAggregateNotSupported`. Additional OPC UA aggregates (`Start`, `End`, `StandardDeviationPopulation`) sit on the Historian columns `First`, `Last`, `StdDev` and can be exposed by extending the enum + mapping together. -### HistorianClusterEndpointPicker +### Read-only cluster failover -The picker (in the plugin assembly, internal) is pure logic with no SDK dependency — all cluster behavior is unit-testable with a fake clock and scripted factory. Key characteristics: +`HistorianConfiguration.ServerNames` accepts an ordered list of cluster nodes. `HistorianClusterEndpointPicker` iterates the list in configuration order, marks failed nodes with a `FailureCooldownSeconds` window, and re-admits them when the cooldown elapses. One picker instance is shared by the process-values connection and the event-history connection (two SDK silos), so a node failure on one silo immediately benches it for the other. `FailureCooldownSeconds = 0` disables the cooldown — the SDK's own retry semantics are the sole gate. -- **Ordered iteration**: nodes are tried in the exact order they appear in `ServerNames`. Operators can express a preference ("primary first, fallback second") by ordering the list. -- **Per-node cooldown**: `MarkFailed(node, error)` starts a `FailureCooldownSeconds` window during which the node is skipped from `GetHealthyNodes()`. `MarkHealthy(node)` clears the window immediately (used on successful connect). -- **Automatic re-admission**: when a node's cooldown elapses, the next call to `GetHealthyNodes()` includes it automatically — no background probe, no manual reset. The cumulative `FailureCount` and `LastError` are retained for operator diagnostics. -- **Thread-safe**: a single lock guards the per-node state. Operations are microsecond-scale so contention is a non-issue. -- **Shared across silos**: one picker instance is shared by the process-values connection and the event-history connection, so a node failure on one path immediately benches it for the other. -- **Zero cooldown mode**: `FailureCooldownSeconds = 0` disables the cooldown entirely — the node is never benched. Useful for tests or for operators who want the SDK's own retry semantics to be the sole gate. +Host-side cluster health is surfaced via `HistorianHealthSnapshot { NodeCount, HealthyNodeCount, ActiveProcessNode, ActiveEventNode, Nodes }` and forwarded to the Proxy so the Admin UI Historian panel can render a per-node table. `HealthCheckService` flips overall service health to `Degraded` when `HealthyNodeCount < NodeCount`. -### Connection attempt flow +### Runtime health counters -`HistorianDataSource.ConnectToAnyHealthyNode(HistorianConnectionType)` performs the actual iteration: +`HistorianDataSource` maintains per-read counters — `TotalQueries`, `TotalSuccesses`, `TotalFailures`, `ConsecutiveFailures`, `LastSuccessTime`, `LastFailureTime`, `LastError`, `ProcessConnectionOpen`, `EventConnectionOpen` — so the dashboard can distinguish "backend loaded but never queried" from "backend loaded and queries are failing". `LastError` is prefixed with the read path (`raw:`, `aggregate:`, `at-time:`, `events:`) so operators can tell which silo is broken. `HealthCheckService` degrades at `ConsecutiveFailures >= 3`. -1. Snapshot healthy nodes from the picker. If empty, throw `InvalidOperationException` with either "No historian nodes configured" or "All N historian nodes are in cooldown". -2. For each candidate, clone `HistorianConfiguration` with the candidate as `ServerName` and pass it to the factory. On success: `MarkHealthy(node)` and return the `(Connection, Node)` tuple. On exception: `MarkFailed(node, ex.Message)`, log a warning, continue. -3. If all candidates fail, wrap the last inner exception in an `InvalidOperationException` with the cumulative failure count so the existing read-method catch blocks surface a meaningful error through the health counters. +### Quality mapping -The wrapping exception intentionally includes the last inner error message in the outer `Message` so the health snapshot's `LastError` field is still human-readable when the cluster exhausts every candidate. - -### Single-node backward compatibility - -When `ServerNames` is empty, the picker is seeded with a single entry from `ServerName` and the iteration loop still runs — it just has one candidate. Legacy deployments see no behavior change: the picker marks the single node healthy on success, runs the same cooldown logic on failure, and the dashboard renders a compact `Node: ` line instead of the cluster table. - -### Cluster health surface - -Runtime cluster state is exposed on `HistorianHealthSnapshot`: - -- `NodeCount` / `HealthyNodeCount` -- size of the configured cluster and how many are currently eligible. -- `ActiveProcessNode` / `ActiveEventNode` -- which nodes are currently serving the two connection silos, or `null` when a silo has no open connection. -- `Nodes: List` -- per-node state with `Name`, `IsHealthy`, `CooldownUntil`, `FailureCount`, `LastError`, `LastFailureTime`. - -The dashboard renders this as a cluster table when `NodeCount > 1`. See [Status Dashboard](StatusDashboard.md#historian). `HealthCheckService` flips the overall service health to `Degraded` when `HealthyNodeCount < NodeCount` so operators can alert on a partially-failed cluster even while queries are still succeeding via the remaining nodes. - -## Runtime Health Counters - -`HistorianDataSource` maintains runtime query counters updated on every read method exit — success or failure — so the dashboard can distinguish "plugin loaded but never queried" from "plugin loaded and queries are failing". The load-time `HistorianPluginLoader.LastOutcome` only reports whether the assembly resolved at startup; it cannot catch a connection that succeeds at boot and degrades later. - -### Counters - -- `TotalQueries` / `TotalSuccesses` / `TotalFailures` — cumulative since startup. Every call to `RecordSuccess` or `RecordFailure` in the read methods updates these under `_healthLock`. Empty result sets count as successes — the counter reflects "the SDK call returned" rather than "the SDK call returned data". -- `ConsecutiveFailures` — latches while queries are failing; reset to zero by the first success. Drives `HealthCheckService` degradation at threshold 3. -- `LastSuccessTime` / `LastFailureTime` — UTC timestamps of the most recent success or failure, or `null` when no query of that outcome has occurred yet. -- `LastError` — exception message from the most recent failure, prefixed with the read-path name (`raw:`, `aggregate:`, `at-time:`, `events:`) so operators can tell which SDK call is broken. Cleared on the next success. -- `ProcessConnectionOpen` / `EventConnectionOpen` — whether the plugin currently holds an open SDK connection on each silo. Read from the data source's `_connection` / `_eventConnection` fields via a `Volatile.Read`. - -These fields are read once per dashboard refresh via `IHistorianDataSource.GetHealthSnapshot()` and serialized into `HistorianStatusInfo`. See [Status Dashboard](StatusDashboard.md#historian) for the HTML/JSON surface. - -### Two SDK connection silos - -The plugin maintains two independent `ArchestrA.HistorianAccess` connections, one per `HistorianConnectionType`: - -- **Process connection** (`ConnectionType = Process`) — serves historical *value* queries: `ReadRawAsync`, `ReadAggregateAsync`, `ReadAtTimeAsync`. This is the SDK's query channel for tags stored in the Historian runtime. -- **Event connection** (`ConnectionType = Event`) — serves historical *event/alarm* queries: `ReadEventsAsync`. The SDK requires a separately opened connection for its event store because the query API and wire schema are distinct from value queries. - -Both connections are lazy: they open on the first query that needs them. Either can be open, closed, or open against a different cluster node than the other. The dashboard renders both independently in the Historian panel (`Process Conn: open (host-a) | Event Conn: closed`) so operators can tell which silos are active and which node is serving each. When cluster support is configured, both silos share the same `HistorianClusterEndpointPicker`, so a failure on one silo marks the node unhealthy for the other as well. - -## Raw Reads - -`IHistorianDataSource.ReadRawAsync` (plugin implementation) uses a `HistoryQuery` to retrieve individual samples within a time range: - -1. Create a `HistoryQuery` via `_connection.CreateHistoryQuery()` -2. Configure `HistoryQueryArgs` with `TagNames`, `StartDateTime`, `EndDateTime`, and `RetrievalMode = Full` -3. Iterate: `StartQuery` -> `MoveNext` loop -> `EndQuery` - -Each result row is converted to an OPC UA `DataValue`: - -- `QueryResult.Value` (double) takes priority; `QueryResult.StringValue` is used as fallback for string-typed tags. -- `SourceTimestamp` and `ServerTimestamp` are both set to `QueryResult.StartDateTime`. -- `StatusCode` is mapped from the `QueryResult.OpcQuality` (UInt16) via `QualityMapper` (the same OPC DA quality byte mapping used for live MXAccess data). - -## Aggregate Reads - -`IHistorianDataSource.ReadAggregateAsync` (plugin implementation) uses an `AnalogSummaryQuery` to retrieve pre-computed aggregates: - -1. Create an `AnalogSummaryQuery` via `_connection.CreateAnalogSummaryQuery()` -2. Configure `AnalogSummaryQueryArgs` with `TagNames`, `StartDateTime`, `EndDateTime`, and `Resolution` (milliseconds) -3. Iterate the same `StartQuery` -> `MoveNext` -> `EndQuery` pattern -4. Extract the requested aggregate from named properties on `AnalogSummaryQueryResult` - -Null aggregate values return `BadNoData` status rather than `Good` with a null variant. - -## Quality Mapping - -The Historian SDK returns standard OPC DA quality values in `QueryResult.OpcQuality` (UInt16). The low byte is passed through the shared `QualityMapper` pipeline (`MapFromMxAccessQuality` -> `MapToOpcUaStatusCode`), which maps the OPC DA quality families to OPC UA status codes: +The Historian SDK returns standard OPC DA quality values in `QueryResult.OpcQuality` (UInt16). The low byte flows through the shared `QualityMapper` pipeline (`MapFromMxAccessQuality` → `MapToOpcUaStatusCode`): | OPC Quality Byte | OPC DA Family | OPC UA StatusCode | -|---|---|---| +|------------------|---------------|-------------------| | 0-63 | Bad | `Bad` (with sub-code when an exact enum match exists) | | 64-191 | Uncertain | `Uncertain` (with sub-code when an exact enum match exists) | | 192+ | Good | `Good` (with sub-code when an exact enum match exists) | -See `Domain/QualityMapper.cs` and `Domain/Quality.cs` for the full mapping table and sub-code definitions. +See `Domain/QualityMapper.cs` and `Domain/Quality.cs` in `Driver.Galaxy.Host` for the full table. -## Aggregate Function Mapping +## OPC UA Client — upstream forwarding -`HistorianAggregateMap.MapAggregateToColumn` (in the core Host assembly, so the node manager can validate aggregate support without requiring the plugin to be loaded) translates OPC UA aggregate NodeIds to `AnalogSummaryQueryResult` property names: +The OPC UA Client driver (`Driver.OpcUaClient`) implements `IHistoryProvider` by forwarding each call to the upstream server via `Session.HistoryRead`. Raw / processed / at-time / events map onto the stack's native HistoryRead details types. Continuation points are passed through — the Core's `HistoryContinuationPointManager` treats the driver as an opaque pager. -| OPC UA Aggregate | Result Property | -|---|---| -| `AggregateFunction_Average` | `Average` | -| `AggregateFunction_Minimum` | `Minimum` | -| `AggregateFunction_Maximum` | `Maximum` | -| `AggregateFunction_Count` | `ValueCount` | -| `AggregateFunction_Start` | `First` | -| `AggregateFunction_End` | `Last` | -| `AggregateFunction_StandardDeviationPopulation` | `StdDev` | +## Historizing flag and AccessLevel -Unsupported aggregates return `null`, which causes the node manager to return `BadAggregateNotSupported`. - -## HistoryReadRawModified Override - -`LmxNodeManager` overrides `HistoryReadRawModified` to handle raw history read requests: - -1. Resolve the `NodeHandle` to a tag reference via `_nodeIdToTagReference`. Return `BadNodeIdUnknown` if not found. -2. Check that `_historianDataSource` is not null. Return `BadHistoryOperationUnsupported` if historian is disabled. -3. Call `ReadRawAsync` with the time range and `NumValuesPerNode` from the `ReadRawModifiedDetails`. -4. Pack the resulting `DataValue` list into a `HistoryData` object and wrap it in an `ExtensionObject` for the `HistoryReadResult`. - -## HistoryReadProcessed Override - -`HistoryReadProcessed` handles aggregate history requests with additional validation: - -1. Resolve the node and check historian availability (same as raw). -2. Validate that `AggregateType` is present in the `ReadProcessedDetails`. Return `BadAggregateListMismatch` if empty. -3. Map the requested aggregate to a result property via `MapAggregateToColumn`. Return `BadAggregateNotSupported` if unmapped. -4. Call `ReadAggregateAsync` with the time range, `ProcessingInterval`, and property name. -5. Return results in the same `HistoryData` / `ExtensionObject` format. - -## Historizing Flag and AccessLevel - -During variable node creation in `CreateAttributeVariable`, attributes with `IsHistorized == true` receive two additional settings: +During variable node creation, drivers that advertise history set: ```csharp if (attr.IsHistorized) @@ -230,7 +111,13 @@ if (attr.IsHistorized) variable.Historizing = attr.IsHistorized; ``` -- **`Historizing = true`** -- Tells OPC UA clients that this node has historical data available. -- **`AccessLevels.HistoryRead`** -- Enables the `HistoryRead` access bit on the node, which the OPC UA stack checks before routing history requests to the node manager override. Nodes without this bit set will be rejected by the framework before reaching `HistoryReadRawModified` or `HistoryReadProcessed`. +- **`Historizing = true`** — tells OPC UA clients that the node has historical data available. +- **`AccessLevels.HistoryRead`** — enables the `HistoryRead` access bit. The OPC UA stack checks this bit before routing history requests to the Core dispatcher; nodes without it are rejected before reaching `IHistoryProvider`. -The `IsHistorized` flag originates from the Galaxy repository database query, which checks whether the attribute has Historian logging configured. +The `IsHistorized` flag originates in the driver's discovery output. For Galaxy it comes from the repository query detecting a `HistoryExtension` primitive (see [drivers/Galaxy-Repository.md](drivers/Galaxy-Repository.md)). For OPC UA Client it is copied from the upstream server's `Historizing` property. + +## Configuration + +Driver-specific historian config lives in each driver's `DriverConfig` JSON blob, validated against the driver type's `DriverConfigJsonSchema` in `DriverTypeRegistry`. The Galaxy driver's historian section carries the fields exercised by `HistorianConfiguration` — `ServerName` / `ServerNames`, `FailureCooldownSeconds`, `IntegratedSecurity` / `UserName` / `Password`, `Port` (default `32568`), `CommandTimeoutSeconds`, `RequestTimeoutSeconds`, `MaxValuesPerRead`. The OPC UA Client driver inherits its timeouts from the upstream session. + +See [Configuration.md](Configuration.md) for the schema shape and validation path. diff --git a/docs/MxAccessBridge.md b/docs/MxAccessBridge.md deleted file mode 100644 index f81ae66..0000000 --- a/docs/MxAccessBridge.md +++ /dev/null @@ -1,166 +0,0 @@ -# MXAccess Bridge - -The MXAccess bridge connects the OPC UA server to the AVEVA System Platform runtime through the `ArchestrA.MxAccess` COM API. It handles all COM threading requirements, translates between OPC UA read/write requests and MXAccess operations, and manages connection health. - -## STA Thread Requirement - -MXAccess is a COM-based API that requires a Single-Threaded Apartment (STA). All COM objects -- `LMXProxyServer` instantiation, `Register`, `AddItem`, `AdviseSupervisory`, `Write`, and cleanup calls -- must execute on the same STA thread. Calling COM objects from the wrong thread causes marshalling failures or silent data corruption. - -`StaComThread` provides a dedicated STA thread with the apartment state set before the thread starts: - -```csharp -_thread = new Thread(ThreadEntry) { Name = "MxAccess-STA", IsBackground = true }; -_thread.SetApartmentState(ApartmentState.STA); -``` - -Work items are queued via `RunAsync(Action)` or `RunAsync(Func)`, which enqueue the work to a `ConcurrentQueue` and post a `WM_APP` message to wake the pump. Each work item is wrapped in a `TaskCompletionSource` so callers can `await` the result from any thread. - -## Win32 Message Pump - -COM callbacks (like `OnDataChange`) are delivered through the Windows message loop. `StaComThread` runs a standard Win32 message pump using P/Invoke: - -1. `PeekMessage` primes the message queue (required before `PostThreadMessage` works) -2. `GetMessage` blocks until a message arrives -3. `WM_APP` messages drain the work queue -4. `WM_APP + 1` drains the queue and posts `WM_QUIT` to exit the loop -5. All other messages are passed through `TranslateMessage`/`DispatchMessage` for COM callback delivery - -Without this message pump, MXAccess COM callbacks would never fire and the server would receive no live data. - -## LMXProxyServer COM Object - -`MxProxyAdapter` wraps the real `ArchestrA.MxAccess.LMXProxyServer` COM object behind the `IMxProxy` interface. This abstraction allows unit tests to substitute a fake proxy without requiring the ArchestrA runtime. - -The COM object lifecycle: - -1. **`Register(clientName)`** -- Creates a new `LMXProxyServer` instance, wires up `OnDataChange` and `OnWriteComplete` event handlers, and calls `Register` to obtain a connection handle -2. **`Unregister(handle)`** -- Unwires event handlers, calls `Unregister`, and releases the COM object via `Marshal.ReleaseComObject` - -## Register/AddItem/AdviseSupervisory Pattern - -Every MXAccess data operation follows a three-step pattern, all executed on the STA thread: - -1. **`AddItem(handle, address)`** -- Resolves a Galaxy tag reference (e.g., `TestMachine_001.MachineID`) to an integer item handle -2. **`AdviseSupervisory(handle, itemHandle)`** -- Subscribes the item for supervisory data change callbacks -3. The runtime begins delivering `OnDataChange` events for the item - -For writes, after `AddItem` + `AdviseSupervisory`, `Write(handle, itemHandle, value, securityClassification)` sends the value to the runtime. The `OnWriteComplete` callback confirms or rejects the write. - -Cleanup reverses the pattern: `UnAdviseSupervisory` then `RemoveItem`. - -## OnDataChange and OnWriteComplete Callbacks - -### OnDataChange - -Fired by the COM runtime on the STA thread when a subscribed tag value changes. The handler in `MxAccessClient.EventHandlers.cs`: - -1. Maps the integer `phItemHandle` back to a tag address via `_handleToAddress` -2. Maps the MXAccess quality code to the internal `Quality` enum -3. Checks `MXSTATUS_PROXY` for error details and adjusts quality accordingly -4. Converts the timestamp to UTC -5. Constructs a `Vtq` (Value/Timestamp/Quality) and delivers it to: - - The stored per-tag subscription callback - - Any pending one-shot read completions - - The global `OnTagValueChanged` event (consumed by `LmxNodeManager`) - -### OnWriteComplete - -Fired when the runtime acknowledges or rejects a write. The handler resolves the pending `TaskCompletionSource` for the item handle. If `MXSTATUS_PROXY.success == 0`, the write is considered failed and the error detail is logged. - -## Reconnection Logic - -`MxAccessClient` implements automatic reconnection through two mechanisms: - -### Monitor loop - -`StartMonitor` launches a background task that polls at `MonitorIntervalSeconds`. On each cycle: - -- If the state is `Disconnected` or `Error` and `AutoReconnect` is enabled, it calls `ReconnectAsync` -- If connected and a probe tag is configured, it checks the probe staleness threshold - -### Reconnect sequence - -`ReconnectAsync` performs a full disconnect-then-connect cycle: - -1. Increment the reconnect counter -2. `DisconnectAsync` -- Tears down all active subscriptions (`UnAdviseSupervisory` + `RemoveItem` for each), detaches COM event handlers, calls `Unregister`, and clears all handle mappings -3. `ConnectAsync` -- Creates a fresh `LMXProxyServer`, registers, replays all stored subscriptions, and re-subscribes the probe tag - -Stored subscriptions (`_storedSubscriptions`) persist across reconnects. When `ConnectAsync` succeeds, `ReplayStoredSubscriptionsAsync` iterates all stored entries and calls `AddItem` + `AdviseSupervisory` for each. - -## Probe Tag Health Monitoring - -A configurable probe tag (e.g., a frequently updating Galaxy attribute) serves as a connection health indicator. After connecting, the client subscribes to the probe tag and records `_lastProbeValueTime` on every `OnDataChange` callback. - -The monitor loop compares `DateTime.UtcNow - _lastProbeValueTime` against `ProbeStaleThresholdSeconds`. If the probe value has not updated within the threshold, the connection is assumed stale and a reconnect is forced. This catches scenarios where the COM connection is technically alive but the runtime has stopped delivering data. - -## Per-Host Runtime Status Probes (`.ScanState`) - -Separate from the connection-level probe above, the bridge advises `.ScanState` on every deployed `$WinPlatform` and `$AppEngine` in the Galaxy. These probes track per-host runtime state so the dashboard can report "this specific Platform / AppEngine is off scan" and the bridge can proactively invalidate every OPC UA variable hosted by the stopped object — preventing MxAccess from serving stale Good-quality cached values to clients who read those tags while the host is down. - -Enabled by default via `MxAccess.RuntimeStatusProbesEnabled`; see [Configuration](Configuration.md#mxaccess) for the two config fields. - -### How it works - -`GalaxyRuntimeProbeManager` is owned by `LmxNodeManager` and operates on a simple three-state machine per host (Unknown / Running / Stopped): - -1. **Discovery** — After `BuildAddressSpace` completes, the manager filters the hierarchy to rows where `CategoryId == 1` (`$WinPlatform`) or `CategoryId == 3` (`$AppEngine`) and issues `AdviseSupervisory` for `.ScanState` on each one. Probes are bridge-owned, not ref-counted against client subscriptions, and persist across address-space rebuilds via a `Sync` diff. -2. **Transition predicate** — A probe callback is interpreted as `isRunning = vtq.Quality.IsGood() && vtq.Value is bool b && b`. Everything else (explicit `ScanState = false`, bad quality, communication errors from the broker) means **Stopped**. -3. **On-change-only delivery** — `ScanState` is delivered **only when the value actually changes**. A stably Running host may go hours without a callback. The probe manager's `Tick()` explicitly does NOT run a starvation check on Running entries — the only time-based transition is **Unknown → Stopped** when the initial callback hasn't arrived within `RuntimeStatusUnknownTimeoutSeconds` (default 15s). This protects against a probe that fails to resolve at all without incorrectly flipping healthy long-running hosts. -4. **Transport gating** — When `IMxAccessClient.State != Connected`, `GetSnapshot()` forces every entry to `Unknown` regardless of underlying state. The dashboard shows the Connection panel as the primary signal in that case rather than misleading operators with "every host stopped." -5. **Subscribe failure rollback** — If `SubscribeAsync` throws for a new probe (SDK failure, broker rejection, transport error), the manager rolls back both `_byProbe` and `_probeByGobjectId` so the probe never appears in `GetSnapshot()`. Without this rollback, a failed subscribe would leave the entry in `Unknown` forever, and `Tick()` would later transition it to `Stopped` after the unknown-resolution timeout, fanning out a **false-negative** host-down signal that invalidates the subtree of a host that was never actually advised. Stability review 2026-04-13 Finding 1. - -### Subtree quality invalidation on transition - -When a host transitions **Running → Stopped**, the probe manager invokes a callback that walks `_hostedVariables[gobjectId]` — the set of every OPC UA variable transitively hosted by that Galaxy object — and sets each variable's `StatusCode` to `BadOutOfService`. The reverse happens on **Stopped → Running**: `ClearHostVariablesBadQuality` resets each to `Good` and lets subsequent on-change MxAccess updates repopulate the values. - -The hosted-variables map is built once per `BuildAddressSpace` by walking each object's `HostedByGobjectId` chain up to the nearest Platform or Engine ancestor. A variable hosted by an Engine inside a Platform ends up in **both** the Engine's list and the Platform's list, so stopping the Platform transitively invalidates every descendant Engine's variables. - -### Read-path short-circuit (`IsTagUnderStoppedHost`) - -`LmxNodeManager.Read` override is called by the OPC UA SDK for both direct Read requests and monitored-item sampling. It previously called `_mxAccessClient.ReadAsync(tagRef)` unconditionally and returned whatever VTQ the runtime reported. That created a gap: MxAccess happily serves the last cached value as Good on a tag whose hosting Engine has gone off scan. - -The Read override now checks `IsTagUnderStoppedHost(tagRef)` (a reverse-index lookup `_hostIdsByTagRef[tagRef]` → `GalaxyRuntimeProbeManager.IsHostStopped(hostId)`) before the MxAccess round-trip. When the owning host is Stopped, the handler returns a synthesized `DataValue { Value = cachedVar.Value, StatusCode = BadOutOfService }` directly without touching MxAccess. This guarantees clients see a uniform `BadOutOfService` on every descendant tag of a stopped host, regardless of whether they're reading or subscribing. - -### Deferred dispatch: the STA deadlock - -**Critical**: probe transition callbacks must **not** run synchronously on the STA thread that delivered the `OnDataChange`. `MarkHostVariablesBadQuality` takes the `LmxNodeManager.Lock`, which may be held by a worker thread currently inside `Read` waiting on an `_mxAccessClient.ReadAsync()` round-trip that is itself waiting for the STA thread. Classic circular wait — the first real deploy of this feature hung inside 30 seconds from exactly this pattern. - -The fix is a deferred-dispatch queue: probe callbacks enqueue the transition onto `ConcurrentQueue<(int GobjectId, bool Stopped)>` and set the existing dispatch signal. The dispatch thread drains the queue inside its existing 100ms `WaitOne` loop — **outside** any locks held by the STA path — and then calls `MarkHostVariablesBadQuality` / `ClearHostVariablesBadQuality` under its own natural `Lock` acquisition. No circular wait, no STA dispatch involvement. - -See the `runtimestatus.md` plan file and the `service_info.md` entry for the in-flight debugging that led to this pattern. - -### Dashboard + health surface - -- Dashboard **Galaxy Runtime** panel between Galaxy Info and Historian shows per-host state with Name / Kind / State / Since / Last Error columns. Panel color is green (all Running), yellow (any Unknown, none Stopped), red (any Stopped), gray (MxAccess transport disconnected). -- Subscriptions panel gains a `Probes: N (bridge-owned runtime status)` line when at least one probe is active, so operators can distinguish bridge-owned probe count from client-driven subscriptions. -- `HealthCheckService.CheckHealth` Rule 2e rolls overall health to `Degraded` when any host is Stopped, ordered after the MxAccess-transport check (Rule 1) so a transport outage stays `Unhealthy` without double-messaging. - -See [Status Dashboard](StatusDashboard.md#galaxy-runtime) for the field table and [Configuration](Configuration.md#mxaccess) for the two new config fields. - -## Request Timeout Safety Backstop - -Every sync-over-async site on the OPC UA stack thread that calls into MxAccess (`Read`, `Write`, address-space rebuild probe sync) is wrapped in a bounded `SyncOverAsync.WaitSync(...)` helper with timeout `MxAccess.RequestTimeoutSeconds` (default 30s). This is a backstop: `MxAccessClient.Read/Write` already enforce inner `ReadTimeoutSeconds` / `WriteTimeoutSeconds` bounds on the async path. The outer wrapper exists so a scheduler stall, slow reconnect, or any other non-returning async path cannot park the stack thread indefinitely. - -On timeout, the underlying task is **not** cancelled — it runs to completion on the thread pool and is abandoned. This is acceptable because MxAccess clients are shared singletons and the abandoned continuation does not capture request-scoped state. The OPC UA stack receives `StatusCodes.BadTimeout` on the affected operation. - -`ConfigurationValidator` enforces `RequestTimeoutSeconds >= 1` and warns when it is set below the inner Read/Write timeouts (operator misconfiguration). Stability review 2026-04-13 Finding 3. - -## Why Marshal.ReleaseComObject Is Needed - -The .NET runtime's garbage collector releases COM references non-deterministically. For MXAccess, delayed release can leave stale COM connections open, preventing clean re-registration. `MxProxyAdapter.Unregister` calls `Marshal.ReleaseComObject(_lmxProxy)` in a `finally` block to immediately release the COM reference count to zero. This ensures the underlying COM server is freed before a reconnect attempt creates a new instance. - -## Key source files - -- `src/ZB.MOM.WW.OtOpcUa.Host/MxAccess/StaComThread.cs` -- STA thread and Win32 message pump -- `src/ZB.MOM.WW.OtOpcUa.Host/MxAccess/MxAccessClient.cs` -- Core client class (partial) -- `src/ZB.MOM.WW.OtOpcUa.Host/MxAccess/MxAccessClient.Connection.cs` -- Connect, disconnect, reconnect -- `src/ZB.MOM.WW.OtOpcUa.Host/MxAccess/MxAccessClient.Subscription.cs` -- Subscribe, unsubscribe, replay -- `src/ZB.MOM.WW.OtOpcUa.Host/MxAccess/MxAccessClient.ReadWrite.cs` -- Read and write operations -- `src/ZB.MOM.WW.OtOpcUa.Host/MxAccess/MxAccessClient.EventHandlers.cs` -- OnDataChange and OnWriteComplete handlers -- `src/ZB.MOM.WW.OtOpcUa.Host/MxAccess/MxAccessClient.Monitor.cs` -- Background health monitor -- `src/ZB.MOM.WW.OtOpcUa.Host/MxAccess/MxProxyAdapter.cs` -- COM object wrapper -- `src/ZB.MOM.WW.OtOpcUa.Host/MxAccess/GalaxyRuntimeProbeManager.cs` -- Per-host `ScanState` probes, state machine, `IsHostStopped` lookup -- `src/ZB.MOM.WW.OtOpcUa.Host/Domain/GalaxyRuntimeStatus.cs` -- Per-host DTO -- `src/ZB.MOM.WW.OtOpcUa.Host/Domain/GalaxyRuntimeState.cs` -- `Unknown` / `Running` / `Stopped` enum -- `src/ZB.MOM.WW.OtOpcUa.Host/Domain/IMxAccessClient.cs` -- Client interface diff --git a/docs/GalaxyRepository.md b/docs/drivers/Galaxy-Repository.md similarity index 51% rename from docs/GalaxyRepository.md rename to docs/drivers/Galaxy-Repository.md index 30be885..381ca85 100644 --- a/docs/GalaxyRepository.md +++ b/docs/drivers/Galaxy-Repository.md @@ -1,6 +1,19 @@ -# Galaxy Repository +# Galaxy Repository — Tag Discovery for the Galaxy Driver -`GalaxyRepositoryService` reads the Galaxy object hierarchy and attribute metadata from the System Platform Galaxy Repository SQL Server database. This data drives the construction of the OPC UA address space. +`GalaxyRepositoryService` reads the Galaxy object hierarchy and attribute metadata from the System Platform Galaxy Repository SQL Server database. It is the Galaxy driver's implementation of **`ITagDiscovery.DiscoverAsync`** — every driver has its own discovery source, and the Galaxy driver's is a direct SQL query against the Galaxy Repository (the `ZB` database). Other drivers use completely different mechanisms: + +| Driver | `ITagDiscovery` source | +|--------|------------------------| +| Galaxy | ZB SQL hierarchy + attribute queries (this doc) | +| AB CIP | `@tags` walker against the PLC controller | +| AB Legacy | Data-table scan via PCCC `LogicalRead` on the PLC | +| TwinCAT | Beckhoff `SymbolLoaderFactory` — uploads the full symbol tree from the ADS runtime | +| S7 | Config-DB enumeration (no native symbol upload for S7comm) | +| Modbus | Config-DB enumeration (flat register map, user-authored) | +| FOCAS | CNC queries (`cnc_rdaxisname`, `cnc_rdmacroinfo`, …) + optional Config-DB overlays | +| OPC UA Client | `Session.Browse` against the remote server | + +`GalaxyRepositoryService` lives in `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/Backend/GalaxyRepository/` — Host-side, .NET Framework 4.8 x86, same process that owns the MXAccess COM objects. The Proxy forwards discovery over IPC the same way it forwards reads and writes. ## Connection Configuration @@ -19,7 +32,7 @@ The connection uses Windows Authentication because the Galaxy Repository databas ## SQL Queries -All queries are embedded as `const string` fields in `GalaxyRepositoryService`. No dynamic SQL is used. +All queries are embedded as `const string` fields in `GalaxyRepositoryService`. No dynamic SQL is used. Project convention `GR-006` requires `const string` SQL queries; any new query must be added as a named constant rather than built at runtime. ### Hierarchy query @@ -31,9 +44,9 @@ Returns deployed Galaxy objects with their parent relationships, browse names, a - Marks objects with `category_id = 13` as areas - Filters to `is_template = 0` (instances only, not templates) - Filters to `deployed_package_id <> 0` (deployed objects only) -- Returns a `template_chain` column built by a recursive CTE that walks `gobject.derived_from_gobject_id` from each instance through its immediate template and ancestor templates (depth guard `< 10`). Template names are ordered by depth and joined with `|` via `STUFF(... FOR XML PATH(''))`. Example: `TestMachine_001` returns `$TestMachine|$gMachine|$gUserDefined|$UserDefined`. The C# repository reader splits the column on `|`, trims, and populates `GalaxyObjectInfo.TemplateChain`, which is consumed by `AlarmObjectFilter` for template-based alarm filtering. See [Alarm Tracking](AlarmTracking.md#template-based-alarm-object-filter). -- Returns `template_definition.category_id` as a `category_id` column, populated into `GalaxyObjectInfo.CategoryId`. The runtime status probe manager filters this down to `CategoryId == 1` (`$WinPlatform`) and `CategoryId == 3` (`$AppEngine`) to decide which objects get a `.ScanState` probe advised. Also used by `LmxNodeManager.BuildHostedVariablesMap` to identify Platform/Engine ancestors during the hosted-variables walk. -- Returns `gobject.hosted_by_gobject_id` as a `hosted_by_gobject_id` column, populated into `GalaxyObjectInfo.HostedByGobjectId`. This is the **runtime host** of the object (e.g., which `$AppEngine` actually runs it), **not** the browse-containment parent (`contained_by_gobject_id`). The two are often different — an object can live in one Area in the browse tree but be hosted by an Engine on a different Platform for runtime execution. The node manager walks this chain during `BuildHostedVariablesMap` to find the nearest `$WinPlatform` or `$AppEngine` ancestor so subtree quality invalidation on a Stopped host reaches exactly the variables that were actually executing there. Note: the Galaxy schema column is named `hosted_by_gobject_id` (not `host_gobject_id` as some documentation sources guess). See [MXAccess Bridge — Per-Host Runtime Status Probes](MxAccessBridge.md#per-host-runtime-status-probes-hostscanstate). +- Returns a `template_chain` column built by a recursive CTE that walks `gobject.derived_from_gobject_id` from each instance through its immediate template and ancestor templates (depth guard `< 10`). Template names are ordered by depth and joined with `|` via `STUFF(... FOR XML PATH(''))`. Example: `TestMachine_001` returns `$TestMachine|$gMachine|$gUserDefined|$UserDefined`. The C# repository reader splits the column on `|`, trims, and populates `GalaxyObjectInfo.TemplateChain`, which is consumed by `AlarmObjectFilter` for template-based alarm filtering. See [Alarm Tracking](../AlarmTracking.md#template-based-alarm-object-filter). +- Returns `template_definition.category_id` as a `category_id` column, populated into `GalaxyObjectInfo.CategoryId`. The runtime status probe manager filters this down to `CategoryId == 1` (`$WinPlatform`) and `CategoryId == 3` (`$AppEngine`) to decide which objects get a `.ScanState` probe advised. Also used during the hosted-variables walk to identify Platform/Engine ancestors. +- Returns `gobject.hosted_by_gobject_id` as a `hosted_by_gobject_id` column, populated into `GalaxyObjectInfo.HostedByGobjectId`. This is the **runtime host** of the object (e.g., which `$AppEngine` actually runs it), **not** the browse-containment parent (`contained_by_gobject_id`). The two are often different — an object can live in one Area in the browse tree but be hosted by an Engine on a different Platform for runtime execution. The driver walks this chain during `BuildHostedVariablesMap` to find the nearest `$WinPlatform` or `$AppEngine` ancestor so subtree quality invalidation on a Stopped host reaches exactly the variables that were actually executing there. Note: the Galaxy schema column is named `hosted_by_gobject_id` (not `host_gobject_id` as some documentation sources guess). See [Galaxy driver — Per-Host Runtime Status Probes](Galaxy.md#per-host-runtime-status-probes-hostscanstate). ### Attributes query (standard) @@ -53,8 +66,8 @@ Returns user-defined dynamic attributes for deployed objects: When `ExtendedAttributes = true`, a more comprehensive query runs that unions two sources: -1. **Primitive attributes** -- Joins through `primitive_instance` and `attribute_definition` to include system-level attributes from primitive components. Each attribute carries its `primitive_name` so the address space can group them under their parent variable. -2. **Dynamic attributes** -- The same CTE-based query as the standard path, with an empty `primitive_name`. +1. **Primitive attributes** — Joins through `primitive_instance` and `attribute_definition` to include system-level attributes from primitive components. Each attribute carries its `primitive_name` so the address space can group them under their parent variable. +2. **Dynamic attributes** — The same CTE-based query as the standard path, with an empty `primitive_name`. The `full_tag_reference` for primitive attributes follows the pattern `tag_name.primitive_name.attribute_name` (e.g., `TestMachine_001.AlarmAttr.InAlarm`). @@ -66,10 +79,10 @@ A single-column query: `SELECT time_of_last_deploy FROM galaxy`. The `galaxy` ta The Galaxy maintains two package references for each object: -- `checked_in_package_id` -- The latest saved version, which may include undeployed configuration changes -- `deployed_package_id` -- The version currently running on the target platform +- `checked_in_package_id` — the latest saved version, which may include undeployed configuration changes +- `deployed_package_id` — the version currently running on the target platform -The queries filter on `deployed_package_id <> 0` because the OPC UA server must mirror what is actually running in the Galaxy runtime. Using `checked_in_package_id` would expose attributes and objects that exist in the IDE but have not been deployed, causing mismatches between the OPC UA address space and the MXAccess runtime. +The queries filter on `deployed_package_id <> 0` because the OPC UA address space must mirror what is actually running in the Galaxy runtime. Using `checked_in_package_id` would expose attributes and objects that exist in the IDE but have not been deployed, causing mismatches between the OPC UA address space and the MXAccess runtime. ## Platform Scope Filter @@ -77,21 +90,16 @@ When `Scope` is set to `LocalPlatform`, the repository applies a post-query C# f ### How it works -1. **Platform lookup** -- A separate `const string` SQL query (`PlatformLookupSql`) reads `platform_gobject_id` and `node_name` from the `platform` table for all deployed platforms. This runs once per hierarchy load. - -2. **Platform matching** -- The configured `PlatformName` (or `Environment.MachineName` when null) is matched case-insensitively against the `node_name` column. If no match is found, a warning is logged listing the available platforms, and the address space is empty. - -3. **Host chain collection** -- The filter collects the matching platform's `gobject_id`, then iterates the hierarchy to find all `$AppEngine` (category 3) objects whose `HostedByGobjectId` equals the platform. This produces the full set of host gobject_ids under the local platform. - -4. **Object inclusion** -- All non-area objects whose `HostedByGobjectId` is in the host set are included, along with the hosts themselves. - -5. **Area retention** -- `ParentGobjectId` chains are walked upward from included objects to pull in ancestor areas, keeping the browse tree connected. Areas that contain no local descendants are excluded. - -6. **Attribute filtering** -- The set of included `gobject_id` values is cached after `GetHierarchyAsync` and reused by `GetAttributesAsync` to filter attributes to the same scope. +1. **Platform lookup** — A separate `const string` SQL query (`PlatformLookupSql`) reads `platform_gobject_id` and `node_name` from the `platform` table for all deployed platforms. This runs once per hierarchy load. +2. **Platform matching** — The configured `PlatformName` (or `Environment.MachineName` when null) is matched case-insensitively against the `node_name` column. If no match is found, a warning is logged listing the available platforms and the address space is empty. +3. **Host chain collection** — The filter collects the matching platform's `gobject_id`, then iterates the hierarchy to find all `$AppEngine` (category 3) objects whose `HostedByGobjectId` equals the platform. This produces the full set of host gobject_ids under the local platform. +4. **Object inclusion** — All non-area objects whose `HostedByGobjectId` is in the host set are included, along with the hosts themselves. +5. **Area retention** — `ParentGobjectId` chains are walked upward from included objects to pull in ancestor areas, keeping the browse tree connected. Areas that contain no local descendants are excluded. +6. **Attribute filtering** — The set of included `gobject_id` values is cached after `GetHierarchyAsync` and reused by `GetAttributesAsync` to filter attributes to the same scope. ### Design rationale -The filter is applied in C# rather than SQL because the project convention `GR-006` requires `const string` SQL queries with no dynamic SQL. The hierarchy query already returns `HostedByGobjectId` and `CategoryId` on every row, so all information needed for filtering is already in memory after the query runs. The only new SQL is the lightweight platform lookup query. +The filter is applied in C# rather than SQL because project convention `GR-006` requires `const string` SQL queries with no dynamic SQL. The hierarchy query already returns `HostedByGobjectId` and `CategoryId` on every row, so all information needed for filtering is already in memory after the query runs. The only new SQL is the lightweight platform lookup query. ### Configuration @@ -102,7 +110,7 @@ The filter is applied in C# rather than SQL because the project convention `GR-0 } ``` -- Set `Scope` to `"LocalPlatform"` to enable filtering. Default is `"Galaxy"` (load everything, backward compatible). +- Set `Scope` to `"LocalPlatform"` to enable filtering. Default is `"Galaxy"` (load everything). - Set `PlatformName` to an explicit hostname to target a specific platform, or leave null to use the local machine name. ### Startup log @@ -119,25 +127,26 @@ GetAttributesAsync returned 4206 attributes (extended=true) Scope filter retained 2100 of 4206 attributes ``` -## Change Detection Polling +## Change Detection Polling and IRediscoverable -`ChangeDetectionService` runs a background polling loop that calls `GetLastDeployTimeAsync` at the configured interval. It compares the returned timestamp against the last known value: +`ChangeDetectionService` runs a background polling loop in the Host process that calls `GetLastDeployTimeAsync` at the configured interval. It compares the returned timestamp against the last known value: - On the first poll (no previous state), the timestamp is recorded and `OnGalaxyChanged` fires unconditionally - On subsequent polls, `OnGalaxyChanged` fires only when `time_of_last_deploy` differs from the cached value -When the event fires, the host service queries fresh hierarchy and attribute data from the repository and calls `LmxNodeManager.RebuildAddressSpace` (which delegates to incremental `SyncAddressSpace`). +When the event fires, the Host re-runs the hierarchy and attribute queries and pushes the result back to the Server via an IPC `RediscoveryNeeded` message. That surfaces on `GalaxyProxyDriver` as the **`IRediscoverable.OnRediscoveryNeeded`** event; the Server's `DriverNodeManager` consumes it and calls `SyncAddressSpace` to compute the diff against the live address space. The polling approach is used because the Galaxy Repository database does not provide change notifications. The `galaxy.time_of_last_deploy` column updates only on completed deployments, so the polling interval controls how quickly the OPC UA address space reflects Galaxy changes. ## TestConnection -`TestConnectionAsync` runs `SELECT 1` against the configured database. This is used at service startup to verify connectivity before attempting the full hierarchy query. +`TestConnectionAsync` runs `SELECT 1` against the configured database. This is used at Host startup to verify connectivity before attempting the full hierarchy query. ## Key source files -- `src/ZB.MOM.WW.OtOpcUa.Host/GalaxyRepository/GalaxyRepositoryService.cs` -- SQL queries and data access -- `src/ZB.MOM.WW.OtOpcUa.Host/GalaxyRepository/PlatformScopeFilter.cs` -- Platform-based hierarchy and attribute filtering -- `src/ZB.MOM.WW.OtOpcUa.Host/GalaxyRepository/ChangeDetectionService.cs` -- Deploy timestamp polling loop -- `src/ZB.MOM.WW.OtOpcUa.Host/Configuration/GalaxyRepositoryConfiguration.cs` -- Connection, polling, and scope settings -- `src/ZB.MOM.WW.OtOpcUa.Host/Domain/PlatformInfo.cs` -- Platform-to-hostname DTO +- `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/Backend/GalaxyRepository/GalaxyRepositoryService.cs` — SQL queries and data access +- `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/Backend/GalaxyRepository/PlatformScopeFilter.cs` — Platform-based hierarchy and attribute filtering +- `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/Backend/GalaxyRepository/ChangeDetectionService.cs` — Deploy timestamp polling loop +- `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/Configuration/GalaxyRepositoryConfiguration.cs` — Connection, polling, and scope settings +- `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/Domain/PlatformInfo.cs` — Platform-to-hostname DTO +- `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Shared/Contracts/DiscoveryResponse.cs` — IPC DTO the Host uses to return hierarchy + attribute results across the pipe diff --git a/docs/drivers/Galaxy.md b/docs/drivers/Galaxy.md new file mode 100644 index 0000000..91d3039 --- /dev/null +++ b/docs/drivers/Galaxy.md @@ -0,0 +1,211 @@ +# Galaxy Driver + +The Galaxy driver bridges OtOpcUa to AVEVA System Platform (Wonderware) Galaxies through the `ArchestrA.MxAccess` COM API plus the Galaxy Repository SQL database. It is one driver of seven in the OtOpcUa platform (see [drivers/README.md](README.md) for the full list); all other drivers run in-process in the main Server (.NET 10 x64). Galaxy is the exception — it runs as its own Windows service and talks to the Server over a local named pipe. + +For the decision record on why Galaxy is out-of-process and how the refactor was staged, see [docs/v2/plan.md §4 Galaxy/MXAccess as Out-of-Process Driver](../v2/plan.md). For the full driver spec (addressing, data-type map, config shape), see [docs/v2/driver-specs.md §1](../v2/driver-specs.md). + +## Project Split + +Galaxy ships as three projects: + +| Project | Target | Role | +|---------|--------|------| +| `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Shared/` | .NET Standard 2.0 | IPC contracts (MessagePack records + `MessageKind` enum) referenced by both sides | +| `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/` | .NET Framework 4.8 **x86** | Separate Windows service hosting the MXAccess COM objects, STA thread + Win32 message pump, Galaxy Repository reader, Historian SDK, runtime-probe manager | +| `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Proxy/` | .NET 10 (matches Server) | `GalaxyProxyDriver : IDriver, ITagDiscovery, IReadable, IWritable, ISubscribable, IAlarmSource, IHistoryProvider, IRediscoverable, IHostConnectivityProbe` — loaded in-process by the Server; every call forwards over the pipe to the Host | + +The Shared assembly is the **only** contract between the two runtimes. It carries no COM or SDK references so Proxy (net10) can reference it without dragging x86 code into the Server process. + +## Why Out-of-Process + +Two reasons drive the split, per `docs/v2/plan.md`: + +1. **Bitness constraint.** MXAccess is 32-bit COM only — `ArchestrA.MxAccess.dll` in `Program Files (x86)\ArchestrA\Framework\bin` has no 64-bit variant. The main OtOpcUa Server is .NET 10 x64 (the OPC Foundation stack, SqlClient, and every other non-Galaxy driver target 64-bit). In-process hosting would force the whole Server to x86, which every other driver project would then inherit. +2. **Tier-C stability isolation.** Galaxy is classified Tier C in [docs/v2/driver-stability.md](../v2/driver-stability.md) — the COM runtime, STA thread, Aveva Historian SDK, and SQL queries all have crash/hang modes that can take down the hosting process. Isolating the driver in its own Windows service means a COM deadlock, AccessViolation in an unmanaged Historian DLL, or a runaway SQL query never takes the Server endpoint down. The Proxy-side supervisor restarts the Host with crash-loop circuit-breaker. + +The same Tier-C isolation story applies to FOCAS (decision record in `docs/v2/plan.md` §7), which is the second out-of-process driver. + +## IPC Transport + +`GalaxyProxyDriver` → `GalaxyIpcClient` → named pipe → `Galaxy.Host` pipe server. + +- Pipe name: `otopcua-galaxy-{DriverInstanceId}` (localhost-only, no TCP surface) +- Wire format: MessagePack-CSharp, length-prefixed frames +- ACL: pipe is created with a DACL that grants only the Server's service identity; the Admins group is explicitly denied so a live-smoke test running from an elevated shell fails fast rather than silently bypassing the handshake +- Handshake: Proxy presents a shared secret at `OpenSessionRequest`; Host rejects anything else with `MessageKind.OpenSessionResponse{Success=false}` +- Heartbeat: Proxy sends a periodic ping; missed heartbeats trigger the Proxy-side crash-loop supervisor to restart the Host + +Every capability call on `GalaxyProxyDriver` (Read, Write, Subscribe, HistoryRead*, etc.) serializes a `*Request`, awaits the matching `*Response` via a `CallAsync` helper, and rehydrates the result into the `Core.Abstractions` shape the Server expects. + +## STA Thread Requirement (Host-side) + +MXAccess COM objects — `LMXProxyServer` instantiation, `Register`, `AddItem`, `AdviseSupervisory`, `Write`, and cleanup calls — must all execute on the same Single-Threaded Apartment. Calling a COM object from the wrong thread causes marshalling failures or silent data corruption. + +`StaComThread` in the Host provides that thread with the apartment state set before the thread starts: + +```csharp +_thread = new Thread(ThreadEntry) { Name = "MxAccess-STA", IsBackground = true }; +_thread.SetApartmentState(ApartmentState.STA); +``` + +Work items queue via `RunAsync(Action)` or `RunAsync(Func)` into a `ConcurrentQueue` and post `WM_APP` to wake the pump. Each work item is wrapped in a `TaskCompletionSource` so callers can `await` the result from any thread — including the IPC handler thread that receives the inbound pipe request. + +## Win32 Message Pump (Host-side) + +COM callbacks (`OnDataChange`, `OnWriteComplete`) are delivered through the Windows message loop. `StaComThread` runs a standard Win32 message pump via P/Invoke: + +1. `PeekMessage` primes the message queue (required before `PostThreadMessage` works) +2. `GetMessage` blocks until a message arrives +3. `WM_APP` drains the work queue +4. `WM_APP + 1` drains the queue and posts `WM_QUIT` to exit the loop +5. All other messages go through `TranslateMessage` / `DispatchMessage` for COM callback delivery + +Without this pump MXAccess callbacks never fire and the driver delivers no live data. + +## LMXProxyServer COM Object + +`MxProxyAdapter` wraps the real `ArchestrA.MxAccess.LMXProxyServer` COM object behind the `IMxProxy` interface so Host unit tests can substitute a fake proxy without requiring the ArchestrA runtime. Lifecycle: + +1. **`Register(clientName)`** — Creates a new `LMXProxyServer` instance, wires up `OnDataChange` and `OnWriteComplete` event handlers, calls `Register` to obtain a connection handle +2. **`Unregister(handle)`** — Unwires event handlers, calls `Unregister`, releases the COM object via `Marshal.ReleaseComObject` + +## Register / AddItem / AdviseSupervisory Pattern + +Every MXAccess data operation follows a three-step pattern, all executed on the STA thread: + +1. **`AddItem(handle, address)`** — Resolves a Galaxy tag reference (e.g., `TestMachine_001.MachineID`) to an integer item handle +2. **`AdviseSupervisory(handle, itemHandle)`** — Subscribes the item for supervisory data-change callbacks +3. The runtime begins delivering `OnDataChange` events + +For writes, after `AddItem` + `AdviseSupervisory`, `Write(handle, itemHandle, value, securityClassification)` sends the value; `OnWriteComplete` confirms or rejects. Cleanup reverses: `UnAdviseSupervisory` then `RemoveItem`. + +## OnDataChange and OnWriteComplete Callbacks + +### OnDataChange + +Fired by the COM runtime on the STA thread when a subscribed tag changes. The handler in `MxAccessClient.EventHandlers.cs`: + +1. Maps the integer `phItemHandle` back to a tag address via `_handleToAddress` +2. Maps the MXAccess quality code to the internal `Quality` enum +3. Checks `MXSTATUS_PROXY` for error details and adjusts quality +4. Converts the timestamp to UTC +5. Constructs a `Vtq` (Value/Timestamp/Quality) and delivers it to: + - The stored per-tag subscription callback + - Any pending one-shot read completions + - The global `OnTagValueChanged` event (consumed by the Host's subscription dispatcher, which packages changes into `DataChangeEventArgs` and forwards them over the pipe to `GalaxyProxyDriver.OnDataChange`) + +### OnWriteComplete + +Fired when the runtime acknowledges or rejects a write. The handler resolves the pending `TaskCompletionSource` for the item handle. If `MXSTATUS_PROXY.success == 0` the write is considered failed and the error detail is logged. + +## Reconnection Logic + +`MxAccessClient` implements automatic reconnection through two mechanisms. + +### Monitor loop + +`StartMonitor` launches a background task that polls at `MonitorIntervalSeconds`. On each cycle: + +- If the state is `Disconnected` or `Error` and `AutoReconnect` is enabled, it calls `ReconnectAsync` +- If connected and a probe tag is configured, it checks the probe staleness threshold + +### Reconnect sequence + +`ReconnectAsync` performs a full disconnect-then-connect cycle: + +1. Increment the reconnect counter +2. `DisconnectAsync` — tear down all active subscriptions (`UnAdviseSupervisory` + `RemoveItem` for each), detach COM event handlers, call `Unregister`, clear all handle mappings +3. `ConnectAsync` — create a fresh `LMXProxyServer`, register, replay all stored subscriptions, re-subscribe the probe tag + +Stored subscriptions (`_storedSubscriptions`) persist across reconnects. `ReplayStoredSubscriptionsAsync` iterates the stored entries and calls `AddItem` + `AdviseSupervisory` for each. + +## Probe Tag Health Monitoring + +A configurable probe tag (e.g., a frequently updating Galaxy attribute) serves as a connection health indicator. After connecting, the client subscribes to the probe tag and records `_lastProbeValueTime` on every `OnDataChange`. The monitor loop compares `DateTime.UtcNow - _lastProbeValueTime` against `ProbeStaleThresholdSeconds`; if the probe has not updated within the window, the connection is assumed stale and a reconnect is forced. This catches scenarios where the COM connection is technically alive but the runtime has stopped delivering data. + +## Per-Host Runtime Status Probes (`.ScanState`) + +Separate from the connection-level probe, the driver advises `.ScanState` on every deployed `$WinPlatform` and `$AppEngine` in the Galaxy. These probes track per-host runtime state so the Admin UI dashboard can report "this specific Platform / AppEngine is off scan" and the driver can proactively invalidate every OPC UA variable hosted by the stopped object — preventing MXAccess from serving stale Good-quality cached values to clients who read those tags while the host is down. + +Enabled by default via `MxAccess.RuntimeStatusProbesEnabled`; see [Configuration](../Configuration.md#mxaccess) for the two config fields. + +### How it works + +`GalaxyRuntimeProbeManager` lives in `Driver.Galaxy.Host` alongside the rest of the MXAccess code. It is owned by the Host's subscription dispatcher and runs a three-state machine per host (Unknown / Running / Stopped): + +1. **Discovery** — After the Host completes `BuildAddressSpace`, the manager filters the hierarchy to rows where `CategoryId == 1` (`$WinPlatform`) or `CategoryId == 3` (`$AppEngine`) and issues `AdviseSupervisory` for `.ScanState` on each one. Probes are driver-owned, not ref-counted against client subscriptions, and persist across address-space rebuilds via a `Sync` diff. +2. **Transition predicate** — A probe callback is interpreted as `isRunning = vtq.Quality.IsGood() && vtq.Value is bool b && b`. Everything else (explicit `ScanState = false`, bad quality, communication errors) means **Stopped**. +3. **On-change-only delivery** — `ScanState` is delivered only when the value actually changes. A stably Running host may go hours without a callback. `Tick()` does NOT run a starvation check on Running entries — the only time-based transition is **Unknown → Stopped** when the initial callback hasn't arrived within `RuntimeStatusUnknownTimeoutSeconds` (default 15s). This protects against a probe that fails to resolve at all without incorrectly flipping healthy long-running hosts. +4. **Transport gating** — When `IMxAccessClient.State != Connected`, `GetSnapshot()` forces every entry to `Unknown`. The dashboard shows the Connection panel as the primary signal in that case rather than misleading operators with "every host stopped". +5. **Subscribe failure rollback** — If `SubscribeAsync` throws for a new probe (SDK failure, broker rejection, transport error), the manager rolls back both `_byProbe` and `_probeByGobjectId` so the probe never appears in `GetSnapshot()`. Stability review 2026-04-13 Finding 1. + +### Subtree quality invalidation on transition + +When a host transitions **Running → Stopped**, the probe manager invokes a callback that walks `_hostedVariables[gobjectId]` — the set of every OPC UA variable transitively hosted by that Galaxy object — and sets each variable's `StatusCode` to `BadOutOfService`. **Stopped → Running** calls `ClearHostVariablesBadQuality` to reset each to `Good` so the next on-change MXAccess update repopulates the value. + +The hosted-variables map is built once per `BuildAddressSpace` by walking each object's `HostedByGobjectId` chain up to the nearest Platform or Engine ancestor. A variable hosted by an Engine inside a Platform lands in both the Engine's list and the Platform's list, so stopping the Platform transitively invalidates every descendant Engine's variables. + +### Read-path short-circuit (`IsTagUnderStoppedHost`) + +The Host's Read handler checks `IsTagUnderStoppedHost(tagRef)` (a reverse-index lookup `_hostIdsByTagRef[tagRef]` → `GalaxyRuntimeProbeManager.IsHostStopped(hostId)`) before the MXAccess round-trip. When the owning host is Stopped, the handler returns a synthesized `DataValue { Value = cachedVar.Value, StatusCode = BadOutOfService }` directly without touching MXAccess. This guarantees clients see a uniform `BadOutOfService` on every descendant tag of a stopped host, regardless of whether they're reading or subscribing. + +### Deferred dispatch — the STA deadlock + +**Critical**: probe transition callbacks must **not** run synchronously on the STA thread that delivered the `OnDataChange`. `MarkHostVariablesBadQuality` takes the subscription dispatcher lock, which may be held by a worker thread currently inside `Read` waiting on an `_mxAccessClient.ReadAsync()` round-trip that is itself waiting for the STA thread. Classic circular wait — the first real deploy of this feature hung inside 30 seconds from exactly this pattern. + +The fix is a deferred-dispatch queue: probe callbacks enqueue the transition onto `ConcurrentQueue<(int GobjectId, bool Stopped)>` and set the existing dispatch signal. The dispatch thread drains the queue inside its existing 100ms `WaitOne` loop — outside any locks held by the STA path — and then calls `MarkHostVariablesBadQuality` / `ClearHostVariablesBadQuality` under its own natural lock acquisition. No circular wait, no STA involvement. + +### Dashboard and health surface + +- Admin UI **Galaxy Runtime** panel shows per-host state with Name / Kind / State / Since / Last Error columns. Panel color is green (all Running), yellow (any Unknown, none Stopped), red (any Stopped), gray (MXAccess transport disconnected) +- `HealthCheckService.CheckHealth` rolls overall driver health to `Degraded` when any host is Stopped + +See [Status Dashboard](../StatusDashboard.md#galaxy-runtime) for the field table and [Configuration](../Configuration.md#mxaccess) for the config fields. + +## Request Timeout Safety Backstop + +Every sync-over-async site on the OPC UA stack thread that calls into Galaxy (`Read`, `Write`, address-space rebuild probe sync) is wrapped in a bounded `SyncOverAsync.WaitSync(...)` helper with timeout `MxAccess.RequestTimeoutSeconds` (default 30s). Inner `ReadTimeoutSeconds` / `WriteTimeoutSeconds` bounds on the async path are the first line of defense; the outer wrapper is a backstop so a scheduler stall, slow reconnect, or any other non-returning async path cannot park the stack thread indefinitely. + +On timeout, the underlying task is **not** cancelled — it runs to completion on the thread pool and is abandoned. This is acceptable because Galaxy IPC clients are shared singletons and the abandoned continuation does not capture request-scoped state. The OPC UA stack receives `StatusCodes.BadTimeout` on the affected operation. + +`ConfigurationValidator` enforces `RequestTimeoutSeconds >= 1` and warns when it is set below the inner Read/Write timeouts (operator misconfiguration). Stability review 2026-04-13 Finding 3. + +All capability calls at the Server dispatch layer are additionally wrapped by `CapabilityInvoker` (Core/Resilience/) which runs them through a Polly pipeline keyed on `(DriverInstanceId, HostName, DriverCapability)`. `OTOPCUA0001` analyzer enforces the wrap at build time. + +## Why Marshal.ReleaseComObject Is Needed + +The .NET Framework runtime's garbage collector releases COM references non-deterministically. For MXAccess, delayed release can leave stale COM connections open, preventing clean re-registration. `MxProxyAdapter.Unregister` calls `Marshal.ReleaseComObject(_lmxProxy)` in a `finally` block to immediately drive the COM reference count to zero. This ensures the underlying COM server is freed before a reconnect attempt creates a new instance. + +## Tag Discovery and Historical Data + +Tag discovery (the Galaxy Repository SQL reader + `LocalPlatform` scope filter) is covered in [Galaxy-Repository.md](Galaxy-Repository.md). The Galaxy driver is `ITagDiscovery` for the Server's bootstrap path and `IRediscoverable` for the on-change-redeploy path. + +Historical data access (raw, processed, at-time, events) runs against the Aveva Historian via the `aahClientManaged` SDK and is exposed through the Galaxy driver's `IHistoryProvider` implementation. See [HistoricalDataAccess.md](../HistoricalDataAccess.md). + +## Key source files + +Host-side (`.NET 4.8 x86`, `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/`): + +- `Backend/MxAccess/StaComThread.cs` — STA thread and Win32 message pump +- `Backend/MxAccess/MxAccessClient.cs` — Core client (partial) +- `Backend/MxAccess/MxAccessClient.Connection.cs` — Connect / disconnect / reconnect +- `Backend/MxAccess/MxAccessClient.Subscription.cs` — Subscribe / unsubscribe / replay +- `Backend/MxAccess/MxAccessClient.ReadWrite.cs` — Read and write operations +- `Backend/MxAccess/MxAccessClient.EventHandlers.cs` — `OnDataChange` / `OnWriteComplete` handlers +- `Backend/MxAccess/MxAccessClient.Monitor.cs` — Background health monitor +- `Backend/MxAccess/MxProxyAdapter.cs` — COM object wrapper +- `Backend/MxAccess/GalaxyRuntimeProbeManager.cs` — Per-host `ScanState` probes, state machine, `IsHostStopped` lookup +- `Backend/Historian/HistorianDataSource.cs` — `aahClientManaged` SDK wrapper (see [HistoricalDataAccess.md](../HistoricalDataAccess.md)) +- `Ipc/GalaxyIpcServer.cs` — Named-pipe server, message dispatch +- `Domain/IMxAccessClient.cs` — Client interface + +Shared (`.NET Standard 2.0`, `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Shared/`): + +- `Contracts/MessageKind.cs` — IPC message kinds (`ReadRequest`, `HistoryReadRequest`, `OpenSessionResponse`, …) +- `Contracts/*.cs` — MessagePack DTOs for every request/response pair + +Proxy-side (`.NET 10`, `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Proxy/`): + +- `GalaxyProxyDriver.cs` — `IDriver`/`ITagDiscovery`/`IReadable`/`IWritable`/`ISubscribable`/`IAlarmSource`/`IHistoryProvider`/`IRediscoverable`/`IHostConnectivityProbe` implementation; every method forwards via `GalaxyIpcClient` +- `Ipc/GalaxyIpcClient.cs` — Named-pipe client, `CallAsync`, reconnect on broken pipe +- `GalaxyProxySupervisor.cs` — Host-process monitor, crash-loop circuit-breaker, Host relaunch diff --git a/docs/drivers/README.md b/docs/drivers/README.md new file mode 100644 index 0000000..164ac03 --- /dev/null +++ b/docs/drivers/README.md @@ -0,0 +1,46 @@ +# Drivers + +OtOpcUa is a multi-driver OPC UA server. The Core (`ZB.MOM.WW.OtOpcUa.Core` + `Core.Abstractions` + `Server`) owns the OPC UA stack, address space, session/security/subscription machinery, resilience pipeline, and namespace kinds (Equipment + SystemPlatform). Drivers plug in through **capability interfaces** defined in `src/ZB.MOM.WW.OtOpcUa.Core.Abstractions/`: + +- `IDriver` — lifecycle (`InitializeAsync`, `ReinitializeAsync`, `ShutdownAsync`, `GetHealth`) +- `IReadable` / `IWritable` — one-shot reads and writes +- `ITagDiscovery` — address-space enumeration +- `ISubscribable` — driver-pushed data-change streams +- `IHostConnectivityProbe` — per-host reachability events +- `IPerCallHostResolver` — multi-host drivers that route each call to a target endpoint at dispatch time +- `IAlarmSource` — driver-emitted OPC UA A&C events +- `IHistoryProvider` — raw / processed / at-time / events HistoryRead (see [HistoricalDataAccess.md](../HistoricalDataAccess.md)) +- `IRediscoverable` — driver-initiated address-space rebuild notifications + +Each driver opts into only the capabilities it supports. Every async capability call at the Server dispatch layer goes through `CapabilityInvoker` (`Core/Resilience/CapabilityInvoker.cs`), which wraps it in a Polly pipeline keyed on `(DriverInstanceId, HostName, DriverCapability)`. The `OTOPCUA0001` analyzer enforces the wrap at build time. Drivers themselves never depend on Polly; they just implement the capability interface and let the Core wrap it. + +Driver type metadata is registered at startup in `DriverTypeRegistry` (`src/ZB.MOM.WW.OtOpcUa.Core.Abstractions/DriverTypeRegistry.cs`). The registry records each type's allowed namespace kinds (`Equipment` / `SystemPlatform` / `Simulated`), its JSON Schema for `DriverConfig` / `DeviceConfig` / `TagConfig` columns, and its stability tier per [docs/v2/driver-stability.md](../v2/driver-stability.md). + +## Ground-truth driver list + +| Driver | Project path | Tier | Wire / library | Capabilities | Notable quirk | +|--------|--------------|:----:|----------------|--------------|---------------| +| [Galaxy](Galaxy.md) | `Driver.Galaxy.{Shared, Host, Proxy}` | C | MXAccess COM + `aahClientManaged` + SqlClient | IDriver, ITagDiscovery, IReadable, IWritable, ISubscribable, IAlarmSource, IHistoryProvider, IRediscoverable, IHostConnectivityProbe | Out-of-process — Host is its own Windows service (.NET 4.8 x86 for the COM bitness constraint); Proxy talks to Host over a named pipe | +| Modbus TCP | `Driver.Modbus` | A | NModbus-derived in-house client | IDriver, ITagDiscovery, IReadable, IWritable, ISubscribable, IHostConnectivityProbe | Polled subscriptions via the shared `PollGroupEngine`. DL205 PLCs are covered by `AddressFormat=DL205` (octal V/X/Y/C/T/CT translation) — no separate driver | +| Siemens S7 | `Driver.S7` | A | [S7netplus](https://github.com/S7NetPlus/s7netplus) | IDriver, ITagDiscovery, IReadable, IWritable, ISubscribable, IHostConnectivityProbe | Single S7netplus `Plc` instance per PLC serialized with `SemaphoreSlim` — the S7 CPU's comm mailbox is scanned at most once per cycle, so parallel reads don't help | +| AB CIP | `Driver.AbCip` | A | libplctag CIP | IDriver, ITagDiscovery, IReadable, IWritable, ISubscribable, IHostConnectivityProbe, IPerCallHostResolver | ControlLogix / CompactLogix. Tag discovery uses the `@tags` walker to enumerate controller-scoped + program-scoped symbols; UDT member resolution via the UDT template reader | +| AB Legacy | `Driver.AbLegacy` | A | libplctag PCCC | IDriver, ITagDiscovery, IReadable, IWritable, ISubscribable, IHostConnectivityProbe, IPerCallHostResolver | SLC 500 / MicroLogix. File-based addressing (`N7:0`, `F8:0`) — no symbol table, tag list is user-authored in the config DB | +| TwinCAT | `Driver.TwinCAT` | B | Beckhoff `TwinCAT.Ads` (`TcAdsClient`) | IDriver, ITagDiscovery, IReadable, IWritable, ISubscribable, IHostConnectivityProbe, IPerCallHostResolver | The only native-notification driver outside Galaxy — ADS delivers `ValueChangedCallback` events the driver forwards straight to `ISubscribable.OnDataChange` without polling. Symbol tree uploaded via `SymbolLoaderFactory` | +| FOCAS | `Driver.FOCAS` | C | FANUC FOCAS2 (`Fwlib32.dll` P/Invoke) | IDriver, ITagDiscovery, IReadable, IWritable, ISubscribable, IHostConnectivityProbe, IPerCallHostResolver | Tier C — FOCAS DLL has crash modes that warrant process isolation. CNC-shaped data model (axes, spindle, PMC, macros, alarms) not a flat tag map | +| OPC UA Client | `Driver.OpcUaClient` | B | OPCFoundation `Opc.Ua.Client` | IDriver, ITagDiscovery, IReadable, IWritable, ISubscribable, IAlarmSource, IHistoryProvider, IHostConnectivityProbe | Gateway/aggregation driver. Opens a single `Session` against a remote OPC UA server and re-exposes its address space. Owns its own `ApplicationConfiguration` (distinct from `Client.Shared`) because it's always-on with keep-alive + `TransferSubscriptions` across SDK reconnect, not an interactive CLI | + +## Per-driver documentation + +- **Galaxy** has its own docs in this folder because the out-of-process architecture + MXAccess COM rules + Galaxy Repository SQL + Historian + runtime probe manager don't fit a single table row: + - [Galaxy.md](Galaxy.md) — COM bridge, STA pump, IPC, runtime probes + - [Galaxy-Repository.md](Galaxy-Repository.md) — ZB SQL reader, `LocalPlatform` scope filter, change detection + +- **All other drivers** share a single per-driver specification in [docs/v2/driver-specs.md](../v2/driver-specs.md) — addressing, data-type maps, connection settings, and quirks live there. That file is the authoritative per-driver reference; this index points at it rather than duplicating. + +## Related cross-driver docs + +- [HistoricalDataAccess.md](../HistoricalDataAccess.md) — `IHistoryProvider` dispatch, aggregate mapping, continuation points. The Galaxy driver's Aveva Historian implementation is the first; OPC UA Client forwards to the upstream server; other drivers do not implement the interface and return `BadHistoryOperationUnsupported`. +- [AlarmTracking.md](../AlarmTracking.md) — `IAlarmSource` event model and filtering. +- [Subscriptions.md](../Subscriptions.md) — how the Server multiplexes subscriptions onto `ISubscribable.OnDataChange`. +- [docs/v2/driver-stability.md](../v2/driver-stability.md) — tier system (A / B / C), shared `CapabilityPolicy` defaults per tier × capability, `MemoryTracking` hybrid formula, and process-level recycle rules. +- [docs/v2/plan.md](../v2/plan.md) — authoritative vision, architecture decisions, migration strategy.