Doc refresh (task #203) — driver docs split + drivers index + IHistoryProvider-aware HistoricalDataAccess

Restructure the driver-facing docs to match the OtOpcUa v2 multi-driver
reality (Galaxy, Modbus, S7, AB CIP, AB Legacy, TwinCAT, FOCAS, OPC UA Client
— 8 drivers total; Galaxy ships as three projects) and the capability-interface
architecture where every driver opts into IDriver + whichever of IReadable /
IWritable / ITagDiscovery / ISubscribable / IHostConnectivityProbe /
IPerCallHostResolver / IAlarmSource / IHistoryProvider / IRediscoverable it
supports. Doc scope follows the code: one-driver-specific docs scoped to that
driver, cross-driver concerns live once at the top level, per-driver specs
cross-link to docs/v2/driver-specs.md rather than duplicate.

What changed per file:

- docs/MxAccessBridge.md -> docs/drivers/Galaxy.md (git mv + rewrite): retitled
  "Galaxy Driver", reframed as one of seven drivers. Added Project Split table
  (Shared .NET Standard 2.0 / Host .NET 4.8 x86 / Proxy .NET 10) and Why
  Out-of-Process section citing both the MXAccess bitness constraint and Tier C
  stability isolation per docs/v2/plan.md section 4. Added IPC Transport
  section covering pipe naming, MessagePack framing, DACL that denies Admins,
  shared-secret handshake, heartbeat, and CallAsync<TReq,TResp> dispatch.
  Moved file paths from src/ZB.MOM.WW.LmxOpcUa.Host/MxAccess/* to
  src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/Backend/MxAccess/* and added the
  Shared + Proxy key-file tables. Added CapabilityInvoker + OTOPCUA0001
  analyzer callout. Cross-linked to drivers/README.md, Galaxy-Repository.md,
  HistoricalDataAccess.md.

- docs/GalaxyRepository.md -> docs/drivers/Galaxy-Repository.md (git mv +
  rewrite): retitled "Galaxy Repository — Tag Discovery for the Galaxy
  Driver", opened with a comparison table showing how every driver's
  ITagDiscovery source is different (AB CIP @tags walker, TwinCAT
  SymbolLoaderFactory, FOCAS CNC queries, OPC UA Client Session.Browse, etc).
  Repositioned GalaxyRepositoryService as the Galaxy driver's
  ITagDiscovery.DiscoverAsync implementation. Updated paths to
  Driver.Galaxy.Host/Backend/GalaxyRepository/*. Added IRediscoverable section
  covering the on-change-redeploy IPC path.

- docs/drivers/README.md (new): index with ground-truth driver table —
  project path, stability tier, wire library, capability-interface list, and
  one notable quirk per driver. Verified against the driver csproj files and
  class declarations on focas-pr3-remaining-capabilities (the most recent
  branch containing every driver). Galaxy gets its own dedicated docs; the
  other seven drivers cross-link to docs/v2/driver-specs.md. Lists the full
  Core.Abstractions capability surface, DriverTypeRegistry, CapabilityInvoker,
  and OTOPCUA0001 analyzer.

- docs/HistoricalDataAccess.md (rewrite): reframed around IHistoryProvider as
  a per-driver optional capability interface. Replaced v1 HistorianPluginLoader
  / AvevaHistorianPluginEntry plugin architecture with the v2 story —
  Historian.Aveva was merged into Driver.Galaxy.Host/Backend/Historian/ and
  IPC-forwarded through GalaxyProxyDriver. Documented all four IHistoryProvider
  methods (ReadRawAsync / ReadProcessedAsync / ReadAtTimeAsync /
  ReadEventsAsync), CapabilityInvoker wrapping with DriverCapability.HistoryRead,
  and the per-driver coverage matrix (Galaxy + OPC UA Client implement; the
  six protocol drivers don't and return BadHistoryOperationUnsupported). Kept
  the cluster-failover + health-counter + quality-mapping detail for the
  Galaxy Historian implementation. Flagged one gap: Proxy forwards all four
  history message kinds but the Host-side HistoryAggregateType -> AnalogSummary
  column mapping may surface GalaxyIpcException{Code="not-implemented"} on a
  given branch until the Phase 2 Galaxy out-of-process gate lands.

Driver list built against ground truth (src on focas-pr3-remaining-capabilities):
  Driver.Galaxy.{Shared,Host,Proxy}, Driver.Modbus, Driver.S7, Driver.AbCip,
  Driver.AbLegacy, Driver.TwinCAT, Driver.FOCAS, Driver.OpcUaClient.
Capability interface lists verified against each *Driver.cs class declaration.
Aveva Historian ported to Driver.Galaxy.Host/Backend/Historian/; no separate
Historian.Aveva assembly on v2 branches.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Joseph Doherty
2026-04-20 01:30:47 -04:00
parent 985b7aba26
commit 71339307fa
5 changed files with 372 additions and 385 deletions

View File

@@ -0,0 +1,152 @@
# Galaxy Repository — Tag Discovery for the Galaxy Driver
`GalaxyRepositoryService` reads the Galaxy object hierarchy and attribute metadata from the System Platform Galaxy Repository SQL Server database. It is the Galaxy driver's implementation of **`ITagDiscovery.DiscoverAsync`** — every driver has its own discovery source, and the Galaxy driver's is a direct SQL query against the Galaxy Repository (the `ZB` database). Other drivers use completely different mechanisms:
| Driver | `ITagDiscovery` source |
|--------|------------------------|
| Galaxy | ZB SQL hierarchy + attribute queries (this doc) |
| AB CIP | `@tags` walker against the PLC controller |
| AB Legacy | Data-table scan via PCCC `LogicalRead` on the PLC |
| TwinCAT | Beckhoff `SymbolLoaderFactory` — uploads the full symbol tree from the ADS runtime |
| S7 | Config-DB enumeration (no native symbol upload for S7comm) |
| Modbus | Config-DB enumeration (flat register map, user-authored) |
| FOCAS | CNC queries (`cnc_rdaxisname`, `cnc_rdmacroinfo`, …) + optional Config-DB overlays |
| OPC UA Client | `Session.Browse` against the remote server |
`GalaxyRepositoryService` lives in `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/Backend/GalaxyRepository/` — Host-side, .NET Framework 4.8 x86, same process that owns the MXAccess COM objects. The Proxy forwards discovery over IPC the same way it forwards reads and writes.
## Connection Configuration
`GalaxyRepositoryConfiguration` controls database access:
| Property | Default | Description |
|----------|---------|-------------|
| `ConnectionString` | `Server=localhost;Database=ZB;Integrated Security=true;` | SQL Server connection using Windows Authentication |
| `ChangeDetectionIntervalSeconds` | `30` | Polling frequency for deploy change detection |
| `CommandTimeoutSeconds` | `30` | SQL command timeout for all queries |
| `ExtendedAttributes` | `false` | When true, loads primitive-level attributes in addition to dynamic attributes |
| `Scope` | `Galaxy` | `Galaxy` loads all deployed objects. `LocalPlatform` filters to the local platform's subtree only |
| `PlatformName` | `null` | Explicit platform hostname for `LocalPlatform` filtering. When null, uses `Environment.MachineName` |
The connection uses Windows Authentication because the Galaxy Repository database is local to the System Platform node and secured through domain credentials.
## SQL Queries
All queries are embedded as `const string` fields in `GalaxyRepositoryService`. No dynamic SQL is used. Project convention `GR-006` requires `const string` SQL queries; any new query must be added as a named constant rather than built at runtime.
### Hierarchy query
Returns deployed Galaxy objects with their parent relationships, browse names, and template derivation chains:
- Joins `gobject` to `template_definition` to filter by relevant `category_id` values (1, 3, 4, 10, 11, 13, 17, 24, 26)
- Uses `contained_name` as the browse name, falling back to `tag_name` when `contained_name` is null or empty
- Resolves the parent using `contained_by_gobject_id` when non-zero, otherwise falls back to `area_gobject_id`
- Marks objects with `category_id = 13` as areas
- Filters to `is_template = 0` (instances only, not templates)
- Filters to `deployed_package_id <> 0` (deployed objects only)
- Returns a `template_chain` column built by a recursive CTE that walks `gobject.derived_from_gobject_id` from each instance through its immediate template and ancestor templates (depth guard `< 10`). Template names are ordered by depth and joined with `|` via `STUFF(... FOR XML PATH(''))`. Example: `TestMachine_001` returns `$TestMachine|$gMachine|$gUserDefined|$UserDefined`. The C# repository reader splits the column on `|`, trims, and populates `GalaxyObjectInfo.TemplateChain`, which is consumed by `AlarmObjectFilter` for template-based alarm filtering. See [Alarm Tracking](../AlarmTracking.md#template-based-alarm-object-filter).
- Returns `template_definition.category_id` as a `category_id` column, populated into `GalaxyObjectInfo.CategoryId`. The runtime status probe manager filters this down to `CategoryId == 1` (`$WinPlatform`) and `CategoryId == 3` (`$AppEngine`) to decide which objects get a `<Host>.ScanState` probe advised. Also used during the hosted-variables walk to identify Platform/Engine ancestors.
- Returns `gobject.hosted_by_gobject_id` as a `hosted_by_gobject_id` column, populated into `GalaxyObjectInfo.HostedByGobjectId`. This is the **runtime host** of the object (e.g., which `$AppEngine` actually runs it), **not** the browse-containment parent (`contained_by_gobject_id`). The two are often different — an object can live in one Area in the browse tree but be hosted by an Engine on a different Platform for runtime execution. The driver walks this chain during `BuildHostedVariablesMap` to find the nearest `$WinPlatform` or `$AppEngine` ancestor so subtree quality invalidation on a Stopped host reaches exactly the variables that were actually executing there. Note: the Galaxy schema column is named `hosted_by_gobject_id` (not `host_gobject_id` as some documentation sources guess). See [Galaxy driver — Per-Host Runtime Status Probes](Galaxy.md#per-host-runtime-status-probes-hostscanstate).
### Attributes query (standard)
Returns user-defined dynamic attributes for deployed objects:
- Uses a recursive CTE (`deployed_package_chain`) to walk the package inheritance chain from `deployed_package_id` through `derived_from_package_id`, limited to 10 levels
- Joins `dynamic_attribute` on each package in the chain to collect inherited attributes
- Uses `ROW_NUMBER() OVER (PARTITION BY gobject_id, attribute_name ORDER BY depth)` to pick the most-derived definition when an attribute is overridden at multiple levels
- Builds `full_tag_reference` as `tag_name.attribute_name` with `[]` appended for arrays
- Extracts `array_dimension` from the binary `mx_value` column (bytes 13-16, little-endian int32)
- Detects historized attributes by checking for a `HistoryExtension` primitive instance
- Detects alarm attributes by checking for an `AlarmExtension` primitive instance
- Excludes internal attributes (names starting with `_`) and `.Description` suffixes
- Filters by `mx_attribute_category` to include only user-relevant categories
### Attributes query (extended)
When `ExtendedAttributes = true`, a more comprehensive query runs that unions two sources:
1. **Primitive attributes** — Joins through `primitive_instance` and `attribute_definition` to include system-level attributes from primitive components. Each attribute carries its `primitive_name` so the address space can group them under their parent variable.
2. **Dynamic attributes** — The same CTE-based query as the standard path, with an empty `primitive_name`.
The `full_tag_reference` for primitive attributes follows the pattern `tag_name.primitive_name.attribute_name` (e.g., `TestMachine_001.AlarmAttr.InAlarm`).
### Change detection query
A single-column query: `SELECT time_of_last_deploy FROM galaxy`. The `galaxy` table contains one row with the timestamp of the most recent deployment.
## Why deployed_package_id Instead of checked_in_package_id
The Galaxy maintains two package references for each object:
- `checked_in_package_id` — the latest saved version, which may include undeployed configuration changes
- `deployed_package_id` — the version currently running on the target platform
The queries filter on `deployed_package_id <> 0` because the OPC UA address space must mirror what is actually running in the Galaxy runtime. Using `checked_in_package_id` would expose attributes and objects that exist in the IDE but have not been deployed, causing mismatches between the OPC UA address space and the MXAccess runtime.
## Platform Scope Filter
When `Scope` is set to `LocalPlatform`, the repository applies a post-query C# filter to restrict the address space to objects hosted by the local platform. This reduces memory footprint, MXAccess subscription count, and address space size on multi-node Galaxy deployments where each OPC UA server instance only needs to serve its own platform's objects.
### How it works
1. **Platform lookup** — A separate `const string` SQL query (`PlatformLookupSql`) reads `platform_gobject_id` and `node_name` from the `platform` table for all deployed platforms. This runs once per hierarchy load.
2. **Platform matching** — The configured `PlatformName` (or `Environment.MachineName` when null) is matched case-insensitively against the `node_name` column. If no match is found, a warning is logged listing the available platforms and the address space is empty.
3. **Host chain collection** — The filter collects the matching platform's `gobject_id`, then iterates the hierarchy to find all `$AppEngine` (category 3) objects whose `HostedByGobjectId` equals the platform. This produces the full set of host gobject_ids under the local platform.
4. **Object inclusion** — All non-area objects whose `HostedByGobjectId` is in the host set are included, along with the hosts themselves.
5. **Area retention**`ParentGobjectId` chains are walked upward from included objects to pull in ancestor areas, keeping the browse tree connected. Areas that contain no local descendants are excluded.
6. **Attribute filtering** — The set of included `gobject_id` values is cached after `GetHierarchyAsync` and reused by `GetAttributesAsync` to filter attributes to the same scope.
### Design rationale
The filter is applied in C# rather than SQL because project convention `GR-006` requires `const string` SQL queries with no dynamic SQL. The hierarchy query already returns `HostedByGobjectId` and `CategoryId` on every row, so all information needed for filtering is already in memory after the query runs. The only new SQL is the lightweight platform lookup query.
### Configuration
```json
"GalaxyRepository": {
"Scope": "LocalPlatform",
"PlatformName": null
}
```
- Set `Scope` to `"LocalPlatform"` to enable filtering. Default is `"Galaxy"` (load everything).
- Set `PlatformName` to an explicit hostname to target a specific platform, or leave null to use the local machine name.
### Startup log
When `LocalPlatform` is active, the startup log shows the filtering result:
```
GalaxyRepository.Scope="LocalPlatform", PlatformName=MYNODE
GetHierarchyAsync returned 49 objects
GetPlatformsAsync returned 2 platform(s)
Scope filter targeting platform 'MYNODE' (gobject_id=1042)
Scope filter retained 25 of 49 objects for platform 'MYNODE'
GetAttributesAsync returned 4206 attributes (extended=true)
Scope filter retained 2100 of 4206 attributes
```
## Change Detection Polling and IRediscoverable
`ChangeDetectionService` runs a background polling loop in the Host process that calls `GetLastDeployTimeAsync` at the configured interval. It compares the returned timestamp against the last known value:
- On the first poll (no previous state), the timestamp is recorded and `OnGalaxyChanged` fires unconditionally
- On subsequent polls, `OnGalaxyChanged` fires only when `time_of_last_deploy` differs from the cached value
When the event fires, the Host re-runs the hierarchy and attribute queries and pushes the result back to the Server via an IPC `RediscoveryNeeded` message. That surfaces on `GalaxyProxyDriver` as the **`IRediscoverable.OnRediscoveryNeeded`** event; the Server's `DriverNodeManager` consumes it and calls `SyncAddressSpace` to compute the diff against the live address space.
The polling approach is used because the Galaxy Repository database does not provide change notifications. The `galaxy.time_of_last_deploy` column updates only on completed deployments, so the polling interval controls how quickly the OPC UA address space reflects Galaxy changes.
## TestConnection
`TestConnectionAsync` runs `SELECT 1` against the configured database. This is used at Host startup to verify connectivity before attempting the full hierarchy query.
## Key source files
- `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/Backend/GalaxyRepository/GalaxyRepositoryService.cs` — SQL queries and data access
- `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/Backend/GalaxyRepository/PlatformScopeFilter.cs` — Platform-based hierarchy and attribute filtering
- `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/Backend/GalaxyRepository/ChangeDetectionService.cs` — Deploy timestamp polling loop
- `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/Configuration/GalaxyRepositoryConfiguration.cs` — Connection, polling, and scope settings
- `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/Domain/PlatformInfo.cs` — Platform-to-hostname DTO
- `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Shared/Contracts/DiscoveryResponse.cs` — IPC DTO the Host uses to return hierarchy + attribute results across the pipe

211
docs/drivers/Galaxy.md Normal file
View File

@@ -0,0 +1,211 @@
# Galaxy Driver
The Galaxy driver bridges OtOpcUa to AVEVA System Platform (Wonderware) Galaxies through the `ArchestrA.MxAccess` COM API plus the Galaxy Repository SQL database. It is one driver of seven in the OtOpcUa platform (see [drivers/README.md](README.md) for the full list); all other drivers run in-process in the main Server (.NET 10 x64). Galaxy is the exception — it runs as its own Windows service and talks to the Server over a local named pipe.
For the decision record on why Galaxy is out-of-process and how the refactor was staged, see [docs/v2/plan.md §4 Galaxy/MXAccess as Out-of-Process Driver](../v2/plan.md). For the full driver spec (addressing, data-type map, config shape), see [docs/v2/driver-specs.md §1](../v2/driver-specs.md).
## Project Split
Galaxy ships as three projects:
| Project | Target | Role |
|---------|--------|------|
| `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Shared/` | .NET Standard 2.0 | IPC contracts (MessagePack records + `MessageKind` enum) referenced by both sides |
| `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/` | .NET Framework 4.8 **x86** | Separate Windows service hosting the MXAccess COM objects, STA thread + Win32 message pump, Galaxy Repository reader, Historian SDK, runtime-probe manager |
| `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Proxy/` | .NET 10 (matches Server) | `GalaxyProxyDriver : IDriver, ITagDiscovery, IReadable, IWritable, ISubscribable, IAlarmSource, IHistoryProvider, IRediscoverable, IHostConnectivityProbe` — loaded in-process by the Server; every call forwards over the pipe to the Host |
The Shared assembly is the **only** contract between the two runtimes. It carries no COM or SDK references so Proxy (net10) can reference it without dragging x86 code into the Server process.
## Why Out-of-Process
Two reasons drive the split, per `docs/v2/plan.md`:
1. **Bitness constraint.** MXAccess is 32-bit COM only — `ArchestrA.MxAccess.dll` in `Program Files (x86)\ArchestrA\Framework\bin` has no 64-bit variant. The main OtOpcUa Server is .NET 10 x64 (the OPC Foundation stack, SqlClient, and every other non-Galaxy driver target 64-bit). In-process hosting would force the whole Server to x86, which every other driver project would then inherit.
2. **Tier-C stability isolation.** Galaxy is classified Tier C in [docs/v2/driver-stability.md](../v2/driver-stability.md) — the COM runtime, STA thread, Aveva Historian SDK, and SQL queries all have crash/hang modes that can take down the hosting process. Isolating the driver in its own Windows service means a COM deadlock, AccessViolation in an unmanaged Historian DLL, or a runaway SQL query never takes the Server endpoint down. The Proxy-side supervisor restarts the Host with crash-loop circuit-breaker.
The same Tier-C isolation story applies to FOCAS (decision record in `docs/v2/plan.md` §7), which is the second out-of-process driver.
## IPC Transport
`GalaxyProxyDriver``GalaxyIpcClient` → named pipe → `Galaxy.Host` pipe server.
- Pipe name: `otopcua-galaxy-{DriverInstanceId}` (localhost-only, no TCP surface)
- Wire format: MessagePack-CSharp, length-prefixed frames
- ACL: pipe is created with a DACL that grants only the Server's service identity; the Admins group is explicitly denied so a live-smoke test running from an elevated shell fails fast rather than silently bypassing the handshake
- Handshake: Proxy presents a shared secret at `OpenSessionRequest`; Host rejects anything else with `MessageKind.OpenSessionResponse{Success=false}`
- Heartbeat: Proxy sends a periodic ping; missed heartbeats trigger the Proxy-side crash-loop supervisor to restart the Host
Every capability call on `GalaxyProxyDriver` (Read, Write, Subscribe, HistoryRead*, etc.) serializes a `*Request`, awaits the matching `*Response` via a `CallAsync<TReq, TResp>` helper, and rehydrates the result into the `Core.Abstractions` shape the Server expects.
## STA Thread Requirement (Host-side)
MXAccess COM objects — `LMXProxyServer` instantiation, `Register`, `AddItem`, `AdviseSupervisory`, `Write`, and cleanup calls — must all execute on the same Single-Threaded Apartment. Calling a COM object from the wrong thread causes marshalling failures or silent data corruption.
`StaComThread` in the Host provides that thread with the apartment state set before the thread starts:
```csharp
_thread = new Thread(ThreadEntry) { Name = "MxAccess-STA", IsBackground = true };
_thread.SetApartmentState(ApartmentState.STA);
```
Work items queue via `RunAsync(Action)` or `RunAsync<T>(Func<T>)` into a `ConcurrentQueue<Action>` and post `WM_APP` to wake the pump. Each work item is wrapped in a `TaskCompletionSource` so callers can `await` the result from any thread — including the IPC handler thread that receives the inbound pipe request.
## Win32 Message Pump (Host-side)
COM callbacks (`OnDataChange`, `OnWriteComplete`) are delivered through the Windows message loop. `StaComThread` runs a standard Win32 message pump via P/Invoke:
1. `PeekMessage` primes the message queue (required before `PostThreadMessage` works)
2. `GetMessage` blocks until a message arrives
3. `WM_APP` drains the work queue
4. `WM_APP + 1` drains the queue and posts `WM_QUIT` to exit the loop
5. All other messages go through `TranslateMessage` / `DispatchMessage` for COM callback delivery
Without this pump MXAccess callbacks never fire and the driver delivers no live data.
## LMXProxyServer COM Object
`MxProxyAdapter` wraps the real `ArchestrA.MxAccess.LMXProxyServer` COM object behind the `IMxProxy` interface so Host unit tests can substitute a fake proxy without requiring the ArchestrA runtime. Lifecycle:
1. **`Register(clientName)`** — Creates a new `LMXProxyServer` instance, wires up `OnDataChange` and `OnWriteComplete` event handlers, calls `Register` to obtain a connection handle
2. **`Unregister(handle)`** — Unwires event handlers, calls `Unregister`, releases the COM object via `Marshal.ReleaseComObject`
## Register / AddItem / AdviseSupervisory Pattern
Every MXAccess data operation follows a three-step pattern, all executed on the STA thread:
1. **`AddItem(handle, address)`** — Resolves a Galaxy tag reference (e.g., `TestMachine_001.MachineID`) to an integer item handle
2. **`AdviseSupervisory(handle, itemHandle)`** — Subscribes the item for supervisory data-change callbacks
3. The runtime begins delivering `OnDataChange` events
For writes, after `AddItem` + `AdviseSupervisory`, `Write(handle, itemHandle, value, securityClassification)` sends the value; `OnWriteComplete` confirms or rejects. Cleanup reverses: `UnAdviseSupervisory` then `RemoveItem`.
## OnDataChange and OnWriteComplete Callbacks
### OnDataChange
Fired by the COM runtime on the STA thread when a subscribed tag changes. The handler in `MxAccessClient.EventHandlers.cs`:
1. Maps the integer `phItemHandle` back to a tag address via `_handleToAddress`
2. Maps the MXAccess quality code to the internal `Quality` enum
3. Checks `MXSTATUS_PROXY` for error details and adjusts quality
4. Converts the timestamp to UTC
5. Constructs a `Vtq` (Value/Timestamp/Quality) and delivers it to:
- The stored per-tag subscription callback
- Any pending one-shot read completions
- The global `OnTagValueChanged` event (consumed by the Host's subscription dispatcher, which packages changes into `DataChangeEventArgs` and forwards them over the pipe to `GalaxyProxyDriver.OnDataChange`)
### OnWriteComplete
Fired when the runtime acknowledges or rejects a write. The handler resolves the pending `TaskCompletionSource<bool>` for the item handle. If `MXSTATUS_PROXY.success == 0` the write is considered failed and the error detail is logged.
## Reconnection Logic
`MxAccessClient` implements automatic reconnection through two mechanisms.
### Monitor loop
`StartMonitor` launches a background task that polls at `MonitorIntervalSeconds`. On each cycle:
- If the state is `Disconnected` or `Error` and `AutoReconnect` is enabled, it calls `ReconnectAsync`
- If connected and a probe tag is configured, it checks the probe staleness threshold
### Reconnect sequence
`ReconnectAsync` performs a full disconnect-then-connect cycle:
1. Increment the reconnect counter
2. `DisconnectAsync` — tear down all active subscriptions (`UnAdviseSupervisory` + `RemoveItem` for each), detach COM event handlers, call `Unregister`, clear all handle mappings
3. `ConnectAsync` — create a fresh `LMXProxyServer`, register, replay all stored subscriptions, re-subscribe the probe tag
Stored subscriptions (`_storedSubscriptions`) persist across reconnects. `ReplayStoredSubscriptionsAsync` iterates the stored entries and calls `AddItem` + `AdviseSupervisory` for each.
## Probe Tag Health Monitoring
A configurable probe tag (e.g., a frequently updating Galaxy attribute) serves as a connection health indicator. After connecting, the client subscribes to the probe tag and records `_lastProbeValueTime` on every `OnDataChange`. The monitor loop compares `DateTime.UtcNow - _lastProbeValueTime` against `ProbeStaleThresholdSeconds`; if the probe has not updated within the window, the connection is assumed stale and a reconnect is forced. This catches scenarios where the COM connection is technically alive but the runtime has stopped delivering data.
## Per-Host Runtime Status Probes (`<Host>.ScanState`)
Separate from the connection-level probe, the driver advises `<HostName>.ScanState` on every deployed `$WinPlatform` and `$AppEngine` in the Galaxy. These probes track per-host runtime state so the Admin UI dashboard can report "this specific Platform / AppEngine is off scan" and the driver can proactively invalidate every OPC UA variable hosted by the stopped object — preventing MXAccess from serving stale Good-quality cached values to clients who read those tags while the host is down.
Enabled by default via `MxAccess.RuntimeStatusProbesEnabled`; see [Configuration](../Configuration.md#mxaccess) for the two config fields.
### How it works
`GalaxyRuntimeProbeManager` lives in `Driver.Galaxy.Host` alongside the rest of the MXAccess code. It is owned by the Host's subscription dispatcher and runs a three-state machine per host (Unknown / Running / Stopped):
1. **Discovery** — After the Host completes `BuildAddressSpace`, the manager filters the hierarchy to rows where `CategoryId == 1` (`$WinPlatform`) or `CategoryId == 3` (`$AppEngine`) and issues `AdviseSupervisory` for `<TagName>.ScanState` on each one. Probes are driver-owned, not ref-counted against client subscriptions, and persist across address-space rebuilds via a `Sync` diff.
2. **Transition predicate** — A probe callback is interpreted as `isRunning = vtq.Quality.IsGood() && vtq.Value is bool b && b`. Everything else (explicit `ScanState = false`, bad quality, communication errors) means **Stopped**.
3. **On-change-only delivery**`ScanState` is delivered only when the value actually changes. A stably Running host may go hours without a callback. `Tick()` does NOT run a starvation check on Running entries — the only time-based transition is **Unknown → Stopped** when the initial callback hasn't arrived within `RuntimeStatusUnknownTimeoutSeconds` (default 15s). This protects against a probe that fails to resolve at all without incorrectly flipping healthy long-running hosts.
4. **Transport gating** — When `IMxAccessClient.State != Connected`, `GetSnapshot()` forces every entry to `Unknown`. The dashboard shows the Connection panel as the primary signal in that case rather than misleading operators with "every host stopped".
5. **Subscribe failure rollback** — If `SubscribeAsync` throws for a new probe (SDK failure, broker rejection, transport error), the manager rolls back both `_byProbe` and `_probeByGobjectId` so the probe never appears in `GetSnapshot()`. Stability review 2026-04-13 Finding 1.
### Subtree quality invalidation on transition
When a host transitions **Running → Stopped**, the probe manager invokes a callback that walks `_hostedVariables[gobjectId]` — the set of every OPC UA variable transitively hosted by that Galaxy object — and sets each variable's `StatusCode` to `BadOutOfService`. **Stopped → Running** calls `ClearHostVariablesBadQuality` to reset each to `Good` so the next on-change MXAccess update repopulates the value.
The hosted-variables map is built once per `BuildAddressSpace` by walking each object's `HostedByGobjectId` chain up to the nearest Platform or Engine ancestor. A variable hosted by an Engine inside a Platform lands in both the Engine's list and the Platform's list, so stopping the Platform transitively invalidates every descendant Engine's variables.
### Read-path short-circuit (`IsTagUnderStoppedHost`)
The Host's Read handler checks `IsTagUnderStoppedHost(tagRef)` (a reverse-index lookup `_hostIdsByTagRef[tagRef]``GalaxyRuntimeProbeManager.IsHostStopped(hostId)`) before the MXAccess round-trip. When the owning host is Stopped, the handler returns a synthesized `DataValue { Value = cachedVar.Value, StatusCode = BadOutOfService }` directly without touching MXAccess. This guarantees clients see a uniform `BadOutOfService` on every descendant tag of a stopped host, regardless of whether they're reading or subscribing.
### Deferred dispatch — the STA deadlock
**Critical**: probe transition callbacks must **not** run synchronously on the STA thread that delivered the `OnDataChange`. `MarkHostVariablesBadQuality` takes the subscription dispatcher lock, which may be held by a worker thread currently inside `Read` waiting on an `_mxAccessClient.ReadAsync()` round-trip that is itself waiting for the STA thread. Classic circular wait — the first real deploy of this feature hung inside 30 seconds from exactly this pattern.
The fix is a deferred-dispatch queue: probe callbacks enqueue the transition onto `ConcurrentQueue<(int GobjectId, bool Stopped)>` and set the existing dispatch signal. The dispatch thread drains the queue inside its existing 100ms `WaitOne` loop — outside any locks held by the STA path — and then calls `MarkHostVariablesBadQuality` / `ClearHostVariablesBadQuality` under its own natural lock acquisition. No circular wait, no STA involvement.
### Dashboard and health surface
- Admin UI **Galaxy Runtime** panel shows per-host state with Name / Kind / State / Since / Last Error columns. Panel color is green (all Running), yellow (any Unknown, none Stopped), red (any Stopped), gray (MXAccess transport disconnected)
- `HealthCheckService.CheckHealth` rolls overall driver health to `Degraded` when any host is Stopped
See [Status Dashboard](../StatusDashboard.md#galaxy-runtime) for the field table and [Configuration](../Configuration.md#mxaccess) for the config fields.
## Request Timeout Safety Backstop
Every sync-over-async site on the OPC UA stack thread that calls into Galaxy (`Read`, `Write`, address-space rebuild probe sync) is wrapped in a bounded `SyncOverAsync.WaitSync(...)` helper with timeout `MxAccess.RequestTimeoutSeconds` (default 30s). Inner `ReadTimeoutSeconds` / `WriteTimeoutSeconds` bounds on the async path are the first line of defense; the outer wrapper is a backstop so a scheduler stall, slow reconnect, or any other non-returning async path cannot park the stack thread indefinitely.
On timeout, the underlying task is **not** cancelled — it runs to completion on the thread pool and is abandoned. This is acceptable because Galaxy IPC clients are shared singletons and the abandoned continuation does not capture request-scoped state. The OPC UA stack receives `StatusCodes.BadTimeout` on the affected operation.
`ConfigurationValidator` enforces `RequestTimeoutSeconds >= 1` and warns when it is set below the inner Read/Write timeouts (operator misconfiguration). Stability review 2026-04-13 Finding 3.
All capability calls at the Server dispatch layer are additionally wrapped by `CapabilityInvoker` (Core/Resilience/) which runs them through a Polly pipeline keyed on `(DriverInstanceId, HostName, DriverCapability)`. `OTOPCUA0001` analyzer enforces the wrap at build time.
## Why Marshal.ReleaseComObject Is Needed
The .NET Framework runtime's garbage collector releases COM references non-deterministically. For MXAccess, delayed release can leave stale COM connections open, preventing clean re-registration. `MxProxyAdapter.Unregister` calls `Marshal.ReleaseComObject(_lmxProxy)` in a `finally` block to immediately drive the COM reference count to zero. This ensures the underlying COM server is freed before a reconnect attempt creates a new instance.
## Tag Discovery and Historical Data
Tag discovery (the Galaxy Repository SQL reader + `LocalPlatform` scope filter) is covered in [Galaxy-Repository.md](Galaxy-Repository.md). The Galaxy driver is `ITagDiscovery` for the Server's bootstrap path and `IRediscoverable` for the on-change-redeploy path.
Historical data access (raw, processed, at-time, events) runs against the Aveva Historian via the `aahClientManaged` SDK and is exposed through the Galaxy driver's `IHistoryProvider` implementation. See [HistoricalDataAccess.md](../HistoricalDataAccess.md).
## Key source files
Host-side (`.NET 4.8 x86`, `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/`):
- `Backend/MxAccess/StaComThread.cs` — STA thread and Win32 message pump
- `Backend/MxAccess/MxAccessClient.cs` — Core client (partial)
- `Backend/MxAccess/MxAccessClient.Connection.cs` — Connect / disconnect / reconnect
- `Backend/MxAccess/MxAccessClient.Subscription.cs` — Subscribe / unsubscribe / replay
- `Backend/MxAccess/MxAccessClient.ReadWrite.cs` — Read and write operations
- `Backend/MxAccess/MxAccessClient.EventHandlers.cs``OnDataChange` / `OnWriteComplete` handlers
- `Backend/MxAccess/MxAccessClient.Monitor.cs` — Background health monitor
- `Backend/MxAccess/MxProxyAdapter.cs` — COM object wrapper
- `Backend/MxAccess/GalaxyRuntimeProbeManager.cs` — Per-host `ScanState` probes, state machine, `IsHostStopped` lookup
- `Backend/Historian/HistorianDataSource.cs``aahClientManaged` SDK wrapper (see [HistoricalDataAccess.md](../HistoricalDataAccess.md))
- `Ipc/GalaxyIpcServer.cs` — Named-pipe server, message dispatch
- `Domain/IMxAccessClient.cs` — Client interface
Shared (`.NET Standard 2.0`, `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Shared/`):
- `Contracts/MessageKind.cs` — IPC message kinds (`ReadRequest`, `HistoryReadRequest`, `OpenSessionResponse`, …)
- `Contracts/*.cs` — MessagePack DTOs for every request/response pair
Proxy-side (`.NET 10`, `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Proxy/`):
- `GalaxyProxyDriver.cs``IDriver`/`ITagDiscovery`/`IReadable`/`IWritable`/`ISubscribable`/`IAlarmSource`/`IHistoryProvider`/`IRediscoverable`/`IHostConnectivityProbe` implementation; every method forwards via `GalaxyIpcClient`
- `Ipc/GalaxyIpcClient.cs` — Named-pipe client, `CallAsync<TReq, TResp>`, reconnect on broken pipe
- `GalaxyProxySupervisor.cs` — Host-process monitor, crash-loop circuit-breaker, Host relaunch

46
docs/drivers/README.md Normal file
View File

@@ -0,0 +1,46 @@
# Drivers
OtOpcUa is a multi-driver OPC UA server. The Core (`ZB.MOM.WW.OtOpcUa.Core` + `Core.Abstractions` + `Server`) owns the OPC UA stack, address space, session/security/subscription machinery, resilience pipeline, and namespace kinds (Equipment + SystemPlatform). Drivers plug in through **capability interfaces** defined in `src/ZB.MOM.WW.OtOpcUa.Core.Abstractions/`:
- `IDriver` — lifecycle (`InitializeAsync`, `ReinitializeAsync`, `ShutdownAsync`, `GetHealth`)
- `IReadable` / `IWritable` — one-shot reads and writes
- `ITagDiscovery` — address-space enumeration
- `ISubscribable` — driver-pushed data-change streams
- `IHostConnectivityProbe` — per-host reachability events
- `IPerCallHostResolver` — multi-host drivers that route each call to a target endpoint at dispatch time
- `IAlarmSource` — driver-emitted OPC UA A&C events
- `IHistoryProvider` — raw / processed / at-time / events HistoryRead (see [HistoricalDataAccess.md](../HistoricalDataAccess.md))
- `IRediscoverable` — driver-initiated address-space rebuild notifications
Each driver opts into only the capabilities it supports. Every async capability call at the Server dispatch layer goes through `CapabilityInvoker` (`Core/Resilience/CapabilityInvoker.cs`), which wraps it in a Polly pipeline keyed on `(DriverInstanceId, HostName, DriverCapability)`. The `OTOPCUA0001` analyzer enforces the wrap at build time. Drivers themselves never depend on Polly; they just implement the capability interface and let the Core wrap it.
Driver type metadata is registered at startup in `DriverTypeRegistry` (`src/ZB.MOM.WW.OtOpcUa.Core.Abstractions/DriverTypeRegistry.cs`). The registry records each type's allowed namespace kinds (`Equipment` / `SystemPlatform` / `Simulated`), its JSON Schema for `DriverConfig` / `DeviceConfig` / `TagConfig` columns, and its stability tier per [docs/v2/driver-stability.md](../v2/driver-stability.md).
## Ground-truth driver list
| Driver | Project path | Tier | Wire / library | Capabilities | Notable quirk |
|--------|--------------|:----:|----------------|--------------|---------------|
| [Galaxy](Galaxy.md) | `Driver.Galaxy.{Shared, Host, Proxy}` | C | MXAccess COM + `aahClientManaged` + SqlClient | IDriver, ITagDiscovery, IReadable, IWritable, ISubscribable, IAlarmSource, IHistoryProvider, IRediscoverable, IHostConnectivityProbe | Out-of-process — Host is its own Windows service (.NET 4.8 x86 for the COM bitness constraint); Proxy talks to Host over a named pipe |
| Modbus TCP | `Driver.Modbus` | A | NModbus-derived in-house client | IDriver, ITagDiscovery, IReadable, IWritable, ISubscribable, IHostConnectivityProbe | Polled subscriptions via the shared `PollGroupEngine`. DL205 PLCs are covered by `AddressFormat=DL205` (octal V/X/Y/C/T/CT translation) — no separate driver |
| Siemens S7 | `Driver.S7` | A | [S7netplus](https://github.com/S7NetPlus/s7netplus) | IDriver, ITagDiscovery, IReadable, IWritable, ISubscribable, IHostConnectivityProbe | Single S7netplus `Plc` instance per PLC serialized with `SemaphoreSlim` — the S7 CPU's comm mailbox is scanned at most once per cycle, so parallel reads don't help |
| AB CIP | `Driver.AbCip` | A | libplctag CIP | IDriver, ITagDiscovery, IReadable, IWritable, ISubscribable, IHostConnectivityProbe, IPerCallHostResolver | ControlLogix / CompactLogix. Tag discovery uses the `@tags` walker to enumerate controller-scoped + program-scoped symbols; UDT member resolution via the UDT template reader |
| AB Legacy | `Driver.AbLegacy` | A | libplctag PCCC | IDriver, ITagDiscovery, IReadable, IWritable, ISubscribable, IHostConnectivityProbe, IPerCallHostResolver | SLC 500 / MicroLogix. File-based addressing (`N7:0`, `F8:0`) — no symbol table, tag list is user-authored in the config DB |
| TwinCAT | `Driver.TwinCAT` | B | Beckhoff `TwinCAT.Ads` (`TcAdsClient`) | IDriver, ITagDiscovery, IReadable, IWritable, ISubscribable, IHostConnectivityProbe, IPerCallHostResolver | The only native-notification driver outside Galaxy — ADS delivers `ValueChangedCallback` events the driver forwards straight to `ISubscribable.OnDataChange` without polling. Symbol tree uploaded via `SymbolLoaderFactory` |
| FOCAS | `Driver.FOCAS` | C | FANUC FOCAS2 (`Fwlib32.dll` P/Invoke) | IDriver, ITagDiscovery, IReadable, IWritable, ISubscribable, IHostConnectivityProbe, IPerCallHostResolver | Tier C — FOCAS DLL has crash modes that warrant process isolation. CNC-shaped data model (axes, spindle, PMC, macros, alarms) not a flat tag map |
| OPC UA Client | `Driver.OpcUaClient` | B | OPCFoundation `Opc.Ua.Client` | IDriver, ITagDiscovery, IReadable, IWritable, ISubscribable, IAlarmSource, IHistoryProvider, IHostConnectivityProbe | Gateway/aggregation driver. Opens a single `Session` against a remote OPC UA server and re-exposes its address space. Owns its own `ApplicationConfiguration` (distinct from `Client.Shared`) because it's always-on with keep-alive + `TransferSubscriptions` across SDK reconnect, not an interactive CLI |
## Per-driver documentation
- **Galaxy** has its own docs in this folder because the out-of-process architecture + MXAccess COM rules + Galaxy Repository SQL + Historian + runtime probe manager don't fit a single table row:
- [Galaxy.md](Galaxy.md) — COM bridge, STA pump, IPC, runtime probes
- [Galaxy-Repository.md](Galaxy-Repository.md) — ZB SQL reader, `LocalPlatform` scope filter, change detection
- **All other drivers** share a single per-driver specification in [docs/v2/driver-specs.md](../v2/driver-specs.md) — addressing, data-type maps, connection settings, and quirks live there. That file is the authoritative per-driver reference; this index points at it rather than duplicating.
## Related cross-driver docs
- [HistoricalDataAccess.md](../HistoricalDataAccess.md) — `IHistoryProvider` dispatch, aggregate mapping, continuation points. The Galaxy driver's Aveva Historian implementation is the first; OPC UA Client forwards to the upstream server; other drivers do not implement the interface and return `BadHistoryOperationUnsupported`.
- [AlarmTracking.md](../AlarmTracking.md) — `IAlarmSource` event model and filtering.
- [Subscriptions.md](../Subscriptions.md) — how the Server multiplexes subscriptions onto `ISubscribable.OnDataChange`.
- [docs/v2/driver-stability.md](../v2/driver-stability.md) — tier system (A / B / C), shared `CapabilityPolicy` defaults per tier × capability, `MemoryTracking` hybrid formula, and process-level recycle rules.
- [docs/v2/plan.md](../v2/plan.md) — authoritative vision, architecture decisions, migration strategy.