docs: post-PR-7.2 cleanup — audit + three-track scrub
Audit (three parallel agent passes) found 43 markdown files carrying stale references to the deleted Galaxy.Host/Proxy/Shared projects after the v2-mxgw merge. This commit lands the prioritized fixes. Track 1 — high-traffic in-place rewrites (3 files, ~454 lines deleted) - README.md (202 → 91 lines): drops .NET 4.8 / x86 / TopShelf install text; leads with the multi-driver .NET 10 server identity and points at scripts/install/Install-Services.ps1 and the parity rig. - docs/v2/driver-specs.md §1 Galaxy (~289 → ~66 lines): replaces the Tier-C out-of-process spec with a Tier-A in-process description matching the current GalaxyDriver code, with the four-section GalaxyDriverOptions JSON shape pulled verbatim from Config/GalaxyDriverOptions.cs. - docs/drivers/Galaxy.md (211 → 92 lines): full rewrite around the current Browse/Runtime/Health/Config sub-folders. Track 2 — historical banners (5 files) - lmx_mxgw.md, lmx_mxgw_impl.md, lmx_backend.md, docs/v2/Galaxy.ParityMatrix.md, docs/v2/implementation/phase-2-galaxy-out-of-process.md each get a "✅ Completed 2026-04-30 — historical record" banner block. lmx_mxgw.md also fixes two dead links (`docs/Galaxy.Driver.md` and `docs/v2/Galaxy.Driver.md`) → `docs/drivers/Galaxy.md`. Track 3 — v1 archive sweep (10 git mv + 1 new index + 2 in-place scrubs) - Moved 10 v1 docs under docs/v1/ preserving subpath structure: AlarmTracking, Configuration, DataTypeMapping, HistoricalDataAccess, Subscriptions (top-level); drivers/Galaxy-Repository, drivers/Galaxy-Test-Fixture; reqs/GalaxyRepositoryReqs, reqs/MxAccessClientReqs, reqs/ServiceHostReqs. - New docs/v1/README.md is the shared archive banner + per-file table. - docs/README.md repointed to the v1 paths and updated to reflect the v2 two-process deploy shape (Server + Admin + optional OtOpcUaWonderwareHistorian). - docs/v2/Galaxy.ParityRig.md got a historical banner + four inline scrubs marking the OtOpcUaGalaxyHost service / Driver.Galaxy.Host EXE / Driver.Galaxy.ParityTests project as deleted-in-PR-7.2. The repo's live-reading surface (README + CLAUDE.md + docs/v2/) now describes only the post-PR-7.2 architecture. v1 docs are preserved as a labelled archive under docs/v1/. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -1,152 +0,0 @@
|
||||
# Galaxy Repository — Tag Discovery for the Galaxy Driver
|
||||
|
||||
`GalaxyRepositoryService` reads the Galaxy object hierarchy and attribute metadata from the System Platform Galaxy Repository SQL Server database. It is the Galaxy driver's implementation of **`ITagDiscovery.DiscoverAsync`** — every driver has its own discovery source, and the Galaxy driver's is a direct SQL query against the Galaxy Repository (the `ZB` database). Other drivers use completely different mechanisms:
|
||||
|
||||
| Driver | `ITagDiscovery` source |
|
||||
|--------|------------------------|
|
||||
| Galaxy | ZB SQL hierarchy + attribute queries (this doc) |
|
||||
| AB CIP | `@tags` walker against the PLC controller |
|
||||
| AB Legacy | Data-table scan via PCCC `LogicalRead` on the PLC |
|
||||
| TwinCAT | Beckhoff `SymbolLoaderFactory` — uploads the full symbol tree from the ADS runtime |
|
||||
| S7 | Config-DB enumeration (no native symbol upload for S7comm) |
|
||||
| Modbus | Config-DB enumeration (flat register map, user-authored) |
|
||||
| FOCAS | CNC queries (`cnc_rdaxisname`, `cnc_rdmacroinfo`, …) + optional Config-DB overlays |
|
||||
| OPC UA Client | `Session.Browse` against the remote server |
|
||||
|
||||
`GalaxyRepositoryService` lives in `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/Backend/GalaxyRepository/` — Host-side, .NET Framework 4.8 x86, same process that owns the MXAccess COM objects. The Proxy forwards discovery over IPC the same way it forwards reads and writes.
|
||||
|
||||
## Connection Configuration
|
||||
|
||||
`GalaxyRepositoryConfiguration` controls database access:
|
||||
|
||||
| Property | Default | Description |
|
||||
|----------|---------|-------------|
|
||||
| `ConnectionString` | `Server=localhost;Database=ZB;Integrated Security=true;` | SQL Server connection using Windows Authentication |
|
||||
| `ChangeDetectionIntervalSeconds` | `30` | Polling frequency for deploy change detection |
|
||||
| `CommandTimeoutSeconds` | `30` | SQL command timeout for all queries |
|
||||
| `ExtendedAttributes` | `false` | When true, loads primitive-level attributes in addition to dynamic attributes |
|
||||
| `Scope` | `Galaxy` | `Galaxy` loads all deployed objects. `LocalPlatform` filters to the local platform's subtree only |
|
||||
| `PlatformName` | `null` | Explicit platform hostname for `LocalPlatform` filtering. When null, uses `Environment.MachineName` |
|
||||
|
||||
The connection uses Windows Authentication because the Galaxy Repository database is local to the System Platform node and secured through domain credentials.
|
||||
|
||||
## SQL Queries
|
||||
|
||||
All queries are embedded as `const string` fields in `GalaxyRepositoryService`. No dynamic SQL is used. Project convention `GR-006` requires `const string` SQL queries; any new query must be added as a named constant rather than built at runtime.
|
||||
|
||||
### Hierarchy query
|
||||
|
||||
Returns deployed Galaxy objects with their parent relationships, browse names, and template derivation chains:
|
||||
|
||||
- Joins `gobject` to `template_definition` to filter by relevant `category_id` values (1, 3, 4, 10, 11, 13, 17, 24, 26)
|
||||
- Uses `contained_name` as the browse name, falling back to `tag_name` when `contained_name` is null or empty
|
||||
- Resolves the parent using `contained_by_gobject_id` when non-zero, otherwise falls back to `area_gobject_id`
|
||||
- Marks objects with `category_id = 13` as areas
|
||||
- Filters to `is_template = 0` (instances only, not templates)
|
||||
- Filters to `deployed_package_id <> 0` (deployed objects only)
|
||||
- Returns a `template_chain` column built by a recursive CTE that walks `gobject.derived_from_gobject_id` from each instance through its immediate template and ancestor templates (depth guard `< 10`). Template names are ordered by depth and joined with `|` via `STUFF(... FOR XML PATH(''))`. Example: `TestMachine_001` returns `$TestMachine|$gMachine|$gUserDefined|$UserDefined`. The C# repository reader splits the column on `|`, trims, and populates `GalaxyObjectInfo.TemplateChain`, which is consumed by `AlarmObjectFilter` for template-based alarm filtering. See [Alarm Tracking](../AlarmTracking.md#template-based-alarm-object-filter).
|
||||
- Returns `template_definition.category_id` as a `category_id` column, populated into `GalaxyObjectInfo.CategoryId`. The runtime status probe manager filters this down to `CategoryId == 1` (`$WinPlatform`) and `CategoryId == 3` (`$AppEngine`) to decide which objects get a `<Host>.ScanState` probe advised. Also used during the hosted-variables walk to identify Platform/Engine ancestors.
|
||||
- Returns `gobject.hosted_by_gobject_id` as a `hosted_by_gobject_id` column, populated into `GalaxyObjectInfo.HostedByGobjectId`. This is the **runtime host** of the object (e.g., which `$AppEngine` actually runs it), **not** the browse-containment parent (`contained_by_gobject_id`). The two are often different — an object can live in one Area in the browse tree but be hosted by an Engine on a different Platform for runtime execution. The driver walks this chain during `BuildHostedVariablesMap` to find the nearest `$WinPlatform` or `$AppEngine` ancestor so subtree quality invalidation on a Stopped host reaches exactly the variables that were actually executing there. Note: the Galaxy schema column is named `hosted_by_gobject_id` (not `host_gobject_id` as some documentation sources guess). See [Galaxy driver — Per-Host Runtime Status Probes](Galaxy.md#per-host-runtime-status-probes-hostscanstate).
|
||||
|
||||
### Attributes query (standard)
|
||||
|
||||
Returns user-defined dynamic attributes for deployed objects:
|
||||
|
||||
- Uses a recursive CTE (`deployed_package_chain`) to walk the package inheritance chain from `deployed_package_id` through `derived_from_package_id`, limited to 10 levels
|
||||
- Joins `dynamic_attribute` on each package in the chain to collect inherited attributes
|
||||
- Uses `ROW_NUMBER() OVER (PARTITION BY gobject_id, attribute_name ORDER BY depth)` to pick the most-derived definition when an attribute is overridden at multiple levels
|
||||
- Builds `full_tag_reference` as `tag_name.attribute_name` with `[]` appended for arrays
|
||||
- Extracts `array_dimension` from the binary `mx_value` column (bytes 13-16, little-endian int32)
|
||||
- Detects historized attributes by checking for a `HistoryExtension` primitive instance
|
||||
- Detects alarm attributes by checking for an `AlarmExtension` primitive instance
|
||||
- Excludes internal attributes (names starting with `_`) and `.Description` suffixes
|
||||
- Filters by `mx_attribute_category` to include only user-relevant categories
|
||||
|
||||
### Attributes query (extended)
|
||||
|
||||
When `ExtendedAttributes = true`, a more comprehensive query runs that unions two sources:
|
||||
|
||||
1. **Primitive attributes** — Joins through `primitive_instance` and `attribute_definition` to include system-level attributes from primitive components. Each attribute carries its `primitive_name` so the address space can group them under their parent variable.
|
||||
2. **Dynamic attributes** — The same CTE-based query as the standard path, with an empty `primitive_name`.
|
||||
|
||||
The `full_tag_reference` for primitive attributes follows the pattern `tag_name.primitive_name.attribute_name` (e.g., `TestMachine_001.AlarmAttr.InAlarm`).
|
||||
|
||||
### Change detection query
|
||||
|
||||
A single-column query: `SELECT time_of_last_deploy FROM galaxy`. The `galaxy` table contains one row with the timestamp of the most recent deployment.
|
||||
|
||||
## Why deployed_package_id Instead of checked_in_package_id
|
||||
|
||||
The Galaxy maintains two package references for each object:
|
||||
|
||||
- `checked_in_package_id` — the latest saved version, which may include undeployed configuration changes
|
||||
- `deployed_package_id` — the version currently running on the target platform
|
||||
|
||||
The queries filter on `deployed_package_id <> 0` because the OPC UA address space must mirror what is actually running in the Galaxy runtime. Using `checked_in_package_id` would expose attributes and objects that exist in the IDE but have not been deployed, causing mismatches between the OPC UA address space and the MXAccess runtime.
|
||||
|
||||
## Platform Scope Filter
|
||||
|
||||
When `Scope` is set to `LocalPlatform`, the repository applies a post-query C# filter to restrict the address space to objects hosted by the local platform. This reduces memory footprint, MXAccess subscription count, and address space size on multi-node Galaxy deployments where each OPC UA server instance only needs to serve its own platform's objects.
|
||||
|
||||
### How it works
|
||||
|
||||
1. **Platform lookup** — A separate `const string` SQL query (`PlatformLookupSql`) reads `platform_gobject_id` and `node_name` from the `platform` table for all deployed platforms. This runs once per hierarchy load.
|
||||
2. **Platform matching** — The configured `PlatformName` (or `Environment.MachineName` when null) is matched case-insensitively against the `node_name` column. If no match is found, a warning is logged listing the available platforms and the address space is empty.
|
||||
3. **Host chain collection** — The filter collects the matching platform's `gobject_id`, then iterates the hierarchy to find all `$AppEngine` (category 3) objects whose `HostedByGobjectId` equals the platform. This produces the full set of host gobject_ids under the local platform.
|
||||
4. **Object inclusion** — All non-area objects whose `HostedByGobjectId` is in the host set are included, along with the hosts themselves.
|
||||
5. **Area retention** — `ParentGobjectId` chains are walked upward from included objects to pull in ancestor areas, keeping the browse tree connected. Areas that contain no local descendants are excluded.
|
||||
6. **Attribute filtering** — The set of included `gobject_id` values is cached after `GetHierarchyAsync` and reused by `GetAttributesAsync` to filter attributes to the same scope.
|
||||
|
||||
### Design rationale
|
||||
|
||||
The filter is applied in C# rather than SQL because project convention `GR-006` requires `const string` SQL queries with no dynamic SQL. The hierarchy query already returns `HostedByGobjectId` and `CategoryId` on every row, so all information needed for filtering is already in memory after the query runs. The only new SQL is the lightweight platform lookup query.
|
||||
|
||||
### Configuration
|
||||
|
||||
```json
|
||||
"GalaxyRepository": {
|
||||
"Scope": "LocalPlatform",
|
||||
"PlatformName": null
|
||||
}
|
||||
```
|
||||
|
||||
- Set `Scope` to `"LocalPlatform"` to enable filtering. Default is `"Galaxy"` (load everything).
|
||||
- Set `PlatformName` to an explicit hostname to target a specific platform, or leave null to use the local machine name.
|
||||
|
||||
### Startup log
|
||||
|
||||
When `LocalPlatform` is active, the startup log shows the filtering result:
|
||||
|
||||
```
|
||||
GalaxyRepository.Scope="LocalPlatform", PlatformName=MYNODE
|
||||
GetHierarchyAsync returned 49 objects
|
||||
GetPlatformsAsync returned 2 platform(s)
|
||||
Scope filter targeting platform 'MYNODE' (gobject_id=1042)
|
||||
Scope filter retained 25 of 49 objects for platform 'MYNODE'
|
||||
GetAttributesAsync returned 4206 attributes (extended=true)
|
||||
Scope filter retained 2100 of 4206 attributes
|
||||
```
|
||||
|
||||
## Change Detection Polling and IRediscoverable
|
||||
|
||||
`ChangeDetectionService` runs a background polling loop in the Host process that calls `GetLastDeployTimeAsync` at the configured interval. It compares the returned timestamp against the last known value:
|
||||
|
||||
- On the first poll (no previous state), the timestamp is recorded and `OnGalaxyChanged` fires unconditionally
|
||||
- On subsequent polls, `OnGalaxyChanged` fires only when `time_of_last_deploy` differs from the cached value
|
||||
|
||||
When the event fires, the Host re-runs the hierarchy and attribute queries and pushes the result back to the Server via an IPC `RediscoveryNeeded` message. That surfaces on `GalaxyProxyDriver` as the **`IRediscoverable.OnRediscoveryNeeded`** event; the Server's `DriverNodeManager` consumes it and calls `SyncAddressSpace` to compute the diff against the live address space.
|
||||
|
||||
The polling approach is used because the Galaxy Repository database does not provide change notifications. The `galaxy.time_of_last_deploy` column updates only on completed deployments, so the polling interval controls how quickly the OPC UA address space reflects Galaxy changes.
|
||||
|
||||
## TestConnection
|
||||
|
||||
`TestConnectionAsync` runs `SELECT 1` against the configured database. This is used at Host startup to verify connectivity before attempting the full hierarchy query.
|
||||
|
||||
## Key source files
|
||||
|
||||
- `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/Backend/GalaxyRepository/GalaxyRepositoryService.cs` — SQL queries and data access
|
||||
- `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/Backend/GalaxyRepository/PlatformScopeFilter.cs` — Platform-based hierarchy and attribute filtering
|
||||
- `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/Backend/GalaxyRepository/ChangeDetectionService.cs` — Deploy timestamp polling loop
|
||||
- `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/Configuration/GalaxyRepositoryConfiguration.cs` — Connection, polling, and scope settings
|
||||
- `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/Domain/PlatformInfo.cs` — Platform-to-hostname DTO
|
||||
- `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Shared/Contracts/DiscoveryResponse.cs` — IPC DTO the Host uses to return hierarchy + attribute results across the pipe
|
||||
@@ -1,165 +0,0 @@
|
||||
# Galaxy test fixture
|
||||
|
||||
Coverage map + gap inventory for the Galaxy driver — out-of-process Host
|
||||
(net48 x86 MXAccess COM) + Proxy (net10) + Shared protocol.
|
||||
|
||||
**TL;DR: Galaxy has the richest test harness in the fleet** — real Host
|
||||
subprocess spawn, real ZB SQL queries, IPC parity checks against the v1
|
||||
LmxProxy reference, + live-smoke tests when MXAccess runtime is actually
|
||||
installed. Gaps are live-plant + failover-shaped: the E2E suite covers the
|
||||
representative ~50-tag deployment but not large-site discovery stress, real
|
||||
Rockwell/Siemens PLC enumeration through MXAccess, or ZB SQL Always-On
|
||||
replica failover.
|
||||
|
||||
## What the fixture is
|
||||
|
||||
Multi-project test topology:
|
||||
|
||||
- **E2E parity** —
|
||||
`tests/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.E2E/ParityFixture.cs` spawns the
|
||||
production `OtOpcUa.Driver.Galaxy.Host.exe` as a subprocess, opens the
|
||||
named-pipe IPC, connects `GalaxyProxyDriver` + runs hierarchy / stability
|
||||
parity tests against both.
|
||||
- **Host.Tests** —
|
||||
`tests/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host.Tests/` — direct Host process
|
||||
testing (18+ test classes covering alarm discovery, AVEVA prerequisite
|
||||
checks, IPC dispatcher, alarm tracker, probe manager, historian
|
||||
cluster/quality/wiring, history read, OPC UA attribute mapping,
|
||||
subscription lifecycle, reconnect, multi-host proxy, ADS address routing,
|
||||
expression evaluation) + `GalaxyRepositoryLiveSmokeTests` that hit real
|
||||
ZB SQL.
|
||||
- **Proxy.Tests** — `GalaxyProxyDriver` client contract tests.
|
||||
- **Shared.Tests** — shared protocol + address model.
|
||||
- **TestSupport** — test helpers reused across the above.
|
||||
|
||||
## How tests skip
|
||||
|
||||
- **E2E parity**: `ParityFixture.SkipIfUnavailable()` runs at class init and
|
||||
checks Windows-only, ZB SQL reachable on `localhost:1433`, Host EXE built
|
||||
in the expected `bin/` folder. Any miss → tests skip.
|
||||
- **Live-smoke** (`GalaxyRepositoryLiveSmokeTests`): `Assert.Skip` when ZB
|
||||
unreachable. A `per project_galaxy_host_installed` memory on this repo's
|
||||
dev box notes the MXAccess runtime is installed. The pipe ACL allows the
|
||||
configured SID outright; elevation of the caller doesn't matter because
|
||||
the per-connection SID check in `PipeServer.VerifyCaller` only compares
|
||||
user SIDs (not group membership or integrity level).
|
||||
- **Unit** tests (Shared, Proxy contract, most Host.Tests) have no skip —
|
||||
they run anywhere.
|
||||
|
||||
## What it actually covers
|
||||
|
||||
### E2E parity suite
|
||||
|
||||
- `HierarchyParityTests` — Host address-space hierarchy vs v1 LmxProxy
|
||||
reference (same ZB, same Galaxy, same shape)
|
||||
- `StabilityFindingsRegressionTests` — probe subscription failure
|
||||
handling + host-status mutation guard from the v1 stability findings
|
||||
backlog
|
||||
|
||||
### Host.Tests (representative)
|
||||
|
||||
- Alarm discovery → subsystem setup
|
||||
- AVEVA prerequisite checks (runtime installed, platform deployed, etc.)
|
||||
- IPC dispatcher — request/response routing over the named pipe
|
||||
- Alarm tracker state machine
|
||||
- Probe manager — per-runtime probe subscription + reconnect
|
||||
- Historian cluster / quality / wiring — Aveva Historian integration
|
||||
- OPC UA attribute mapping
|
||||
- Subscription lifecycle + reconnect
|
||||
- Multi-host proxy routing
|
||||
- ADS address routing + expression evaluation (Galaxy's legacy expression
|
||||
language)
|
||||
|
||||
### Live-smoke
|
||||
|
||||
- `GalaxyRepositoryLiveSmokeTests` — real SQL against ZB database, verifies
|
||||
the ZB schema + `LocalPlatform` scope filter + change-detection query
|
||||
shape match production.
|
||||
|
||||
### Capability surfaces hit
|
||||
|
||||
All of them: `IDriver`, `IReadable`, `IWritable`, `ITagDiscovery`,
|
||||
`ISubscribable`, `IHostConnectivityProbe`, `IPerCallHostResolver`,
|
||||
`IAlarmSource`, `IHistoryProvider`. Galaxy is the only driver where every
|
||||
interface sees both contract + real-integration coverage.
|
||||
|
||||
## What it does NOT cover
|
||||
|
||||
### 1. MXAccess COM by default
|
||||
|
||||
The E2E parity suite backs subscriptions via the DB-only path; MXAccess COM
|
||||
integration opts in via a separate live-smoke. So "does the MXAccess STA
|
||||
pump correctly handle real Wonderware runtime events" is exercised only
|
||||
when the operator runs live smoke on a machine with MXAccess installed.
|
||||
|
||||
### 2. Real Rockwell / Siemens PLC enumeration
|
||||
|
||||
Galaxy runtime talks to PLCs through MXAccess (Device Integration Objects).
|
||||
The CI parity suite uses a representative ~50-tag deployment; large sites
|
||||
(1000+ tag hierarchies, multi-Galaxy replication, deeply-nested templates)
|
||||
are not stressed.
|
||||
|
||||
### 3. ZB SQL Always-On failover
|
||||
|
||||
Live-smoke hits a single SQL instance. Real production ZB often runs on
|
||||
Always-On availability groups; replica failover behavior is not tested.
|
||||
|
||||
### 4. Galaxy replication / backup-restore
|
||||
|
||||
Galaxy supports backup + partial replication across platforms — these
|
||||
rewrite the ZB schema in ways that change the contained_name vs tag_name
|
||||
mapping. Not exercised.
|
||||
|
||||
### 5. Historian failover
|
||||
|
||||
Aveva Historian can be clustered. `historian cluster / quality` tests
|
||||
verify the cluster-config query; they don't exercise actual failover
|
||||
(primary dies → secondary takes over mid-HistoryRead).
|
||||
|
||||
### 6. AVEVA runtime version matrix
|
||||
|
||||
MXAccess COM contract varies subtly across System Platform 2017 / 2020 /
|
||||
2023. The live-smoke runs against whatever version is installed on the dev
|
||||
box; CI has no AVEVA installed at all (licensing + footprint).
|
||||
|
||||
## When to trust the Galaxy suite, when to reach for a live plant
|
||||
|
||||
| Question | E2E parity | Live-smoke | Real plant |
|
||||
| --- | --- | --- | --- |
|
||||
| "Does Host spawn + IPC round-trip work?" | yes | yes | yes |
|
||||
| "Does the ZB schema query match production shape?" | partial | yes | yes |
|
||||
| "Does MXAccess COM handle runtime reconnect correctly?" | no | yes | yes |
|
||||
| "Does the driver scale to 1000+ tags on one Galaxy?" | no | partial | yes (required) |
|
||||
| "Does historian failover mid-read return a clean error?" | no | no | yes (required) |
|
||||
| "Does System Platform 2023's MXAccess differ from 2020?" | no | partial | yes (required) |
|
||||
| "Does ZB Always-On replica failover preserve generation?" | no | no | yes (required) |
|
||||
|
||||
## Follow-up candidates
|
||||
|
||||
1. **System Platform 2023 live-smoke matrix** — set up a second dev box
|
||||
running SP2023; run the same live-smoke against both to catch COM-contract
|
||||
drift early.
|
||||
2. **Synthetic large-site fixture** — script a ZB populator that creates a
|
||||
1000-Equipment / 20000-tag hierarchy, run the parity suite against it.
|
||||
Catches O(N) → O(N²) discovery regressions.
|
||||
3. **Historian failover scripted test** — with a two-node AVEVA Historian
|
||||
cluster, tear down primary mid-HistoryRead + verify the driver's failover
|
||||
behavior + error surface.
|
||||
4. **ZB Always-On CI** — SQL Server 2022 on Linux supports Always-On;
|
||||
could stand up a two-replica group for replica-failover coverage.
|
||||
|
||||
This is already the best-tested driver; the remaining work is site-scale
|
||||
+ production-topology coverage, not capability coverage.
|
||||
|
||||
## Key fixture / config files
|
||||
|
||||
- `tests/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.E2E/ParityFixture.cs` — E2E fixture
|
||||
that spawns Host + connects Proxy
|
||||
- `tests/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host.Tests/GalaxyRepositoryLiveSmokeTests.cs`
|
||||
— live ZB smoke with `Assert.Skip` gate
|
||||
- `tests/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.TestSupport/` — shared helpers
|
||||
- `docs/drivers/Galaxy.md` — COM bridge + STA pump + IPC architecture
|
||||
- `docs/drivers/Galaxy-Repository.md` — ZB SQL reader + `LocalPlatform`
|
||||
scope filter + change detection
|
||||
- `docs/v2/aveva-system-platform-io-research.md` — MXAccess + Wonderware
|
||||
background
|
||||
@@ -1,211 +1,92 @@
|
||||
# Galaxy Driver
|
||||
|
||||
The Galaxy driver bridges OtOpcUa to AVEVA System Platform (Wonderware) Galaxies through the `ArchestrA.MxAccess` COM API plus the Galaxy Repository SQL database. It is one driver of seven in the OtOpcUa platform (see [drivers/README.md](README.md) for the full list); all other drivers run in-process in the main Server (.NET 10 x64). Galaxy is the exception — it runs as its own Windows service and talks to the Server over a local named pipe.
|
||||
The Galaxy driver bridges OtOpcUa to AVEVA System Platform (Wonderware) Galaxies. It is a **Tier-A in-process driver** that runs in the OtOpcUa server's .NET 10 AnyCPU process and speaks gRPC to a separately installed `mxaccessgw` server (sibling repo at `c:\Users\dohertj2\Desktop\mxaccessgw\`). The gateway owns the MXAccess COM apartment, the STA + Win32 message pump, the Galaxy Repository SQL reader, and the Historian SDK — all the bits that need x86 / .NET Framework 4.8 / COM interop. The driver itself is platform-agnostic and contains no COM, no STA thread, and no x86 bitness constraint.
|
||||
|
||||
For the decision record on why Galaxy is out-of-process and how the refactor was staged, see [docs/v2/plan.md §4 Galaxy/MXAccess as Out-of-Process Driver](../v2/plan.md). For the full driver spec (addressing, data-type map, config shape), see [docs/v2/driver-specs.md §1](../v2/driver-specs.md).
|
||||
For the driver spec (capability surface, config shape, addressing), see [docs/v2/driver-specs.md §1](../v2/driver-specs.md). For the gateway setup recipe, see [docs/v2/Galaxy.ParityRig.md](../v2/Galaxy.ParityRig.md). For tracing, metrics, and soak profile, see [docs/v2/Galaxy.Performance.md](../v2/Galaxy.Performance.md).
|
||||
|
||||
## Project Split
|
||||
> **Note**: the related drivers `Galaxy-Repository.md` and `Galaxy-Test-Fixture.md` describe the previous v1 / out-of-process topology and are being moved to `docs/v1/` by a parallel cleanup track. Use `Galaxy.ParityRig.md` and the `mxaccessgw` repo for current testing.
|
||||
|
||||
Galaxy ships as three projects:
|
||||
## Architecture
|
||||
|
||||
| Project | Target | Role |
|
||||
|---------|--------|------|
|
||||
| `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Shared/` | .NET Standard 2.0 | IPC contracts (MessagePack records + `MessageKind` enum) referenced by both sides |
|
||||
| `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/` | .NET Framework 4.8 **x86** | Separate Windows service hosting the MXAccess COM objects, STA thread + Win32 message pump, Galaxy Repository reader, Historian SDK, runtime-probe manager |
|
||||
| `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Proxy/` | .NET 10 (matches Server) | `GalaxyProxyDriver : IDriver, ITagDiscovery, IReadable, IWritable, ISubscribable, IAlarmSource, IHistoryProvider, IRediscoverable, IHostConnectivityProbe` — loaded in-process by the Server; every call forwards over the pipe to the Host |
|
||||
|
||||
The Shared assembly is the **only** contract between the two runtimes. It carries no COM or SDK references so Proxy (net10) can reference it without dragging x86 code into the Server process.
|
||||
|
||||
## Why Out-of-Process
|
||||
|
||||
Two reasons drive the split, per `docs/v2/plan.md`:
|
||||
|
||||
1. **Bitness constraint.** MXAccess is 32-bit COM only — `ArchestrA.MxAccess.dll` in `Program Files (x86)\ArchestrA\Framework\bin` has no 64-bit variant. The main OtOpcUa Server is .NET 10 x64 (the OPC Foundation stack, SqlClient, and every other non-Galaxy driver target 64-bit). In-process hosting would force the whole Server to x86, which every other driver project would then inherit.
|
||||
2. **Tier-C stability isolation.** Galaxy is classified Tier C in [docs/v2/driver-stability.md](../v2/driver-stability.md) — the COM runtime, STA thread, Aveva Historian SDK, and SQL queries all have crash/hang modes that can take down the hosting process. Isolating the driver in its own Windows service means a COM deadlock, AccessViolation in an unmanaged Historian DLL, or a runaway SQL query never takes the Server endpoint down. The Proxy-side supervisor restarts the Host with crash-loop circuit-breaker.
|
||||
|
||||
The same Tier-C isolation story applies to FOCAS (decision record in `docs/v2/plan.md` §7), which is the second out-of-process driver.
|
||||
|
||||
## IPC Transport
|
||||
|
||||
`GalaxyProxyDriver` → `GalaxyIpcClient` → named pipe → `Galaxy.Host` pipe server.
|
||||
|
||||
- Pipe name: `otopcua-galaxy-{DriverInstanceId}` (localhost-only, no TCP surface)
|
||||
- Wire format: MessagePack-CSharp, length-prefixed frames
|
||||
- ACL: pipe is created with a DACL that grants `ReadWrite | Synchronize` only to the configured Server service-principal SID + denies `LocalSystem`. The per-connection SID check in `PipeServer.VerifyCaller` is the real authorization boundary — any caller whose impersonated token SID doesn't match the allowed SID is dropped before the first frame is read.
|
||||
- Handshake: Proxy presents a shared secret at `OpenSessionRequest`; Host rejects anything else with `MessageKind.OpenSessionResponse{Success=false}`
|
||||
- Heartbeat: Proxy sends a periodic ping; missed heartbeats trigger the Proxy-side crash-loop supervisor to restart the Host
|
||||
|
||||
Every capability call on `GalaxyProxyDriver` (Read, Write, Subscribe, HistoryRead*, etc.) serializes a `*Request`, awaits the matching `*Response` via a `CallAsync<TReq, TResp>` helper, and rehydrates the result into the `Core.Abstractions` shape the Server expects.
|
||||
|
||||
## STA Thread Requirement (Host-side)
|
||||
|
||||
MXAccess COM objects — `LMXProxyServer` instantiation, `Register`, `AddItem`, `AdviseSupervisory`, `Write`, and cleanup calls — must all execute on the same Single-Threaded Apartment. Calling a COM object from the wrong thread causes marshalling failures or silent data corruption.
|
||||
|
||||
`StaComThread` in the Host provides that thread with the apartment state set before the thread starts:
|
||||
|
||||
```csharp
|
||||
_thread = new Thread(ThreadEntry) { Name = "MxAccess-STA", IsBackground = true };
|
||||
_thread.SetApartmentState(ApartmentState.STA);
|
||||
```
|
||||
+---------------------------------------+
|
||||
| OtOpcUa.Server (.NET 10 AnyCPU) |
|
||||
| GalaxyDriver (in-process) |
|
||||
| ITagDiscovery / IReadable / |
|
||||
| IWritable / ISubscribable / |
|
||||
| IRediscoverable / |
|
||||
| IHostConnectivityProbe |
|
||||
+-------------------+-------------------+
|
||||
|
|
||||
gRPC (default http://localhost:5120)
|
||||
|
|
||||
v
|
||||
+---------------------------------------+
|
||||
| mxaccessgw (sibling repo) |
|
||||
| +-------------------------------+ |
|
||||
| | MxGateway.Worker (x86 net48) | |
|
||||
| | STA + WM_APP pump | |
|
||||
| | ArchestrA.MxAccess COM | |
|
||||
| | Galaxy Repository SQL | |
|
||||
| | Wonderware Historian SDK | |
|
||||
| +-------------------------------+ |
|
||||
+---------------------------------------+
|
||||
```
|
||||
|
||||
Work items queue via `RunAsync(Action)` or `RunAsync<T>(Func<T>)` into a `ConcurrentQueue<Action>` and post `WM_APP` to wake the pump. Each work item is wrapped in a `TaskCompletionSource` so callers can `await` the result from any thread — including the IPC handler thread that receives the inbound pipe request.
|
||||
History reads + alarm-condition tracking moved server-side in PR 7.2 (`IHistoryRouter`, `AlarmConditionService`). Galaxy no longer implements `IHistoryProvider` or `IAlarmSource` of its own.
|
||||
|
||||
## Win32 Message Pump (Host-side)
|
||||
## Project Layout
|
||||
|
||||
COM callbacks (`OnDataChange`, `OnWriteComplete`) are delivered through the Windows message loop. `StaComThread` runs a standard Win32 message pump via P/Invoke:
|
||||
The driver ships as a single project: `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy/` (.NET 10, AnyCPU). Sub-folders:
|
||||
|
||||
1. `PeekMessage` primes the message queue (required before `PostThreadMessage` works)
|
||||
2. `GetMessage` blocks until a message arrives
|
||||
3. `WM_APP` drains the work queue
|
||||
4. `WM_APP + 1` drains the queue and posts `WM_QUIT` to exit the loop
|
||||
5. All other messages go through `TranslateMessage` / `DispatchMessage` for COM callback delivery
|
||||
| Folder | Role |
|
||||
|--------|------|
|
||||
| `Browse/` | Static-side discovery: `GalaxyDiscoverer` walks the gateway's hierarchy + attribute-set RPCs, `DataTypeMap` and `SecurityMap` translate Galaxy types and security classifications into OPC UA equivalents, `AlarmRefBuilder` extracts alarm-bearing attribute references for the server-layer alarm engine. `IGalaxyHierarchySource` + `GatewayGalaxyHierarchySource` + `TracedGalaxyHierarchySource` decorate the gateway browse RPC; `IGalaxyDeployWatchSource` + `GatewayGalaxyDeployWatchSource` + `DeployWatcher` drive `IRediscoverable`. |
|
||||
| `Runtime/` | Live data path: `EventPump` runs the gateway's `StreamEvents` RPC and fans out to subscribers via a bounded channel; `GalaxyMxSession` is the read-side handle; `GatewayGalaxySubscriber` + `GatewayGalaxyDataWriter` (each with a `Traced*` decorator) implement `ISubscribable` / `IWritable`; `SubscriptionRegistry` tracks subscription state for replay; `ReconnectSupervisor` owns the backoff loop and triggers `ReplaySubscriptions` on session loss; `StatusCodeMap` translates gateway StatusCodes to OPC UA; `MxValueDecoder` / `MxValueEncoder` handle scalar + array marshalling; `GalaxyTelemetry` + `GalaxySubscriptionHandle` round out the surface. |
|
||||
| `Health/` | `HostStatusAggregator` rolls per-platform probe state into the driver's `IHostConnectivityProbe` view; `PerPlatformProbeWatcher` listens on the gateway's per-host status stream; `HostConnectivityForwarder` pushes transitions out to the server's connectivity bus. |
|
||||
| `Config/` | `GalaxyDriverOptions` and the four nested option records (`GalaxyGatewayOptions`, `GalaxyMxAccessOptions`, `GalaxyRepositoryOptions`, `GalaxyReconnectOptions`). |
|
||||
|
||||
Without this pump MXAccess callbacks never fire and the driver delivers no live data.
|
||||
Project root files:
|
||||
|
||||
## LMXProxyServer COM Object
|
||||
- `GalaxyDriver.cs` — `IDriver` + capability-interface implementation; composes the Browse / Runtime / Health collaborators.
|
||||
- `GalaxyDriverFactoryExtensions.cs` — DI registration helper used by the server's driver bootstrap.
|
||||
|
||||
`MxProxyAdapter` wraps the real `ArchestrA.MxAccess.LMXProxyServer` COM object behind the `IMxProxy` interface so Host unit tests can substitute a fake proxy without requiring the ArchestrA runtime. Lifecycle:
|
||||
## Capability Surface
|
||||
|
||||
1. **`Register(clientName)`** — Creates a new `LMXProxyServer` instance, wires up `OnDataChange` and `OnWriteComplete` event handlers, calls `Register` to obtain a connection handle
|
||||
2. **`Unregister(handle)`** — Unwires event handlers, calls `Unregister`, releases the COM object via `Marshal.ReleaseComObject`
|
||||
`GalaxyDriver : IDriver, ITagDiscovery, IReadable, IWritable, ISubscribable, IRediscoverable, IHostConnectivityProbe, IDisposable`.
|
||||
|
||||
## Register / AddItem / AdviseSupervisory Pattern
|
||||
| Capability | Implementation entry point |
|
||||
|------------|---------------------------|
|
||||
| `ITagDiscovery` | `Browse/GalaxyDiscoverer.cs` |
|
||||
| `IRediscoverable` | `Browse/DeployWatcher.cs` |
|
||||
| `IReadable` | `Runtime/GalaxyMxSession.cs` |
|
||||
| `IWritable` | `Runtime/GatewayGalaxyDataWriter.cs` |
|
||||
| `ISubscribable` | `Runtime/GatewayGalaxySubscriber.cs` (driven by `EventPump`) |
|
||||
| `IHostConnectivityProbe` | `Health/HostStatusAggregator.cs` |
|
||||
|
||||
Every MXAccess data operation follows a three-step pattern, all executed on the STA thread:
|
||||
## Configuration
|
||||
|
||||
1. **`AddItem(handle, address)`** — Resolves a Galaxy tag reference (e.g., `TestMachine_001.MachineID`) to an integer item handle
|
||||
2. **`AdviseSupervisory(handle, itemHandle)`** — Subscribes the item for supervisory data-change callbacks
|
||||
3. The runtime begins delivering `OnDataChange` events
|
||||
`DriverConfig` JSON binds to `Config/GalaxyDriverOptions.cs`. The four sections are:
|
||||
|
||||
For writes, after `AddItem` + `AdviseSupervisory`, `Write(handle, itemHandle, value, securityClassification)` sends the value; `OnWriteComplete` confirms or rejects. Cleanup reverses: `UnAdviseSupervisory` then `RemoveItem`.
|
||||
- **`Gateway`** — endpoint, API key secret ref, TLS knobs, connect/call/stream timeouts. `StreamTimeoutSeconds = 0` keeps the long-lived `StreamEvents` RPC open for the driver's lifetime.
|
||||
- **`MxAccess`** — `ClientName` (must be unique per OtOpcUa instance — redundancy pairs enforce uniqueness at install time), `PublishingIntervalMs` (forwarded as `buffered_update_interval_ms` on subscribe), `WriteUserId` for ArchestrA secured-write, `EventPumpChannelCapacity` (default 50_000 — one second of headroom at 50k tags / 1Hz; tune via the `galaxy.events.dropped` metric).
|
||||
- **`Repository`** — `DiscoverPageSize`, `WatchDeployEvents`.
|
||||
- **`Reconnect`** — `InitialBackoffMs`, `MaxBackoffMs`, `ReplayOnSessionLost` (calls the gateway's `ReplaySubscriptions` RPC after reconnect rather than re-issuing subscribe-bulk for every tag).
|
||||
|
||||
## OnDataChange and OnWriteComplete Callbacks
|
||||
Full per-field descriptions live in `Config/GalaxyDriverOptions.cs`. The full JSON skeleton is reproduced in [docs/v2/driver-specs.md §1](../v2/driver-specs.md).
|
||||
|
||||
### OnDataChange
|
||||
## Reconnect + Replay
|
||||
|
||||
Fired by the COM runtime on the STA thread when a subscribed tag changes. The handler in `MxAccessClient.EventHandlers.cs`:
|
||||
`ReconnectSupervisor` owns an exponential-backoff loop bounded by `Reconnect.InitialBackoffMs` / `MaxBackoffMs`. On session loss it tears down the gRPC channel, redials, and — when `ReplayOnSessionLost = true` — calls the gateway's `ReplaySubscriptions` RPC with the cached subscription set from `SubscriptionRegistry` instead of re-subscribing tag-by-tag. The gateway's worker then re-issues `AdviseSupervisory` server-side under the apartment lock.
|
||||
|
||||
1. Maps the integer `phItemHandle` back to a tag address via `_handleToAddress`
|
||||
2. Maps the MXAccess quality code to the internal `Quality` enum
|
||||
3. Checks `MXSTATUS_PROXY` for error details and adjusts quality
|
||||
4. Converts the timestamp to UTC
|
||||
5. Constructs a `Vtq` (Value/Timestamp/Quality) and delivers it to:
|
||||
- The stored per-tag subscription callback
|
||||
- Any pending one-shot read completions
|
||||
- The global `OnTagValueChanged` event (consumed by the Host's subscription dispatcher, which packages changes into `DataChangeEventArgs` and forwards them over the pipe to `GalaxyProxyDriver.OnDataChange`)
|
||||
## Testing
|
||||
|
||||
### OnWriteComplete
|
||||
- **Unit tests**: `tests/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Tests/` — fakes the gateway gRPC surface; covers Browse, Runtime, Health, and Config in isolation.
|
||||
- **Parity rig + dev-rig walkthrough**: see [docs/v2/Galaxy.ParityRig.md](../v2/Galaxy.ParityRig.md). The rig stands up a real `mxaccessgw` against a live Galaxy and exercises the full read / write / subscribe / rediscover path.
|
||||
- **Performance + soak**: see [docs/v2/Galaxy.Performance.md](../v2/Galaxy.Performance.md).
|
||||
|
||||
Fired when the runtime acknowledges or rejects a write. The handler resolves the pending `TaskCompletionSource<bool>` for the item handle. If `MXSTATUS_PROXY.success == 0` the write is considered failed and the error detail is logged.
|
||||
## Operational Notes
|
||||
|
||||
## Reconnection Logic
|
||||
|
||||
`MxAccessClient` implements automatic reconnection through two mechanisms.
|
||||
|
||||
### Monitor loop
|
||||
|
||||
`StartMonitor` launches a background task that polls at `MonitorIntervalSeconds`. On each cycle:
|
||||
|
||||
- If the state is `Disconnected` or `Error` and `AutoReconnect` is enabled, it calls `ReconnectAsync`
|
||||
- If connected and a probe tag is configured, it checks the probe staleness threshold
|
||||
|
||||
### Reconnect sequence
|
||||
|
||||
`ReconnectAsync` performs a full disconnect-then-connect cycle:
|
||||
|
||||
1. Increment the reconnect counter
|
||||
2. `DisconnectAsync` — tear down all active subscriptions (`UnAdviseSupervisory` + `RemoveItem` for each), detach COM event handlers, call `Unregister`, clear all handle mappings
|
||||
3. `ConnectAsync` — create a fresh `LMXProxyServer`, register, replay all stored subscriptions, re-subscribe the probe tag
|
||||
|
||||
Stored subscriptions (`_storedSubscriptions`) persist across reconnects. `ReplayStoredSubscriptionsAsync` iterates the stored entries and calls `AddItem` + `AdviseSupervisory` for each.
|
||||
|
||||
## Probe Tag Health Monitoring
|
||||
|
||||
A configurable probe tag (e.g., a frequently updating Galaxy attribute) serves as a connection health indicator. After connecting, the client subscribes to the probe tag and records `_lastProbeValueTime` on every `OnDataChange`. The monitor loop compares `DateTime.UtcNow - _lastProbeValueTime` against `ProbeStaleThresholdSeconds`; if the probe has not updated within the window, the connection is assumed stale and a reconnect is forced. This catches scenarios where the COM connection is technically alive but the runtime has stopped delivering data.
|
||||
|
||||
## Per-Host Runtime Status Probes (`<Host>.ScanState`)
|
||||
|
||||
Separate from the connection-level probe, the driver advises `<HostName>.ScanState` on every deployed `$WinPlatform` and `$AppEngine` in the Galaxy. These probes track per-host runtime state so the Admin UI dashboard can report "this specific Platform / AppEngine is off scan" and the driver can proactively invalidate every OPC UA variable hosted by the stopped object — preventing MXAccess from serving stale Good-quality cached values to clients who read those tags while the host is down.
|
||||
|
||||
Enabled by default via `MxAccess.RuntimeStatusProbesEnabled`; see [Configuration](../Configuration.md#mxaccess) for the two config fields.
|
||||
|
||||
### How it works
|
||||
|
||||
`GalaxyRuntimeProbeManager` lives in `Driver.Galaxy.Host` alongside the rest of the MXAccess code. It is owned by the Host's subscription dispatcher and runs a three-state machine per host (Unknown / Running / Stopped):
|
||||
|
||||
1. **Discovery** — After the Host completes `BuildAddressSpace`, the manager filters the hierarchy to rows where `CategoryId == 1` (`$WinPlatform`) or `CategoryId == 3` (`$AppEngine`) and issues `AdviseSupervisory` for `<TagName>.ScanState` on each one. Probes are driver-owned, not ref-counted against client subscriptions, and persist across address-space rebuilds via a `Sync` diff.
|
||||
2. **Transition predicate** — A probe callback is interpreted as `isRunning = vtq.Quality.IsGood() && vtq.Value is bool b && b`. Everything else (explicit `ScanState = false`, bad quality, communication errors) means **Stopped**.
|
||||
3. **On-change-only delivery** — `ScanState` is delivered only when the value actually changes. A stably Running host may go hours without a callback. `Tick()` does NOT run a starvation check on Running entries — the only time-based transition is **Unknown → Stopped** when the initial callback hasn't arrived within `RuntimeStatusUnknownTimeoutSeconds` (default 15s). This protects against a probe that fails to resolve at all without incorrectly flipping healthy long-running hosts.
|
||||
4. **Transport gating** — When `IMxAccessClient.State != Connected`, `GetSnapshot()` forces every entry to `Unknown`. The dashboard shows the Connection panel as the primary signal in that case rather than misleading operators with "every host stopped".
|
||||
5. **Subscribe failure rollback** — If `SubscribeAsync` throws for a new probe (SDK failure, broker rejection, transport error), the manager rolls back both `_byProbe` and `_probeByGobjectId` so the probe never appears in `GetSnapshot()`. Stability review 2026-04-13 Finding 1.
|
||||
|
||||
### Subtree quality invalidation on transition
|
||||
|
||||
When a host transitions **Running → Stopped**, the probe manager invokes a callback that walks `_hostedVariables[gobjectId]` — the set of every OPC UA variable transitively hosted by that Galaxy object — and sets each variable's `StatusCode` to `BadOutOfService`. **Stopped → Running** calls `ClearHostVariablesBadQuality` to reset each to `Good` so the next on-change MXAccess update repopulates the value.
|
||||
|
||||
The hosted-variables map is built once per `BuildAddressSpace` by walking each object's `HostedByGobjectId` chain up to the nearest Platform or Engine ancestor. A variable hosted by an Engine inside a Platform lands in both the Engine's list and the Platform's list, so stopping the Platform transitively invalidates every descendant Engine's variables.
|
||||
|
||||
### Read-path short-circuit (`IsTagUnderStoppedHost`)
|
||||
|
||||
The Host's Read handler checks `IsTagUnderStoppedHost(tagRef)` (a reverse-index lookup `_hostIdsByTagRef[tagRef]` → `GalaxyRuntimeProbeManager.IsHostStopped(hostId)`) before the MXAccess round-trip. When the owning host is Stopped, the handler returns a synthesized `DataValue { Value = cachedVar.Value, StatusCode = BadOutOfService }` directly without touching MXAccess. This guarantees clients see a uniform `BadOutOfService` on every descendant tag of a stopped host, regardless of whether they're reading or subscribing.
|
||||
|
||||
### Deferred dispatch — the STA deadlock
|
||||
|
||||
**Critical**: probe transition callbacks must **not** run synchronously on the STA thread that delivered the `OnDataChange`. `MarkHostVariablesBadQuality` takes the subscription dispatcher lock, which may be held by a worker thread currently inside `Read` waiting on an `_mxAccessClient.ReadAsync()` round-trip that is itself waiting for the STA thread. Classic circular wait — the first real deploy of this feature hung inside 30 seconds from exactly this pattern.
|
||||
|
||||
The fix is a deferred-dispatch queue: probe callbacks enqueue the transition onto `ConcurrentQueue<(int GobjectId, bool Stopped)>` and set the existing dispatch signal. The dispatch thread drains the queue inside its existing 100ms `WaitOne` loop — outside any locks held by the STA path — and then calls `MarkHostVariablesBadQuality` / `ClearHostVariablesBadQuality` under its own natural lock acquisition. No circular wait, no STA involvement.
|
||||
|
||||
### Dashboard and health surface
|
||||
|
||||
- Admin UI **Galaxy Runtime** panel shows per-host state with Name / Kind / State / Since / Last Error columns. Panel color is green (all Running), yellow (any Unknown, none Stopped), red (any Stopped), gray (MXAccess transport disconnected)
|
||||
- `HealthCheckService.CheckHealth` rolls overall driver health to `Degraded` when any host is Stopped
|
||||
|
||||
See [Status Dashboard](../StatusDashboard.md#galaxy-runtime) for the field table and [Configuration](../Configuration.md#mxaccess) for the config fields.
|
||||
|
||||
## Request Timeout Safety Backstop
|
||||
|
||||
Every sync-over-async site on the OPC UA stack thread that calls into Galaxy (`Read`, `Write`, address-space rebuild probe sync) is wrapped in a bounded `SyncOverAsync.WaitSync(...)` helper with timeout `MxAccess.RequestTimeoutSeconds` (default 30s). Inner `ReadTimeoutSeconds` / `WriteTimeoutSeconds` bounds on the async path are the first line of defense; the outer wrapper is a backstop so a scheduler stall, slow reconnect, or any other non-returning async path cannot park the stack thread indefinitely.
|
||||
|
||||
On timeout, the underlying task is **not** cancelled — it runs to completion on the thread pool and is abandoned. This is acceptable because Galaxy IPC clients are shared singletons and the abandoned continuation does not capture request-scoped state. The OPC UA stack receives `StatusCodes.BadTimeout` on the affected operation.
|
||||
|
||||
`ConfigurationValidator` enforces `RequestTimeoutSeconds >= 1` and warns when it is set below the inner Read/Write timeouts (operator misconfiguration). Stability review 2026-04-13 Finding 3.
|
||||
|
||||
All capability calls at the Server dispatch layer are additionally wrapped by `CapabilityInvoker` (Core/Resilience/) which runs them through a Polly pipeline keyed on `(DriverInstanceId, HostName, DriverCapability)`. `OTOPCUA0001` analyzer enforces the wrap at build time.
|
||||
|
||||
## Why Marshal.ReleaseComObject Is Needed
|
||||
|
||||
The .NET Framework runtime's garbage collector releases COM references non-deterministically. For MXAccess, delayed release can leave stale COM connections open, preventing clean re-registration. `MxProxyAdapter.Unregister` calls `Marshal.ReleaseComObject(_lmxProxy)` in a `finally` block to immediately drive the COM reference count to zero. This ensures the underlying COM server is freed before a reconnect attempt creates a new instance.
|
||||
|
||||
## Tag Discovery and Historical Data
|
||||
|
||||
Tag discovery (the Galaxy Repository SQL reader + `LocalPlatform` scope filter) is covered in [Galaxy-Repository.md](Galaxy-Repository.md). The Galaxy driver is `ITagDiscovery` for the Server's bootstrap path and `IRediscoverable` for the on-change-redeploy path.
|
||||
|
||||
Historical data access (raw, processed, at-time, events) runs against the Aveva Historian via the `aahClientManaged` SDK and is exposed through the Galaxy driver's `IHistoryProvider` implementation. See [HistoricalDataAccess.md](../HistoricalDataAccess.md).
|
||||
|
||||
## Key source files
|
||||
|
||||
Host-side (`.NET 4.8 x86`, `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/`):
|
||||
|
||||
- `Backend/MxAccess/StaComThread.cs` — STA thread and Win32 message pump
|
||||
- `Backend/MxAccess/MxAccessClient.cs` — Core client (partial)
|
||||
- `Backend/MxAccess/MxAccessClient.Connection.cs` — Connect / disconnect / reconnect
|
||||
- `Backend/MxAccess/MxAccessClient.Subscription.cs` — Subscribe / unsubscribe / replay
|
||||
- `Backend/MxAccess/MxAccessClient.ReadWrite.cs` — Read and write operations
|
||||
- `Backend/MxAccess/MxAccessClient.EventHandlers.cs` — `OnDataChange` / `OnWriteComplete` handlers
|
||||
- `Backend/MxAccess/MxAccessClient.Monitor.cs` — Background health monitor
|
||||
- `Backend/MxAccess/MxProxyAdapter.cs` — COM object wrapper
|
||||
- `Backend/MxAccess/GalaxyRuntimeProbeManager.cs` — Per-host `ScanState` probes, state machine, `IsHostStopped` lookup
|
||||
- `Backend/Historian/HistorianDataSource.cs` — `aahClientManaged` SDK wrapper (see [HistoricalDataAccess.md](../HistoricalDataAccess.md))
|
||||
- `Ipc/GalaxyIpcServer.cs` — Named-pipe server, message dispatch
|
||||
- `Domain/IMxAccessClient.cs` — Client interface
|
||||
|
||||
Shared (`.NET Standard 2.0`, `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Shared/`):
|
||||
|
||||
- `Contracts/MessageKind.cs` — IPC message kinds (`ReadRequest`, `HistoryReadRequest`, `OpenSessionResponse`, …)
|
||||
- `Contracts/*.cs` — MessagePack DTOs for every request/response pair
|
||||
|
||||
Proxy-side (`.NET 10`, `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Proxy/`):
|
||||
|
||||
- `GalaxyProxyDriver.cs` — `IDriver`/`ITagDiscovery`/`IReadable`/`IWritable`/`ISubscribable`/`IAlarmSource`/`IHistoryProvider`/`IRediscoverable`/`IHostConnectivityProbe` implementation; every method forwards via `GalaxyIpcClient`
|
||||
- `Ipc/GalaxyIpcClient.cs` — Named-pipe client, `CallAsync<TReq, TResp>`, reconnect on broken pipe
|
||||
- `GalaxyProxySupervisor.cs` — Host-process monitor, crash-loop circuit-breaker, Host relaunch
|
||||
- **MXAccess `ClientName` collisions**: two OtOpcUa instances sharing a `ClientName` cause the older Wonderware session to lose subscription state. Redundancy pairs (decision #149) enforce uniqueness via install scripts.
|
||||
- **Channel saturation**: `galaxy.events.dropped > 0` indicates `EventPump` is back-pressured. Raise `EventPumpChannelCapacity` or investigate downstream slowness in the server-side fan-out.
|
||||
- **Connectivity surface**: per-platform probe state is exposed through `IHostConnectivityProbe` and aggregated by the server's connectivity bus — there is no driver-private dashboard surface anymore. The Admin UI's Host Status panel is the consumer.
|
||||
|
||||
Reference in New Issue
Block a user