Add Galaxy platform scope filter so multi-node deployments can restrict the OPC UA address space to only objects hosted by the local platform, reducing memory footprint and MXAccess subscription count from the full Galaxy (49 objects / 4206 attributes) down to the local subtree (3 objects / 386 attributes on the dev Galaxy).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
Joseph Doherty
2026-04-16 00:39:11 -04:00
parent c76ab8fdee
commit bc282b6788
13 changed files with 610 additions and 107 deletions

View File

@@ -64,6 +64,12 @@ For array attributes, the `[]` suffix present in `full_tag_reference` is strippe
The hierarchy query returns objects ordered by `parent_gobject_id, tag_name`, but this does not guarantee that a parent appears before all of its children in all cases. `LmxNodeManager.TopologicalSort` performs a depth-first traversal to produce a list where every parent is guaranteed to precede its children. This allows the build loop to look up parent nodes from `_nodeMap` without forward references.
## Platform Scope Filtering
When `GalaxyRepository.Scope` is set to `LocalPlatform`, the hierarchy and attributes passed to `BuildAddressSpace` are pre-filtered by `PlatformScopeFilter` inside `GalaxyRepositoryService`. The node manager receives only the local platform's objects and their ancestor areas, so the resulting browse tree is a subset of the full Galaxy. The filtering is transparent to `LmxNodeManager` — it builds nodes from whatever data it receives.
Clients browsing a `LocalPlatform`-scoped server will see only the areas and objects hosted by that platform. Areas that exist in the Galaxy but contain no local descendants are excluded. See [Galaxy Repository — Platform Scope Filter](GalaxyRepository.md#platform-scope-filter) for the filtering algorithm and configuration.
## Incremental Sync
On address space rebuild (triggered by a Galaxy deploy change), `SyncAddressSpace` uses `AddressSpaceDiff` to identify which `gobject_id` values have changed between the old and new snapshots. Only the affected subtrees are torn down and rebuilt, preserving unchanged nodes and their active subscriptions. Affected subscriptions are snapshot before teardown and replayed after rebuild.

View File

@@ -88,6 +88,8 @@ Controls the Galaxy repository database connection used to build the OPC UA addr
| `ChangeDetectionIntervalSeconds` | `int` | `30` | How often the service polls for Galaxy deploy changes |
| `CommandTimeoutSeconds` | `int` | `30` | SQL command timeout for repository queries |
| `ExtendedAttributes` | `bool` | `false` | Load extended Galaxy attribute metadata into the OPC UA model |
| `Scope` | `GalaxyScope` | `"Galaxy"` | Controls how much of the Galaxy hierarchy is loaded. `Galaxy` loads all deployed objects (default). `LocalPlatform` loads only objects hosted by the platform deployed on this machine. See [Galaxy Repository — Platform Scope Filter](GalaxyRepository.md#platform-scope-filter) |
| `PlatformName` | `string?` | `null` | Explicit platform hostname for `LocalPlatform` filtering. When null, uses `Environment.MachineName`. Only used when `Scope` is `LocalPlatform` |
### Dashboard
@@ -242,6 +244,7 @@ Three boolean properties act as feature flags that control optional subsystems:
- **`OpcUa.AlarmFilter.ObjectFilters`** -- List of wildcard template-name patterns that scope alarm tracking to matching objects and their descendants. An empty list preserves the current unfiltered behavior; a non-empty list includes an object only when any name in its template derivation chain matches any pattern, then propagates the inclusion to every descendant in the containment hierarchy. `*` is the only wildcard, matching is case-insensitive, and the Galaxy `$` prefix on template names is normalized so operators can write `TestMachine*` instead of `$TestMachine*`. Each list entry may itself contain comma-separated patterns (`"TestMachine*, Pump_*"`) for convenience. When the list is non-empty but `AlarmTrackingEnabled` is `false`, the validator emits a warning because the filter has no effect. See [Alarm Tracking](AlarmTracking.md#template-based-alarm-object-filter) for the full matching algorithm and telemetry.
- **`Historian.Enabled`** -- When `true`, the service calls `HistorianPluginLoader.TryLoad(config)` to load the `ZB.MOM.WW.LmxOpcUa.Historian.Aveva` plugin from the `Historian/` subfolder next to the host exe and registers the resulting `IHistorianDataSource` with the OPC UA server host. Disabled by default because not all deployments have a Historian instance -- when disabled the plugin is not probed and the Wonderware SDK DLLs are not required on the host. If the flag is `true` but the plugin or its SDK dependencies cannot be loaded, the server still starts and every history read returns `BadHistoryOperationUnsupported` with a warning in the log.
- **`GalaxyRepository.ExtendedAttributes`** -- When `true`, the repository loads additional Galaxy attribute metadata beyond the core set needed for the address space. Disabled by default to minimize startup query time.
- **`GalaxyRepository.Scope`** -- When set to `LocalPlatform`, the repository filters the hierarchy and attributes to only include objects hosted by the platform whose `node_name` matches this machine (or the explicit `PlatformName` override). Ancestor areas are retained to keep the browse tree connected. Default is `Galaxy` (load everything). See [Galaxy Repository — Platform Scope Filter](GalaxyRepository.md#platform-scope-filter).
## Configuration Validation
@@ -319,7 +322,9 @@ Integration tests use this constructor to inject substitute implementations of `
"ConnectionString": "Server=localhost;Database=ZB;Integrated Security=true;",
"ChangeDetectionIntervalSeconds": 30,
"CommandTimeoutSeconds": 30,
"ExtendedAttributes": false
"ExtendedAttributes": false,
"Scope": "Galaxy",
"PlatformName": null
},
"Dashboard": {
"Enabled": true,

View File

@@ -12,6 +12,8 @@
| `ChangeDetectionIntervalSeconds` | `30` | Polling frequency for deploy change detection |
| `CommandTimeoutSeconds` | `30` | SQL command timeout for all queries |
| `ExtendedAttributes` | `false` | When true, loads primitive-level attributes in addition to dynamic attributes |
| `Scope` | `Galaxy` | `Galaxy` loads all deployed objects. `LocalPlatform` filters to the local platform's subtree only |
| `PlatformName` | `null` | Explicit platform hostname for `LocalPlatform` filtering. When null, uses `Environment.MachineName` |
The connection uses Windows Authentication because the Galaxy Repository database is local to the System Platform node and secured through domain credentials.
@@ -69,6 +71,54 @@ The Galaxy maintains two package references for each object:
The queries filter on `deployed_package_id <> 0` because the OPC UA server must mirror what is actually running in the Galaxy runtime. Using `checked_in_package_id` would expose attributes and objects that exist in the IDE but have not been deployed, causing mismatches between the OPC UA address space and the MXAccess runtime.
## Platform Scope Filter
When `Scope` is set to `LocalPlatform`, the repository applies a post-query C# filter to restrict the address space to objects hosted by the local platform. This reduces memory footprint, MXAccess subscription count, and address space size on multi-node Galaxy deployments where each OPC UA server instance only needs to serve its own platform's objects.
### How it works
1. **Platform lookup** -- A separate `const string` SQL query (`PlatformLookupSql`) reads `platform_gobject_id` and `node_name` from the `platform` table for all deployed platforms. This runs once per hierarchy load.
2. **Platform matching** -- The configured `PlatformName` (or `Environment.MachineName` when null) is matched case-insensitively against the `node_name` column. If no match is found, a warning is logged listing the available platforms, and the address space is empty.
3. **Host chain collection** -- The filter collects the matching platform's `gobject_id`, then iterates the hierarchy to find all `$AppEngine` (category 3) objects whose `HostedByGobjectId` equals the platform. This produces the full set of host gobject_ids under the local platform.
4. **Object inclusion** -- All non-area objects whose `HostedByGobjectId` is in the host set are included, along with the hosts themselves.
5. **Area retention** -- `ParentGobjectId` chains are walked upward from included objects to pull in ancestor areas, keeping the browse tree connected. Areas that contain no local descendants are excluded.
6. **Attribute filtering** -- The set of included `gobject_id` values is cached after `GetHierarchyAsync` and reused by `GetAttributesAsync` to filter attributes to the same scope.
### Design rationale
The filter is applied in C# rather than SQL because the project convention `GR-006` requires `const string` SQL queries with no dynamic SQL. The hierarchy query already returns `HostedByGobjectId` and `CategoryId` on every row, so all information needed for filtering is already in memory after the query runs. The only new SQL is the lightweight platform lookup query.
### Configuration
```json
"GalaxyRepository": {
"Scope": "LocalPlatform",
"PlatformName": null
}
```
- Set `Scope` to `"LocalPlatform"` to enable filtering. Default is `"Galaxy"` (load everything, backward compatible).
- Set `PlatformName` to an explicit hostname to target a specific platform, or leave null to use the local machine name.
### Startup log
When `LocalPlatform` is active, the startup log shows the filtering result:
```
GalaxyRepository.Scope="LocalPlatform", PlatformName=MYNODE
GetHierarchyAsync returned 49 objects
GetPlatformsAsync returned 2 platform(s)
Scope filter targeting platform 'MYNODE' (gobject_id=1042)
Scope filter retained 25 of 49 objects for platform 'MYNODE'
GetAttributesAsync returned 4206 attributes (extended=true)
Scope filter retained 2100 of 4206 attributes
```
## Change Detection Polling
`ChangeDetectionService` runs a background polling loop that calls `GetLastDeployTimeAsync` at the configured interval. It compares the returned timestamp against the last known value:
@@ -87,5 +137,7 @@ The polling approach is used because the Galaxy Repository database does not pro
## Key source files
- `src/ZB.MOM.WW.LmxOpcUa.Host/GalaxyRepository/GalaxyRepositoryService.cs` -- SQL queries and data access
- `src/ZB.MOM.WW.LmxOpcUa.Host/GalaxyRepository/PlatformScopeFilter.cs` -- Platform-based hierarchy and attribute filtering
- `src/ZB.MOM.WW.LmxOpcUa.Host/GalaxyRepository/ChangeDetectionService.cs` -- Deploy timestamp polling loop
- `src/ZB.MOM.WW.LmxOpcUa.Host/Configuration/GalaxyRepositoryConfiguration.cs` -- Connection and polling settings
- `src/ZB.MOM.WW.LmxOpcUa.Host/Configuration/GalaxyRepositoryConfiguration.cs` -- Connection, polling, and scope settings
- `src/ZB.MOM.WW.LmxOpcUa.Host/Domain/PlatformInfo.cs` -- Platform-to-hostname DTO

View File

@@ -1,104 +0,0 @@
# Stability Review - 2026-04-13
## Scope
Re-review of the updated `lmxopcua` codebase with emphasis on stability, shutdown behavior, async usage, latent deadlock patterns, and silent failure modes.
Validation run for this review:
```powershell
dotnet test tests\ZB.MOM.WW.LmxOpcUa.Tests\ZB.MOM.WW.LmxOpcUa.Tests.csproj --no-restore
```
Result: `471/471` tests passed in approximately `3m18s`.
## Confirmed Findings
### 1. Probe state is published before the subscription succeeds
Severity: High
File references:
- `src/ZB.MOM.WW.LmxOpcUa.Host/MxAccess/GalaxyRuntimeProbeManager.cs:193`
- `src/ZB.MOM.WW.LmxOpcUa.Host/MxAccess/GalaxyRuntimeProbeManager.cs:201`
- `src/ZB.MOM.WW.LmxOpcUa.Host/MxAccess/GalaxyRuntimeProbeManager.cs:222`
- `src/ZB.MOM.WW.LmxOpcUa.Host/MxAccess/GalaxyRuntimeProbeManager.cs:225`
- `src/ZB.MOM.WW.LmxOpcUa.Host/MxAccess/GalaxyRuntimeProbeManager.cs:343`
`SyncAsync` adds entries to `_byProbe` and `_probeByGobjectId` before `SubscribeAsync` completes. If the advise call fails, the catch block logs the failure but leaves the probe registered internally. `Tick()` later treats that entry as a real advised probe that never produced an initial callback and transitions it from `Unknown` to `Stopped`.
That creates a false-negative health signal: a host can be marked stopped even though the real problem was "subscription never established". In this codebase that distinction matters because runtime-host state is later used to suppress or degrade published node quality.
Recommendation: only commit the new probe entry after a successful subscribe, or roll the dictionaries back in the catch path. Add a regression test for subscribe failure in `GalaxyRuntimeProbeManagerTests`.
### 2. Service startup still ignores dashboard bind failure
Severity: Medium
File references:
- `src/ZB.MOM.WW.LmxOpcUa.Host/Status/StatusWebServer.cs:50`
- `src/ZB.MOM.WW.LmxOpcUa.Host/OpcUaService.cs:307`
- `src/ZB.MOM.WW.LmxOpcUa.Host/OpcUaService.cs:308`
`StatusWebServer.Start()` now correctly returns `bool`, but `OpcUaService.Start` still ignores that result. The service can therefore continue through startup and report success even when the dashboard failed to bind.
This is not a process-crash bug, but it is still an operational stability issue because the service advertises a successful start while one of its enabled endpoints is unavailable.
Recommendation: decide whether dashboard startup failure is fatal or degraded mode, then implement that policy explicitly. At minimum, surface the failure in service startup state instead of dropping the return value.
### 3. Sync-over-async remains on critical request and rebuild paths
Severity: Medium
File references:
- `src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxNodeManager.cs:572`
- `src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxNodeManager.cs:1708`
- `src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxNodeManager.cs:1782`
- `src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxNodeManager.cs:2022`
- `src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxNodeManager.cs:2100`
- `src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxNodeManager.cs:2154`
- `src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxNodeManager.cs:2220`
The updated code removed some blocking work from lock scopes, but several service-critical paths still call async MX access operations synchronously with `.GetAwaiter().GetResult()`. That pattern appears in address-space rebuild, direct read/write handling, and historian reads.
I did not reproduce a deadlock in tests, but the pattern is still a stability risk because request threads now inherit backend latency directly and can stall hard if the underlying async path hangs, blocks on its own scheduler, or experiences slow reconnect behavior.
Recommendation: keep the short synchronous boundary only where the external API forces it, and isolate backend calls behind bounded timeouts or dedicated worker threads. Rebuild-time probe synchronization is the highest-value place to reduce blocking first.
### 4. Several background subscribe paths are still fire-and-forget
Severity: Low
File references:
- `src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxNodeManager.cs:858`
- `src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxNodeManager.cs:1362`
- `src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxNodeManager.cs:2481`
Alarm auto-subscribe and transferred-subscription restore still dispatch `SubscribeAsync(...)` and attach a fault-only continuation. That is better than dropping exceptions completely, but these operations are still not lifecycle-coordinated. A rebuild or shutdown can move on while subscription work is still in flight.
The practical outcome is transient mismatch rather than memory corruption: expected subscriptions can arrive late, and shutdown/rebuild sequencing is harder to reason about under backend slowness.
Recommendation: track these tasks when ordering matters, or centralize them behind a subscription queue with explicit cancellation and shutdown semantics.
## Verified Improvements Since The Previous Review
The following areas that were previously risky now look materially better in the current code:
- `StaComThread` now checks `PostThreadMessage` failures and faults pending work instead of leaving callers parked indefinitely.
- `HistoryContinuationPointManager` now purges expired continuation points on retrieve and release, not only on store.
- `ChangeDetectionService`, MX monitor, and the status web server now retain background task handles and wait briefly during stop.
- `StatusWebServer` no longer swallows startup failure silently; it returns a success flag and logs the failure.
- Connection string validation now redacts credentials before logging.
## Overall Assessment
The updated code is in better shape than the previous pass. The most serious prior shutdown and leak hazards have been addressed, and the full automated test suite is currently green.
The remaining stability work is concentrated in two areas:
1. Correctness around failed runtime-probe subscription.
2. Reducing synchronous waits and untracked background subscription work in the OPC UA node manager.