fix(core): resolve Medium code-review finding (Core-007)

SubscribeAsync now wraps each driver handle in a private HostBoundHandle
that carries the resolved host name.  UnsubscribeAsync unwraps it and
routes through the recorded host's resilience pipeline, correctly
charging the subscription's originating host's circuit breaker/bulkhead
instead of always using the default host.  Falls back to the default
host for handles not created by this invoker.  Two regression tests
added; update findings.md Open count from 10 to 6.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Joseph Doherty
2026-05-22 08:24:17 -04:00
parent 6cec98caef
commit ce86deca62
3 changed files with 89 additions and 16 deletions

View File

@@ -7,7 +7,7 @@
| Review date | 2026-05-22 |
| Commit reviewed | `76d35d1` |
| Status | Reviewed |
| Open findings | 10 |
| Open findings | 6 |
## Checklist coverage
@@ -63,13 +63,13 @@
| Severity | Medium |
| Category | Correctness & logic bugs |
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Core/Authorization/PermissionTrie.cs:80-98` |
| Status | Open |
| Status | Resolved |
**Description:** `WalkSystemPlatform` records every Galaxy folder-segment grant with `NodeAclScopeKind.Equipment` (see the comment at lines 82-86) because `NodeAclScopeKind` has no `FolderSegment` member. The functional union of permission flags is unaffected, but the `MatchedGrant.Scope` carried in `AuthorizationDecision.Provenance` is wrong for Galaxy nodes: a grant anchored at a namespace-root folder and a grant anchored at a deep sub-folder both report `Equipment`, and a namespace-level grant is indistinguishable from a folder-level grant in the audit trail and the Admin UI "Probe this permission" diagnostic. The Phase 6.2 design (adversarial-review item #6) calls for a dedicated `FolderSegment` scope level. The current code is a known shortcut but references only an untracked "TODO" with no issue ID.
**Recommendation:** Add a `FolderSegment` member to `NodeAclScopeKind` and use it in `WalkSystemPlatform` and `PermissionTrieBuilder` so Galaxy folder grants report their true scope. If the enum change is deferred, file a tracked issue and reference its ID in the code comment.
**Resolution:** _(open)_
**Resolution:** Resolved 2026-05-22 — added `FolderSegment` to `NodeAclScopeKind`; `WalkSystemPlatform` now reports `FolderSegment` instead of `Equipment` for each visited Galaxy folder level; added three regression tests asserting the correct scope is reported in `MatchedGrant.Scope`.
### Core-004
@@ -93,13 +93,13 @@
| Severity | Medium |
| Category | Concurrency & thread safety |
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Core/Authorization/PermissionTrieCache.cs:59-70` |
| Status | Open |
| Status | Resolved |
**Description:** `Prune` mutates the `ConcurrentDictionary` with a plain indexer assignment (`_byCluster[clusterId] = new ClusterEntry(...)`) after a separate `TryGetValue` read. If `Install` runs concurrently for the same cluster, the `AddOrUpdate` in `Install` and the indexer write in `Prune` race: `Prune` can read an entry, `Install` then adds a newer generation via `AddOrUpdate`, and `Prune`'s unconditional indexer write then overwrites the entry — silently dropping the just-installed newest generation and its `Current` pointer. The class is documented as a process-singleton accessed on the hot path while publishes install new tries, so the race is reachable.
**Recommendation:** Make `Prune` use an atomic compare-and-swap loop — `_byCluster.TryUpdate(clusterId, prunedEntry, observedEntry)` retried until it succeeds or the key is gone — or perform the prune inside an `AddOrUpdate` update factory.
**Resolution:** _(open)_
**Resolution:** Resolved 2026-05-22 — changed `ClusterEntry` from `sealed record` to `sealed class` (enabling reference-equality CAS via `TryUpdate`); `Prune` now uses a read-compute-`TryUpdate` retry loop that restarts if another thread updates the entry between the read and the write; added regression tests asserting the current generation is preserved after a concurrent install + prune sequence.
### Core-006
@@ -108,13 +108,13 @@
| Severity | Medium |
| Category | Concurrency & thread safety |
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Core/OpcUa/GenericDriverNodeManager.cs:42-64` |
| Status | Open |
| Status | Resolved |
**Description:** `BuildAddressSpaceAsync` is not guarded against being called more than once. A second call subscribes a second `_alarmForwarder` to `IAlarmSource.OnAlarmEvent` and overwrites the `_alarmForwarder` field, so the first delegate is leaked (still subscribed, never unsubscribed because `Dispose` only removes the field's current value). Every alarm transition would then be delivered to its sink twice. The address-space rebuild path on Galaxy redeploy (`DeployWatcher``IRediscoverable.OnRediscoveryNeeded` → server rebuilds the address space) is exactly the scenario where a node manager could legitimately be re-walked. There is also no check of the `_disposed` flag at the top of the method.
**Recommendation:** Either guard `BuildAddressSpaceAsync` so a second call throws `InvalidOperationException` (document it single-shot), or unsubscribe the previous `_alarmForwarder` and clear `_alarmSinks` before re-walking. Also check `_disposed` and throw `ObjectDisposedException` if already disposed.
**Resolution:** _(open)_
**Resolution:** Resolved 2026-05-22 — `BuildAddressSpaceAsync` now checks `_disposed` (throws `ObjectDisposedException`) and tears down the previous alarm forwarder + clears the sink registry before re-walking so a Galaxy-redeploy rebuild does not double-subscribe the forwarder; added three regression tests covering double-build, sink-count after rebuild, and post-dispose throw.
### Core-007
@@ -123,13 +123,13 @@
| Severity | Medium |
| Category | Error handling & resilience |
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Core/Resilience/AlarmSurfaceInvoker.cs:75-83` |
| Status | Open |
| Status | Resolved |
**Description:** `UnsubscribeAsync` always routes through `_defaultHost`, even when an `IPerCallHostResolver` is wired and the original `SubscribeAsync` fanned the subscription out to a non-default host. The `IAlarmSubscriptionHandle` is opaque here and carries no host association, so an unsubscribe for a subscription created against host B runs through host A's resilience pipeline. In a multi-host driver this charges the wrong host's circuit breaker / bulkhead and, if host A is open while host B is healthy, can spuriously block a valid unsubscribe. The XML doc claims it routes "for parity" with `SubscribeAsync` but subscribe is per-host and unsubscribe is not.
**Recommendation:** Carry the resolved host on the `IAlarmSubscriptionHandle` (or in a handle→host map kept by `AlarmSurfaceInvoker`) so `UnsubscribeAsync` routes through the same host's pipeline the subscription was created on.
**Resolution:** _(open)_
**Resolution:** Resolved 2026-05-22 — `SubscribeAsync` now wraps each driver handle in a `HostBoundHandle` (private `IAlarmSubscriptionHandle` implementation) that carries the resolved host name; `UnsubscribeAsync` unwraps it and routes through the recorded host's pipeline, falling back to the default host for handles not created by this invoker; added two regression tests verifying per-host routing and single-host fallback.
### Core-008