lmxopcua/code-reviews/Core/findings.md

# Code Review — Core

| Field | Value |
|---|---|
| Module | `src/Core/ZB.MOM.WW.OtOpcUa.Core` |
| Reviewer | Claude Code |
| Review date | 2026-05-22 |
| Commit reviewed | `76d35d1` |
| Status | Reviewed |
| Open findings | 0 |

## Checklist coverage

| # | Category | Result |
|---|---|---|
| 1 | Correctness & logic bugs | Core-001, Core-002, Core-003 |
| 2 | OtOpcUa conventions | Core-004 |
| 3 | Concurrency & thread safety | Core-005, Core-006 |
| 4 | Error handling & resilience | Core-007, Core-008 |
| 5 | Security | Core-002 |
| 6 | Performance & resource management | Core-009 |
| 7 | Design-document adherence | Core-002, Core-003 |
| 8 | Code organization & conventions | Core-010 |
| 9 | Testing coverage | Core-011 |
| 10 | Documentation & comments | Core-012 |

## Findings

### Core-001

| Field | Value |
|---|---|
| Severity | High |
| Category | Correctness & logic bugs |
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Core/Authorization/UserAuthorizationState.cs:50-68` |
| Status | Resolved |

**Description:** `NeedsRefresh` can never return `true` with the default field values. `AuthCacheMaxStaleness` defaults to 5 minutes and `MembershipFreshnessInterval` defaults to 15 minutes. `NeedsRefresh(utcNow)` is defined as `!IsStale(utcNow) && elapsed > MembershipFreshnessInterval`, i.e. it needs `elapsed > 15 min` AND `elapsed <= 5 min` simultaneously — an empty set. The session crosses the staleness ceiling (5 min) and fails closed long before it ever reaches the 15-minute freshness boundary that is supposed to signal "kick off an async re-resolution while still serving cached memberships." Decision #151 / #152 in `docs/v2/implementation/phase-6-2-authorization-runtime.md` intends the freshness window (15 min, re-resolve) to be the inner trigger and the staleness ceiling to be the outer hard limit; with these defaults the ordering is inverted, so the "refresh while warm" path is dead code and every long-lived session hard-fails authorization after 5 minutes.

**Recommendation:** Either swap the defaults so `MembershipFreshnessInterval` (e.g. 5 min) is strictly less than `AuthCacheMaxStaleness` (e.g. 15 min) — matching the doc's stated intent — or, if the 5/15 values are correct, redefine which window is the refresh trigger and which is the fail-closed ceiling. Add a unit test asserting `NeedsRefresh` returns `true` for at least one point in time with the production defaults.

**Resolution:** Resolved 2026-05-22 — swapped the defaults so `MembershipFreshnessInterval` is 5 min and `AuthCacheMaxStaleness` is 15 min (freshness = inner re-resolve trigger, staleness = outer fail-closed ceiling); added a `NeedsRefresh_FiresWithin_ProductionDefault_Windows` regression test.

### Core-002

| Field | Value |
|---|---|
| Severity | High |
| Category | Security |
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Core/Authorization/TriePermissionEvaluator.cs:24-50` |
| Status | Resolved |

**Description:** `TriePermissionEvaluator.Authorize` never compares the session's `AuthGenerationId` against the generation of the trie it evaluates against. It calls `_cache.GetTrie(scope.ClusterId)` — the current-generation shortcut — and authorizes against whatever generation the cache happens to hold. `UserAuthorizationState` carries `AuthGenerationId` precisely so a stale session can be detected, and the Phase 6.2 design (`phase-6-2-authorization-runtime.md` adversarial-review item #3 "Redundancy-safe invalidation", plus the §Scope `PermissionTrieCache + freshness` row) requires the hot-path call to look up `CurrentGenerationId` and force a re-evaluation on mismatch. As written, a session bound at generation N silently evaluates against generation N+1 the instant another node publishes — grants added or removed in N+1 take effect for that session without the intended generation-stamp re-check, and the provenance returned in `AuthorizationDecision` misreports which generation produced the verdict.

**Recommendation:** In `Authorize`, after resolving the trie, compare `trie.GenerationId` to `session.AuthGenerationId`. On mismatch either fetch the session's bound generation via `_cache.GetTrie(clusterId, session.AuthGenerationId)` and evaluate against it, or signal the caller to re-resolve the session's auth state before retrying. Add a test for the publish-during-session scenario.

**Resolution:** Resolved 2026-05-22 — `Authorize` now compares `trie.GenerationId` to `session.AuthGenerationId` and, on mismatch, re-fetches the session's bound generation via `_cache.GetTrie(clusterId, session.AuthGenerationId)`, failing closed when that generation has been pruned; added publish-during-session and pruned-generation regression tests.

### Core-003

| Field | Value |
|---|---|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Core/Authorization/PermissionTrie.cs:80-98` |
| Status | Resolved |

**Description:** `WalkSystemPlatform` records every Galaxy folder-segment grant with `NodeAclScopeKind.Equipment` (see the comment at lines 82-86) because `NodeAclScopeKind` has no `FolderSegment` member. The functional union of permission flags is unaffected, but the `MatchedGrant.Scope` carried in `AuthorizationDecision.Provenance` is wrong for Galaxy nodes: a grant anchored at a namespace-root folder and a grant anchored at a deep sub-folder both report `Equipment`, and a namespace-level grant is indistinguishable from a folder-level grant in the audit trail and the Admin UI "Probe this permission" diagnostic. The Phase 6.2 design (adversarial-review item #6) calls for a dedicated `FolderSegment` scope level. The current code is a known shortcut but references only an untracked "TODO" with no issue ID.

**Recommendation:** Add a `FolderSegment` member to `NodeAclScopeKind` and use it in `WalkSystemPlatform` and `PermissionTrieBuilder` so Galaxy folder grants report their true scope. If the enum change is deferred, file a tracked issue and reference its ID in the code comment.

**Resolution:** Resolved 2026-05-22 — added `FolderSegment` to `NodeAclScopeKind`; `WalkSystemPlatform` now reports `FolderSegment` instead of `Equipment` for each visited Galaxy folder level; added three regression tests asserting the correct scope is reported in `MatchedGrant.Scope`.

### Core-004

| Field | Value |
|---|---|
| Severity | Low |
| Category | OtOpcUa conventions |
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Core/Hosting/DriverHost.cs:55,72,87` |
| Status | Resolved |

**Description:** `DriverHost` is a library type whose async calls (`driver.InitializeAsync`, `driver.ShutdownAsync`) do not use `ConfigureAwait(false)`, whereas the sibling `CapabilityInvoker` and `AlarmSurfaceInvoker` in the same module consistently do. The server host has no synchronization context so behaviour is currently correct, but the inconsistency is a maintenance hazard and a deviation from the established convention in `Core.Resilience`.

**Recommendation:** Add `.ConfigureAwait(false)` to the three awaited calls in `DriverHost.RegisterAsync`, `UnregisterAsync`, and `DisposeAsync`.

**Resolution:** Resolved 2026-05-23 — added `.ConfigureAwait(false)` to the three awaited driver calls in `RegisterAsync`, `UnregisterAsync`, and `DisposeAsync`; added three `RegisterAsync/UnregisterAsync/DisposeAsync_Does_Not_Capture_SynchronizationContext` regression tests that install a tracking `SynchronizationContext` on a dedicated thread and assert zero captured posts.

### Core-005

| Field | Value |
|---|---|
| Severity | Medium |
| Category | Concurrency & thread safety |
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Core/Authorization/PermissionTrieCache.cs:59-70` |
| Status | Resolved |

**Description:** `Prune` mutates the `ConcurrentDictionary` with a plain indexer assignment (`_byCluster[clusterId] = new ClusterEntry(...)`) after a separate `TryGetValue` read. If `Install` runs concurrently for the same cluster, the `AddOrUpdate` in `Install` and the indexer write in `Prune` race: `Prune` can read an entry, `Install` then adds a newer generation via `AddOrUpdate`, and `Prune`'s unconditional indexer write then overwrites the entry — silently dropping the just-installed newest generation and its `Current` pointer. The class is documented as a process-singleton accessed on the hot path while publishes install new tries, so the race is reachable.

**Recommendation:** Make `Prune` use an atomic compare-and-swap loop — `_byCluster.TryUpdate(clusterId, prunedEntry, observedEntry)` retried until it succeeds or the key is gone — or perform the prune inside an `AddOrUpdate` update factory.

**Resolution:** Resolved 2026-05-22 — changed `ClusterEntry` from `sealed record` to `sealed class` (enabling reference-equality CAS via `TryUpdate`); `Prune` now uses a read-compute-`TryUpdate` retry loop that restarts if another thread updates the entry between the read and the write; added regression tests asserting the current generation is preserved after a concurrent install + prune sequence.

### Core-006

| Field | Value |
|---|---|
| Severity | Medium |
| Category | Concurrency & thread safety |
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Core/OpcUa/GenericDriverNodeManager.cs:42-64` |
| Status | Resolved |

**Description:** `BuildAddressSpaceAsync` is not guarded against being called more than once. A second call subscribes a second `_alarmForwarder` to `IAlarmSource.OnAlarmEvent` and overwrites the `_alarmForwarder` field, so the first delegate is leaked (still subscribed, never unsubscribed because `Dispose` only removes the field's current value). Every alarm transition would then be delivered to its sink twice. The address-space rebuild path on Galaxy redeploy (`DeployWatcher` → `IRediscoverable.OnRediscoveryNeeded` → server rebuilds the address space) is exactly the scenario where a node manager could legitimately be re-walked. There is also no check of the `_disposed` flag at the top of the method.

**Recommendation:** Either guard `BuildAddressSpaceAsync` so a second call throws `InvalidOperationException` (document it single-shot), or unsubscribe the previous `_alarmForwarder` and clear `_alarmSinks` before re-walking. Also check `_disposed` and throw `ObjectDisposedException` if already disposed.

**Resolution:** Resolved 2026-05-22 — `BuildAddressSpaceAsync` now checks `_disposed` (throws `ObjectDisposedException`) and tears down the previous alarm forwarder + clears the sink registry before re-walking so a Galaxy-redeploy rebuild does not double-subscribe the forwarder; added three regression tests covering double-build, sink-count after rebuild, and post-dispose throw.

### Core-007

| Field | Value |
|---|---|
| Severity | Medium |
| Category | Error handling & resilience |
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Core/Resilience/AlarmSurfaceInvoker.cs:75-83` |
| Status | Resolved |

**Description:** `UnsubscribeAsync` always routes through `_defaultHost`, even when an `IPerCallHostResolver` is wired and the original `SubscribeAsync` fanned the subscription out to a non-default host. The `IAlarmSubscriptionHandle` is opaque here and carries no host association, so an unsubscribe for a subscription created against host B runs through host A's resilience pipeline. In a multi-host driver this charges the wrong host's circuit breaker / bulkhead and, if host A is open while host B is healthy, can spuriously block a valid unsubscribe. The XML doc claims it routes "for parity" with `SubscribeAsync` but subscribe is per-host and unsubscribe is not.

**Recommendation:** Carry the resolved host on the `IAlarmSubscriptionHandle` (or in a handle→host map kept by `AlarmSurfaceInvoker`) so `UnsubscribeAsync` routes through the same host's pipeline the subscription was created on.

**Resolution:** Resolved 2026-05-22 — `SubscribeAsync` now wraps each driver handle in a `HostBoundHandle` (private `IAlarmSubscriptionHandle` implementation) that carries the resolved host name; `UnsubscribeAsync` unwraps it and routes through the recorded host's pipeline, falling back to the default host for handles not created by this invoker; added two regression tests verifying per-host routing and single-host fallback.

### Core-008

| Field | Value |
|---|---|
| Severity | Low |
| Category | Error handling & resilience |
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Core/OpcUa/GenericDriverNodeManager.cs:42-64` |
| Status | Resolved |

**Description:** The XML summary of `BuildAddressSpaceAsync` states "Driver exceptions are isolated per decision #12 — the driver's subtree is marked Faulted, but other drivers remain available." The method body contains no such isolation: an exception from `discovery.DiscoverAsync` propagates straight out unhandled, and nothing here marks a subtree Faulted. The isolation is presumably done by the server-layer caller, but the comment asserts behaviour this class does not implement.

**Recommendation:** Either implement the documented isolation in `GenericDriverNodeManager`, or correct the XML doc to state that exception isolation is the caller's responsibility and name the type that performs it.

**Resolution:** Resolved 2026-05-23 — corrected the `BuildAddressSpaceAsync` XML doc to (a) explicitly state exception isolation is the caller's responsibility, and (b) name the type that performs it (`Server.OpcUa.OpcUaApplicationHost.PopulateAddressSpaces`); added `BuildAddressSpaceAsync_Propagates_Discovery_Exceptions_To_Caller` regression test verifying the documented propagation behaviour.

### Core-009

| Field | Value |
|---|---|
| Severity | Low |
| Category | Performance & resource management |
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Core/Resilience/CapabilityInvoker.cs:121-128` |
| Status | Resolved |

**Description:** `ExecuteWriteAsync` calls `_optionsAccessor()` three times for a single non-idempotent write (once for the `with` expression, once inside the dictionary initializer for `.Resolve(...)`, plus the discarded base). On the per-write hot path it rebuilds a fresh `DriverResilienceOptions` and a one-entry dictionary on every non-idempotent write, and the redundant accessor calls could observe two different snapshots if an Admin edit lands between them. Phase 6.1 budgets a 1% pipeline overhead; this is unnecessary allocation plus a minor consistency hazard.

**Recommendation:** Capture `var options = _optionsAccessor();` once at the top of the non-idempotent branch and derive both the `with` and the `Resolve` call from that snapshot. Consider caching the no-retry pipeline keyed on `(hostName, non-idempotent)`.

**Resolution:** Resolved 2026-05-23 — `ExecuteWriteAsync` now captures `_optionsAccessor()` into a single `snapshot` local at the top of the non-idempotent branch; the `with` expression and the `Resolve(Write)` call both derive from that snapshot so the two values are guaranteed coherent and only one accessor invocation occurs per call. Added `ExecuteWriteAsync_NonIdempotent_Snapshots_Options_Once_Per_Call` (counts invocations) and `ExecuteWriteAsync_NonIdempotent_Uses_Consistent_Options_Snapshot` (alternating-accessor) regression tests.

### Core-010

| Field | Value |
|---|---|
| Severity | Low |
| Category | Code organization & conventions |
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Core/Resilience/DriverResilienceOptions.cs:45-52` |
| Status | Resolved |

**Description:** `DriverResilienceOptions.Resolve` indexes the tier-default dictionary directly (`defaults[capability]`) with no fallback. Any future addition to `DriverCapability` that is not also added to all three tier tables in `GetTierDefaults` will make `Resolve` throw `KeyNotFoundException` at runtime on the capability hot path rather than failing at build time. The two are coupled by convention only.

**Recommendation:** Either add a `default` arm to `Resolve` returning a conservative policy (and logging), or add a unit-test invariant asserting every `DriverCapability` value is present in each tier's default table.

**Resolution:** Resolved 2026-05-23 — `Resolve` now uses `TryGetValue` and throws a diagnostic `KeyNotFoundException` whose message names the missing capability + tier and points to `GetTierDefaults` when a capability is missing from both the override map and the tier table; the existing `TierDefaults_Cover_EveryCapability` test invariant prevents this in shipped code, and added `Resolve_Returns_NonNull_Policy_For_Every_Capability` (per-tier exhaustive) + `Resolve_Throws_Diagnostic_When_Capability_Missing_From_Tier_Defaults` regression tests.

### Core-011

| Field | Value |
|---|---|
| Severity | Low |
| Category | Testing coverage |
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Core/Authorization/PermissionTrieBuilder.cs:58-75` |
| Status | Resolved |

**Description:** `PermissionTrieBuilder.Descend` has a two-branch behaviour: with a `scopePaths` lookup it descends the real hierarchy; without one it falls back to placing every non-cluster row directly under the root keyed by `ScopeId` ("works for deterministic tests, not for production"). The fallback silently produces a structurally incorrect trie when `scopePaths` is null or a row's `ScopeId` is missing — a UnsLine-scoped grant ends up as a direct child of the root, so `WalkEquipment` / `WalkSystemPlatform` never reach it and the grant is effectively dropped, with no diagnostic. There is no test asserting the production multi-level descent versus the fallback.

**Recommendation:** Add unit tests covering `Build` with `scopePaths` producing the correct multi-level trie and the missing-`ScopeId` fallback. Have `Descend` surface a diagnostic (or throw outside test configuration) when a sub-cluster row cannot be located in `scopePaths`.

**Resolution:** Resolved 2026-05-23 — added optional `Action<PermissionTrieBuildDiagnostic>? diagnostic` parameter to `PermissionTrieBuilder.Build`; `Descend` now invokes the callback with a `MissingScopePath` diagnostic when a sub-cluster row's `ScopeId` is absent from a supplied (non-null) `scopePaths` lookup so production callers can log + surface orphan grants instead of silently dropping them. New `PermissionTrieBuilderTests` covers (a) production multi-level descent with sibling-line non-leakage, (b) the deterministic-test fallback, (c) the diagnostic firing on a missing scope-path entry, (d) no diagnostic when all rows resolve, and (e) no diagnostic when `scopePaths` is null (explicit test mode).

### Core-012

| Field | Value |
|---|---|
| Severity | Low |
| Category | Documentation & comments |
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Core/Stability/WedgeDetector.cs:26`, `src/Core/ZB.MOM.WW.OtOpcUa.Core/Observability/DriverHealthReport.cs:11-22` |
| Status | Resolved |

**Description:** Two stale doc comments. (1) `WedgeDetector` — the `<summary>` above the constructor reads "Whether the driver reported itself `DriverState.Healthy` at construction." The constructor takes only a `TimeSpan threshold` and the detector is documented as stateless; the comment describes nothing the constructor does. (2) `DriverHealthReport` — the `<remarks>` state matrix lists Unknown, Initializing, Healthy, Degraded, Faulted but `Aggregate` (lines 42-44) also folds `DriverState.Reconnecting` into the Degraded verdict. `Reconnecting` is a real `DriverState` member absent from the documented matrix.

**Recommendation:** Replace the `WedgeDetector` constructor `<summary>` with an accurate description (e.g. "Construct with the wedge-detection threshold; values below 60 s clamp to 60 s"). Add `Reconnecting` to the `DriverHealthReport` `<remarks>` state matrix and state it maps to Degraded.

**Resolution:** Resolved 2026-05-23 — replaced the `WedgeDetector(.ctor)` `<summary>` with an accurate "Construct with the wedge-detection threshold; values below 60 s clamp to 60 s" description plus a `<param>` block; added the `Reconnecting` row to the `DriverHealthReport` `<remarks>` state matrix and updated the verdict-rule prose. Added `WedgeDetectorTests.Doc_Constructor_Summary_Describes_Threshold_Clamp` and `DriverHealthReportTests.Doc_State_Matrix_Includes_Reconnecting` regression tests that parse the generated `.xml` doc to assert the strings, plus `Any_Reconnecting_WithoutFaultedOrNotReady_IsDegraded` confirming the documented Reconnecting → Degraded behaviour.