Files
lmxopcua/code-reviews/Core/findings.md
Joseph Doherty 8be6afbda4 fix(core): resolve Low code-review findings (Core-004,008,009,010,011,012)
- Core-004: add ConfigureAwait(false) to DriverHost.RegisterAsync /
  UnregisterAsync / DisposeAsync.
- Core-008: rewrite the BuildAddressSpaceAsync XML doc to correctly name
  the caller (OpcUaApplicationHost.PopulateAddressSpaces) that owns the
  per-driver isolation.
- Core-009: snapshot DriverResilienceOptions once per non-idempotent write
  in CapabilityInvoker.ExecuteWriteAsync.
- Core-010: switch DriverResilienceOptions.Resolve to TryGetValue with a
  diagnostic error message when a tier table is missing a capability.
- Core-011: add an optional diagnostic callback to PermissionTrieBuilder
  so production callers can surface scope-path mismatches.
- Core-012: correct the stale WedgeDetector ctor summary and add the
  Reconnecting row to DriverHealthReport's state matrix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-23 05:38:09 -04:00

20 KiB

Code Review — Core

Field Value
Module src/Core/ZB.MOM.WW.OtOpcUa.Core
Reviewer Claude Code
Review date 2026-05-22
Commit reviewed 76d35d1
Status Reviewed
Open findings 0

Checklist coverage

# Category Result
1 Correctness & logic bugs Core-001, Core-002, Core-003
2 OtOpcUa conventions Core-004
3 Concurrency & thread safety Core-005, Core-006
4 Error handling & resilience Core-007, Core-008
5 Security Core-002
6 Performance & resource management Core-009
7 Design-document adherence Core-002, Core-003
8 Code organization & conventions Core-010
9 Testing coverage Core-011
10 Documentation & comments Core-012

Findings

Core-001

Field Value
Severity High
Category Correctness & logic bugs
Location src/Core/ZB.MOM.WW.OtOpcUa.Core/Authorization/UserAuthorizationState.cs:50-68
Status Resolved

Description: NeedsRefresh can never return true with the default field values. AuthCacheMaxStaleness defaults to 5 minutes and MembershipFreshnessInterval defaults to 15 minutes. NeedsRefresh(utcNow) is defined as !IsStale(utcNow) && elapsed > MembershipFreshnessInterval, i.e. it needs elapsed > 15 min AND elapsed <= 5 min simultaneously — an empty set. The session crosses the staleness ceiling (5 min) and fails closed long before it ever reaches the 15-minute freshness boundary that is supposed to signal "kick off an async re-resolution while still serving cached memberships." Decision #151 / #152 in docs/v2/implementation/phase-6-2-authorization-runtime.md intends the freshness window (15 min, re-resolve) to be the inner trigger and the staleness ceiling to be the outer hard limit; with these defaults the ordering is inverted, so the "refresh while warm" path is dead code and every long-lived session hard-fails authorization after 5 minutes.

Recommendation: Either swap the defaults so MembershipFreshnessInterval (e.g. 5 min) is strictly less than AuthCacheMaxStaleness (e.g. 15 min) — matching the doc's stated intent — or, if the 5/15 values are correct, redefine which window is the refresh trigger and which is the fail-closed ceiling. Add a unit test asserting NeedsRefresh returns true for at least one point in time with the production defaults.

Resolution: Resolved 2026-05-22 — swapped the defaults so MembershipFreshnessInterval is 5 min and AuthCacheMaxStaleness is 15 min (freshness = inner re-resolve trigger, staleness = outer fail-closed ceiling); added a NeedsRefresh_FiresWithin_ProductionDefault_Windows regression test.

Core-002

Field Value
Severity High
Category Security
Location src/Core/ZB.MOM.WW.OtOpcUa.Core/Authorization/TriePermissionEvaluator.cs:24-50
Status Resolved

Description: TriePermissionEvaluator.Authorize never compares the session's AuthGenerationId against the generation of the trie it evaluates against. It calls _cache.GetTrie(scope.ClusterId) — the current-generation shortcut — and authorizes against whatever generation the cache happens to hold. UserAuthorizationState carries AuthGenerationId precisely so a stale session can be detected, and the Phase 6.2 design (phase-6-2-authorization-runtime.md adversarial-review item #3 "Redundancy-safe invalidation", plus the §Scope PermissionTrieCache + freshness row) requires the hot-path call to look up CurrentGenerationId and force a re-evaluation on mismatch. As written, a session bound at generation N silently evaluates against generation N+1 the instant another node publishes — grants added or removed in N+1 take effect for that session without the intended generation-stamp re-check, and the provenance returned in AuthorizationDecision misreports which generation produced the verdict.

Recommendation: In Authorize, after resolving the trie, compare trie.GenerationId to session.AuthGenerationId. On mismatch either fetch the session's bound generation via _cache.GetTrie(clusterId, session.AuthGenerationId) and evaluate against it, or signal the caller to re-resolve the session's auth state before retrying. Add a test for the publish-during-session scenario.

Resolution: Resolved 2026-05-22 — Authorize now compares trie.GenerationId to session.AuthGenerationId and, on mismatch, re-fetches the session's bound generation via _cache.GetTrie(clusterId, session.AuthGenerationId), failing closed when that generation has been pruned; added publish-during-session and pruned-generation regression tests.

Core-003

Field Value
Severity Medium
Category Correctness & logic bugs
Location src/Core/ZB.MOM.WW.OtOpcUa.Core/Authorization/PermissionTrie.cs:80-98
Status Resolved

Description: WalkSystemPlatform records every Galaxy folder-segment grant with NodeAclScopeKind.Equipment (see the comment at lines 82-86) because NodeAclScopeKind has no FolderSegment member. The functional union of permission flags is unaffected, but the MatchedGrant.Scope carried in AuthorizationDecision.Provenance is wrong for Galaxy nodes: a grant anchored at a namespace-root folder and a grant anchored at a deep sub-folder both report Equipment, and a namespace-level grant is indistinguishable from a folder-level grant in the audit trail and the Admin UI "Probe this permission" diagnostic. The Phase 6.2 design (adversarial-review item #6) calls for a dedicated FolderSegment scope level. The current code is a known shortcut but references only an untracked "TODO" with no issue ID.

Recommendation: Add a FolderSegment member to NodeAclScopeKind and use it in WalkSystemPlatform and PermissionTrieBuilder so Galaxy folder grants report their true scope. If the enum change is deferred, file a tracked issue and reference its ID in the code comment.

Resolution: Resolved 2026-05-22 — added FolderSegment to NodeAclScopeKind; WalkSystemPlatform now reports FolderSegment instead of Equipment for each visited Galaxy folder level; added three regression tests asserting the correct scope is reported in MatchedGrant.Scope.

Core-004

Field Value
Severity Low
Category OtOpcUa conventions
Location src/Core/ZB.MOM.WW.OtOpcUa.Core/Hosting/DriverHost.cs:55,72,87
Status Resolved

Description: DriverHost is a library type whose async calls (driver.InitializeAsync, driver.ShutdownAsync) do not use ConfigureAwait(false), whereas the sibling CapabilityInvoker and AlarmSurfaceInvoker in the same module consistently do. The server host has no synchronization context so behaviour is currently correct, but the inconsistency is a maintenance hazard and a deviation from the established convention in Core.Resilience.

Recommendation: Add .ConfigureAwait(false) to the three awaited calls in DriverHost.RegisterAsync, UnregisterAsync, and DisposeAsync.

Resolution: Resolved 2026-05-23 — added .ConfigureAwait(false) to the three awaited driver calls in RegisterAsync, UnregisterAsync, and DisposeAsync; added three RegisterAsync/UnregisterAsync/DisposeAsync_Does_Not_Capture_SynchronizationContext regression tests that install a tracking SynchronizationContext on a dedicated thread and assert zero captured posts.

Core-005

Field Value
Severity Medium
Category Concurrency & thread safety
Location src/Core/ZB.MOM.WW.OtOpcUa.Core/Authorization/PermissionTrieCache.cs:59-70
Status Resolved

Description: Prune mutates the ConcurrentDictionary with a plain indexer assignment (_byCluster[clusterId] = new ClusterEntry(...)) after a separate TryGetValue read. If Install runs concurrently for the same cluster, the AddOrUpdate in Install and the indexer write in Prune race: Prune can read an entry, Install then adds a newer generation via AddOrUpdate, and Prune's unconditional indexer write then overwrites the entry — silently dropping the just-installed newest generation and its Current pointer. The class is documented as a process-singleton accessed on the hot path while publishes install new tries, so the race is reachable.

Recommendation: Make Prune use an atomic compare-and-swap loop — _byCluster.TryUpdate(clusterId, prunedEntry, observedEntry) retried until it succeeds or the key is gone — or perform the prune inside an AddOrUpdate update factory.

Resolution: Resolved 2026-05-22 — changed ClusterEntry from sealed record to sealed class (enabling reference-equality CAS via TryUpdate); Prune now uses a read-compute-TryUpdate retry loop that restarts if another thread updates the entry between the read and the write; added regression tests asserting the current generation is preserved after a concurrent install + prune sequence.

Core-006

Field Value
Severity Medium
Category Concurrency & thread safety
Location src/Core/ZB.MOM.WW.OtOpcUa.Core/OpcUa/GenericDriverNodeManager.cs:42-64
Status Resolved

Description: BuildAddressSpaceAsync is not guarded against being called more than once. A second call subscribes a second _alarmForwarder to IAlarmSource.OnAlarmEvent and overwrites the _alarmForwarder field, so the first delegate is leaked (still subscribed, never unsubscribed because Dispose only removes the field's current value). Every alarm transition would then be delivered to its sink twice. The address-space rebuild path on Galaxy redeploy (DeployWatcherIRediscoverable.OnRediscoveryNeeded → server rebuilds the address space) is exactly the scenario where a node manager could legitimately be re-walked. There is also no check of the _disposed flag at the top of the method.

Recommendation: Either guard BuildAddressSpaceAsync so a second call throws InvalidOperationException (document it single-shot), or unsubscribe the previous _alarmForwarder and clear _alarmSinks before re-walking. Also check _disposed and throw ObjectDisposedException if already disposed.

Resolution: Resolved 2026-05-22 — BuildAddressSpaceAsync now checks _disposed (throws ObjectDisposedException) and tears down the previous alarm forwarder + clears the sink registry before re-walking so a Galaxy-redeploy rebuild does not double-subscribe the forwarder; added three regression tests covering double-build, sink-count after rebuild, and post-dispose throw.

Core-007

Field Value
Severity Medium
Category Error handling & resilience
Location src/Core/ZB.MOM.WW.OtOpcUa.Core/Resilience/AlarmSurfaceInvoker.cs:75-83
Status Resolved

Description: UnsubscribeAsync always routes through _defaultHost, even when an IPerCallHostResolver is wired and the original SubscribeAsync fanned the subscription out to a non-default host. The IAlarmSubscriptionHandle is opaque here and carries no host association, so an unsubscribe for a subscription created against host B runs through host A's resilience pipeline. In a multi-host driver this charges the wrong host's circuit breaker / bulkhead and, if host A is open while host B is healthy, can spuriously block a valid unsubscribe. The XML doc claims it routes "for parity" with SubscribeAsync but subscribe is per-host and unsubscribe is not.

Recommendation: Carry the resolved host on the IAlarmSubscriptionHandle (or in a handle→host map kept by AlarmSurfaceInvoker) so UnsubscribeAsync routes through the same host's pipeline the subscription was created on.

Resolution: Resolved 2026-05-22 — SubscribeAsync now wraps each driver handle in a HostBoundHandle (private IAlarmSubscriptionHandle implementation) that carries the resolved host name; UnsubscribeAsync unwraps it and routes through the recorded host's pipeline, falling back to the default host for handles not created by this invoker; added two regression tests verifying per-host routing and single-host fallback.

Core-008

Field Value
Severity Low
Category Error handling & resilience
Location src/Core/ZB.MOM.WW.OtOpcUa.Core/OpcUa/GenericDriverNodeManager.cs:42-64
Status Resolved

Description: The XML summary of BuildAddressSpaceAsync states "Driver exceptions are isolated per decision #12 — the driver's subtree is marked Faulted, but other drivers remain available." The method body contains no such isolation: an exception from discovery.DiscoverAsync propagates straight out unhandled, and nothing here marks a subtree Faulted. The isolation is presumably done by the server-layer caller, but the comment asserts behaviour this class does not implement.

Recommendation: Either implement the documented isolation in GenericDriverNodeManager, or correct the XML doc to state that exception isolation is the caller's responsibility and name the type that performs it.

Resolution: Resolved 2026-05-23 — corrected the BuildAddressSpaceAsync XML doc to (a) explicitly state exception isolation is the caller's responsibility, and (b) name the type that performs it (Server.OpcUa.OpcUaApplicationHost.PopulateAddressSpaces); added BuildAddressSpaceAsync_Propagates_Discovery_Exceptions_To_Caller regression test verifying the documented propagation behaviour.

Core-009

Field Value
Severity Low
Category Performance & resource management
Location src/Core/ZB.MOM.WW.OtOpcUa.Core/Resilience/CapabilityInvoker.cs:121-128
Status Resolved

Description: ExecuteWriteAsync calls _optionsAccessor() three times for a single non-idempotent write (once for the with expression, once inside the dictionary initializer for .Resolve(...), plus the discarded base). On the per-write hot path it rebuilds a fresh DriverResilienceOptions and a one-entry dictionary on every non-idempotent write, and the redundant accessor calls could observe two different snapshots if an Admin edit lands between them. Phase 6.1 budgets a 1% pipeline overhead; this is unnecessary allocation plus a minor consistency hazard.

Recommendation: Capture var options = _optionsAccessor(); once at the top of the non-idempotent branch and derive both the with and the Resolve call from that snapshot. Consider caching the no-retry pipeline keyed on (hostName, non-idempotent).

Resolution: Resolved 2026-05-23 — ExecuteWriteAsync now captures _optionsAccessor() into a single snapshot local at the top of the non-idempotent branch; the with expression and the Resolve(Write) call both derive from that snapshot so the two values are guaranteed coherent and only one accessor invocation occurs per call. Added ExecuteWriteAsync_NonIdempotent_Snapshots_Options_Once_Per_Call (counts invocations) and ExecuteWriteAsync_NonIdempotent_Uses_Consistent_Options_Snapshot (alternating-accessor) regression tests.

Core-010

Field Value
Severity Low
Category Code organization & conventions
Location src/Core/ZB.MOM.WW.OtOpcUa.Core/Resilience/DriverResilienceOptions.cs:45-52
Status Resolved

Description: DriverResilienceOptions.Resolve indexes the tier-default dictionary directly (defaults[capability]) with no fallback. Any future addition to DriverCapability that is not also added to all three tier tables in GetTierDefaults will make Resolve throw KeyNotFoundException at runtime on the capability hot path rather than failing at build time. The two are coupled by convention only.

Recommendation: Either add a default arm to Resolve returning a conservative policy (and logging), or add a unit-test invariant asserting every DriverCapability value is present in each tier's default table.

Resolution: Resolved 2026-05-23 — Resolve now uses TryGetValue and throws a diagnostic KeyNotFoundException whose message names the missing capability + tier and points to GetTierDefaults when a capability is missing from both the override map and the tier table; the existing TierDefaults_Cover_EveryCapability test invariant prevents this in shipped code, and added Resolve_Returns_NonNull_Policy_For_Every_Capability (per-tier exhaustive) + Resolve_Throws_Diagnostic_When_Capability_Missing_From_Tier_Defaults regression tests.

Core-011

Field Value
Severity Low
Category Testing coverage
Location src/Core/ZB.MOM.WW.OtOpcUa.Core/Authorization/PermissionTrieBuilder.cs:58-75
Status Resolved

Description: PermissionTrieBuilder.Descend has a two-branch behaviour: with a scopePaths lookup it descends the real hierarchy; without one it falls back to placing every non-cluster row directly under the root keyed by ScopeId ("works for deterministic tests, not for production"). The fallback silently produces a structurally incorrect trie when scopePaths is null or a row's ScopeId is missing — a UnsLine-scoped grant ends up as a direct child of the root, so WalkEquipment / WalkSystemPlatform never reach it and the grant is effectively dropped, with no diagnostic. There is no test asserting the production multi-level descent versus the fallback.

Recommendation: Add unit tests covering Build with scopePaths producing the correct multi-level trie and the missing-ScopeId fallback. Have Descend surface a diagnostic (or throw outside test configuration) when a sub-cluster row cannot be located in scopePaths.

Resolution: Resolved 2026-05-23 — added optional Action<PermissionTrieBuildDiagnostic>? diagnostic parameter to PermissionTrieBuilder.Build; Descend now invokes the callback with a MissingScopePath diagnostic when a sub-cluster row's ScopeId is absent from a supplied (non-null) scopePaths lookup so production callers can log + surface orphan grants instead of silently dropping them. New PermissionTrieBuilderTests covers (a) production multi-level descent with sibling-line non-leakage, (b) the deterministic-test fallback, (c) the diagnostic firing on a missing scope-path entry, (d) no diagnostic when all rows resolve, and (e) no diagnostic when scopePaths is null (explicit test mode).

Core-012

Field Value
Severity Low
Category Documentation & comments
Location src/Core/ZB.MOM.WW.OtOpcUa.Core/Stability/WedgeDetector.cs:26, src/Core/ZB.MOM.WW.OtOpcUa.Core/Observability/DriverHealthReport.cs:11-22
Status Resolved

Description: Two stale doc comments. (1) WedgeDetector — the <summary> above the constructor reads "Whether the driver reported itself DriverState.Healthy at construction." The constructor takes only a TimeSpan threshold and the detector is documented as stateless; the comment describes nothing the constructor does. (2) DriverHealthReport — the <remarks> state matrix lists Unknown, Initializing, Healthy, Degraded, Faulted but Aggregate (lines 42-44) also folds DriverState.Reconnecting into the Degraded verdict. Reconnecting is a real DriverState member absent from the documented matrix.

Recommendation: Replace the WedgeDetector constructor <summary> with an accurate description (e.g. "Construct with the wedge-detection threshold; values below 60 s clamp to 60 s"). Add Reconnecting to the DriverHealthReport <remarks> state matrix and state it maps to Degraded.

Resolution: Resolved 2026-05-23 — replaced the WedgeDetector(.ctor) <summary> with an accurate "Construct with the wedge-detection threshold; values below 60 s clamp to 60 s" description plus a <param> block; added the Reconnecting row to the DriverHealthReport <remarks> state matrix and updated the verdict-rule prose. Added WedgeDetectorTests.Doc_Constructor_Summary_Describes_Threshold_Clamp and DriverHealthReportTests.Doc_State_Matrix_Includes_Reconnecting regression tests that parse the generated .xml doc to assert the strings, plus Any_Reconnecting_WithoutFaultedOrNotReady_IsDegraded confirming the documented Reconnecting → Degraded behaviour.