Files
lmxopcua/code-reviews/Core/findings.md
Joseph Doherty abbf49141c fix(core): resolve High code-review findings (Core-001, Core-002)
Core-001: swap the authorization-cache defaults so
MembershipFreshnessInterval (5 min, inner re-resolve trigger) is
strictly less than AuthCacheMaxStaleness (15 min, fail-closed
ceiling), so NeedsRefresh's warm-refresh path is reachable.

Core-002: TriePermissionEvaluator.Authorize now compares the trie's
GenerationId against the session's AuthGenerationId and re-fetches the
session's bound generation on mismatch, failing closed when that
generation has been pruned.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 06:13:01 -04:00

15 KiB

Code Review — Core

Field Value
Module src/Core/ZB.MOM.WW.OtOpcUa.Core
Reviewer Claude Code
Review date 2026-05-22
Commit reviewed 76d35d1
Status Reviewed
Open findings 10

Checklist coverage

# Category Result
1 Correctness & logic bugs Core-001, Core-002, Core-003
2 OtOpcUa conventions Core-004
3 Concurrency & thread safety Core-005, Core-006
4 Error handling & resilience Core-007, Core-008
5 Security Core-002
6 Performance & resource management Core-009
7 Design-document adherence Core-002, Core-003
8 Code organization & conventions Core-010
9 Testing coverage Core-011
10 Documentation & comments Core-012

Findings

Core-001

Field Value
Severity High
Category Correctness & logic bugs
Location src/Core/ZB.MOM.WW.OtOpcUa.Core/Authorization/UserAuthorizationState.cs:50-68
Status Resolved

Description: NeedsRefresh can never return true with the default field values. AuthCacheMaxStaleness defaults to 5 minutes and MembershipFreshnessInterval defaults to 15 minutes. NeedsRefresh(utcNow) is defined as !IsStale(utcNow) && elapsed > MembershipFreshnessInterval, i.e. it needs elapsed > 15 min AND elapsed <= 5 min simultaneously — an empty set. The session crosses the staleness ceiling (5 min) and fails closed long before it ever reaches the 15-minute freshness boundary that is supposed to signal "kick off an async re-resolution while still serving cached memberships." Decision #151 / #152 in docs/v2/implementation/phase-6-2-authorization-runtime.md intends the freshness window (15 min, re-resolve) to be the inner trigger and the staleness ceiling to be the outer hard limit; with these defaults the ordering is inverted, so the "refresh while warm" path is dead code and every long-lived session hard-fails authorization after 5 minutes.

Recommendation: Either swap the defaults so MembershipFreshnessInterval (e.g. 5 min) is strictly less than AuthCacheMaxStaleness (e.g. 15 min) — matching the doc's stated intent — or, if the 5/15 values are correct, redefine which window is the refresh trigger and which is the fail-closed ceiling. Add a unit test asserting NeedsRefresh returns true for at least one point in time with the production defaults.

Resolution: Resolved 2026-05-22 — swapped the defaults so MembershipFreshnessInterval is 5 min and AuthCacheMaxStaleness is 15 min (freshness = inner re-resolve trigger, staleness = outer fail-closed ceiling); added a NeedsRefresh_FiresWithin_ProductionDefault_Windows regression test.

Core-002

Field Value
Severity High
Category Security
Location src/Core/ZB.MOM.WW.OtOpcUa.Core/Authorization/TriePermissionEvaluator.cs:24-50
Status Resolved

Description: TriePermissionEvaluator.Authorize never compares the session's AuthGenerationId against the generation of the trie it evaluates against. It calls _cache.GetTrie(scope.ClusterId) — the current-generation shortcut — and authorizes against whatever generation the cache happens to hold. UserAuthorizationState carries AuthGenerationId precisely so a stale session can be detected, and the Phase 6.2 design (phase-6-2-authorization-runtime.md adversarial-review item #3 "Redundancy-safe invalidation", plus the §Scope PermissionTrieCache + freshness row) requires the hot-path call to look up CurrentGenerationId and force a re-evaluation on mismatch. As written, a session bound at generation N silently evaluates against generation N+1 the instant another node publishes — grants added or removed in N+1 take effect for that session without the intended generation-stamp re-check, and the provenance returned in AuthorizationDecision misreports which generation produced the verdict.

Recommendation: In Authorize, after resolving the trie, compare trie.GenerationId to session.AuthGenerationId. On mismatch either fetch the session's bound generation via _cache.GetTrie(clusterId, session.AuthGenerationId) and evaluate against it, or signal the caller to re-resolve the session's auth state before retrying. Add a test for the publish-during-session scenario.

Resolution: Resolved 2026-05-22 — Authorize now compares trie.GenerationId to session.AuthGenerationId and, on mismatch, re-fetches the session's bound generation via _cache.GetTrie(clusterId, session.AuthGenerationId), failing closed when that generation has been pruned; added publish-during-session and pruned-generation regression tests.

Core-003

Field Value
Severity Medium
Category Correctness & logic bugs
Location src/Core/ZB.MOM.WW.OtOpcUa.Core/Authorization/PermissionTrie.cs:80-98
Status Open

Description: WalkSystemPlatform records every Galaxy folder-segment grant with NodeAclScopeKind.Equipment (see the comment at lines 82-86) because NodeAclScopeKind has no FolderSegment member. The functional union of permission flags is unaffected, but the MatchedGrant.Scope carried in AuthorizationDecision.Provenance is wrong for Galaxy nodes: a grant anchored at a namespace-root folder and a grant anchored at a deep sub-folder both report Equipment, and a namespace-level grant is indistinguishable from a folder-level grant in the audit trail and the Admin UI "Probe this permission" diagnostic. The Phase 6.2 design (adversarial-review item #6) calls for a dedicated FolderSegment scope level. The current code is a known shortcut but references only an untracked "TODO" with no issue ID.

Recommendation: Add a FolderSegment member to NodeAclScopeKind and use it in WalkSystemPlatform and PermissionTrieBuilder so Galaxy folder grants report their true scope. If the enum change is deferred, file a tracked issue and reference its ID in the code comment.

Resolution: (open)

Core-004

Field Value
Severity Low
Category OtOpcUa conventions
Location src/Core/ZB.MOM.WW.OtOpcUa.Core/Hosting/DriverHost.cs:55,72,87
Status Open

Description: DriverHost is a library type whose async calls (driver.InitializeAsync, driver.ShutdownAsync) do not use ConfigureAwait(false), whereas the sibling CapabilityInvoker and AlarmSurfaceInvoker in the same module consistently do. The server host has no synchronization context so behaviour is currently correct, but the inconsistency is a maintenance hazard and a deviation from the established convention in Core.Resilience.

Recommendation: Add .ConfigureAwait(false) to the three awaited calls in DriverHost.RegisterAsync, UnregisterAsync, and DisposeAsync.

Resolution: (open)

Core-005

Field Value
Severity Medium
Category Concurrency & thread safety
Location src/Core/ZB.MOM.WW.OtOpcUa.Core/Authorization/PermissionTrieCache.cs:59-70
Status Open

Description: Prune mutates the ConcurrentDictionary with a plain indexer assignment (_byCluster[clusterId] = new ClusterEntry(...)) after a separate TryGetValue read. If Install runs concurrently for the same cluster, the AddOrUpdate in Install and the indexer write in Prune race: Prune can read an entry, Install then adds a newer generation via AddOrUpdate, and Prune's unconditional indexer write then overwrites the entry — silently dropping the just-installed newest generation and its Current pointer. The class is documented as a process-singleton accessed on the hot path while publishes install new tries, so the race is reachable.

Recommendation: Make Prune use an atomic compare-and-swap loop — _byCluster.TryUpdate(clusterId, prunedEntry, observedEntry) retried until it succeeds or the key is gone — or perform the prune inside an AddOrUpdate update factory.

Resolution: (open)

Core-006

Field Value
Severity Medium
Category Concurrency & thread safety
Location src/Core/ZB.MOM.WW.OtOpcUa.Core/OpcUa/GenericDriverNodeManager.cs:42-64
Status Open

Description: BuildAddressSpaceAsync is not guarded against being called more than once. A second call subscribes a second _alarmForwarder to IAlarmSource.OnAlarmEvent and overwrites the _alarmForwarder field, so the first delegate is leaked (still subscribed, never unsubscribed because Dispose only removes the field's current value). Every alarm transition would then be delivered to its sink twice. The address-space rebuild path on Galaxy redeploy (DeployWatcherIRediscoverable.OnRediscoveryNeeded → server rebuilds the address space) is exactly the scenario where a node manager could legitimately be re-walked. There is also no check of the _disposed flag at the top of the method.

Recommendation: Either guard BuildAddressSpaceAsync so a second call throws InvalidOperationException (document it single-shot), or unsubscribe the previous _alarmForwarder and clear _alarmSinks before re-walking. Also check _disposed and throw ObjectDisposedException if already disposed.

Resolution: (open)

Core-007

Field Value
Severity Medium
Category Error handling & resilience
Location src/Core/ZB.MOM.WW.OtOpcUa.Core/Resilience/AlarmSurfaceInvoker.cs:75-83
Status Open

Description: UnsubscribeAsync always routes through _defaultHost, even when an IPerCallHostResolver is wired and the original SubscribeAsync fanned the subscription out to a non-default host. The IAlarmSubscriptionHandle is opaque here and carries no host association, so an unsubscribe for a subscription created against host B runs through host A's resilience pipeline. In a multi-host driver this charges the wrong host's circuit breaker / bulkhead and, if host A is open while host B is healthy, can spuriously block a valid unsubscribe. The XML doc claims it routes "for parity" with SubscribeAsync but subscribe is per-host and unsubscribe is not.

Recommendation: Carry the resolved host on the IAlarmSubscriptionHandle (or in a handle→host map kept by AlarmSurfaceInvoker) so UnsubscribeAsync routes through the same host's pipeline the subscription was created on.

Resolution: (open)

Core-008

Field Value
Severity Low
Category Error handling & resilience
Location src/Core/ZB.MOM.WW.OtOpcUa.Core/OpcUa/GenericDriverNodeManager.cs:42-64
Status Open

Description: The XML summary of BuildAddressSpaceAsync states "Driver exceptions are isolated per decision #12 — the driver's subtree is marked Faulted, but other drivers remain available." The method body contains no such isolation: an exception from discovery.DiscoverAsync propagates straight out unhandled, and nothing here marks a subtree Faulted. The isolation is presumably done by the server-layer caller, but the comment asserts behaviour this class does not implement.

Recommendation: Either implement the documented isolation in GenericDriverNodeManager, or correct the XML doc to state that exception isolation is the caller's responsibility and name the type that performs it.

Resolution: (open)

Core-009

Field Value
Severity Low
Category Performance & resource management
Location src/Core/ZB.MOM.WW.OtOpcUa.Core/Resilience/CapabilityInvoker.cs:121-128
Status Open

Description: ExecuteWriteAsync calls _optionsAccessor() three times for a single non-idempotent write (once for the with expression, once inside the dictionary initializer for .Resolve(...), plus the discarded base). On the per-write hot path it rebuilds a fresh DriverResilienceOptions and a one-entry dictionary on every non-idempotent write, and the redundant accessor calls could observe two different snapshots if an Admin edit lands between them. Phase 6.1 budgets a 1% pipeline overhead; this is unnecessary allocation plus a minor consistency hazard.

Recommendation: Capture var options = _optionsAccessor(); once at the top of the non-idempotent branch and derive both the with and the Resolve call from that snapshot. Consider caching the no-retry pipeline keyed on (hostName, non-idempotent).

Resolution: (open)

Core-010

Field Value
Severity Low
Category Code organization & conventions
Location src/Core/ZB.MOM.WW.OtOpcUa.Core/Resilience/DriverResilienceOptions.cs:45-52
Status Open

Description: DriverResilienceOptions.Resolve indexes the tier-default dictionary directly (defaults[capability]) with no fallback. Any future addition to DriverCapability that is not also added to all three tier tables in GetTierDefaults will make Resolve throw KeyNotFoundException at runtime on the capability hot path rather than failing at build time. The two are coupled by convention only.

Recommendation: Either add a default arm to Resolve returning a conservative policy (and logging), or add a unit-test invariant asserting every DriverCapability value is present in each tier's default table.

Resolution: (open)

Core-011

Field Value
Severity Low
Category Testing coverage
Location src/Core/ZB.MOM.WW.OtOpcUa.Core/Authorization/PermissionTrieBuilder.cs:58-75
Status Open

Description: PermissionTrieBuilder.Descend has a two-branch behaviour: with a scopePaths lookup it descends the real hierarchy; without one it falls back to placing every non-cluster row directly under the root keyed by ScopeId ("works for deterministic tests, not for production"). The fallback silently produces a structurally incorrect trie when scopePaths is null or a row's ScopeId is missing — a UnsLine-scoped grant ends up as a direct child of the root, so WalkEquipment / WalkSystemPlatform never reach it and the grant is effectively dropped, with no diagnostic. There is no test asserting the production multi-level descent versus the fallback.

Recommendation: Add unit tests covering Build with scopePaths producing the correct multi-level trie and the missing-ScopeId fallback. Have Descend surface a diagnostic (or throw outside test configuration) when a sub-cluster row cannot be located in scopePaths.

Resolution: (open)

Core-012

Field Value
Severity Low
Category Documentation & comments
Location src/Core/ZB.MOM.WW.OtOpcUa.Core/Stability/WedgeDetector.cs:26, src/Core/ZB.MOM.WW.OtOpcUa.Core/Observability/DriverHealthReport.cs:11-22
Status Open

Description: Two stale doc comments. (1) WedgeDetector — the <summary> above the constructor reads "Whether the driver reported itself DriverState.Healthy at construction." The constructor takes only a TimeSpan threshold and the detector is documented as stateless; the comment describes nothing the constructor does. (2) DriverHealthReport — the <remarks> state matrix lists Unknown, Initializing, Healthy, Degraded, Faulted but Aggregate (lines 42-44) also folds DriverState.Reconnecting into the Degraded verdict. Reconnecting is a real DriverState member absent from the documented matrix.

Recommendation: Replace the WedgeDetector constructor <summary> with an accurate description (e.g. "Construct with the wedge-detection threshold; values below 60 s clamp to 60 s"). Add Reconnecting to the DriverHealthReport <remarks> state matrix and state it maps to Degraded.

Resolution: (open)