Core-001: swap the authorization-cache defaults so MembershipFreshnessInterval (5 min, inner re-resolve trigger) is strictly less than AuthCacheMaxStaleness (15 min, fail-closed ceiling), so NeedsRefresh's warm-refresh path is reachable. Core-002: TriePermissionEvaluator.Authorize now compares the trie's GenerationId against the session's AuthGenerationId and re-fetches the session's bound generation on mismatch, failing closed when that generation has been pruned. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
15 KiB
Code Review — Core
| Field | Value |
|---|---|
| Module | src/Core/ZB.MOM.WW.OtOpcUa.Core |
| Reviewer | Claude Code |
| Review date | 2026-05-22 |
| Commit reviewed | 76d35d1 |
| Status | Reviewed |
| Open findings | 10 |
Checklist coverage
| # | Category | Result |
|---|---|---|
| 1 | Correctness & logic bugs | Core-001, Core-002, Core-003 |
| 2 | OtOpcUa conventions | Core-004 |
| 3 | Concurrency & thread safety | Core-005, Core-006 |
| 4 | Error handling & resilience | Core-007, Core-008 |
| 5 | Security | Core-002 |
| 6 | Performance & resource management | Core-009 |
| 7 | Design-document adherence | Core-002, Core-003 |
| 8 | Code organization & conventions | Core-010 |
| 9 | Testing coverage | Core-011 |
| 10 | Documentation & comments | Core-012 |
Findings
Core-001
| Field | Value |
|---|---|
| Severity | High |
| Category | Correctness & logic bugs |
| Location | src/Core/ZB.MOM.WW.OtOpcUa.Core/Authorization/UserAuthorizationState.cs:50-68 |
| Status | Resolved |
Description: NeedsRefresh can never return true with the default field values. AuthCacheMaxStaleness defaults to 5 minutes and MembershipFreshnessInterval defaults to 15 minutes. NeedsRefresh(utcNow) is defined as !IsStale(utcNow) && elapsed > MembershipFreshnessInterval, i.e. it needs elapsed > 15 min AND elapsed <= 5 min simultaneously — an empty set. The session crosses the staleness ceiling (5 min) and fails closed long before it ever reaches the 15-minute freshness boundary that is supposed to signal "kick off an async re-resolution while still serving cached memberships." Decision #151 / #152 in docs/v2/implementation/phase-6-2-authorization-runtime.md intends the freshness window (15 min, re-resolve) to be the inner trigger and the staleness ceiling to be the outer hard limit; with these defaults the ordering is inverted, so the "refresh while warm" path is dead code and every long-lived session hard-fails authorization after 5 minutes.
Recommendation: Either swap the defaults so MembershipFreshnessInterval (e.g. 5 min) is strictly less than AuthCacheMaxStaleness (e.g. 15 min) — matching the doc's stated intent — or, if the 5/15 values are correct, redefine which window is the refresh trigger and which is the fail-closed ceiling. Add a unit test asserting NeedsRefresh returns true for at least one point in time with the production defaults.
Resolution: Resolved 2026-05-22 — swapped the defaults so MembershipFreshnessInterval is 5 min and AuthCacheMaxStaleness is 15 min (freshness = inner re-resolve trigger, staleness = outer fail-closed ceiling); added a NeedsRefresh_FiresWithin_ProductionDefault_Windows regression test.
Core-002
| Field | Value |
|---|---|
| Severity | High |
| Category | Security |
| Location | src/Core/ZB.MOM.WW.OtOpcUa.Core/Authorization/TriePermissionEvaluator.cs:24-50 |
| Status | Resolved |
Description: TriePermissionEvaluator.Authorize never compares the session's AuthGenerationId against the generation of the trie it evaluates against. It calls _cache.GetTrie(scope.ClusterId) — the current-generation shortcut — and authorizes against whatever generation the cache happens to hold. UserAuthorizationState carries AuthGenerationId precisely so a stale session can be detected, and the Phase 6.2 design (phase-6-2-authorization-runtime.md adversarial-review item #3 "Redundancy-safe invalidation", plus the §Scope PermissionTrieCache + freshness row) requires the hot-path call to look up CurrentGenerationId and force a re-evaluation on mismatch. As written, a session bound at generation N silently evaluates against generation N+1 the instant another node publishes — grants added or removed in N+1 take effect for that session without the intended generation-stamp re-check, and the provenance returned in AuthorizationDecision misreports which generation produced the verdict.
Recommendation: In Authorize, after resolving the trie, compare trie.GenerationId to session.AuthGenerationId. On mismatch either fetch the session's bound generation via _cache.GetTrie(clusterId, session.AuthGenerationId) and evaluate against it, or signal the caller to re-resolve the session's auth state before retrying. Add a test for the publish-during-session scenario.
Resolution: Resolved 2026-05-22 — Authorize now compares trie.GenerationId to session.AuthGenerationId and, on mismatch, re-fetches the session's bound generation via _cache.GetTrie(clusterId, session.AuthGenerationId), failing closed when that generation has been pruned; added publish-during-session and pruned-generation regression tests.
Core-003
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Location | src/Core/ZB.MOM.WW.OtOpcUa.Core/Authorization/PermissionTrie.cs:80-98 |
| Status | Open |
Description: WalkSystemPlatform records every Galaxy folder-segment grant with NodeAclScopeKind.Equipment (see the comment at lines 82-86) because NodeAclScopeKind has no FolderSegment member. The functional union of permission flags is unaffected, but the MatchedGrant.Scope carried in AuthorizationDecision.Provenance is wrong for Galaxy nodes: a grant anchored at a namespace-root folder and a grant anchored at a deep sub-folder both report Equipment, and a namespace-level grant is indistinguishable from a folder-level grant in the audit trail and the Admin UI "Probe this permission" diagnostic. The Phase 6.2 design (adversarial-review item #6) calls for a dedicated FolderSegment scope level. The current code is a known shortcut but references only an untracked "TODO" with no issue ID.
Recommendation: Add a FolderSegment member to NodeAclScopeKind and use it in WalkSystemPlatform and PermissionTrieBuilder so Galaxy folder grants report their true scope. If the enum change is deferred, file a tracked issue and reference its ID in the code comment.
Resolution: (open)
Core-004
| Field | Value |
|---|---|
| Severity | Low |
| Category | OtOpcUa conventions |
| Location | src/Core/ZB.MOM.WW.OtOpcUa.Core/Hosting/DriverHost.cs:55,72,87 |
| Status | Open |
Description: DriverHost is a library type whose async calls (driver.InitializeAsync, driver.ShutdownAsync) do not use ConfigureAwait(false), whereas the sibling CapabilityInvoker and AlarmSurfaceInvoker in the same module consistently do. The server host has no synchronization context so behaviour is currently correct, but the inconsistency is a maintenance hazard and a deviation from the established convention in Core.Resilience.
Recommendation: Add .ConfigureAwait(false) to the three awaited calls in DriverHost.RegisterAsync, UnregisterAsync, and DisposeAsync.
Resolution: (open)
Core-005
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Concurrency & thread safety |
| Location | src/Core/ZB.MOM.WW.OtOpcUa.Core/Authorization/PermissionTrieCache.cs:59-70 |
| Status | Open |
Description: Prune mutates the ConcurrentDictionary with a plain indexer assignment (_byCluster[clusterId] = new ClusterEntry(...)) after a separate TryGetValue read. If Install runs concurrently for the same cluster, the AddOrUpdate in Install and the indexer write in Prune race: Prune can read an entry, Install then adds a newer generation via AddOrUpdate, and Prune's unconditional indexer write then overwrites the entry — silently dropping the just-installed newest generation and its Current pointer. The class is documented as a process-singleton accessed on the hot path while publishes install new tries, so the race is reachable.
Recommendation: Make Prune use an atomic compare-and-swap loop — _byCluster.TryUpdate(clusterId, prunedEntry, observedEntry) retried until it succeeds or the key is gone — or perform the prune inside an AddOrUpdate update factory.
Resolution: (open)
Core-006
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Concurrency & thread safety |
| Location | src/Core/ZB.MOM.WW.OtOpcUa.Core/OpcUa/GenericDriverNodeManager.cs:42-64 |
| Status | Open |
Description: BuildAddressSpaceAsync is not guarded against being called more than once. A second call subscribes a second _alarmForwarder to IAlarmSource.OnAlarmEvent and overwrites the _alarmForwarder field, so the first delegate is leaked (still subscribed, never unsubscribed because Dispose only removes the field's current value). Every alarm transition would then be delivered to its sink twice. The address-space rebuild path on Galaxy redeploy (DeployWatcher → IRediscoverable.OnRediscoveryNeeded → server rebuilds the address space) is exactly the scenario where a node manager could legitimately be re-walked. There is also no check of the _disposed flag at the top of the method.
Recommendation: Either guard BuildAddressSpaceAsync so a second call throws InvalidOperationException (document it single-shot), or unsubscribe the previous _alarmForwarder and clear _alarmSinks before re-walking. Also check _disposed and throw ObjectDisposedException if already disposed.
Resolution: (open)
Core-007
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Error handling & resilience |
| Location | src/Core/ZB.MOM.WW.OtOpcUa.Core/Resilience/AlarmSurfaceInvoker.cs:75-83 |
| Status | Open |
Description: UnsubscribeAsync always routes through _defaultHost, even when an IPerCallHostResolver is wired and the original SubscribeAsync fanned the subscription out to a non-default host. The IAlarmSubscriptionHandle is opaque here and carries no host association, so an unsubscribe for a subscription created against host B runs through host A's resilience pipeline. In a multi-host driver this charges the wrong host's circuit breaker / bulkhead and, if host A is open while host B is healthy, can spuriously block a valid unsubscribe. The XML doc claims it routes "for parity" with SubscribeAsync but subscribe is per-host and unsubscribe is not.
Recommendation: Carry the resolved host on the IAlarmSubscriptionHandle (or in a handle→host map kept by AlarmSurfaceInvoker) so UnsubscribeAsync routes through the same host's pipeline the subscription was created on.
Resolution: (open)
Core-008
| Field | Value |
|---|---|
| Severity | Low |
| Category | Error handling & resilience |
| Location | src/Core/ZB.MOM.WW.OtOpcUa.Core/OpcUa/GenericDriverNodeManager.cs:42-64 |
| Status | Open |
Description: The XML summary of BuildAddressSpaceAsync states "Driver exceptions are isolated per decision #12 — the driver's subtree is marked Faulted, but other drivers remain available." The method body contains no such isolation: an exception from discovery.DiscoverAsync propagates straight out unhandled, and nothing here marks a subtree Faulted. The isolation is presumably done by the server-layer caller, but the comment asserts behaviour this class does not implement.
Recommendation: Either implement the documented isolation in GenericDriverNodeManager, or correct the XML doc to state that exception isolation is the caller's responsibility and name the type that performs it.
Resolution: (open)
Core-009
| Field | Value |
|---|---|
| Severity | Low |
| Category | Performance & resource management |
| Location | src/Core/ZB.MOM.WW.OtOpcUa.Core/Resilience/CapabilityInvoker.cs:121-128 |
| Status | Open |
Description: ExecuteWriteAsync calls _optionsAccessor() three times for a single non-idempotent write (once for the with expression, once inside the dictionary initializer for .Resolve(...), plus the discarded base). On the per-write hot path it rebuilds a fresh DriverResilienceOptions and a one-entry dictionary on every non-idempotent write, and the redundant accessor calls could observe two different snapshots if an Admin edit lands between them. Phase 6.1 budgets a 1% pipeline overhead; this is unnecessary allocation plus a minor consistency hazard.
Recommendation: Capture var options = _optionsAccessor(); once at the top of the non-idempotent branch and derive both the with and the Resolve call from that snapshot. Consider caching the no-retry pipeline keyed on (hostName, non-idempotent).
Resolution: (open)
Core-010
| Field | Value |
|---|---|
| Severity | Low |
| Category | Code organization & conventions |
| Location | src/Core/ZB.MOM.WW.OtOpcUa.Core/Resilience/DriverResilienceOptions.cs:45-52 |
| Status | Open |
Description: DriverResilienceOptions.Resolve indexes the tier-default dictionary directly (defaults[capability]) with no fallback. Any future addition to DriverCapability that is not also added to all three tier tables in GetTierDefaults will make Resolve throw KeyNotFoundException at runtime on the capability hot path rather than failing at build time. The two are coupled by convention only.
Recommendation: Either add a default arm to Resolve returning a conservative policy (and logging), or add a unit-test invariant asserting every DriverCapability value is present in each tier's default table.
Resolution: (open)
Core-011
| Field | Value |
|---|---|
| Severity | Low |
| Category | Testing coverage |
| Location | src/Core/ZB.MOM.WW.OtOpcUa.Core/Authorization/PermissionTrieBuilder.cs:58-75 |
| Status | Open |
Description: PermissionTrieBuilder.Descend has a two-branch behaviour: with a scopePaths lookup it descends the real hierarchy; without one it falls back to placing every non-cluster row directly under the root keyed by ScopeId ("works for deterministic tests, not for production"). The fallback silently produces a structurally incorrect trie when scopePaths is null or a row's ScopeId is missing — a UnsLine-scoped grant ends up as a direct child of the root, so WalkEquipment / WalkSystemPlatform never reach it and the grant is effectively dropped, with no diagnostic. There is no test asserting the production multi-level descent versus the fallback.
Recommendation: Add unit tests covering Build with scopePaths producing the correct multi-level trie and the missing-ScopeId fallback. Have Descend surface a diagnostic (or throw outside test configuration) when a sub-cluster row cannot be located in scopePaths.
Resolution: (open)
Core-012
| Field | Value |
|---|---|
| Severity | Low |
| Category | Documentation & comments |
| Location | src/Core/ZB.MOM.WW.OtOpcUa.Core/Stability/WedgeDetector.cs:26, src/Core/ZB.MOM.WW.OtOpcUa.Core/Observability/DriverHealthReport.cs:11-22 |
| Status | Open |
Description: Two stale doc comments. (1) WedgeDetector — the <summary> above the constructor reads "Whether the driver reported itself DriverState.Healthy at construction." The constructor takes only a TimeSpan threshold and the detector is documented as stateless; the comment describes nothing the constructor does. (2) DriverHealthReport — the <remarks> state matrix lists Unknown, Initializing, Healthy, Degraded, Faulted but Aggregate (lines 42-44) also folds DriverState.Reconnecting into the Degraded verdict. Reconnecting is a real DriverState member absent from the documented matrix.
Recommendation: Replace the WedgeDetector constructor <summary> with an accurate description (e.g. "Construct with the wedge-detection threshold; values below 60 s clamp to 60 s"). Add Reconnecting to the DriverHealthReport <remarks> state matrix and state it maps to Degraded.
Resolution: (open)