Files
lmxopcua/code-reviews/Core/findings.md
T
Joseph Doherty 354b0e7613 review(Core): re-review at HEAD; clean up duplicate/unexplained doc comments
Re-review at 7286d320. Core-013 (duplicate <summary> on HostBoundHandle), Core-014
(clarify EquipmentNodeWalker test-only hardcoded attrs). Both Low, doc-only. Prior
authz/Galaxy churn verified correct.
2026-06-19 11:06:56 -04:00

24 KiB

Code Review — Core

Field Value
Module src/Core/ZB.MOM.WW.OtOpcUa.Core
Reviewer Claude Code
Review date 2026-06-19
Commit reviewed 7286d320
Status Reviewed
Open findings 0

Checklist coverage

# Category Result
1 Correctness & logic bugs Core-001, Core-002, Core-003
2 OtOpcUa conventions Core-004
3 Concurrency & thread safety Core-005, Core-006
4 Error handling & resilience Core-007, Core-008
5 Security Core-002
6 Performance & resource management Core-009
7 Design-document adherence Core-002, Core-003
8 Code organization & conventions Core-010
9 Testing coverage Core-011
10 Documentation & comments Core-012

Findings

Core-001

Field Value
Severity High
Category Correctness & logic bugs
Location src/Core/ZB.MOM.WW.OtOpcUa.Core/Authorization/UserAuthorizationState.cs:50-68
Status Resolved

Description: NeedsRefresh can never return true with the default field values. AuthCacheMaxStaleness defaults to 5 minutes and MembershipFreshnessInterval defaults to 15 minutes. NeedsRefresh(utcNow) is defined as !IsStale(utcNow) && elapsed > MembershipFreshnessInterval, i.e. it needs elapsed > 15 min AND elapsed <= 5 min simultaneously — an empty set. The session crosses the staleness ceiling (5 min) and fails closed long before it ever reaches the 15-minute freshness boundary that is supposed to signal "kick off an async re-resolution while still serving cached memberships." Decision #151 / #152 in docs/v2/implementation/phase-6-2-authorization-runtime.md intends the freshness window (15 min, re-resolve) to be the inner trigger and the staleness ceiling to be the outer hard limit; with these defaults the ordering is inverted, so the "refresh while warm" path is dead code and every long-lived session hard-fails authorization after 5 minutes.

Recommendation: Either swap the defaults so MembershipFreshnessInterval (e.g. 5 min) is strictly less than AuthCacheMaxStaleness (e.g. 15 min) — matching the doc's stated intent — or, if the 5/15 values are correct, redefine which window is the refresh trigger and which is the fail-closed ceiling. Add a unit test asserting NeedsRefresh returns true for at least one point in time with the production defaults.

Resolution: Resolved 2026-05-22 — swapped the defaults so MembershipFreshnessInterval is 5 min and AuthCacheMaxStaleness is 15 min (freshness = inner re-resolve trigger, staleness = outer fail-closed ceiling); added a NeedsRefresh_FiresWithin_ProductionDefault_Windows regression test.

Core-002

Field Value
Severity High
Category Security
Location src/Core/ZB.MOM.WW.OtOpcUa.Core/Authorization/TriePermissionEvaluator.cs:24-50
Status Resolved

Description: TriePermissionEvaluator.Authorize never compares the session's AuthGenerationId against the generation of the trie it evaluates against. It calls _cache.GetTrie(scope.ClusterId) — the current-generation shortcut — and authorizes against whatever generation the cache happens to hold. UserAuthorizationState carries AuthGenerationId precisely so a stale session can be detected, and the Phase 6.2 design (phase-6-2-authorization-runtime.md adversarial-review item #3 "Redundancy-safe invalidation", plus the §Scope PermissionTrieCache + freshness row) requires the hot-path call to look up CurrentGenerationId and force a re-evaluation on mismatch. As written, a session bound at generation N silently evaluates against generation N+1 the instant another node publishes — grants added or removed in N+1 take effect for that session without the intended generation-stamp re-check, and the provenance returned in AuthorizationDecision misreports which generation produced the verdict.

Recommendation: In Authorize, after resolving the trie, compare trie.GenerationId to session.AuthGenerationId. On mismatch either fetch the session's bound generation via _cache.GetTrie(clusterId, session.AuthGenerationId) and evaluate against it, or signal the caller to re-resolve the session's auth state before retrying. Add a test for the publish-during-session scenario.

Resolution: Resolved 2026-05-22 — Authorize now compares trie.GenerationId to session.AuthGenerationId and, on mismatch, re-fetches the session's bound generation via _cache.GetTrie(clusterId, session.AuthGenerationId), failing closed when that generation has been pruned; added publish-during-session and pruned-generation regression tests.

Core-003

Field Value
Severity Medium
Category Correctness & logic bugs
Location src/Core/ZB.MOM.WW.OtOpcUa.Core/Authorization/PermissionTrie.cs:80-98
Status Resolved

Description: WalkSystemPlatform records every Galaxy folder-segment grant with NodeAclScopeKind.Equipment (see the comment at lines 82-86) because NodeAclScopeKind has no FolderSegment member. The functional union of permission flags is unaffected, but the MatchedGrant.Scope carried in AuthorizationDecision.Provenance is wrong for Galaxy nodes: a grant anchored at a namespace-root folder and a grant anchored at a deep sub-folder both report Equipment, and a namespace-level grant is indistinguishable from a folder-level grant in the audit trail and the Admin UI "Probe this permission" diagnostic. The Phase 6.2 design (adversarial-review item #6) calls for a dedicated FolderSegment scope level. The current code is a known shortcut but references only an untracked "TODO" with no issue ID.

Recommendation: Add a FolderSegment member to NodeAclScopeKind and use it in WalkSystemPlatform and PermissionTrieBuilder so Galaxy folder grants report their true scope. If the enum change is deferred, file a tracked issue and reference its ID in the code comment.

Resolution: Resolved 2026-05-22 — added FolderSegment to NodeAclScopeKind; WalkSystemPlatform now reports FolderSegment instead of Equipment for each visited Galaxy folder level; added three regression tests asserting the correct scope is reported in MatchedGrant.Scope.

Core-004

Field Value
Severity Low
Category OtOpcUa conventions
Location src/Core/ZB.MOM.WW.OtOpcUa.Core/Hosting/DriverHost.cs:55,72,87
Status Resolved

Description: DriverHost is a library type whose async calls (driver.InitializeAsync, driver.ShutdownAsync) do not use ConfigureAwait(false), whereas the sibling CapabilityInvoker and AlarmSurfaceInvoker in the same module consistently do. The server host has no synchronization context so behaviour is currently correct, but the inconsistency is a maintenance hazard and a deviation from the established convention in Core.Resilience.

Recommendation: Add .ConfigureAwait(false) to the three awaited calls in DriverHost.RegisterAsync, UnregisterAsync, and DisposeAsync.

Resolution: Resolved 2026-05-23 — added .ConfigureAwait(false) to the three awaited driver calls in RegisterAsync, UnregisterAsync, and DisposeAsync; added three RegisterAsync/UnregisterAsync/DisposeAsync_Does_Not_Capture_SynchronizationContext regression tests that install a tracking SynchronizationContext on a dedicated thread and assert zero captured posts.

Core-005

Field Value
Severity Medium
Category Concurrency & thread safety
Location src/Core/ZB.MOM.WW.OtOpcUa.Core/Authorization/PermissionTrieCache.cs:59-70
Status Resolved

Description: Prune mutates the ConcurrentDictionary with a plain indexer assignment (_byCluster[clusterId] = new ClusterEntry(...)) after a separate TryGetValue read. If Install runs concurrently for the same cluster, the AddOrUpdate in Install and the indexer write in Prune race: Prune can read an entry, Install then adds a newer generation via AddOrUpdate, and Prune's unconditional indexer write then overwrites the entry — silently dropping the just-installed newest generation and its Current pointer. The class is documented as a process-singleton accessed on the hot path while publishes install new tries, so the race is reachable.

Recommendation: Make Prune use an atomic compare-and-swap loop — _byCluster.TryUpdate(clusterId, prunedEntry, observedEntry) retried until it succeeds or the key is gone — or perform the prune inside an AddOrUpdate update factory.

Resolution: Resolved 2026-05-22 — changed ClusterEntry from sealed record to sealed class (enabling reference-equality CAS via TryUpdate); Prune now uses a read-compute-TryUpdate retry loop that restarts if another thread updates the entry between the read and the write; added regression tests asserting the current generation is preserved after a concurrent install + prune sequence.

Core-006

Field Value
Severity Medium
Category Concurrency & thread safety
Location src/Core/ZB.MOM.WW.OtOpcUa.Core/OpcUa/GenericDriverNodeManager.cs:42-64
Status Resolved

Description: BuildAddressSpaceAsync is not guarded against being called more than once. A second call subscribes a second _alarmForwarder to IAlarmSource.OnAlarmEvent and overwrites the _alarmForwarder field, so the first delegate is leaked (still subscribed, never unsubscribed because Dispose only removes the field's current value). Every alarm transition would then be delivered to its sink twice. The address-space rebuild path on Galaxy redeploy (DeployWatcherIRediscoverable.OnRediscoveryNeeded → server rebuilds the address space) is exactly the scenario where a node manager could legitimately be re-walked. There is also no check of the _disposed flag at the top of the method.

Recommendation: Either guard BuildAddressSpaceAsync so a second call throws InvalidOperationException (document it single-shot), or unsubscribe the previous _alarmForwarder and clear _alarmSinks before re-walking. Also check _disposed and throw ObjectDisposedException if already disposed.

Resolution: Resolved 2026-05-22 — BuildAddressSpaceAsync now checks _disposed (throws ObjectDisposedException) and tears down the previous alarm forwarder + clears the sink registry before re-walking so a Galaxy-redeploy rebuild does not double-subscribe the forwarder; added three regression tests covering double-build, sink-count after rebuild, and post-dispose throw.

Core-007

Field Value
Severity Medium
Category Error handling & resilience
Location src/Core/ZB.MOM.WW.OtOpcUa.Core/Resilience/AlarmSurfaceInvoker.cs:75-83
Status Resolved

Description: UnsubscribeAsync always routes through _defaultHost, even when an IPerCallHostResolver is wired and the original SubscribeAsync fanned the subscription out to a non-default host. The IAlarmSubscriptionHandle is opaque here and carries no host association, so an unsubscribe for a subscription created against host B runs through host A's resilience pipeline. In a multi-host driver this charges the wrong host's circuit breaker / bulkhead and, if host A is open while host B is healthy, can spuriously block a valid unsubscribe. The XML doc claims it routes "for parity" with SubscribeAsync but subscribe is per-host and unsubscribe is not.

Recommendation: Carry the resolved host on the IAlarmSubscriptionHandle (or in a handle→host map kept by AlarmSurfaceInvoker) so UnsubscribeAsync routes through the same host's pipeline the subscription was created on.

Resolution: Resolved 2026-05-22 — SubscribeAsync now wraps each driver handle in a HostBoundHandle (private IAlarmSubscriptionHandle implementation) that carries the resolved host name; UnsubscribeAsync unwraps it and routes through the recorded host's pipeline, falling back to the default host for handles not created by this invoker; added two regression tests verifying per-host routing and single-host fallback.

Core-008

Field Value
Severity Low
Category Error handling & resilience
Location src/Core/ZB.MOM.WW.OtOpcUa.Core/OpcUa/GenericDriverNodeManager.cs:42-64
Status Resolved

Description: The XML summary of BuildAddressSpaceAsync states "Driver exceptions are isolated per decision #12 — the driver's subtree is marked Faulted, but other drivers remain available." The method body contains no such isolation: an exception from discovery.DiscoverAsync propagates straight out unhandled, and nothing here marks a subtree Faulted. The isolation is presumably done by the server-layer caller, but the comment asserts behaviour this class does not implement.

Recommendation: Either implement the documented isolation in GenericDriverNodeManager, or correct the XML doc to state that exception isolation is the caller's responsibility and name the type that performs it.

Resolution: Resolved 2026-05-23 — corrected the BuildAddressSpaceAsync XML doc to (a) explicitly state exception isolation is the caller's responsibility, and (b) name the type that performs it (Server.OpcUa.OpcUaApplicationHost.PopulateAddressSpaces); added BuildAddressSpaceAsync_Propagates_Discovery_Exceptions_To_Caller regression test verifying the documented propagation behaviour.

Core-009

Field Value
Severity Low
Category Performance & resource management
Location src/Core/ZB.MOM.WW.OtOpcUa.Core/Resilience/CapabilityInvoker.cs:121-128
Status Resolved

Description: ExecuteWriteAsync calls _optionsAccessor() three times for a single non-idempotent write (once for the with expression, once inside the dictionary initializer for .Resolve(...), plus the discarded base). On the per-write hot path it rebuilds a fresh DriverResilienceOptions and a one-entry dictionary on every non-idempotent write, and the redundant accessor calls could observe two different snapshots if an Admin edit lands between them. Phase 6.1 budgets a 1% pipeline overhead; this is unnecessary allocation plus a minor consistency hazard.

Recommendation: Capture var options = _optionsAccessor(); once at the top of the non-idempotent branch and derive both the with and the Resolve call from that snapshot. Consider caching the no-retry pipeline keyed on (hostName, non-idempotent).

Resolution: Resolved 2026-05-23 — ExecuteWriteAsync now captures _optionsAccessor() into a single snapshot local at the top of the non-idempotent branch; the with expression and the Resolve(Write) call both derive from that snapshot so the two values are guaranteed coherent and only one accessor invocation occurs per call. Added ExecuteWriteAsync_NonIdempotent_Snapshots_Options_Once_Per_Call (counts invocations) and ExecuteWriteAsync_NonIdempotent_Uses_Consistent_Options_Snapshot (alternating-accessor) regression tests.

Core-010

Field Value
Severity Low
Category Code organization & conventions
Location src/Core/ZB.MOM.WW.OtOpcUa.Core/Resilience/DriverResilienceOptions.cs:45-52
Status Resolved

Description: DriverResilienceOptions.Resolve indexes the tier-default dictionary directly (defaults[capability]) with no fallback. Any future addition to DriverCapability that is not also added to all three tier tables in GetTierDefaults will make Resolve throw KeyNotFoundException at runtime on the capability hot path rather than failing at build time. The two are coupled by convention only.

Recommendation: Either add a default arm to Resolve returning a conservative policy (and logging), or add a unit-test invariant asserting every DriverCapability value is present in each tier's default table.

Resolution: Resolved 2026-05-23 — Resolve now uses TryGetValue and throws a diagnostic KeyNotFoundException whose message names the missing capability + tier and points to GetTierDefaults when a capability is missing from both the override map and the tier table; the existing TierDefaults_Cover_EveryCapability test invariant prevents this in shipped code, and added Resolve_Returns_NonNull_Policy_For_Every_Capability (per-tier exhaustive) + Resolve_Throws_Diagnostic_When_Capability_Missing_From_Tier_Defaults regression tests.

Core-011

Field Value
Severity Low
Category Testing coverage
Location src/Core/ZB.MOM.WW.OtOpcUa.Core/Authorization/PermissionTrieBuilder.cs:58-75
Status Resolved

Description: PermissionTrieBuilder.Descend has a two-branch behaviour: with a scopePaths lookup it descends the real hierarchy; without one it falls back to placing every non-cluster row directly under the root keyed by ScopeId ("works for deterministic tests, not for production"). The fallback silently produces a structurally incorrect trie when scopePaths is null or a row's ScopeId is missing — a UnsLine-scoped grant ends up as a direct child of the root, so WalkEquipment / WalkSystemPlatform never reach it and the grant is effectively dropped, with no diagnostic. There is no test asserting the production multi-level descent versus the fallback.

Recommendation: Add unit tests covering Build with scopePaths producing the correct multi-level trie and the missing-ScopeId fallback. Have Descend surface a diagnostic (or throw outside test configuration) when a sub-cluster row cannot be located in scopePaths.

Resolution: Resolved 2026-05-23 — added optional Action<PermissionTrieBuildDiagnostic>? diagnostic parameter to PermissionTrieBuilder.Build; Descend now invokes the callback with a MissingScopePath diagnostic when a sub-cluster row's ScopeId is absent from a supplied (non-null) scopePaths lookup so production callers can log + surface orphan grants instead of silently dropping them. New PermissionTrieBuilderTests covers (a) production multi-level descent with sibling-line non-leakage, (b) the deterministic-test fallback, (c) the diagnostic firing on a missing scope-path entry, (d) no diagnostic when all rows resolve, and (e) no diagnostic when scopePaths is null (explicit test mode).

Core-012

Field Value
Severity Low
Category Documentation & comments
Location src/Core/ZB.MOM.WW.OtOpcUa.Core/Stability/WedgeDetector.cs:26, src/Core/ZB.MOM.WW.OtOpcUa.Core/Observability/DriverHealthReport.cs:11-22
Status Resolved

Description: Two stale doc comments. (1) WedgeDetector — the <summary> above the constructor reads "Whether the driver reported itself DriverState.Healthy at construction." The constructor takes only a TimeSpan threshold and the detector is documented as stateless; the comment describes nothing the constructor does. (2) DriverHealthReport — the <remarks> state matrix lists Unknown, Initializing, Healthy, Degraded, Faulted but Aggregate (lines 42-44) also folds DriverState.Reconnecting into the Degraded verdict. Reconnecting is a real DriverState member absent from the documented matrix.

Recommendation: Replace the WedgeDetector constructor <summary> with an accurate description (e.g. "Construct with the wedge-detection threshold; values below 60 s clamp to 60 s"). Add Reconnecting to the DriverHealthReport <remarks> state matrix and state it maps to Degraded.

Resolution: Resolved 2026-05-23 — replaced the WedgeDetector(.ctor) <summary> with an accurate "Construct with the wedge-detection threshold; values below 60 s clamp to 60 s" description plus a <param> block; added the Reconnecting row to the DriverHealthReport <remarks> state matrix and updated the verdict-rule prose. Added WedgeDetectorTests.Doc_Constructor_Summary_Describes_Threshold_Clamp and DriverHealthReportTests.Doc_State_Matrix_Includes_Reconnecting regression tests that parse the generated .xml doc to assert the strings, plus Any_Reconnecting_WithoutFaultedOrNotReady_IsDegraded confirming the documented Reconnecting → Degraded behaviour.


Re-review 2026-06-19 (commit 7286d320)

What changed since 76d35d1

  • Galaxy standard-driver migration (Phase A): NodeHierarchyKind.SystemPlatform and NodeScope.FolderSegments removed; PermissionTrie.WalkSystemPlatform deleted; CollectMatches now unconditionally delegates to WalkEquipment. Galaxy points resolve as ordinary Equipment-kind tags through the same UNS walk.
  • HistoryUpdate operation mapping fixed: TriePermissionEvaluator.MapOperationToPermission previously mapped OpcUaOperation.HistoryUpdate → NodePermissions.HistoryRead; now correctly maps to NodePermissions.HistoryUpdate. The prior TODO marker is gone; NodePermissions.HistoryUpdate is a real bit (1 << 12) with tests asserting no collision with MethodCall.
  • Generation-stamp guard added to TriePermissionEvaluator.Authorize (Core-002 full fix): compares trie.GenerationId to session.AuthGenerationId and re-fetches the bound generation, failing closed when pruned.
  • HostBoundHandle per-host routing in AlarmSurfaceInvoker (Core-007 full fix): SubscribeAsync wraps each driver handle with resolved host; UnsubscribeAsync unwraps it.
  • CapabilityInvoker.ExecuteWriteAsync now snapshots _optionsAccessor() once per non-idempotent write (Core-009 fix).
  • XML doc coverage added throughout (constructors, params, returns, inner classes).
  • EquipmentNodeWalker updated to reflect Galaxy-standard-driver retirement of SystemPlatform path; EquipmentNamespaceContent doc trimmed to remove stale generation-scoping language.
  • DriverResilienceOptions.Resolve throws a diagnostic KeyNotFoundException instead of crashing with KeyNotFoundException from dict indexer (Core-010 fix).

Checklist coverage (re-review)

# Category Result
1 Correctness & logic bugs No new issues
2 OtOpcUa conventions No new issues
3 Concurrency & thread safety No new issues
4 Error handling & resilience No new issues
5 Security No new issues — HistoryUpdate mapping fix verified correct; generation-stamp guard correct
6 Performance & resource management No new issues
7 Design-document adherence No new issues
8 Code organization & conventions Core-013
9 Testing coverage No new gaps
10 Documentation & comments Core-013, Core-014

Core-013

Field Value
Severity Low
Category Documentation & comments / Code organization
Location src/Core/ZB.MOM.WW.OtOpcUa.Core/Resilience/AlarmSurfaceInvoker.cs:153-162
Status Resolved

Description: The Core-007 fix introduced a HostBoundHandle private class with two consecutive identical <summary> XML doc blocks (lines 153-157 and 158-162). The C# compiler tolerates this but the generated XML doc contains a duplicate <member> entry for the type. Consumers of the XML (IDEs, dotnet-xml-doc) surface the duplicate as a warning or silently drop the second entry.

Recommendation: Remove the duplicate <summary> block, keeping exactly one.

Resolution: Resolved 2026-06-19 — removed the second duplicate <summary> block from HostBoundHandle; single-block doc retained. No regression test needed: the build verifies zero XML-doc warnings and the AlarmSurfaceInvokerTests suite (8 tests) covers the class behaviour.

Core-014

Field Value
Severity Low
Category Documentation & comments
Location src/Core/ZB.MOM.WW.OtOpcUa.Core/OpcUa/EquipmentNodeWalker.cs:158-168
Status Resolved

Description: EquipmentNodeWalker.AddTagVariable hardcodes SecurityClass: SecurityClassification.FreeAccess and IsHistorized: false without any comment explaining why these fields are not read from TagConfig. A maintainer unfamiliar with the retirement history (the production server path moved to AddressSpaceComposer → AddressSpaceApplier → Sink in ADR-001) could incorrectly assume the walker is the live composition path and spend time debugging missing HistoryRead bits or wrong security classes on nodes. In production, DeploymentArtifact.ExtractTagHistorize reads isHistorized/historianTagname from TagConfig JSON; EquipmentNodeWalker.Walk is only called by unit tests.

Recommendation: Add an inline comment above the DriverAttributeInfo constructor call explaining that the walker is test-support only in production; real nodes use AddressSpaceComposer/DeploymentArtifact.

Resolution: Resolved 2026-06-19 — added an 8-line inline comment above the DriverAttributeInfo constructor in AddTagVariable documenting the deliberate omission and naming the production code path. No behaviour change; no regression test needed (comment only).