review(Cluster): record findings + fix snapshot consistency, dispose, stale docs

Code review at HEAD 7286d320. Cluster-001 (SeedFromCurrentState reads from one
snapshot), Cluster-003 (HoconLoader double-dispose), Cluster-004 (stale akka.conf
header), Cluster-005 (ServiceLevelCalculator tests added to Cluster.Tests). Cluster-002
deferred (no production caller).
This commit is contained in:
Joseph Doherty
2026-06-19 10:22:59 -04:00
parent 6dc74289ce
commit b1946194d6
5 changed files with 242 additions and 8 deletions
+144
View File
@@ -0,0 +1,144 @@
# Code Review — Cluster
| Field | Value |
|---|---|
| Module | `src/Core/ZB.MOM.WW.OtOpcUa.Cluster` |
| Reviewer | Claude Code |
| Review date | 2026-06-19 |
| Commit reviewed | `7286d320` |
| Status | Reviewed |
| Open findings | 0 |
## Checklist coverage
| # | Category | Result |
|---|---|---|
| 1 | Correctness & logic bugs | Cluster-001 (double-snapshot in seed — resolved), Cluster-002 (stale-member lag — deferred) |
| 2 | OtOpcUa conventions | No issues found |
| 3 | Concurrency & thread safety | No issues found — all mutable state protected by `_lock`; event raised outside lock |
| 4 | Error handling & resilience | No issues found |
| 5 | Security | No issues found |
| 6 | Performance & resource management | Cluster-003 (redundant double-dispose in `HoconLoader` — resolved) |
| 7 | Design-document adherence | Cluster-004 (stale `akka.conf` header comments — resolved) |
| 8 | Code organization & conventions | No issues found |
| 9 | Testing coverage | Cluster-005 (`ServiceLevelCalculatorTests` in wrong project — resolved) |
| 10 | Documentation & comments | Cluster-004 (stale `akka.conf` header comments — resolved) |
## Findings
### Cluster-001
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Cluster/ClusterRoleInfo.cs:100-105` |
| Status | Resolved |
**Description:** `SeedFromCurrentState` takes a snapshot of `_cluster.State` into `snapshot` at
line 89, then iterates `snapshot.Members` for roles but calls `_cluster.State.RoleLeader(role)` at
line 102 — a second, live call to the cluster state. Between these two calls the cluster state
could advance. This yields a `_roleLeaders` map seeded from two inconsistent views: members from
one snapshot, leaders from another. In the worst case the leader address is not in `snapshot.Members`,
so `snapshot.Members.FirstOrDefault(m => m.Address == leaderAddr)` returns `null` and the role is
seeded as leaderless. The first `RoleLeaderChanged` event self-corrects.
**Recommendation:** Replace `_cluster.State.RoleLeader(role)` with `snapshot.RoleLeader(role)`.
`CurrentClusterState` exposes `RoleLeader(string)` for exactly this purpose.
**Resolution:** Fixed 2026-06-19. Changed line 102 to call `snapshot.RoleLeader(role)` so the
entire seed reads from one consistent state object. A unit-level regression test requires a live
Akka cluster (TestKit); the fix is verified by code inspection + the module building clean.
Cross-cluster regression coverage lives in `Host.IntegrationTests`.
---
### Cluster-002
| Field | Value |
|---|---|
| Severity | Low |
| Category | Correctness & logic bugs |
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Cluster/ClusterRoleInfo.cs:110-133` |
| Status | Deferred |
**Description:** `HandleMemberEvent` only adds members on `MemberUp` and removes on
`MemberRemoved`. `MemberLeft`, `MemberExited`, `MemberWeaklyUp`, and `MemberDowned` (all
`IMemberEvent` implementors) are silently dropped by the `switch` fall-through default. A member
that is gracefully leaving stays in `_membersByRole` until the `Removed` transition (several
seconds). Currently `MembersWithRole` has no production callers so this causes no observable
misbehaviour, but if a future caller needs leaving-aware membership it will silently receive stale
data.
**Recommendation:** Add cases for `MemberLeft`, `MemberExited`, and `MemberDowned` that remove the
member from `_membersByRole` immediately.
**Resolution:** Deferred — `MembersWithRole` has no production callers today; stale window is
bounded and self-corrects on `MemberRemoved`. Revisit when a caller is added.
---
### Cluster-003
| Field | Value |
|---|---|
| Severity | Low |
| Category | Performance & resource management |
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Cluster/HoconLoader.cs:12-16` |
| Status | Resolved |
**Description:** `LoadBaseConfig` has two nested `using` scopes — one for `stream` and one for
`reader`. `StreamReader.Dispose` closes its underlying stream; the outer `using var stream` then
disposes the same `UnmanagedMemoryStream` a second time. Double-disposal is safe for managed
streams, but the pattern is confusing and misrepresents ownership: `StreamReader` should
exclusively own the stream.
**Recommendation:** Remove the outer `using var stream`; inline the null-guard into the
`StreamReader` constructor call.
**Resolution:** Fixed 2026-06-19. Removed the redundant outer `using var stream`; the
`StreamReader` now exclusively owns the stream lifetime. Existing `HoconLoaderTests` remain green.
---
### Cluster-004
| Field | Value |
|---|---|
| Severity | Low |
| Category | Documentation & comments |
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Cluster/Resources/akka.conf:4,7` |
| Status | Resolved |
**Description:** The `akka.conf` header contains two stale references:
1. Line 4 references `AkkaHostedService.cs` which was deleted in commit `d6fac2d` when the
bootstrap migrated to `ServiceCollectionExtensions.WithOtOpcUaClusterBootstrap` + Akka.Hosting.
2. Line 7 attributes the tuning to `ScadaLink`, a different project unrelated to this codebase.
**Recommendation:** Update the header to reference `WithOtOpcUaClusterBootstrap` and remove the
`ScadaLink` attribution.
**Resolution:** Fixed 2026-06-19. Updated `akka.conf` header to reference
`ServiceCollectionExtensions.WithOtOpcUaClusterBootstrap`; removed stale `ScadaLink` attribution.
---
### Cluster-005
| Field | Value |
|---|---|
| Severity | Low |
| Category | Testing coverage |
| Location | `tests/Server/ZB.MOM.WW.OtOpcUa.ControlPlane.Tests/ServiceLevelCalculatorTests.cs` |
| Status | Resolved |
**Description:** `ServiceLevelCalculator` was moved from `ControlPlane` to
`Cluster.Redundancy` (per `docs/Redundancy.md`), but `ServiceLevelCalculatorTests` remained in
`ControlPlane.Tests`. The `Cluster.Tests` project has no direct coverage of the module's most
complex logic.
**Recommendation:** Move the tests to `Cluster.Tests`. As the `ControlPlane.Tests` file is outside
this review scope, add equivalent tests directly in `Cluster.Tests`.
**Resolution:** Fixed 2026-06-19. Added `ServiceLevelCalculatorTests` to `Cluster.Tests` covering
all tiers. The `ControlPlane.Tests` copy is left in place (out of scope).