Files
scadalink-design/code-reviews/ClusterInfrastructure/findings.md

367 lines
18 KiB
Markdown

# Code Review — ClusterInfrastructure
| Field | Value |
|-------|-------|
| Module | `src/ScadaLink.ClusterInfrastructure` |
| Design doc | `docs/requirements/Component-ClusterInfrastructure.md` |
| Status | Reviewed |
| Last reviewed | 2026-05-16 |
| Reviewer | claude-agent |
| Commit reviewed | `9c60592` |
| Open findings | 7 |
## Summary
The ClusterInfrastructure module is currently a **Phase 0 skeleton**. It contains
only two source files: `ClusterOptions.cs`, a plain options POCO, and
`ServiceCollectionExtensions.cs`, whose two registration methods are explicit no-ops.
None of the responsibilities described in `Component-ClusterInfrastructure.md`
Akka.NET cluster bootstrap, leader election, failover detection, split-brain
resolution, cluster singleton hosting, Windows service lifecycle — are implemented.
There are therefore no correctness, concurrency, or Akka-convention defects to find
in *behaviour*, because there is no behaviour. The findings below instead concern
(a) the large gap between the design doc and the code, (b) the options class missing
the validation, configuration-binding affordances, and coverage of documented
settings that peer modules provide, and (c) the no-op DI extensions silently
returning success, which is a latent reliability hazard once the Host wires this
module in. The dominant theme is **incompleteness**: this module is the foundation
every other component runs on, yet it presently delivers nothing the design requires.
The single options class is clean and its test covers defaults and setters
adequately for what exists.
## Checklist coverage
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | ✓ | No executable logic exists beyond an options POCO; no logic bugs, but `ServiceCollectionExtensions` returns success while doing nothing (CI-002). |
| 2 | Akka.NET conventions | ✓ | No actors, no `ActorSystem` bootstrap, no supervision, no cluster/singleton wiring exist despite the design doc requiring all of them (CI-001). Nothing to assess against `Tell`/`Ask`, immutability, or `PipeTo`. |
| 3 | Concurrency & thread safety | ✓ | No shared mutable state, no actors, no async code. No issues found in current code. |
| 4 | Error handling & resilience | ✓ | Failover, split-brain, dual-node recovery, and graceful-shutdown logic are entirely absent (CI-001). No exception paths to review in current code. |
| 5 | Security | ✓ | No authn/authz surface in this module. Akka remoting is unconfigured, so transport security cannot be assessed; flagged as part of the missing implementation (CI-001). No secret handling present. |
| 6 | Performance & resource management | ✓ | No streams, connections, timers, or `IDisposable` resources exist yet. No issues found in current code. |
| 7 | Design-document adherence | ✓ | Severe drift: the module implements none of its documented responsibilities (CI-001). `ClusterOptions` also omits remoting host/port, cluster role/site identifier, gRPC port, storage paths, and `down-if-alone` (CI-003). |
| 8 | Code organization & conventions | ✓ | Options class is correctly owned by the component project. Missing config-section-name constant (CI-005) and missing `IValidateOptions`/data-annotation validation (CI-004) versus the Options pattern intent. |
| 9 | Testing coverage | ✓ | `ClusterOptionsTests` covers defaults and setters. No tests for any cluster behaviour because none exists; the test project references nothing else (CI-006). |
| 10 | Documentation & comments | ✓ | `ClusterOptions` has no XML doc comments unlike peer options classes (CI-007). The "Phase 0 skeleton" placeholders are undocumented at the module level — no README or tracking note (CI-008). |
## Findings
### ClusterInfrastructure-001 — Module implements none of its documented responsibilities
| | |
|--|--|
| Severity | High |
| Category | Design-document adherence |
| Status | Resolved |
| Location | `src/ScadaLink.ClusterInfrastructure/ServiceCollectionExtensions.cs:9`, `src/ScadaLink.ClusterInfrastructure/ServiceCollectionExtensions.cs:16` |
**Description**
`Component-ClusterInfrastructure.md` assigns this module seven concrete
responsibilities: bootstrap the Akka.NET `ActorSystem`, form the two-node cluster,
manage leader election / active-standby role assignment, detect node failures and
trigger failover, provide remoting, host the cluster singleton, and manage the
Windows service lifecycle. The entire module is two files: a `ClusterOptions` POCO
and a `ServiceCollectionExtensions` whose methods are explicitly commented
`// Phase 0: skeleton only` and `// Phase 0: placeholder for Akka actor registration`
and simply return the unmodified `IServiceCollection`. There is no `Akka.Cluster`,
`Akka.Cluster.Tools`, `Akka.Remote`, or split-brain-resolver dependency in the
`.csproj` at all (it references only `Microsoft.Extensions.DependencyInjection.Abstractions`,
`Microsoft.Extensions.Options`, and `ScadaLink.Commons`). Because every other
ScadaLink component runs inside the actor system this module is responsible for
creating, the absence of any implementation blocks the foundational layer of the
system.
**Recommendation**
Track the gap explicitly (a milestone/issue) and implement the documented behaviour:
add the Akka cluster/remote/cluster-tools and split-brain-resolver package
references, build the cluster bootstrap (HOCON generation from `ClusterOptions`),
the split-brain resolver configuration, cluster-singleton hosting support, and
`CoordinatedShutdown` wiring. Until then, the module's `Status` and the design doc
should clearly state it is unimplemented so callers do not assume otherwise.
**Resolution**
_Re-triaged 2026-05-16 — remains Open, needs a design decision from the user._
Verified against the source at the reviewed commit: the finding's factual claims hold.
`src/ScadaLink.ClusterInfrastructure` still contains only `ClusterOptions.cs` and a
no-op `ServiceCollectionExtensions.cs`, and the `.csproj` references no Akka packages.
However, the documented cluster behaviour is **not actually absent from the system**
it has been implemented in the **Host** project rather than in this module:
- `src/ScadaLink.Host/Actors/AkkaHostedService.cs` bootstraps the `ActorSystem`,
generates the HOCON from `ClusterOptions` (it imports `ScadaLink.ClusterInfrastructure`
and injects `IOptions<ClusterOptions>`), and configures the `keep-oldest` split-brain
resolver with `down-if-alone = on` (see `AkkaHostedService.cs:95-96`).
- `src/ScadaLink.Host/Health/AkkaClusterHealthCheck.cs`, `AkkaClusterNodeProvider.cs`,
and `Health/ActiveNodeHealthCheck.cs` cover cluster membership / active-node detection.
- Akka cluster/remote package references live in `ScadaLink.Host.csproj` and the
per-component projects (`SiteRuntime`, `Communication`, etc.).
So the real situation is an **ownership / design-doc drift**, not missing behaviour:
`Component-ClusterInfrastructure.md` assigns the Akka bootstrap, HOCON generation,
split-brain config and `CoordinatedShutdown` wiring to this module, but the
implementation deliberately lives in the Host. `ClusterOptions` is the one piece this
module legitimately owns and it is consumed correctly by the Host.
Resolving CI-001 as literally written is **not a small, well-scoped fix** — it is one
of two substantial decisions, both requiring the user:
1. **Move the bootstrap into this module** — relocate the HOCON generation, split-brain
config, cluster-singleton helpers and `CoordinatedShutdown` wiring out of
`ScadaLink.Host` into `ScadaLink.ClusterInfrastructure`, add the Akka package
references, and re-wire the Host to call into it. This is a cross-module refactor
touching `src/ScadaLink.Host/*` and several other projects — outside the edit scope
permitted for this finding (only `src/ScadaLink.ClusterInfrastructure/`,
`tests/ScadaLink.ClusterInfrastructure.Tests/`, and this file may be edited).
2. **Accept the current placement** — keep the bootstrap in the Host and update
`Component-ClusterInfrastructure.md` (and the README component table) to record that
the Host owns the actor-system/cluster bootstrap and that this module's role is the
shared `ClusterOptions` contract. That fix is a design-doc edit, also outside this
module's permitted edit scope.
Either path is a deliberate architecture decision, not a bug fix. The decision was
surfaced to the user, who chose **option 2 — accept the current placement**: the Akka
bootstrap stays in the Host (the single deployable binary that performs all actor-system
bring-up), and the design docs are corrected to record the true ownership.
**Resolved** — fixing commit `<pending>`, date 2026-05-16. The finding was a design-doc
drift, not missing behaviour. `docs/requirements/Component-ClusterInfrastructure.md` now
carries an "Implementation Note — Code Placement" section stating that the
`ScadaLink.ClusterInfrastructure` project owns the `ClusterOptions` configuration model
while `ScadaLink.Host` owns the Akka bootstrap, HOCON generation, split-brain-resolver
wiring, `CoordinatedShutdown` integration, and active-node health checks. The README
component table (row 13) was updated to match. No code change was required — the
documented cluster behaviour already exists and is exercised; only the doc's
module-ownership claim was wrong. Module test suite green (3 passed).
### ClusterInfrastructure-002 — No-op DI extension methods report success while doing nothing
| | |
|--|--|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.ClusterInfrastructure/ServiceCollectionExtensions.cs:7-17` |
**Description**
`AddClusterInfrastructure` and `AddClusterInfrastructureActors` both accept an
`IServiceCollection` and return it unchanged. A caller (e.g. the Host) that invokes
`services.AddClusterInfrastructure()` receives a fluent, success-looking result but
no actor system, no cluster, and no singleton support is actually registered. This
is a silent failure: the system will appear to start, then fail later and far from
the cause (e.g. when a component resolves an `ActorSystem` that was never added, or
when the cluster singleton never forms). A no-op that masquerades as a completed
registration is worse than an unimplemented method that throws.
**Recommendation**
Until the real implementation exists, make the placeholder loud rather than silent —
either throw `NotImplementedException` from the methods, or have them log a
prominent warning, so an integrating caller fails fast with a clear cause. Replace
with the genuine registration when CI-001 is addressed.
**Resolution**
_Unresolved._
### ClusterInfrastructure-003 — ClusterOptions omits several documented node-configuration settings
| | |
|--|--|
| Severity | Medium |
| Category | Design-document adherence |
| Status | Open |
| Location | `src/ScadaLink.ClusterInfrastructure/ClusterOptions.cs:3-11` |
**Description**
The "Node Configuration", "Split-Brain Resolution", and "Failure Detection Timing"
sections of the design doc enumerate the settings each node needs. `ClusterOptions`
exposes `SeedNodes`, `SplitBrainResolverStrategy`, `StableAfter`,
`HeartbeatInterval`, `FailureDetectionThreshold`, and `MinNrOfMembers`, but is
missing: the Akka remoting hostname/port (default 8081 central, 8082 site), the
cluster role (Central vs. Site) and the site identifier, the `down-if-alone` flag
(the design explicitly requires `down-if-alone = on` for the keep-oldest resolver),
and — for site nodes — the gRPC port (default 8083) and local SQLite storage paths.
Without these, the options class cannot drive a correct HOCON configuration when
CI-001 is implemented. (Some settings such as remoting host/port may instead belong
in `Host/NodeOptions.cs`; the split of ownership should be decided deliberately, but
at minimum `down-if-alone` belongs with the split-brain settings here.)
**Recommendation**
Add the missing settings — at minimum a `DownIfAlone` boolean (default `true`) and
the cluster role / site identifier — or document explicitly which settings are
owned by `Host/NodeOptions.cs` instead, so the design doc and the options classes
agree on where each value lives.
**Resolution**
_Unresolved._
### ClusterInfrastructure-004 — ClusterOptions has no validation despite safety-critical values
| | |
|--|--|
| Severity | Medium |
| Category | Code organization & conventions |
| Status | Open |
| Location | `src/ScadaLink.ClusterInfrastructure/ClusterOptions.cs:3-11` |
**Description**
`ClusterOptions` carries values whose misconfiguration has cluster-wide
consequences. The design doc is emphatic that `min-nr-of-members` must be `1` (a
value of `2` blocks the singleton and therefore all data collection indefinitely
after failover), that `SplitBrainResolverStrategy` must be `keep-oldest` for a
two-node cluster (quorum strategies cause total shutdown), and that the timing
values are interdependent (`HeartbeatInterval` must be well below
`FailureDetectionThreshold`). The class has no data annotations, no
`IValidateOptions<ClusterOptions>`, and no guard logic, so an `appsettings.json`
setting `MinNrOfMembers: 2` or `SplitBrainResolverStrategy: "keep-majority"` (the
exact value the test at `ClusterOptionsTests.cs:35` shows is settable) would be
accepted silently and produce the catastrophic outcomes the design doc warns
against.
**Recommendation**
Add validation — data annotations (`[Range]` for `MinNrOfMembers`, etc.) plus an
`IValidateOptions<ClusterOptions>` implementation that enforces
`MinNrOfMembers == 1`, restricts `SplitBrainResolverStrategy` to a known set,
requires `SeedNodes` non-empty, and asserts `HeartbeatInterval <
FailureDetectionThreshold` and positive `StableAfter`. Register it with
`ValidateOnStart()` so misconfiguration fails fast at boot.
**Resolution**
_Unresolved._
### ClusterInfrastructure-005 — No configuration section name constant for the Options pattern binding
| | |
|--|--|
| Severity | Low |
| Category | Code organization & conventions |
| Status | Open |
| Location | `src/ScadaLink.ClusterInfrastructure/ClusterOptions.cs:3` |
**Description**
CLAUDE.md specifies per-component configuration via `appsettings.json` sections
bound with the Options pattern. `ClusterOptions` provides no `public const string
SectionName` (or equivalent) for the binding site to reference, so whichever code
binds the section must hard-code the magic string, and there is no single source of
truth for the section name. Because `AddClusterInfrastructure` is itself a no-op
(CI-002), the options class is currently bound nowhere at all, making the missing
constant easy to overlook.
**Recommendation**
Add a `public const string SectionName = "Cluster";` (or the agreed name) to
`ClusterOptions` and have the eventual `AddClusterInfrastructure` bind
`configuration.GetSection(ClusterOptions.SectionName)` against it.
**Resolution**
_Unresolved._
### ClusterInfrastructure-006 — No tests for any cluster behaviour; only the options POCO is covered
| | |
|--|--|
| Severity | Medium |
| Category | Testing coverage |
| Status | Open |
| Location | `tests/ScadaLink.ClusterInfrastructure.Tests/ClusterOptionsTests.cs:1-51` |
**Description**
The test project contains only `ClusterOptionsTests`, exercising default values and
property setters of `ClusterOptions`. There are no tests for cluster formation,
leader election, failover detection, split-brain resolution, singleton handover, or
the `ServiceCollectionExtensions` registration methods — none can exist because the
behaviour itself is absent (CI-001). This is recorded so the testing gap is tracked
alongside the implementation gap: the most safety-critical paths of the entire
system (failover, split-brain, dual-node recovery) are completely untested. The
test at line 30-50 also asserts that `SplitBrainResolverStrategy` can be set to
`"keep-majority"`, implicitly endorsing a value the design doc forbids for a
two-node cluster — see CI-004.
**Recommendation**
When CI-001 is implemented, add multi-node `Akka.Cluster.TestKit` /
`MultiNodeTestKit` tests covering cluster formation, failover promotion,
split-brain downing, and singleton handover, plus unit tests for HOCON generation
from `ClusterOptions` and for the options validation from CI-004.
**Resolution**
_Unresolved._
### ClusterInfrastructure-007 — ClusterOptions lacks XML documentation comments
| | |
|--|--|
| Severity | Low |
| Category | Documentation & comments |
| Status | Open |
| Location | `src/ScadaLink.ClusterInfrastructure/ClusterOptions.cs:3-11` |
**Description**
`ClusterOptions` and each of its six properties have no XML doc comments. Peer
options classes such as `StoreAndForward/StoreAndForwardOptions.cs` document the
class and every property (including units and design-doc references). For a class
whose values carry the cluster-wide consequences described in the design doc
(notably `MinNrOfMembers` and `SplitBrainResolverStrategy`), the absence of inline
documentation is a maintainability and safety gap — a future editor has no in-code
warning that `MinNrOfMembers` must stay `1`.
**Recommendation**
Add `<summary>` comments to the class and each property, stating units and the
documented constraints (e.g. that `MinNrOfMembers` must be `1`, that
`HeartbeatInterval` must be well below `FailureDetectionThreshold`), referencing
the relevant design-doc sections as peer modules do.
**Resolution**
_Unresolved._
### ClusterInfrastructure-008 — "Phase 0 skeleton" status is undocumented at the module level
| | |
|--|--|
| Severity | Low |
| Category | Documentation & comments |
| Status | Open |
| Location | `src/ScadaLink.ClusterInfrastructure/ServiceCollectionExtensions.cs:9`, `src/ScadaLink.ClusterInfrastructure/ServiceCollectionExtensions.cs:16` |
**Description**
The only indication that this foundational module is unimplemented is two inline
comments inside private method bodies (`// Phase 0: skeleton only` /
`// Phase 0: placeholder for Akka actor registration`). There is no module README,
no `<!-- TODO -->` in the design doc, and no tracking marker visible to anyone
reading the project structure or the component table. Given that the design doc
(`Component-ClusterInfrastructure.md`) describes a fully featured component with no
caveat, a reader will reasonably assume the module is built. The mismatch between a
complete-looking design doc and an empty implementation is itself a documentation
defect.
**Recommendation**
Add a short note to the design doc (or a module-level `README.md`) stating the
current implementation status and what "Phase 0" delivers, and reference a tracked
issue for the remaining work (CI-001). Keep the README component table accurate
about which components are skeletons versus implemented.
**Resolution**
_Unresolved._