Files
scadalink-design/code-reviews/ClusterInfrastructure/findings.md

24 KiB

Code Review — ClusterInfrastructure

Field Value
Module src/ScadaLink.ClusterInfrastructure
Design doc docs/requirements/Component-ClusterInfrastructure.md
Status Reviewed
Last reviewed 2026-05-16
Reviewer claude-agent
Commit reviewed 9c60592
Open findings 3

Summary

The ClusterInfrastructure module is currently a Phase 0 skeleton. It contains only two source files: ClusterOptions.cs, a plain options POCO, and ServiceCollectionExtensions.cs, whose two registration methods are explicit no-ops. None of the responsibilities described in Component-ClusterInfrastructure.md — Akka.NET cluster bootstrap, leader election, failover detection, split-brain resolution, cluster singleton hosting, Windows service lifecycle — are implemented. There are therefore no correctness, concurrency, or Akka-convention defects to find in behaviour, because there is no behaviour. The findings below instead concern (a) the large gap between the design doc and the code, (b) the options class missing the validation, configuration-binding affordances, and coverage of documented settings that peer modules provide, and (c) the no-op DI extensions silently returning success, which is a latent reliability hazard once the Host wires this module in. The dominant theme is incompleteness: this module is the foundation every other component runs on, yet it presently delivers nothing the design requires. The single options class is clean and its test covers defaults and setters adequately for what exists.

Checklist coverage

# Category Examined Notes
1 Correctness & logic bugs No executable logic exists beyond an options POCO; no logic bugs, but ServiceCollectionExtensions returns success while doing nothing (CI-002).
2 Akka.NET conventions No actors, no ActorSystem bootstrap, no supervision, no cluster/singleton wiring exist despite the design doc requiring all of them (CI-001). Nothing to assess against Tell/Ask, immutability, or PipeTo.
3 Concurrency & thread safety No shared mutable state, no actors, no async code. No issues found in current code.
4 Error handling & resilience Failover, split-brain, dual-node recovery, and graceful-shutdown logic are entirely absent (CI-001). No exception paths to review in current code.
5 Security No authn/authz surface in this module. Akka remoting is unconfigured, so transport security cannot be assessed; flagged as part of the missing implementation (CI-001). No secret handling present.
6 Performance & resource management No streams, connections, timers, or IDisposable resources exist yet. No issues found in current code.
7 Design-document adherence Severe drift: the module implements none of its documented responsibilities (CI-001). ClusterOptions also omits remoting host/port, cluster role/site identifier, gRPC port, storage paths, and down-if-alone (CI-003).
8 Code organization & conventions Options class is correctly owned by the component project. Missing config-section-name constant (CI-005) and missing IValidateOptions/data-annotation validation (CI-004) versus the Options pattern intent.
9 Testing coverage ClusterOptionsTests covers defaults and setters. No tests for any cluster behaviour because none exists; the test project references nothing else (CI-006).
10 Documentation & comments ClusterOptions has no XML doc comments unlike peer options classes (CI-007). The "Phase 0 skeleton" placeholders are undocumented at the module level — no README or tracking note (CI-008).

Findings

ClusterInfrastructure-001 — Module implements none of its documented responsibilities

Severity High
Category Design-document adherence
Status Resolved
Location src/ScadaLink.ClusterInfrastructure/ServiceCollectionExtensions.cs:9, src/ScadaLink.ClusterInfrastructure/ServiceCollectionExtensions.cs:16

Description

Component-ClusterInfrastructure.md assigns this module seven concrete responsibilities: bootstrap the Akka.NET ActorSystem, form the two-node cluster, manage leader election / active-standby role assignment, detect node failures and trigger failover, provide remoting, host the cluster singleton, and manage the Windows service lifecycle. The entire module is two files: a ClusterOptions POCO and a ServiceCollectionExtensions whose methods are explicitly commented // Phase 0: skeleton only and // Phase 0: placeholder for Akka actor registration and simply return the unmodified IServiceCollection. There is no Akka.Cluster, Akka.Cluster.Tools, Akka.Remote, or split-brain-resolver dependency in the .csproj at all (it references only Microsoft.Extensions.DependencyInjection.Abstractions, Microsoft.Extensions.Options, and ScadaLink.Commons). Because every other ScadaLink component runs inside the actor system this module is responsible for creating, the absence of any implementation blocks the foundational layer of the system.

Recommendation

Track the gap explicitly (a milestone/issue) and implement the documented behaviour: add the Akka cluster/remote/cluster-tools and split-brain-resolver package references, build the cluster bootstrap (HOCON generation from ClusterOptions), the split-brain resolver configuration, cluster-singleton hosting support, and CoordinatedShutdown wiring. Until then, the module's Status and the design doc should clearly state it is unimplemented so callers do not assume otherwise.

Resolution

Re-triaged 2026-05-16 — remains Open, needs a design decision from the user.

Verified against the source at the reviewed commit: the finding's factual claims hold. src/ScadaLink.ClusterInfrastructure still contains only ClusterOptions.cs and a no-op ServiceCollectionExtensions.cs, and the .csproj references no Akka packages.

However, the documented cluster behaviour is not actually absent from the system — it has been implemented in the Host project rather than in this module:

  • src/ScadaLink.Host/Actors/AkkaHostedService.cs bootstraps the ActorSystem, generates the HOCON from ClusterOptions (it imports ScadaLink.ClusterInfrastructure and injects IOptions<ClusterOptions>), and configures the keep-oldest split-brain resolver with down-if-alone = on (see AkkaHostedService.cs:95-96).
  • src/ScadaLink.Host/Health/AkkaClusterHealthCheck.cs, AkkaClusterNodeProvider.cs, and Health/ActiveNodeHealthCheck.cs cover cluster membership / active-node detection.
  • Akka cluster/remote package references live in ScadaLink.Host.csproj and the per-component projects (SiteRuntime, Communication, etc.).

So the real situation is an ownership / design-doc drift, not missing behaviour: Component-ClusterInfrastructure.md assigns the Akka bootstrap, HOCON generation, split-brain config and CoordinatedShutdown wiring to this module, but the implementation deliberately lives in the Host. ClusterOptions is the one piece this module legitimately owns and it is consumed correctly by the Host.

Resolving CI-001 as literally written is not a small, well-scoped fix — it is one of two substantial decisions, both requiring the user:

  1. Move the bootstrap into this module — relocate the HOCON generation, split-brain config, cluster-singleton helpers and CoordinatedShutdown wiring out of ScadaLink.Host into ScadaLink.ClusterInfrastructure, add the Akka package references, and re-wire the Host to call into it. This is a cross-module refactor touching src/ScadaLink.Host/* and several other projects — outside the edit scope permitted for this finding (only src/ScadaLink.ClusterInfrastructure/, tests/ScadaLink.ClusterInfrastructure.Tests/, and this file may be edited).
  2. Accept the current placement — keep the bootstrap in the Host and update Component-ClusterInfrastructure.md (and the README component table) to record that the Host owns the actor-system/cluster bootstrap and that this module's role is the shared ClusterOptions contract. That fix is a design-doc edit, also outside this module's permitted edit scope.

Either path is a deliberate architecture decision, not a bug fix. The decision was surfaced to the user, who chose option 2 — accept the current placement: the Akka bootstrap stays in the Host (the single deployable binary that performs all actor-system bring-up), and the design docs are corrected to record the true ownership.

Resolved — fixing commit <pending>, date 2026-05-16. The finding was a design-doc drift, not missing behaviour. docs/requirements/Component-ClusterInfrastructure.md now carries an "Implementation Note — Code Placement" section stating that the ScadaLink.ClusterInfrastructure project owns the ClusterOptions configuration model while ScadaLink.Host owns the Akka bootstrap, HOCON generation, split-brain-resolver wiring, CoordinatedShutdown integration, and active-node health checks. The README component table (row 13) was updated to match. No code change was required — the documented cluster behaviour already exists and is exercised; only the doc's module-ownership claim was wrong. Module test suite green (3 passed).

ClusterInfrastructure-002 — No-op DI extension methods report success while doing nothing

Severity Medium
Category Correctness & logic bugs
Status Resolved
Location src/ScadaLink.ClusterInfrastructure/ServiceCollectionExtensions.cs:7-17

Description

AddClusterInfrastructure and AddClusterInfrastructureActors both accept an IServiceCollection and return it unchanged. A caller (e.g. the Host) that invokes services.AddClusterInfrastructure() receives a fluent, success-looking result but no actor system, no cluster, and no singleton support is actually registered. This is a silent failure: the system will appear to start, then fail later and far from the cause (e.g. when a component resolves an ActorSystem that was never added, or when the cluster singleton never forms). A no-op that masquerades as a completed registration is worse than an unimplemented method that throws.

Recommendation

Until the real implementation exists, make the placeholder loud rather than silent — either throw NotImplementedException from the methods, or have them log a prominent warning, so an integrating caller fails fast with a clear cause. Replace with the genuine registration when CI-001 is addressed.

Resolution

Confirmed against the source: both methods returned the IServiceCollection unchanged. Verified the consumers — ScadaLink.Host calls AddClusterInfrastructure() (Program.cs:68, SiteServiceRegistration.cs:24); AddClusterInfrastructureActors is dead — it is called nowhere in the solution.

Resolved — fixing commit commit pending, date 2026-05-16. AddClusterInfrastructure now does real work: it registers the ClusterOptionsValidator (CI-004) via TryAddEnumerable, so the method is no longer a no-op and a misconfigured ScadaLink:Cluster section fails fast on the first IOptions<ClusterOptions> resolution. AddClusterInfrastructureActors — which this component never had any actors to register, as CI-001 established the Akka bootstrap lives in ScadaLink.Host — now throws NotImplementedException with a message pointing the caller to the Host, rather than masquerading as a completed registration. Covered by ServiceCollectionExtensionsTests (AddClusterInfrastructure_RegistersOptionsValidator, AddClusterInfrastructure_ValidatorRejectsBadOptionsAtResolution, AddClusterInfrastructureActors_ThrowsRatherThanSilentlySucceeding).

ClusterInfrastructure-003 — ClusterOptions omits several documented node-configuration settings

Severity Medium
Category Design-document adherence
Status Resolved
Location src/ScadaLink.ClusterInfrastructure/ClusterOptions.cs:3-11

Description

The "Node Configuration", "Split-Brain Resolution", and "Failure Detection Timing" sections of the design doc enumerate the settings each node needs. ClusterOptions exposes SeedNodes, SplitBrainResolverStrategy, StableAfter, HeartbeatInterval, FailureDetectionThreshold, and MinNrOfMembers, but is missing: the Akka remoting hostname/port (default 8081 central, 8082 site), the cluster role (Central vs. Site) and the site identifier, the down-if-alone flag (the design explicitly requires down-if-alone = on for the keep-oldest resolver), and — for site nodes — the gRPC port (default 8083) and local SQLite storage paths. Without these, the options class cannot drive a correct HOCON configuration when CI-001 is implemented. (Some settings such as remoting host/port may instead belong in Host/NodeOptions.cs; the split of ownership should be decided deliberately, but at minimum down-if-alone belongs with the split-brain settings here.)

Recommendation

Add the missing settings — at minimum a DownIfAlone boolean (default true) and the cluster role / site identifier — or document explicitly which settings are owned by Host/NodeOptions.cs instead, so the design doc and the options classes agree on where each value lives.

Resolution

Partially re-triaged. Verified against the source: most of the "missing" settings are deliberately owned by ScadaLink.Host.NodeOptionsNodeOptions already carries Role, NodeHostname, SiteId, RemotingPort and GrpcPort, and AkkaHostedService builds the HOCON from NodeOptions for exactly those values. Local SQLite storage paths live in the database / store-and-forward options. This is the ownership split CI-001 established (the Host owns node identity and bootstrap; this project owns the cluster-formation contract), so those settings do not belong in ClusterOptions.

The one genuine gap the finding identifies is down-if-alone, which the design doc puts with the split-brain settings.

Resolved — fixing commit commit pending, date 2026-05-16. Added the DownIfAlone boolean (default true) to ClusterOptions so the split-brain configuration contract is complete, and added a class-level XML doc that records the deliberate ownership split — node identity/remoting/gRPC in Host.NodeOptions, storage paths in the database options, cluster-formation settings here — so the design doc and the options classes now agree on where each value lives. (AkkaHostedService currently hard-codes down-if-alone = on in HOCON; wiring it to read DownIfAlone is a one-line ScadaLink.Host change, outside this module's permitted edit scope, and is noted for the Host's review.) Covered by ClusterOptionsTests.DefaultValues_AreCorrect and ClusterOptionsTests.DownIfAlone_CanBeSet.

ClusterInfrastructure-004 — ClusterOptions has no validation despite safety-critical values

Severity Medium
Category Code organization & conventions
Status Resolved
Location src/ScadaLink.ClusterInfrastructure/ClusterOptions.cs:3-11

Description

ClusterOptions carries values whose misconfiguration has cluster-wide consequences. The design doc is emphatic that min-nr-of-members must be 1 (a value of 2 blocks the singleton and therefore all data collection indefinitely after failover), that SplitBrainResolverStrategy must be keep-oldest for a two-node cluster (quorum strategies cause total shutdown), and that the timing values are interdependent (HeartbeatInterval must be well below FailureDetectionThreshold). The class has no data annotations, no IValidateOptions<ClusterOptions>, and no guard logic, so an appsettings.json setting MinNrOfMembers: 2 or SplitBrainResolverStrategy: "keep-majority" (the exact value the test at ClusterOptionsTests.cs:35 shows is settable) would be accepted silently and produce the catastrophic outcomes the design doc warns against.

Recommendation

Add validation — data annotations ([Range] for MinNrOfMembers, etc.) plus an IValidateOptions<ClusterOptions> implementation that enforces MinNrOfMembers == 1, restricts SplitBrainResolverStrategy to a known set, requires SeedNodes non-empty, and asserts HeartbeatInterval < FailureDetectionThreshold and positive StableAfter. Register it with ValidateOnStart() so misconfiguration fails fast at boot.

Resolution

Confirmed: ClusterOptions had no validation of any kind, and the design doc's catastrophic-misconfiguration values (MinNrOfMembers: 2, a quorum split-brain strategy) would have been bound silently.

Resolved — fixing commit commit pending, date 2026-05-16. Added ClusterOptionsValidator : IValidateOptions<ClusterOptions>, which enforces MinNrOfMembers == 1, restricts SplitBrainResolverStrategy to the keep-oldest-only allowed set, requires a non-empty SeedNodes, requires positive StableAfter / HeartbeatInterval / FailureDetectionThreshold, and asserts HeartbeatInterval < FailureDetectionThreshold. It accumulates every failure into one result. It is registered by AddClusterInfrastructure() (CI-002) as a singleton IValidateOptions<ClusterOptions>, so a misconfigured section throws OptionsValidationException on the first IOptions<ClusterOptions>.Value resolution — which AkkaHostedService performs during startup, giving the fail-fast-at-boot behaviour the recommendation asks for without the src project taking a dependency on the full Microsoft.Extensions.DependencyInjection package needed for the ValidateOnStart() overload. Data annotations were not used — a single IValidateOptions implementation expresses the interdependent timing rules that attributes cannot. Covered by ClusterOptionsValidatorTests (8 cases) and ServiceCollectionExtensionsTests.AddClusterInfrastructure_ValidatorRejectsBadOptionsAtResolution.

ClusterInfrastructure-005 — No configuration section name constant for the Options pattern binding

Severity Low
Category Code organization & conventions
Status Open
Location src/ScadaLink.ClusterInfrastructure/ClusterOptions.cs:3

Description

CLAUDE.md specifies per-component configuration via appsettings.json sections bound with the Options pattern. ClusterOptions provides no public const string SectionName (or equivalent) for the binding site to reference, so whichever code binds the section must hard-code the magic string, and there is no single source of truth for the section name. Because AddClusterInfrastructure is itself a no-op (CI-002), the options class is currently bound nowhere at all, making the missing constant easy to overlook.

Recommendation

Add a public const string SectionName = "Cluster"; (or the agreed name) to ClusterOptions and have the eventual AddClusterInfrastructure bind configuration.GetSection(ClusterOptions.SectionName) against it.

Resolution

Unresolved.

ClusterInfrastructure-006 — No tests for any cluster behaviour; only the options POCO is covered

Severity Medium
Category Testing coverage
Status Resolved
Location tests/ScadaLink.ClusterInfrastructure.Tests/ClusterOptionsTests.cs:1-51

Description

The test project contains only ClusterOptionsTests, exercising default values and property setters of ClusterOptions. There are no tests for cluster formation, leader election, failover detection, split-brain resolution, singleton handover, or the ServiceCollectionExtensions registration methods — none can exist because the behaviour itself is absent (CI-001). This is recorded so the testing gap is tracked alongside the implementation gap: the most safety-critical paths of the entire system (failover, split-brain, dual-node recovery) are completely untested. The test at line 30-50 also asserts that SplitBrainResolverStrategy can be set to "keep-majority", implicitly endorsing a value the design doc forbids for a two-node cluster — see CI-004.

Recommendation

When CI-001 is implemented, add multi-node Akka.Cluster.TestKit / MultiNodeTestKit tests covering cluster formation, failover promotion, split-brain downing, and singleton handover, plus unit tests for HOCON generation from ClusterOptions and for the options validation from CI-004.

Resolution

Re-triaged in light of CI-001's resolution. The Akka bootstrap, HOCON generation, cluster formation, failover and singleton handover are owned by ScadaLink.Host, not this project — multi-node Akka.Cluster.TestKit tests for that behaviour belong in the Host's test suite, outside this module's scope. What this module legitimately owns is ClusterOptions, its validator, and the DI registration, and the testing gap there is now closed.

Resolved — fixing commit commit pending, date 2026-05-16. Added two test classes to tests/ScadaLink.ClusterInfrastructure.Tests: ClusterOptionsValidatorTests (8 cases — valid defaults pass; MinNrOfMembers != 1, unsupported split-brain strategies, empty seed nodes, heartbeat not below the failure threshold, non-positive StableAfter all fail; and a multi-failure accumulation case) and ServiceCollectionExtensionsTests (3 cases — AddClusterInfrastructure registers the validator, the validator rejects bad options at IOptions resolution, and AddClusterInfrastructureActors throws). The pre-existing ClusterOptionsTests was extended with DownIfAlone coverage. The test project gained references to Microsoft.Extensions.DependencyInjection and Microsoft.Extensions.Options. Module test suite green: 16 passed (was 3). Note: the keep-majority value used in the pre-existing ClusterOptionsTests.Properties_CanBeSetToCustomValues is intentionally left — that test exercises the POCO's property setter (the POCO accepts any string by design); ClusterOptionsValidator is the layer that now rejects keep-majority, and UnsupportedSplitBrainStrategy_FailsValidation proves it.

ClusterInfrastructure-007 — ClusterOptions lacks XML documentation comments

Severity Low
Category Documentation & comments
Status Open
Location src/ScadaLink.ClusterInfrastructure/ClusterOptions.cs:3-11

Description

ClusterOptions and each of its six properties have no XML doc comments. Peer options classes such as StoreAndForward/StoreAndForwardOptions.cs document the class and every property (including units and design-doc references). For a class whose values carry the cluster-wide consequences described in the design doc (notably MinNrOfMembers and SplitBrainResolverStrategy), the absence of inline documentation is a maintainability and safety gap — a future editor has no in-code warning that MinNrOfMembers must stay 1.

Recommendation

Add <summary> comments to the class and each property, stating units and the documented constraints (e.g. that MinNrOfMembers must be 1, that HeartbeatInterval must be well below FailureDetectionThreshold), referencing the relevant design-doc sections as peer modules do.

Resolution

Unresolved.

ClusterInfrastructure-008 — "Phase 0 skeleton" status is undocumented at the module level

Severity Low
Category Documentation & comments
Status Open
Location src/ScadaLink.ClusterInfrastructure/ServiceCollectionExtensions.cs:9, src/ScadaLink.ClusterInfrastructure/ServiceCollectionExtensions.cs:16

Description

The only indication that this foundational module is unimplemented is two inline comments inside private method bodies (// Phase 0: skeleton only / // Phase 0: placeholder for Akka actor registration). There is no module README, no <!-- TODO --> in the design doc, and no tracking marker visible to anyone reading the project structure or the component table. Given that the design doc (Component-ClusterInfrastructure.md) describes a fully featured component with no caveat, a reader will reasonably assume the module is built. The mismatch between a complete-looking design doc and an empty implementation is itself a documentation defect.

Recommendation

Add a short note to the design doc (or a module-level README.md) stating the current implementation status and what "Phase 0" delivers, and reference a tracked issue for the remaining work (CI-001). Keep the README component table accurate about which components are skeletons versus implemented.

Resolution

Unresolved.