367 lines
18 KiB
Markdown
367 lines
18 KiB
Markdown
# Code Review — ClusterInfrastructure
|
|
|
|
| Field | Value |
|
|
|-------|-------|
|
|
| Module | `src/ScadaLink.ClusterInfrastructure` |
|
|
| Design doc | `docs/requirements/Component-ClusterInfrastructure.md` |
|
|
| Status | Reviewed |
|
|
| Last reviewed | 2026-05-16 |
|
|
| Reviewer | claude-agent |
|
|
| Commit reviewed | `9c60592` |
|
|
| Open findings | 7 |
|
|
|
|
## Summary
|
|
|
|
The ClusterInfrastructure module is currently a **Phase 0 skeleton**. It contains
|
|
only two source files: `ClusterOptions.cs`, a plain options POCO, and
|
|
`ServiceCollectionExtensions.cs`, whose two registration methods are explicit no-ops.
|
|
None of the responsibilities described in `Component-ClusterInfrastructure.md` —
|
|
Akka.NET cluster bootstrap, leader election, failover detection, split-brain
|
|
resolution, cluster singleton hosting, Windows service lifecycle — are implemented.
|
|
There are therefore no correctness, concurrency, or Akka-convention defects to find
|
|
in *behaviour*, because there is no behaviour. The findings below instead concern
|
|
(a) the large gap between the design doc and the code, (b) the options class missing
|
|
the validation, configuration-binding affordances, and coverage of documented
|
|
settings that peer modules provide, and (c) the no-op DI extensions silently
|
|
returning success, which is a latent reliability hazard once the Host wires this
|
|
module in. The dominant theme is **incompleteness**: this module is the foundation
|
|
every other component runs on, yet it presently delivers nothing the design requires.
|
|
The single options class is clean and its test covers defaults and setters
|
|
adequately for what exists.
|
|
|
|
## Checklist coverage
|
|
|
|
| # | Category | Examined | Notes |
|
|
|---|----------|----------|-------|
|
|
| 1 | Correctness & logic bugs | ✓ | No executable logic exists beyond an options POCO; no logic bugs, but `ServiceCollectionExtensions` returns success while doing nothing (CI-002). |
|
|
| 2 | Akka.NET conventions | ✓ | No actors, no `ActorSystem` bootstrap, no supervision, no cluster/singleton wiring exist despite the design doc requiring all of them (CI-001). Nothing to assess against `Tell`/`Ask`, immutability, or `PipeTo`. |
|
|
| 3 | Concurrency & thread safety | ✓ | No shared mutable state, no actors, no async code. No issues found in current code. |
|
|
| 4 | Error handling & resilience | ✓ | Failover, split-brain, dual-node recovery, and graceful-shutdown logic are entirely absent (CI-001). No exception paths to review in current code. |
|
|
| 5 | Security | ✓ | No authn/authz surface in this module. Akka remoting is unconfigured, so transport security cannot be assessed; flagged as part of the missing implementation (CI-001). No secret handling present. |
|
|
| 6 | Performance & resource management | ✓ | No streams, connections, timers, or `IDisposable` resources exist yet. No issues found in current code. |
|
|
| 7 | Design-document adherence | ✓ | Severe drift: the module implements none of its documented responsibilities (CI-001). `ClusterOptions` also omits remoting host/port, cluster role/site identifier, gRPC port, storage paths, and `down-if-alone` (CI-003). |
|
|
| 8 | Code organization & conventions | ✓ | Options class is correctly owned by the component project. Missing config-section-name constant (CI-005) and missing `IValidateOptions`/data-annotation validation (CI-004) versus the Options pattern intent. |
|
|
| 9 | Testing coverage | ✓ | `ClusterOptionsTests` covers defaults and setters. No tests for any cluster behaviour because none exists; the test project references nothing else (CI-006). |
|
|
| 10 | Documentation & comments | ✓ | `ClusterOptions` has no XML doc comments unlike peer options classes (CI-007). The "Phase 0 skeleton" placeholders are undocumented at the module level — no README or tracking note (CI-008). |
|
|
|
|
## Findings
|
|
|
|
### ClusterInfrastructure-001 — Module implements none of its documented responsibilities
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | High |
|
|
| Category | Design-document adherence |
|
|
| Status | Resolved |
|
|
| Location | `src/ScadaLink.ClusterInfrastructure/ServiceCollectionExtensions.cs:9`, `src/ScadaLink.ClusterInfrastructure/ServiceCollectionExtensions.cs:16` |
|
|
|
|
**Description**
|
|
|
|
`Component-ClusterInfrastructure.md` assigns this module seven concrete
|
|
responsibilities: bootstrap the Akka.NET `ActorSystem`, form the two-node cluster,
|
|
manage leader election / active-standby role assignment, detect node failures and
|
|
trigger failover, provide remoting, host the cluster singleton, and manage the
|
|
Windows service lifecycle. The entire module is two files: a `ClusterOptions` POCO
|
|
and a `ServiceCollectionExtensions` whose methods are explicitly commented
|
|
`// Phase 0: skeleton only` and `// Phase 0: placeholder for Akka actor registration`
|
|
and simply return the unmodified `IServiceCollection`. There is no `Akka.Cluster`,
|
|
`Akka.Cluster.Tools`, `Akka.Remote`, or split-brain-resolver dependency in the
|
|
`.csproj` at all (it references only `Microsoft.Extensions.DependencyInjection.Abstractions`,
|
|
`Microsoft.Extensions.Options`, and `ScadaLink.Commons`). Because every other
|
|
ScadaLink component runs inside the actor system this module is responsible for
|
|
creating, the absence of any implementation blocks the foundational layer of the
|
|
system.
|
|
|
|
**Recommendation**
|
|
|
|
Track the gap explicitly (a milestone/issue) and implement the documented behaviour:
|
|
add the Akka cluster/remote/cluster-tools and split-brain-resolver package
|
|
references, build the cluster bootstrap (HOCON generation from `ClusterOptions`),
|
|
the split-brain resolver configuration, cluster-singleton hosting support, and
|
|
`CoordinatedShutdown` wiring. Until then, the module's `Status` and the design doc
|
|
should clearly state it is unimplemented so callers do not assume otherwise.
|
|
|
|
**Resolution**
|
|
|
|
_Re-triaged 2026-05-16 — remains Open, needs a design decision from the user._
|
|
|
|
Verified against the source at the reviewed commit: the finding's factual claims hold.
|
|
`src/ScadaLink.ClusterInfrastructure` still contains only `ClusterOptions.cs` and a
|
|
no-op `ServiceCollectionExtensions.cs`, and the `.csproj` references no Akka packages.
|
|
|
|
However, the documented cluster behaviour is **not actually absent from the system** —
|
|
it has been implemented in the **Host** project rather than in this module:
|
|
|
|
- `src/ScadaLink.Host/Actors/AkkaHostedService.cs` bootstraps the `ActorSystem`,
|
|
generates the HOCON from `ClusterOptions` (it imports `ScadaLink.ClusterInfrastructure`
|
|
and injects `IOptions<ClusterOptions>`), and configures the `keep-oldest` split-brain
|
|
resolver with `down-if-alone = on` (see `AkkaHostedService.cs:95-96`).
|
|
- `src/ScadaLink.Host/Health/AkkaClusterHealthCheck.cs`, `AkkaClusterNodeProvider.cs`,
|
|
and `Health/ActiveNodeHealthCheck.cs` cover cluster membership / active-node detection.
|
|
- Akka cluster/remote package references live in `ScadaLink.Host.csproj` and the
|
|
per-component projects (`SiteRuntime`, `Communication`, etc.).
|
|
|
|
So the real situation is an **ownership / design-doc drift**, not missing behaviour:
|
|
`Component-ClusterInfrastructure.md` assigns the Akka bootstrap, HOCON generation,
|
|
split-brain config and `CoordinatedShutdown` wiring to this module, but the
|
|
implementation deliberately lives in the Host. `ClusterOptions` is the one piece this
|
|
module legitimately owns and it is consumed correctly by the Host.
|
|
|
|
Resolving CI-001 as literally written is **not a small, well-scoped fix** — it is one
|
|
of two substantial decisions, both requiring the user:
|
|
|
|
1. **Move the bootstrap into this module** — relocate the HOCON generation, split-brain
|
|
config, cluster-singleton helpers and `CoordinatedShutdown` wiring out of
|
|
`ScadaLink.Host` into `ScadaLink.ClusterInfrastructure`, add the Akka package
|
|
references, and re-wire the Host to call into it. This is a cross-module refactor
|
|
touching `src/ScadaLink.Host/*` and several other projects — outside the edit scope
|
|
permitted for this finding (only `src/ScadaLink.ClusterInfrastructure/`,
|
|
`tests/ScadaLink.ClusterInfrastructure.Tests/`, and this file may be edited).
|
|
2. **Accept the current placement** — keep the bootstrap in the Host and update
|
|
`Component-ClusterInfrastructure.md` (and the README component table) to record that
|
|
the Host owns the actor-system/cluster bootstrap and that this module's role is the
|
|
shared `ClusterOptions` contract. That fix is a design-doc edit, also outside this
|
|
module's permitted edit scope.
|
|
|
|
Either path is a deliberate architecture decision, not a bug fix. The decision was
|
|
surfaced to the user, who chose **option 2 — accept the current placement**: the Akka
|
|
bootstrap stays in the Host (the single deployable binary that performs all actor-system
|
|
bring-up), and the design docs are corrected to record the true ownership.
|
|
|
|
**Resolved** — fixing commit `<pending>`, date 2026-05-16. The finding was a design-doc
|
|
drift, not missing behaviour. `docs/requirements/Component-ClusterInfrastructure.md` now
|
|
carries an "Implementation Note — Code Placement" section stating that the
|
|
`ScadaLink.ClusterInfrastructure` project owns the `ClusterOptions` configuration model
|
|
while `ScadaLink.Host` owns the Akka bootstrap, HOCON generation, split-brain-resolver
|
|
wiring, `CoordinatedShutdown` integration, and active-node health checks. The README
|
|
component table (row 13) was updated to match. No code change was required — the
|
|
documented cluster behaviour already exists and is exercised; only the doc's
|
|
module-ownership claim was wrong. Module test suite green (3 passed).
|
|
|
|
### ClusterInfrastructure-002 — No-op DI extension methods report success while doing nothing
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Medium |
|
|
| Category | Correctness & logic bugs |
|
|
| Status | Open |
|
|
| Location | `src/ScadaLink.ClusterInfrastructure/ServiceCollectionExtensions.cs:7-17` |
|
|
|
|
**Description**
|
|
|
|
`AddClusterInfrastructure` and `AddClusterInfrastructureActors` both accept an
|
|
`IServiceCollection` and return it unchanged. A caller (e.g. the Host) that invokes
|
|
`services.AddClusterInfrastructure()` receives a fluent, success-looking result but
|
|
no actor system, no cluster, and no singleton support is actually registered. This
|
|
is a silent failure: the system will appear to start, then fail later and far from
|
|
the cause (e.g. when a component resolves an `ActorSystem` that was never added, or
|
|
when the cluster singleton never forms). A no-op that masquerades as a completed
|
|
registration is worse than an unimplemented method that throws.
|
|
|
|
**Recommendation**
|
|
|
|
Until the real implementation exists, make the placeholder loud rather than silent —
|
|
either throw `NotImplementedException` from the methods, or have them log a
|
|
prominent warning, so an integrating caller fails fast with a clear cause. Replace
|
|
with the genuine registration when CI-001 is addressed.
|
|
|
|
**Resolution**
|
|
|
|
_Unresolved._
|
|
|
|
### ClusterInfrastructure-003 — ClusterOptions omits several documented node-configuration settings
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Medium |
|
|
| Category | Design-document adherence |
|
|
| Status | Open |
|
|
| Location | `src/ScadaLink.ClusterInfrastructure/ClusterOptions.cs:3-11` |
|
|
|
|
**Description**
|
|
|
|
The "Node Configuration", "Split-Brain Resolution", and "Failure Detection Timing"
|
|
sections of the design doc enumerate the settings each node needs. `ClusterOptions`
|
|
exposes `SeedNodes`, `SplitBrainResolverStrategy`, `StableAfter`,
|
|
`HeartbeatInterval`, `FailureDetectionThreshold`, and `MinNrOfMembers`, but is
|
|
missing: the Akka remoting hostname/port (default 8081 central, 8082 site), the
|
|
cluster role (Central vs. Site) and the site identifier, the `down-if-alone` flag
|
|
(the design explicitly requires `down-if-alone = on` for the keep-oldest resolver),
|
|
and — for site nodes — the gRPC port (default 8083) and local SQLite storage paths.
|
|
Without these, the options class cannot drive a correct HOCON configuration when
|
|
CI-001 is implemented. (Some settings such as remoting host/port may instead belong
|
|
in `Host/NodeOptions.cs`; the split of ownership should be decided deliberately, but
|
|
at minimum `down-if-alone` belongs with the split-brain settings here.)
|
|
|
|
**Recommendation**
|
|
|
|
Add the missing settings — at minimum a `DownIfAlone` boolean (default `true`) and
|
|
the cluster role / site identifier — or document explicitly which settings are
|
|
owned by `Host/NodeOptions.cs` instead, so the design doc and the options classes
|
|
agree on where each value lives.
|
|
|
|
**Resolution**
|
|
|
|
_Unresolved._
|
|
|
|
### ClusterInfrastructure-004 — ClusterOptions has no validation despite safety-critical values
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Medium |
|
|
| Category | Code organization & conventions |
|
|
| Status | Open |
|
|
| Location | `src/ScadaLink.ClusterInfrastructure/ClusterOptions.cs:3-11` |
|
|
|
|
**Description**
|
|
|
|
`ClusterOptions` carries values whose misconfiguration has cluster-wide
|
|
consequences. The design doc is emphatic that `min-nr-of-members` must be `1` (a
|
|
value of `2` blocks the singleton and therefore all data collection indefinitely
|
|
after failover), that `SplitBrainResolverStrategy` must be `keep-oldest` for a
|
|
two-node cluster (quorum strategies cause total shutdown), and that the timing
|
|
values are interdependent (`HeartbeatInterval` must be well below
|
|
`FailureDetectionThreshold`). The class has no data annotations, no
|
|
`IValidateOptions<ClusterOptions>`, and no guard logic, so an `appsettings.json`
|
|
setting `MinNrOfMembers: 2` or `SplitBrainResolverStrategy: "keep-majority"` (the
|
|
exact value the test at `ClusterOptionsTests.cs:35` shows is settable) would be
|
|
accepted silently and produce the catastrophic outcomes the design doc warns
|
|
against.
|
|
|
|
**Recommendation**
|
|
|
|
Add validation — data annotations (`[Range]` for `MinNrOfMembers`, etc.) plus an
|
|
`IValidateOptions<ClusterOptions>` implementation that enforces
|
|
`MinNrOfMembers == 1`, restricts `SplitBrainResolverStrategy` to a known set,
|
|
requires `SeedNodes` non-empty, and asserts `HeartbeatInterval <
|
|
FailureDetectionThreshold` and positive `StableAfter`. Register it with
|
|
`ValidateOnStart()` so misconfiguration fails fast at boot.
|
|
|
|
**Resolution**
|
|
|
|
_Unresolved._
|
|
|
|
### ClusterInfrastructure-005 — No configuration section name constant for the Options pattern binding
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Low |
|
|
| Category | Code organization & conventions |
|
|
| Status | Open |
|
|
| Location | `src/ScadaLink.ClusterInfrastructure/ClusterOptions.cs:3` |
|
|
|
|
**Description**
|
|
|
|
CLAUDE.md specifies per-component configuration via `appsettings.json` sections
|
|
bound with the Options pattern. `ClusterOptions` provides no `public const string
|
|
SectionName` (or equivalent) for the binding site to reference, so whichever code
|
|
binds the section must hard-code the magic string, and there is no single source of
|
|
truth for the section name. Because `AddClusterInfrastructure` is itself a no-op
|
|
(CI-002), the options class is currently bound nowhere at all, making the missing
|
|
constant easy to overlook.
|
|
|
|
**Recommendation**
|
|
|
|
Add a `public const string SectionName = "Cluster";` (or the agreed name) to
|
|
`ClusterOptions` and have the eventual `AddClusterInfrastructure` bind
|
|
`configuration.GetSection(ClusterOptions.SectionName)` against it.
|
|
|
|
**Resolution**
|
|
|
|
_Unresolved._
|
|
|
|
### ClusterInfrastructure-006 — No tests for any cluster behaviour; only the options POCO is covered
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Medium |
|
|
| Category | Testing coverage |
|
|
| Status | Open |
|
|
| Location | `tests/ScadaLink.ClusterInfrastructure.Tests/ClusterOptionsTests.cs:1-51` |
|
|
|
|
**Description**
|
|
|
|
The test project contains only `ClusterOptionsTests`, exercising default values and
|
|
property setters of `ClusterOptions`. There are no tests for cluster formation,
|
|
leader election, failover detection, split-brain resolution, singleton handover, or
|
|
the `ServiceCollectionExtensions` registration methods — none can exist because the
|
|
behaviour itself is absent (CI-001). This is recorded so the testing gap is tracked
|
|
alongside the implementation gap: the most safety-critical paths of the entire
|
|
system (failover, split-brain, dual-node recovery) are completely untested. The
|
|
test at line 30-50 also asserts that `SplitBrainResolverStrategy` can be set to
|
|
`"keep-majority"`, implicitly endorsing a value the design doc forbids for a
|
|
two-node cluster — see CI-004.
|
|
|
|
**Recommendation**
|
|
|
|
When CI-001 is implemented, add multi-node `Akka.Cluster.TestKit` /
|
|
`MultiNodeTestKit` tests covering cluster formation, failover promotion,
|
|
split-brain downing, and singleton handover, plus unit tests for HOCON generation
|
|
from `ClusterOptions` and for the options validation from CI-004.
|
|
|
|
**Resolution**
|
|
|
|
_Unresolved._
|
|
|
|
### ClusterInfrastructure-007 — ClusterOptions lacks XML documentation comments
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Low |
|
|
| Category | Documentation & comments |
|
|
| Status | Open |
|
|
| Location | `src/ScadaLink.ClusterInfrastructure/ClusterOptions.cs:3-11` |
|
|
|
|
**Description**
|
|
|
|
`ClusterOptions` and each of its six properties have no XML doc comments. Peer
|
|
options classes such as `StoreAndForward/StoreAndForwardOptions.cs` document the
|
|
class and every property (including units and design-doc references). For a class
|
|
whose values carry the cluster-wide consequences described in the design doc
|
|
(notably `MinNrOfMembers` and `SplitBrainResolverStrategy`), the absence of inline
|
|
documentation is a maintainability and safety gap — a future editor has no in-code
|
|
warning that `MinNrOfMembers` must stay `1`.
|
|
|
|
**Recommendation**
|
|
|
|
Add `<summary>` comments to the class and each property, stating units and the
|
|
documented constraints (e.g. that `MinNrOfMembers` must be `1`, that
|
|
`HeartbeatInterval` must be well below `FailureDetectionThreshold`), referencing
|
|
the relevant design-doc sections as peer modules do.
|
|
|
|
**Resolution**
|
|
|
|
_Unresolved._
|
|
|
|
### ClusterInfrastructure-008 — "Phase 0 skeleton" status is undocumented at the module level
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Low |
|
|
| Category | Documentation & comments |
|
|
| Status | Open |
|
|
| Location | `src/ScadaLink.ClusterInfrastructure/ServiceCollectionExtensions.cs:9`, `src/ScadaLink.ClusterInfrastructure/ServiceCollectionExtensions.cs:16` |
|
|
|
|
**Description**
|
|
|
|
The only indication that this foundational module is unimplemented is two inline
|
|
comments inside private method bodies (`// Phase 0: skeleton only` /
|
|
`// Phase 0: placeholder for Akka actor registration`). There is no module README,
|
|
no `<!-- TODO -->` in the design doc, and no tracking marker visible to anyone
|
|
reading the project structure or the component table. Given that the design doc
|
|
(`Component-ClusterInfrastructure.md`) describes a fully featured component with no
|
|
caveat, a reader will reasonably assume the module is built. The mismatch between a
|
|
complete-looking design doc and an empty implementation is itself a documentation
|
|
defect.
|
|
|
|
**Recommendation**
|
|
|
|
Add a short note to the design doc (or a module-level `README.md`) stating the
|
|
current implementation status and what "Phase 0" delivers, and reference a tracked
|
|
issue for the remaining work (CI-001). Keep the README component table accurate
|
|
about which components are skeletons versus implemented.
|
|
|
|
**Resolution**
|
|
|
|
_Unresolved._
|