17 new findings: CentralUI-020..025, CLI-014..016, ClusterInfrastructure-009..010, Commons-013..014, Communication-012..015.
606 lines
35 KiB
Markdown
606 lines
35 KiB
Markdown
# Code Review — ClusterInfrastructure
|
|
|
|
| Field | Value |
|
|
|-------|-------|
|
|
| Module | `src/ScadaLink.ClusterInfrastructure` |
|
|
| Design doc | `docs/requirements/Component-ClusterInfrastructure.md` |
|
|
| Status | Reviewed |
|
|
| Last reviewed | 2026-05-17 |
|
|
| Reviewer | claude-agent |
|
|
| Commit reviewed | `39d737e` |
|
|
| Open findings | 2 |
|
|
|
|
## Summary
|
|
|
|
The ClusterInfrastructure module is currently a **Phase 0 skeleton**. It contains
|
|
only two source files: `ClusterOptions.cs`, a plain options POCO, and
|
|
`ServiceCollectionExtensions.cs`, whose two registration methods are explicit no-ops.
|
|
None of the responsibilities described in `Component-ClusterInfrastructure.md` —
|
|
Akka.NET cluster bootstrap, leader election, failover detection, split-brain
|
|
resolution, cluster singleton hosting, Windows service lifecycle — are implemented.
|
|
There are therefore no correctness, concurrency, or Akka-convention defects to find
|
|
in *behaviour*, because there is no behaviour. The findings below instead concern
|
|
(a) the large gap between the design doc and the code, (b) the options class missing
|
|
the validation, configuration-binding affordances, and coverage of documented
|
|
settings that peer modules provide, and (c) the no-op DI extensions silently
|
|
returning success, which is a latent reliability hazard once the Host wires this
|
|
module in. The dominant theme is **incompleteness**: this module is the foundation
|
|
every other component runs on, yet it presently delivers nothing the design requires.
|
|
The single options class is clean and its test covers defaults and setters
|
|
adequately for what exists.
|
|
|
|
**Re-review 2026-05-17 (commit `39d737e`).** All eight prior findings (CI-001..008)
|
|
were resolved by the batch of work between `9c60592` and `39d737e`: `ClusterOptions`
|
|
gained XML docs, the `SectionName` constant, and the `DownIfAlone` property;
|
|
`ClusterOptionsValidator` was added; `ServiceCollectionExtensions` now registers the
|
|
validator and throws from the dead actor-registration method; and the test project
|
|
grew to 16 cases across three test classes. The module is in good shape — the
|
|
`ClusterOptions` contract, its validator, and the DI registration are all sound,
|
|
well-documented, and well-tested. This re-review examined all three source files and
|
|
all three test files against the full 10-category checklist and found **two new
|
|
issues**, both stemming from work the prior review explicitly deferred to a "Host
|
|
review" that has not happened: the `DownIfAlone` property is exposed and validated as
|
|
part of the configuration contract but is never consumed — `ScadaLink.Host`'s
|
|
`BuildHocon` still hard-codes `down-if-alone = on` (CI-009, Medium) — and the validator
|
|
does not enforce the design doc's requirement that `down-if-alone` be `on` for the
|
|
keep-oldest resolver, so `DownIfAlone = false` is silently accepted (CI-010, Low).
|
|
|
|
## Checklist coverage
|
|
|
|
Original review (2026-05-16, `9c60592`) below; the re-review notes (2026-05-17,
|
|
`39d737e`) are appended in each row.
|
|
|
|
| # | Category | Examined | Notes |
|
|
|---|----------|----------|-------|
|
|
| 1 | Correctness & logic bugs | ✓ | No executable logic exists beyond an options POCO; no logic bugs, but `ServiceCollectionExtensions` returns success while doing nothing (CI-002). **Re-review:** CI-002 resolved. New — `DownIfAlone` is a settable property that controls nothing because the HOCON builder hard-codes the value (CI-009). |
|
|
| 2 | Akka.NET conventions | ✓ | No actors, no `ActorSystem` bootstrap, no supervision, no cluster/singleton wiring exist despite the design doc requiring all of them (CI-001). Nothing to assess against `Tell`/`Ask`, immutability, or `PipeTo`. **Re-review:** confirmed the Akka bootstrap legitimately lives in `ScadaLink.Host` (CI-001 resolution); still nothing actor-related in this module. No issues. |
|
|
| 3 | Concurrency & thread safety | ✓ | No shared mutable state, no actors, no async code. No issues found in current code. **Re-review:** validator and DI extensions are stateless; no issues. |
|
|
| 4 | Error handling & resilience | ✓ | Failover, split-brain, dual-node recovery, and graceful-shutdown logic are entirely absent (CI-001). No exception paths to review in current code. **Re-review:** the validator now fails fast on misconfiguration. New — it does not enforce the design doc's `down-if-alone = on` requirement (CI-010). |
|
|
| 5 | Security | ✓ | No authn/authz surface in this module. Akka remoting is unconfigured, so transport security cannot be assessed; flagged as part of the missing implementation (CI-001). No secret handling present. **Re-review:** still no authn/authz surface, no secret handling. No issues. |
|
|
| 6 | Performance & resource management | ✓ | No streams, connections, timers, or `IDisposable` resources exist yet. No issues found in current code. **Re-review:** no resources held; the validator allocates a small failure list per call only. No issues. |
|
|
| 7 | Design-document adherence | ✓ | Severe drift: the module implements none of its documented responsibilities (CI-001). `ClusterOptions` also omits remoting host/port, cluster role/site identifier, gRPC port, storage paths, and `down-if-alone` (CI-003). **Re-review:** CI-001/CI-003 resolved (ownership split documented; `DownIfAlone` added). New — `DownIfAlone` was added to the contract but never wired into the HOCON (CI-009). |
|
|
| 8 | Code organization & conventions | ✓ | Options class is correctly owned by the component project. Missing config-section-name constant (CI-005) and missing `IValidateOptions`/data-annotation validation (CI-004) versus the Options pattern intent. **Re-review:** CI-004/CI-005 resolved; `SectionName` constant present and options/validator placement correct. No issues. |
|
|
| 9 | Testing coverage | ✓ | `ClusterOptionsTests` covers defaults and setters. No tests for any cluster behaviour because none exists; the test project references nothing else (CI-006). **Re-review:** CI-006 resolved — 16 tests across three classes covering options, validator, and DI registration. No `DownIfAlone`-wiring test exists, but that wiring lives in the Host (CI-009). No new issue here. |
|
|
| 10 | Documentation & comments | ✓ | `ClusterOptions` has no XML doc comments unlike peer options classes (CI-007). The "Phase 0 skeleton" placeholders are undocumented at the module level — no README or tracking note (CI-008). **Re-review:** CI-007/CI-008 resolved — full XML docs on all members; skeleton comments gone. Note: the `DownIfAlone` XML doc calls `true` "the design-doc requirement" yet the value is inert (CI-009) and unenforced (CI-010). |
|
|
|
|
## Findings
|
|
|
|
### ClusterInfrastructure-001 — Module implements none of its documented responsibilities
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | High |
|
|
| Category | Design-document adherence |
|
|
| Status | Resolved |
|
|
| Location | `src/ScadaLink.ClusterInfrastructure/ServiceCollectionExtensions.cs:9`, `src/ScadaLink.ClusterInfrastructure/ServiceCollectionExtensions.cs:16` |
|
|
|
|
**Description**
|
|
|
|
`Component-ClusterInfrastructure.md` assigns this module seven concrete
|
|
responsibilities: bootstrap the Akka.NET `ActorSystem`, form the two-node cluster,
|
|
manage leader election / active-standby role assignment, detect node failures and
|
|
trigger failover, provide remoting, host the cluster singleton, and manage the
|
|
Windows service lifecycle. The entire module is two files: a `ClusterOptions` POCO
|
|
and a `ServiceCollectionExtensions` whose methods are explicitly commented
|
|
`// Phase 0: skeleton only` and `// Phase 0: placeholder for Akka actor registration`
|
|
and simply return the unmodified `IServiceCollection`. There is no `Akka.Cluster`,
|
|
`Akka.Cluster.Tools`, `Akka.Remote`, or split-brain-resolver dependency in the
|
|
`.csproj` at all (it references only `Microsoft.Extensions.DependencyInjection.Abstractions`,
|
|
`Microsoft.Extensions.Options`, and `ScadaLink.Commons`). Because every other
|
|
ScadaLink component runs inside the actor system this module is responsible for
|
|
creating, the absence of any implementation blocks the foundational layer of the
|
|
system.
|
|
|
|
**Recommendation**
|
|
|
|
Track the gap explicitly (a milestone/issue) and implement the documented behaviour:
|
|
add the Akka cluster/remote/cluster-tools and split-brain-resolver package
|
|
references, build the cluster bootstrap (HOCON generation from `ClusterOptions`),
|
|
the split-brain resolver configuration, cluster-singleton hosting support, and
|
|
`CoordinatedShutdown` wiring. Until then, the module's `Status` and the design doc
|
|
should clearly state it is unimplemented so callers do not assume otherwise.
|
|
|
|
**Resolution**
|
|
|
|
_Re-triaged 2026-05-16 — remains Open, needs a design decision from the user._
|
|
|
|
Verified against the source at the reviewed commit: the finding's factual claims hold.
|
|
`src/ScadaLink.ClusterInfrastructure` still contains only `ClusterOptions.cs` and a
|
|
no-op `ServiceCollectionExtensions.cs`, and the `.csproj` references no Akka packages.
|
|
|
|
However, the documented cluster behaviour is **not actually absent from the system** —
|
|
it has been implemented in the **Host** project rather than in this module:
|
|
|
|
- `src/ScadaLink.Host/Actors/AkkaHostedService.cs` bootstraps the `ActorSystem`,
|
|
generates the HOCON from `ClusterOptions` (it imports `ScadaLink.ClusterInfrastructure`
|
|
and injects `IOptions<ClusterOptions>`), and configures the `keep-oldest` split-brain
|
|
resolver with `down-if-alone = on` (see `AkkaHostedService.cs:95-96`).
|
|
- `src/ScadaLink.Host/Health/AkkaClusterHealthCheck.cs`, `AkkaClusterNodeProvider.cs`,
|
|
and `Health/ActiveNodeHealthCheck.cs` cover cluster membership / active-node detection.
|
|
- Akka cluster/remote package references live in `ScadaLink.Host.csproj` and the
|
|
per-component projects (`SiteRuntime`, `Communication`, etc.).
|
|
|
|
So the real situation is an **ownership / design-doc drift**, not missing behaviour:
|
|
`Component-ClusterInfrastructure.md` assigns the Akka bootstrap, HOCON generation,
|
|
split-brain config and `CoordinatedShutdown` wiring to this module, but the
|
|
implementation deliberately lives in the Host. `ClusterOptions` is the one piece this
|
|
module legitimately owns and it is consumed correctly by the Host.
|
|
|
|
Resolving CI-001 as literally written is **not a small, well-scoped fix** — it is one
|
|
of two substantial decisions, both requiring the user:
|
|
|
|
1. **Move the bootstrap into this module** — relocate the HOCON generation, split-brain
|
|
config, cluster-singleton helpers and `CoordinatedShutdown` wiring out of
|
|
`ScadaLink.Host` into `ScadaLink.ClusterInfrastructure`, add the Akka package
|
|
references, and re-wire the Host to call into it. This is a cross-module refactor
|
|
touching `src/ScadaLink.Host/*` and several other projects — outside the edit scope
|
|
permitted for this finding (only `src/ScadaLink.ClusterInfrastructure/`,
|
|
`tests/ScadaLink.ClusterInfrastructure.Tests/`, and this file may be edited).
|
|
2. **Accept the current placement** — keep the bootstrap in the Host and update
|
|
`Component-ClusterInfrastructure.md` (and the README component table) to record that
|
|
the Host owns the actor-system/cluster bootstrap and that this module's role is the
|
|
shared `ClusterOptions` contract. That fix is a design-doc edit, also outside this
|
|
module's permitted edit scope.
|
|
|
|
Either path is a deliberate architecture decision, not a bug fix. The decision was
|
|
surfaced to the user, who chose **option 2 — accept the current placement**: the Akka
|
|
bootstrap stays in the Host (the single deployable binary that performs all actor-system
|
|
bring-up), and the design docs are corrected to record the true ownership.
|
|
|
|
**Resolved** — fixing commit `<pending>`, date 2026-05-16. The finding was a design-doc
|
|
drift, not missing behaviour. `docs/requirements/Component-ClusterInfrastructure.md` now
|
|
carries an "Implementation Note — Code Placement" section stating that the
|
|
`ScadaLink.ClusterInfrastructure` project owns the `ClusterOptions` configuration model
|
|
while `ScadaLink.Host` owns the Akka bootstrap, HOCON generation, split-brain-resolver
|
|
wiring, `CoordinatedShutdown` integration, and active-node health checks. The README
|
|
component table (row 13) was updated to match. No code change was required — the
|
|
documented cluster behaviour already exists and is exercised; only the doc's
|
|
module-ownership claim was wrong. Module test suite green (3 passed).
|
|
|
|
### ClusterInfrastructure-002 — No-op DI extension methods report success while doing nothing
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Medium |
|
|
| Category | Correctness & logic bugs |
|
|
| Status | Resolved |
|
|
| Location | `src/ScadaLink.ClusterInfrastructure/ServiceCollectionExtensions.cs:7-17` |
|
|
|
|
**Description**
|
|
|
|
`AddClusterInfrastructure` and `AddClusterInfrastructureActors` both accept an
|
|
`IServiceCollection` and return it unchanged. A caller (e.g. the Host) that invokes
|
|
`services.AddClusterInfrastructure()` receives a fluent, success-looking result but
|
|
no actor system, no cluster, and no singleton support is actually registered. This
|
|
is a silent failure: the system will appear to start, then fail later and far from
|
|
the cause (e.g. when a component resolves an `ActorSystem` that was never added, or
|
|
when the cluster singleton never forms). A no-op that masquerades as a completed
|
|
registration is worse than an unimplemented method that throws.
|
|
|
|
**Recommendation**
|
|
|
|
Until the real implementation exists, make the placeholder loud rather than silent —
|
|
either throw `NotImplementedException` from the methods, or have them log a
|
|
prominent warning, so an integrating caller fails fast with a clear cause. Replace
|
|
with the genuine registration when CI-001 is addressed.
|
|
|
|
**Resolution**
|
|
|
|
Confirmed against the source: both methods returned the `IServiceCollection`
|
|
unchanged. Verified the consumers — `ScadaLink.Host` calls `AddClusterInfrastructure()`
|
|
(`Program.cs:68`, `SiteServiceRegistration.cs:24`); `AddClusterInfrastructureActors`
|
|
is dead — it is called nowhere in the solution.
|
|
|
|
**Resolved** — fixing commit `commit pending`, date 2026-05-16.
|
|
`AddClusterInfrastructure` now does real work: it registers the
|
|
`ClusterOptionsValidator` (CI-004) via `TryAddEnumerable`, so the method is no longer a
|
|
no-op and a misconfigured `ScadaLink:Cluster` section fails fast on the first
|
|
`IOptions<ClusterOptions>` resolution. `AddClusterInfrastructureActors` — which this
|
|
component never had any actors to register, as CI-001 established the Akka bootstrap
|
|
lives in `ScadaLink.Host` — now throws `NotImplementedException` with a message
|
|
pointing the caller to the Host, rather than masquerading as a completed registration.
|
|
Covered by `ServiceCollectionExtensionsTests`
|
|
(`AddClusterInfrastructure_RegistersOptionsValidator`,
|
|
`AddClusterInfrastructure_ValidatorRejectsBadOptionsAtResolution`,
|
|
`AddClusterInfrastructureActors_ThrowsRatherThanSilentlySucceeding`).
|
|
|
|
### ClusterInfrastructure-003 — ClusterOptions omits several documented node-configuration settings
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Medium |
|
|
| Category | Design-document adherence |
|
|
| Status | Resolved |
|
|
| Location | `src/ScadaLink.ClusterInfrastructure/ClusterOptions.cs:3-11` |
|
|
|
|
**Description**
|
|
|
|
The "Node Configuration", "Split-Brain Resolution", and "Failure Detection Timing"
|
|
sections of the design doc enumerate the settings each node needs. `ClusterOptions`
|
|
exposes `SeedNodes`, `SplitBrainResolverStrategy`, `StableAfter`,
|
|
`HeartbeatInterval`, `FailureDetectionThreshold`, and `MinNrOfMembers`, but is
|
|
missing: the Akka remoting hostname/port (default 8081 central, 8082 site), the
|
|
cluster role (Central vs. Site) and the site identifier, the `down-if-alone` flag
|
|
(the design explicitly requires `down-if-alone = on` for the keep-oldest resolver),
|
|
and — for site nodes — the gRPC port (default 8083) and local SQLite storage paths.
|
|
Without these, the options class cannot drive a correct HOCON configuration when
|
|
CI-001 is implemented. (Some settings such as remoting host/port may instead belong
|
|
in `Host/NodeOptions.cs`; the split of ownership should be decided deliberately, but
|
|
at minimum `down-if-alone` belongs with the split-brain settings here.)
|
|
|
|
**Recommendation**
|
|
|
|
Add the missing settings — at minimum a `DownIfAlone` boolean (default `true`) and
|
|
the cluster role / site identifier — or document explicitly which settings are
|
|
owned by `Host/NodeOptions.cs` instead, so the design doc and the options classes
|
|
agree on where each value lives.
|
|
|
|
**Resolution**
|
|
|
|
Partially re-triaged. Verified against the source: most of the "missing" settings are
|
|
**deliberately owned by `ScadaLink.Host.NodeOptions`** — `NodeOptions` already carries
|
|
`Role`, `NodeHostname`, `SiteId`, `RemotingPort` and `GrpcPort`, and `AkkaHostedService`
|
|
builds the HOCON from `NodeOptions` for exactly those values. Local SQLite storage paths
|
|
live in the database / store-and-forward options. This is the ownership split CI-001
|
|
established (the Host owns node identity and bootstrap; this project owns the
|
|
cluster-formation contract), so those settings do **not** belong in `ClusterOptions`.
|
|
|
|
The one genuine gap the finding identifies is `down-if-alone`, which the design doc
|
|
puts with the split-brain settings.
|
|
|
|
**Resolved** — fixing commit `commit pending`, date 2026-05-16. Added the
|
|
`DownIfAlone` boolean (default `true`) to `ClusterOptions` so the split-brain
|
|
configuration contract is complete, and added a class-level XML doc that records the
|
|
deliberate ownership split — node identity/remoting/gRPC in `Host.NodeOptions`, storage
|
|
paths in the database options, cluster-formation settings here — so the design doc and
|
|
the options classes now agree on where each value lives. (`AkkaHostedService` currently
|
|
hard-codes `down-if-alone = on` in HOCON; wiring it to read `DownIfAlone` is a one-line
|
|
`ScadaLink.Host` change, outside this module's permitted edit scope, and is noted for
|
|
the Host's review.) Covered by `ClusterOptionsTests.DefaultValues_AreCorrect` and
|
|
`ClusterOptionsTests.DownIfAlone_CanBeSet`.
|
|
|
|
### ClusterInfrastructure-004 — ClusterOptions has no validation despite safety-critical values
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Medium |
|
|
| Category | Code organization & conventions |
|
|
| Status | Resolved |
|
|
| Location | `src/ScadaLink.ClusterInfrastructure/ClusterOptions.cs:3-11` |
|
|
|
|
**Description**
|
|
|
|
`ClusterOptions` carries values whose misconfiguration has cluster-wide
|
|
consequences. The design doc is emphatic that `min-nr-of-members` must be `1` (a
|
|
value of `2` blocks the singleton and therefore all data collection indefinitely
|
|
after failover), that `SplitBrainResolverStrategy` must be `keep-oldest` for a
|
|
two-node cluster (quorum strategies cause total shutdown), and that the timing
|
|
values are interdependent (`HeartbeatInterval` must be well below
|
|
`FailureDetectionThreshold`). The class has no data annotations, no
|
|
`IValidateOptions<ClusterOptions>`, and no guard logic, so an `appsettings.json`
|
|
setting `MinNrOfMembers: 2` or `SplitBrainResolverStrategy: "keep-majority"` (the
|
|
exact value the test at `ClusterOptionsTests.cs:35` shows is settable) would be
|
|
accepted silently and produce the catastrophic outcomes the design doc warns
|
|
against.
|
|
|
|
**Recommendation**
|
|
|
|
Add validation — data annotations (`[Range]` for `MinNrOfMembers`, etc.) plus an
|
|
`IValidateOptions<ClusterOptions>` implementation that enforces
|
|
`MinNrOfMembers == 1`, restricts `SplitBrainResolverStrategy` to a known set,
|
|
requires `SeedNodes` non-empty, and asserts `HeartbeatInterval <
|
|
FailureDetectionThreshold` and positive `StableAfter`. Register it with
|
|
`ValidateOnStart()` so misconfiguration fails fast at boot.
|
|
|
|
**Resolution**
|
|
|
|
Confirmed: `ClusterOptions` had no validation of any kind, and the design doc's
|
|
catastrophic-misconfiguration values (`MinNrOfMembers: 2`, a quorum split-brain
|
|
strategy) would have been bound silently.
|
|
|
|
**Resolved** — fixing commit `commit pending`, date 2026-05-16. Added
|
|
`ClusterOptionsValidator : IValidateOptions<ClusterOptions>`, which enforces
|
|
`MinNrOfMembers == 1`, restricts `SplitBrainResolverStrategy` to the
|
|
`keep-oldest`-only allowed set, requires a non-empty `SeedNodes`, requires positive
|
|
`StableAfter` / `HeartbeatInterval` / `FailureDetectionThreshold`, and asserts
|
|
`HeartbeatInterval < FailureDetectionThreshold`. It accumulates every failure into one
|
|
result. It is registered by `AddClusterInfrastructure()` (CI-002) as a singleton
|
|
`IValidateOptions<ClusterOptions>`, so a misconfigured section throws
|
|
`OptionsValidationException` on the first `IOptions<ClusterOptions>.Value` resolution
|
|
— which `AkkaHostedService` performs during startup, giving the fail-fast-at-boot
|
|
behaviour the recommendation asks for without the src project taking a dependency on
|
|
the full `Microsoft.Extensions.DependencyInjection` package needed for the
|
|
`ValidateOnStart()` overload. Data annotations were not used — a single
|
|
`IValidateOptions` implementation expresses the interdependent timing rules that
|
|
attributes cannot. Covered by `ClusterOptionsValidatorTests` (8 cases) and
|
|
`ServiceCollectionExtensionsTests.AddClusterInfrastructure_ValidatorRejectsBadOptionsAtResolution`.
|
|
|
|
### ClusterInfrastructure-005 — No configuration section name constant for the Options pattern binding
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Low |
|
|
| Category | Code organization & conventions |
|
|
| Status | Resolved |
|
|
| Location | `src/ScadaLink.ClusterInfrastructure/ClusterOptions.cs:3` |
|
|
|
|
**Description**
|
|
|
|
CLAUDE.md specifies per-component configuration via `appsettings.json` sections
|
|
bound with the Options pattern. `ClusterOptions` provides no `public const string
|
|
SectionName` (or equivalent) for the binding site to reference, so whichever code
|
|
binds the section must hard-code the magic string, and there is no single source of
|
|
truth for the section name. Because `AddClusterInfrastructure` is itself a no-op
|
|
(CI-002), the options class is currently bound nowhere at all, making the missing
|
|
constant easy to overlook.
|
|
|
|
**Recommendation**
|
|
|
|
Add a `public const string SectionName = "Cluster";` (or the agreed name) to
|
|
`ClusterOptions` and have the eventual `AddClusterInfrastructure` bind
|
|
`configuration.GetSection(ClusterOptions.SectionName)` against it.
|
|
|
|
**Resolution**
|
|
|
|
Confirmed against the source: `ClusterOptions` previously exposed no section-name
|
|
constant, leaving binding sites to hard-code the magic string.
|
|
|
|
**Resolved** — fixing commit `commit pending`, date 2026-05-16. `ClusterOptions` now
|
|
exposes `public const string SectionName = "ScadaLink:Cluster";` as the single source
|
|
of truth for the `appsettings.json` section name, with an XML doc explaining its
|
|
purpose. The chosen value matches the `ScadaLink:`-prefixed section convention used by
|
|
peer option classes and referenced by `ClusterOptionsValidator` / the design doc.
|
|
Covered by `ClusterOptionsTests.SectionName_IsTheExpectedAppSettingsSection`, which
|
|
both pins the value and — by referencing the constant — guards against its removal
|
|
(its deletion would break compilation).
|
|
|
|
### ClusterInfrastructure-006 — No tests for any cluster behaviour; only the options POCO is covered
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Medium |
|
|
| Category | Testing coverage |
|
|
| Status | Resolved |
|
|
| Location | `tests/ScadaLink.ClusterInfrastructure.Tests/ClusterOptionsTests.cs:1-51` |
|
|
|
|
**Description**
|
|
|
|
The test project contains only `ClusterOptionsTests`, exercising default values and
|
|
property setters of `ClusterOptions`. There are no tests for cluster formation,
|
|
leader election, failover detection, split-brain resolution, singleton handover, or
|
|
the `ServiceCollectionExtensions` registration methods — none can exist because the
|
|
behaviour itself is absent (CI-001). This is recorded so the testing gap is tracked
|
|
alongside the implementation gap: the most safety-critical paths of the entire
|
|
system (failover, split-brain, dual-node recovery) are completely untested. The
|
|
test at line 30-50 also asserts that `SplitBrainResolverStrategy` can be set to
|
|
`"keep-majority"`, implicitly endorsing a value the design doc forbids for a
|
|
two-node cluster — see CI-004.
|
|
|
|
**Recommendation**
|
|
|
|
When CI-001 is implemented, add multi-node `Akka.Cluster.TestKit` /
|
|
`MultiNodeTestKit` tests covering cluster formation, failover promotion,
|
|
split-brain downing, and singleton handover, plus unit tests for HOCON generation
|
|
from `ClusterOptions` and for the options validation from CI-004.
|
|
|
|
**Resolution**
|
|
|
|
Re-triaged in light of CI-001's resolution. The Akka bootstrap, HOCON generation,
|
|
cluster formation, failover and singleton handover are owned by `ScadaLink.Host`, not
|
|
this project — multi-node `Akka.Cluster.TestKit` tests for that behaviour belong in the
|
|
Host's test suite, outside this module's scope. What this module legitimately owns is
|
|
`ClusterOptions`, its validator, and the DI registration, and the testing gap there is
|
|
now closed.
|
|
|
|
**Resolved** — fixing commit `commit pending`, date 2026-05-16. Added two test classes
|
|
to `tests/ScadaLink.ClusterInfrastructure.Tests`: `ClusterOptionsValidatorTests`
|
|
(8 cases — valid defaults pass; `MinNrOfMembers != 1`, unsupported split-brain
|
|
strategies, empty seed nodes, heartbeat not below the failure threshold, non-positive
|
|
`StableAfter` all fail; and a multi-failure accumulation case) and
|
|
`ServiceCollectionExtensionsTests` (3 cases — `AddClusterInfrastructure` registers the
|
|
validator, the validator rejects bad options at `IOptions` resolution, and
|
|
`AddClusterInfrastructureActors` throws). The pre-existing `ClusterOptionsTests` was
|
|
extended with `DownIfAlone` coverage. The test project gained references to
|
|
`Microsoft.Extensions.DependencyInjection` and `Microsoft.Extensions.Options`. Module
|
|
test suite green: 16 passed (was 3). Note: the `keep-majority` value used in the
|
|
pre-existing `ClusterOptionsTests.Properties_CanBeSetToCustomValues` is intentionally
|
|
left — that test exercises the POCO's property setter (the POCO accepts any string by
|
|
design); `ClusterOptionsValidator` is the layer that now rejects `keep-majority`, and
|
|
`UnsupportedSplitBrainStrategy_FailsValidation` proves it.
|
|
|
|
### ClusterInfrastructure-007 — ClusterOptions lacks XML documentation comments
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Low |
|
|
| Category | Documentation & comments |
|
|
| Status | Resolved |
|
|
| Location | `src/ScadaLink.ClusterInfrastructure/ClusterOptions.cs:3-11` |
|
|
|
|
**Description**
|
|
|
|
`ClusterOptions` and each of its six properties have no XML doc comments. Peer
|
|
options classes such as `StoreAndForward/StoreAndForwardOptions.cs` document the
|
|
class and every property (including units and design-doc references). For a class
|
|
whose values carry the cluster-wide consequences described in the design doc
|
|
(notably `MinNrOfMembers` and `SplitBrainResolverStrategy`), the absence of inline
|
|
documentation is a maintainability and safety gap — a future editor has no in-code
|
|
warning that `MinNrOfMembers` must stay `1`.
|
|
|
|
**Recommendation**
|
|
|
|
Add `<summary>` comments to the class and each property, stating units and the
|
|
documented constraints (e.g. that `MinNrOfMembers` must be `1`, that
|
|
`HeartbeatInterval` must be well below `FailureDetectionThreshold`), referencing
|
|
the relevant design-doc sections as peer modules do.
|
|
|
|
**Resolution**
|
|
|
|
Confirmed against the source: `ClusterOptions` previously had no XML doc comments on
|
|
the class or any property.
|
|
|
|
**Resolved** — fixing commit `commit pending`, date 2026-05-16. `ClusterOptions` now
|
|
carries a class-level `<summary>` documenting the cluster configuration contract and
|
|
the deliberate ownership split (node identity / remoting / gRPC in
|
|
`Host.NodeOptions`, storage paths in the database options, cluster-formation settings
|
|
here), plus `<summary>` comments on all seven properties — each stating units,
|
|
defaults, and the design-doc constraints. Notably `MinNrOfMembers` is documented as
|
|
required to be `1` (a value of `2` blocks the cluster singleton after failover) and
|
|
`HeartbeatInterval` as required to be well below `FailureDetectionThreshold`, giving a
|
|
future editor the in-code warnings the recommendation asks for. This is a
|
|
documentation-only change, so no runtime regression test is meaningful; verified by
|
|
inspection of `ClusterOptions.cs:3-74`. Module test suite green (17 passed).
|
|
|
|
### ClusterInfrastructure-008 — "Phase 0 skeleton" status is undocumented at the module level
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Low |
|
|
| Category | Documentation & comments |
|
|
| Status | Resolved |
|
|
| Location | `src/ScadaLink.ClusterInfrastructure/ServiceCollectionExtensions.cs:9`, `src/ScadaLink.ClusterInfrastructure/ServiceCollectionExtensions.cs:16` |
|
|
|
|
**Description**
|
|
|
|
The only indication that this foundational module is unimplemented is two inline
|
|
comments inside private method bodies (`// Phase 0: skeleton only` /
|
|
`// Phase 0: placeholder for Akka actor registration`). There is no module README,
|
|
no `<!-- TODO -->` in the design doc, and no tracking marker visible to anyone
|
|
reading the project structure or the component table. Given that the design doc
|
|
(`Component-ClusterInfrastructure.md`) describes a fully featured component with no
|
|
caveat, a reader will reasonably assume the module is built. The mismatch between a
|
|
complete-looking design doc and an empty implementation is itself a documentation
|
|
defect.
|
|
|
|
**Recommendation**
|
|
|
|
Add a short note to the design doc (or a module-level `README.md`) stating the
|
|
current implementation status and what "Phase 0" delivers, and reference a tracked
|
|
issue for the remaining work (CI-001). Keep the README component table accurate
|
|
about which components are skeletons versus implemented.
|
|
|
|
**Resolution**
|
|
|
|
Re-triaged in light of CI-001's resolution. The original defect was the mismatch
|
|
between two misleading inline `// Phase 0: skeleton only` /
|
|
`// Phase 0: placeholder` comments buried in private method bodies and a
|
|
complete-looking design doc with no caveat. That premise has been overtaken by the
|
|
CI-001/CI-002 work:
|
|
|
|
- The "Phase 0 skeleton" comments no longer exist anywhere in
|
|
`src/ScadaLink.ClusterInfrastructure` (verified by `grep`). `ServiceCollectionExtensions`
|
|
now does real work (registers `ClusterOptionsValidator`) and `AddClusterInfrastructureActors`
|
|
throws explicitly — both with accurate XML docs explaining the ownership split.
|
|
- The module is no longer an unimplemented skeleton. CI-001 established that the Akka
|
|
bootstrap legitimately lives in `ScadaLink.Host`, and this project's true scope —
|
|
the `ClusterOptions` configuration contract, its validator, and DI registration — is
|
|
fully implemented and tested.
|
|
- The design doc `Component-ClusterInfrastructure.md` now opens with an
|
|
"Implementation Note — Code Placement" section (added by CI-001) that explicitly
|
|
states the component is a *design responsibility* realised across
|
|
`ScadaLink.ClusterInfrastructure` (configuration model) and `ScadaLink.Host`
|
|
(bootstrap/runtime wiring), and the README component table (row 13) was updated to
|
|
match. A reader of the design doc no longer assumes a single fully-built project.
|
|
|
|
**Resolved** — fixing commit `commit pending`, date 2026-05-16. No further change
|
|
required: the misleading skeleton comments are gone, the module's real status is
|
|
documented at the design-doc level via the "Implementation Note — Code Placement"
|
|
section, and the component table reflects the true placement. This is a
|
|
documentation-only finding, so no runtime regression test is meaningful; verified by
|
|
inspection of `ServiceCollectionExtensions.cs` and
|
|
`docs/requirements/Component-ClusterInfrastructure.md:21-39`.
|
|
|
|
### ClusterInfrastructure-009 — `DownIfAlone` is an inert configuration knob — never consumed by the HOCON builder
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Medium |
|
|
| Category | Design-document adherence |
|
|
| Status | Open |
|
|
| Location | `src/ScadaLink.ClusterInfrastructure/ClusterOptions.cs:74` |
|
|
|
|
**Description**
|
|
|
|
The `DownIfAlone` property was added to `ClusterOptions` by CI-003's resolution as
|
|
part of "the split-brain configuration contract". It is public, defaults to `true`,
|
|
carries an XML doc presenting it as "the design-doc requirement", and is exercised by
|
|
`ClusterOptionsTests.DownIfAlone_CanBeSet`. However, nothing in the system reads it.
|
|
The Akka.NET HOCON is generated by `ScadaLink.Host.Actors.AkkaHostedService.BuildHocon`,
|
|
which **hard-codes** the resolver setting:
|
|
|
|
```
|
|
split-brain-resolver {
|
|
active-strategy = ...
|
|
stable-after = ...
|
|
keep-oldest {
|
|
down-if-alone = on
|
|
}
|
|
}
|
|
```
|
|
|
|
`BuildHocon` receives the full `ClusterOptions` instance and consumes every other
|
|
field (`SeedNodes`, `MinNrOfMembers`, `SplitBrainResolverStrategy`, `StableAfter`,
|
|
`HeartbeatInterval`, `FailureDetectionThreshold`) but ignores `DownIfAlone` entirely.
|
|
The result is a configuration property that an operator can set in `appsettings.json`,
|
|
that passes validation, and that has **zero runtime effect** — setting
|
|
`DownIfAlone: false` does not turn the flag off. CI-003's resolution explicitly
|
|
acknowledged this gap ("wiring it to read `DownIfAlone` is a one-line `ScadaLink.Host`
|
|
change ... noted for the Host's review") but the wiring was never done and no tracked
|
|
finding carried it, so the gap has silently persisted to commit `39d737e`. An inert,
|
|
misleadingly-documented configuration knob is a correctness and design-adherence
|
|
defect: it gives operators a false sense of control over a safety-critical resolver
|
|
setting.
|
|
|
|
**Recommendation**
|
|
|
|
Either (a) wire `DownIfAlone` into `BuildHocon` — emit `down-if-alone = {(clusterOptions.DownIfAlone ? "on" : "off")}`
|
|
— so the property does what its XML doc claims (a Host-side change, to be tracked in
|
|
the Host module's review since `BuildHocon` lives there), or (b) if the flag is
|
|
intentionally fixed at `on` and must never be operator-configurable, remove the
|
|
`DownIfAlone` property from `ClusterOptions` and document the hard-coded `on` value as
|
|
a non-negotiable invariant. Do not leave a public, settable, validated property that
|
|
controls nothing.
|
|
|
|
**Resolution**
|
|
|
|
_Unresolved._
|
|
|
|
### ClusterInfrastructure-010 — Validator does not enforce `DownIfAlone = true` despite the design doc requiring it
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Low |
|
|
| Category | Error handling & resilience |
|
|
| Status | Open |
|
|
| Location | `src/ScadaLink.ClusterInfrastructure/ClusterOptionsValidator.cs:21-71` |
|
|
|
|
**Description**
|
|
|
|
`Component-ClusterInfrastructure.md` (Split-Brain Resolution) states the keep-oldest
|
|
resolver must be configured with `down-if-alone = on`, and the XML doc on
|
|
`ClusterOptions.DownIfAlone` calls `true` "the design-doc requirement" — the rationale
|
|
being that without it the oldest node can run as an isolated single-node cluster
|
|
during a partition while the younger node forms its own. `ClusterOptionsValidator`
|
|
guards every other safety-critical setting (`MinNrOfMembers == 1`, `keep-oldest`-only
|
|
strategy, positive timings, heartbeat below the failure threshold) but performs no
|
|
check on `DownIfAlone`. A configuration of `DownIfAlone: false` therefore passes
|
|
validation cleanly. This is currently latent because CI-009 shows the property is not
|
|
consumed at all, but the moment CI-009 is fixed by wiring the property into the HOCON
|
|
(option (a)), `DownIfAlone: false` would silently produce the unsafe single-node
|
|
behaviour the design doc explicitly forbids — with no fail-fast guard. The validator
|
|
is the right place to enforce the invariant, consistent with how it already rejects
|
|
quorum split-brain strategies.
|
|
|
|
**Recommendation**
|
|
|
|
If CI-009 is resolved by keeping `DownIfAlone` configurable, add a check to
|
|
`ClusterOptionsValidator.Validate` that fails when `DownIfAlone` is `false` (or, if
|
|
some future deployment legitimately needs it off, fails only in combination with the
|
|
`keep-oldest` strategy), with a message explaining the isolated-single-node-cluster
|
|
hazard. If CI-009 is resolved by removing the property, this finding is moot and
|
|
should be closed as resolved alongside it.
|
|
|
|
**Resolution**
|
|
|
|
_Unresolved._
|