code-review: 2026-05-28 baseline re-review of all 23 modules at 1eb6e97

Re-applies the full 10-category checklist to every src/ project — including
first-time reviews of the four newer components (AuditLog, NotificationOutbox,
SiteCallAudit, Transport) — so the code-reviews/ index reflects today's
codebase rather than the 2026-05-16 baseline. 172 new Open findings (0
Critical, 18 High, 62 Medium, 92 Low); 481 findings total across 23 modules.

regen-readme.py now derives each module's Last reviewed + Commit from its
findings.md header instead of hard-coding 2026-05-16 / 9c60592, so future
single-module re-reviews show their own date in the Module Status table.
This commit is contained in:
Joseph Doherty
2026-05-28 02:55:47 -04:00
parent 1eb6e972b0
commit f93b7b99bb
25 changed files with 8793 additions and 115 deletions
+335 -13
View File
@@ -5,10 +5,10 @@
| Module | `src/ScadaLink.DeploymentManager` |
| Design doc | `docs/requirements/Component-DeploymentManager.md` |
| Status | Reviewed |
| Last reviewed | 2026-05-17 |
| Last reviewed | 2026-05-28 |
| Reviewer | claude-agent |
| Commit reviewed | `39d737e` |
| Open findings | 0 |
| Commit reviewed | `1eb6e97` |
| Open findings | 7 |
## Summary
@@ -53,20 +53,52 @@ DeploymentManager-016). The `GetDeploymentStatusAsync` XML doc is now stale —
it still describes the query-before-redeploy behaviour that actually moved into
`TryReconcileWithSiteAsync` (DeploymentManager-017).
#### Re-review 2026-05-28 (commit `1eb6e97`)
Re-reviewed at commit `1eb6e97` after the DeploymentManager-015/016/017 fixes
and a docs-only XML-comment pass. The three prior findings remain `Resolved`
and verified — `ApplyPostSuccessSideEffectsAsync` is now invoked from both the
normal success path and `TryReconcileWithSiteAsync`, the reconciled-success
branch corrects `prior.RevisionHash` to the target, and `GetDeploymentStatusAsync`'s
XML doc now describes the local-DB-read it actually performs and cross-refs the
reconciliation helper. The DiffService wiring, options binding, ref-counted
operation lock, broadened catch, non-cancellable cleanup, and TestKit-actor
test seam are still in place. The 7 new findings here are not regressions in
the DeploymentManager-015/016 fixes — they are issues uncovered by widening
the lens to the lifecycle paths, reconciliation's interaction with
intentional `Disabled` state, audit semantics, and operational concerns
(per-site artifact-build cost, Pending→InProgress double-write).
The single notable correctness issue is DeploymentManager-018: the
reconciliation shortcut unconditionally sets `instance.State = Enabled` via
`ApplyPostSuccessSideEffectsAsync`. After a central failover that loses the
in-memory operation lock, a user can legitimately `Disable` an instance whose
prior deploy record is still `InProgress`; a subsequent redeploy then reconciles
and silently re-enables the instance against the user's explicit intent.
The remaining six findings are medium/low: lifecycle-timeout audit gap
(DeploymentManager-019), audit-user attribution in reconciliation
(DeploymentManager-020), silent fallback in `ResolveSiteIdentifierAsync`
(DeploymentManager-021), back-to-back `Pending``InProgress` writes
(DeploymentManager-022), per-site re-query of system-wide artifacts
(DeploymentManager-023), and shared static state across `*ProbeActor` tests
(DeploymentManager-024).
## Checklist coverage
#### Re-review 2026-05-28 (commit `1eb6e97`)
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | ✓ | Re-review 2026-05-17: reconciliation skips instance-state/snapshot updates (DeploymentManager-015) and keeps a stale `RevisionHash` (DeploymentManager-016). Prior: stuck `InProgress` / cancelled-token write (resolved). |
| 2 | Akka.NET conventions | ✓ | Module is a plain service layer; it calls `CommunicationService` which wraps Ask. No actors here. No issues. |
| 3 | Concurrency & thread safety | ✓ | `OperationLockManager` ref-counts and reclaims semaphores; `DeployToAllSitesAsync` correctly builds commands sequentially before parallel send. No issues at re-review. |
| 4 | Error handling & resilience | ✓ | Prior gaps DeploymentManager-001/002/003/004 resolved and verified. No new issues. |
| 5 | Security | ✓ | SMTP credential handling documented as an accepted design decision (DeploymentManager-013). No injection vectors; no authz here (enforced upstream). No new issues. |
| 6 | Performance & resource management | ✓ | Semaphore leak resolved (DeploymentManager-005). No new issues. |
| 7 | Design-document adherence | ✓ | Query-before-redeploy and Diff View implemented (DeploymentManager-006/007). Re-review: reconciliation path breaks the deployed-snapshot/instance-state invariants — see DeploymentManager-015. |
| 8 | Code organization & conventions | ✓ | Options binding resolved (DeploymentManager-008). POCO/repo placement correct. No new issues. |
| 9 | Testing coverage | ✓ | Broad coverage added (success, lifecycle, lock serialization, reconciliation, artifact matrix). Re-review: reconciled-success path's missing side effects (DeploymentManager-015) are untested. |
| 10 | Documentation & comments | ✓ | Prior comment findings resolved. Re-review: `GetDeploymentStatusAsync` XML doc is now stale — DeploymentManager-017. |
| 1 | Correctness & logic bugs | ✓ | New: reconciliation forces `Enabled` even if the user disabled the instance in between (DeploymentManager-018). |
| 2 | Akka.NET conventions | ✓ | Module remains a plain service layer; no actors. No issues. |
| 3 | Concurrency & thread safety | ✓ | `OperationLockManager` ref-counting verified. Note: test probes hold static state (DeploymentManager-024) — a test concern, not production code. |
| 4 | Error handling & resilience | ✓ | New: Disable/Enable/Delete timeouts return early without writing any audit entry — deploy has `DeployFailed`, lifecycle has nothing (DeploymentManager-019). |
| 5 | Security | ✓ | No new issues. SMTP credential decision documented (DeploymentManager-013 closed). |
| 6 | Performance & resource management | ✓ | New: `BuildDeployArtifactsCommandAsync` re-queries every system-wide artifact set per site in `DeployToAllSitesAsync` (DeploymentManager-023). |
| 7 | Design-document adherence | ✓ | Reconciliation now performs post-success side effects (DeploymentManager-015 resolved). DeploymentManager-018 surfaces a new gap on `Disabled`-state preservation. |
| 8 | Code organization & conventions | ✓ | New: redundant `Pending``InProgress` back-to-back write with no intervening work (DeploymentManager-022). Silent string-fallback in `ResolveSiteIdentifierAsync` (DeploymentManager-021). |
| 9 | Testing coverage | ✓ | New: no coverage for the reconciliation-overwrites-Disabled case (part of DeploymentManager-018); test probes share static state across tests (DeploymentManager-024). |
| 10 | Documentation & comments | ✓ | New: `DeployReconciled` audit uses `prior.DeployedBy` instead of the current `user` parameter — misleading for forensics (DeploymentManager-020). |
## Findings
@@ -873,3 +905,293 @@ database as a pure local read, and cross-references `TryReconcileWithSiteAsync`
as where the query-the-site-before-redeploy reconciliation actually lives.
Documentation-only change; no regression test (a test asserting comment text
would be meaningless).
### DeploymentManager-018 — Reconciliation force-sets `Enabled`, overwriting an intentional `Disabled` after central failover
| | |
|--|--|
| Severity | High |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.DeploymentManager/DeploymentService.cs:675-682,721-748` |
**Description**
`TryReconcileWithSiteAsync` calls `ApplyPostSuccessSideEffectsAsync` whenever
the site reports it has the target revision hash, and that helper
unconditionally writes `instance.State = InstanceState.Enabled`. The
reconciliation shortcut only runs when the prior `DeploymentRecord` is
`InProgress` or timeout-`Failed` — exactly the scenarios that survive a central
failover (the in-memory `OperationLockManager` is lost on failover, by design:
*"Lost on central failover (acceptable per design — in-progress treated as
failed)"*).
After such a failover, the per-instance operation lock is gone but the
deployment record is still `InProgress` in the DB. A user can legitimately
issue `DisableInstanceAsync` for the same instance — there is nothing in
`DisableInstanceAsync` that consults the deployment record, only the
`StateTransitionValidator` over `Instance.State`. If the state is `Enabled`
(the typical case when the deploy started), the disable proceeds, the site
honours it (the design states a disabled instance retains its deployed
configuration), and central now persists `Instance.State = Disabled`. The
deployment-record row remains `InProgress` (no one transitioned it). Later the
user retries the deploy: `TryReconcileWithSiteAsync` runs, the site still has
the target revision hash (Disable doesn't change the deployed config), the
prior record is marked `Success`, and `ApplyPostSuccessSideEffectsAsync` writes
`Instance.State = Enabled` — silently overriding the user's explicit Disable.
The same trap exists for any direct DB edit / migration that flipped the state
between the timed-out deploy and the redeploy. The normal deploy path can
defensibly assume `Enabled` after a fresh successful apply, but the
reconciliation path is reconciling *prior* state with *prior* user intent; it
should preserve `Disabled` if that is the current `Instance.State` at the time
of reconciliation, mirroring the design's separation between deploy (config
apply) and disable (subscription/script lifecycle).
**Recommendation**
In the reconciliation branch, do not force `Enabled`. Either:
- Pass a flag/parameter to `ApplyPostSuccessSideEffectsAsync` telling it
whether to touch state, and skip the state write on the reconciliation path
(leaving the current `Instance.State` intact, which is already `Enabled`
for a fresh deploy that timed out and `Disabled` for the user-disabled
follow-up case); or
- Only set `Enabled` when the current `Instance.State` is `NotDeployed` (i.e.
the first-deploy timed-out case), and leave existing `Enabled`/`Disabled`
alone.
Add a regression test where an instance with `Instance.State = Disabled` and a
prior `InProgress` deployment record is reconciled — the resulting
`Instance.State` must remain `Disabled`, and the deployment record must still
be marked `Success`.
### DeploymentManager-019 — Lifecycle command timeout writes no audit entry
| | |
|--|--|
| Severity | Medium |
| Category | Error handling & resilience |
| Status | Open |
| Location | `src/ScadaLink.DeploymentManager/DeploymentService.cs:328-339,385-396,445-458` |
**Description**
`DisableInstanceAsync`, `EnableInstanceAsync`, and `DeleteInstanceAsync` each
wrap the `CommunicationService` call in a linked CTS with
`LifecycleCommandTimeout` (DeploymentManager-012). On timeout they log a
warning and `return Result<...>.Failure(...)` — and skip the
`_auditService.LogAsync` call entirely. As a result, an operator-initiated
disable/enable/delete that times out at the site leaves **no audit trail**:
the user, the timestamp, the command id, and the failure mode are not
recorded in the audit log. The deploy path goes out of its way to write a
`DeployFailed` audit entry on the same failure mode
(`DeploymentService.cs:274-276`), with `CancellationToken.None` so the write is
durable; the lifecycle commands do not.
The design lists audit logging as a Deployment Manager responsibility for "all
deployment actions, system-wide artifact deployments, and instance lifecycle
changes" — a timed-out lifecycle command **is** an attempted lifecycle change,
and the operator action is exactly the kind of event the audit log exists to
record.
**Recommendation**
In each of the three `catch (Exception ex) when (ex is TimeoutException or
OperationCanceledException)` blocks, write a `DisableTimeout`/`EnableTimeout`/
`DeleteTimeout` (or use the existing operation name with a failure flag)
audit entry with `CancellationToken.None` so a cancelled outer token does not
prevent the audit write, mirroring `DeployFailed`. Add a unit test asserting
that `DisableInstanceAsync_SiteUnresponsive_LifecycleCommandTimeoutBoundsTheWait`
also produces an audit entry.
### DeploymentManager-020 — `DeployReconciled` audit attributes the action to the prior deployer, not the current user
| | |
|--|--|
| Severity | Low |
| Category | Documentation & comments |
| Status | Open |
| Location | `src/ScadaLink.DeploymentManager/DeploymentService.cs:683-686` |
**Description**
In `TryReconcileWithSiteAsync` the audit call is:
```
await _auditService.LogAsync(prior.DeployedBy, "DeployReconciled", ...)
```
`prior.DeployedBy` is the user who issued the original (timed-out / stuck)
deployment, not the `user` parameter passed into `DeployInstanceAsync`. The
current user — the one who triggered the redeploy that produced the
reconciliation — is dropped on the floor. For audit forensics this is
misleading: the row will read "user A reconciled their own deployment"
when in fact user B initiated the action that reconciled it.
The original deployer is interesting context, but it should be carried in the
audit-detail object (where `DeploymentId` and `RevisionHash` already live), not
substituted for the actor.
**Recommendation**
Use `user` (the parameter on `DeployInstanceAsync`, threaded through
`TryReconcileWithSiteAsync`) as the audit actor, and include
`OriginalDeployer = prior.DeployedBy` in the detail object so the original
attribution is preserved without misrepresenting who took the action.
### DeploymentManager-021 — `ResolveSiteIdentifierAsync` silently substitutes the DB id when the site row is missing
| | |
|--|--|
| Severity | Low |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.DeploymentManager/DeploymentService.cs:107-111` |
**Description**
```
private async Task<string> ResolveSiteIdentifierAsync(int siteId, CancellationToken cancellationToken)
{
var site = await _siteRepository.GetSiteByIdAsync(siteId, cancellationToken);
return site?.SiteIdentifier ?? siteId.ToString();
}
```
If the `Site` row is missing (FK was deleted, race with admin delete, DB
inconsistency), the method silently returns the numeric DB id rendered as a
string. This is then passed to `CommunicationService.{Deploy,Disable,Enable,
Delete}InstanceAsync` and `QueryDeploymentStateAsync` as if it were a real
`SiteIdentifier` (e.g. "site-a"). The communication layer will fail with an
"unknown site" or routing error, producing a confusing diagnostic that hides
the actual problem (no site row).
This is a defensive concern, but every mutating operation in the module goes
through this method, so a stale instance whose site was deleted will produce a
misleading error every time it is touched.
**Recommendation**
Treat a missing site as a hard validation failure: return a
`Result.Failure($"Site with ID {siteId} not found")` early from the calling
operations, instead of fabricating an identifier. The repository already
returns `Site?`, so the null path is type-visible; just don't paper over it.
### DeploymentManager-022 — `Pending` and `InProgress` are written back-to-back with no intervening work
| | |
|--|--|
| Severity | Low |
| Category | Code organization & conventions |
| Status | Open |
| Location | `src/ScadaLink.DeploymentManager/DeploymentService.cs:178-194` |
**Description**
`DeployInstanceAsync` does:
```
record.Status = Pending;
AddDeploymentRecordAsync(record); SaveChangesAsync(); NotifyStatusChange(record);
record.Status = InProgress;
UpdateDeploymentRecordAsync(record); SaveChangesAsync(); NotifyStatusChange(record);
```
There is no work between the two writes — flattening, validation, and
reconciliation have already completed by line 174. The deploy command is sent
immediately after the `InProgress` write. The `Pending` write therefore costs:
an extra `SaveChangesAsync` round-trip, an extra `IDeploymentStatusNotifier`
invocation (which the CentralUI-006 page renders, so the user briefly sees a
`Pending` flicker before `InProgress`), and an extra row-version bump if EF
optimistic concurrency is enabled on the table.
The design uses `Pending` to mean "queued, not yet sent" and `InProgress` to
mean "sent to site, awaiting response". The code's `Pending` slot has no
queuing — it is set and immediately overwritten — so the state buys nothing
operationally.
**Recommendation**
Either:
- Drop the `Pending` write entirely and create the record directly in
`InProgress` (one row insert, one notification, simpler UI); or
- Move the `Pending``InProgress` transition to bracket actual queueing/work
(e.g. set `Pending` *before* flattening + reconciliation, set `InProgress`
immediately before `DeployInstanceAsync` on the comm service) so the two
states carry distinguishable semantics worth a separate write.
### DeploymentManager-023 — `BuildDeployArtifactsCommandAsync` re-queries system-wide artifacts once per site
| | |
|--|--|
| Severity | Low |
| Category | Performance & resource management |
| Status | Open |
| Location | `src/ScadaLink.DeploymentManager/ArtifactDeploymentService.cs:82-144,169-173` |
**Description**
`DeployToAllSitesAsync` loops over sites and calls
`BuildDeployArtifactsCommandAsync(site.Id, ...)` for each one. Of the six
artifact sets the method gathers, **only** `dataConnections` is per-site:
- `_templateRepo.GetAllSharedScriptsAsync` — global.
- `_externalSystemRepo.GetAllExternalSystemsAsync` — global, plus
`GetMethodsByExternalSystemIdAsync` per external system per site.
- `_externalSystemRepo.GetAllDatabaseConnectionsAsync` — global.
- `_notificationRepo.GetAllNotificationListsAsync` — global.
- `_notificationRepo.GetAllSmtpConfigurationsAsync` — global.
- `_siteRepo.GetDataConnectionsBySiteIdAsync(siteId, ...)`**per-site**.
With N sites this issues ≈ 5·N redundant queries on the global sets (plus
M·N method queries, where M is the external-system count). On a hub-and-spoke
deployment with many sites the artifact-deploy path is noticeably slower than
necessary and pins DbContext usage longer than needed. Per CLAUDE.md, the
DbContext is not thread-safe and the per-site commands are already built
sequentially (good); the redundant queries are sequential too, but the
network/round-trip cost is real.
**Recommendation**
Hoist the global queries (shared scripts, external systems + their methods,
DB connections, notification lists, SMTP configurations) out of
`BuildDeployArtifactsCommandAsync`, fetch them once in `DeployToAllSitesAsync`,
and pass them in alongside the site id (or expose a
`BuildDeployArtifactsCommandAsync(siteId, prefetchedGlobals)` overload).
`RetryForSiteAsync` (the single-site path) can keep the convenience-overload
behaviour. Add a test using NSubstitute's `.Received()` to assert
`_templateRepo.GetAllSharedScriptsAsync` is called exactly once for an
N-site deployment.
### DeploymentManager-024 — Test probe actors hold mutable static state across tests
| | |
|--|--|
| Severity | Low |
| Category | Testing coverage |
| Status | Open |
| Location | `tests/ScadaLink.DeploymentManager.Tests/DeploymentServiceTests.cs:966-1075`, `tests/ScadaLink.DeploymentManager.Tests/ArtifactDeploymentServiceTests.cs:196-217` |
**Description**
`ReconcileProbeActor.QueryCount` / `DeployCount`, `SerializationProbeActor.MaxConcurrent`
/ `_current`, and `ArtifactProbeActor.Received` are all `static` fields.
Each test's actor constructor resets them — but reset-on-construction only
works as long as no two tests in the same class run concurrently. xUnit's
default parallelism disables intra-class parallelism, so today's tests pass;
flip the assembly-level `[CollectionBehavior(DisableTestParallelization = true)]`
or move to xUnit v3 (which enables intra-class parallelism by default) and the
counters race — a deploy in test A could increment `DeployCount` while test B
is asserting on it.
Static state shared across tests is also why a flaky-test investigation here
will be unusually painful: the offending interaction is invisible from any
single test file.
**Recommendation**
Replace the static counters with instance state, hand the actor a probe
recipient (an `IActorRef` to a TestKit probe), and assert via `ExpectMsg`
in each test. Where the simpler counter shape is preferred, pass a
shared-state object into the actor's constructor so each test owns its own
instance — never reach for `static` mutable test state.