code-review: 2026-05-28 baseline re-review of all 23 modules at 1eb6e97

Re-applies the full 10-category checklist to every src/ project — including first-time reviews of the four newer components (AuditLog, NotificationOutbox, SiteCallAudit, Transport) — so the code-reviews/ index reflects today's codebase rather than the 2026-05-16 baseline. 172 new Open findings (0 Critical, 18 High, 62 Medium, 92 Low); 481 findings total across 23 modules. regen-readme.py now derives each module's Last reviewed + Commit from its findings.md header instead of hard-coding 2026-05-16 / 9c60592, so future single-module re-reviews show their own date in the Module Status table.
2026-05-28 02:55:47 -04:00
parent 1eb6e972b0
commit f93b7b99bb
25 changed files with 8793 additions and 115 deletions
@@ -5,10 +5,10 @@
 | Module | `src/ScadaLink.DeploymentManager` |
 | Design doc | `docs/requirements/Component-DeploymentManager.md` |
 | Status | Reviewed |
-| Last reviewed | 2026-05-17 |
+| Last reviewed | 2026-05-28 |
 | Reviewer | claude-agent |
-| Commit reviewed | `39d737e` |
-| Open findings | 0 |
+| Commit reviewed | `1eb6e97` |
+| Open findings | 7 |

 ## Summary

@@ -53,20 +53,52 @@ DeploymentManager-016). The `GetDeploymentStatusAsync` XML doc is now stale —
 it still describes the query-before-redeploy behaviour that actually moved into
 `TryReconcileWithSiteAsync` (DeploymentManager-017).

+#### Re-review 2026-05-28 (commit `1eb6e97`)
+
+Re-reviewed at commit `1eb6e97` after the DeploymentManager-015/016/017 fixes
+and a docs-only XML-comment pass. The three prior findings remain `Resolved`
+and verified — `ApplyPostSuccessSideEffectsAsync` is now invoked from both the
+normal success path and `TryReconcileWithSiteAsync`, the reconciled-success
+branch corrects `prior.RevisionHash` to the target, and `GetDeploymentStatusAsync`'s
+XML doc now describes the local-DB-read it actually performs and cross-refs the
+reconciliation helper. The DiffService wiring, options binding, ref-counted
+operation lock, broadened catch, non-cancellable cleanup, and TestKit-actor
+test seam are still in place. The 7 new findings here are not regressions in
+the DeploymentManager-015/016 fixes — they are issues uncovered by widening
+the lens to the lifecycle paths, reconciliation's interaction with
+intentional `Disabled` state, audit semantics, and operational concerns
+(per-site artifact-build cost, Pending→InProgress double-write).
+
+The single notable correctness issue is DeploymentManager-018: the
+reconciliation shortcut unconditionally sets `instance.State = Enabled` via
+`ApplyPostSuccessSideEffectsAsync`. After a central failover that loses the
+in-memory operation lock, a user can legitimately `Disable` an instance whose
+prior deploy record is still `InProgress`; a subsequent redeploy then reconciles
+and silently re-enables the instance against the user's explicit intent.
+The remaining six findings are medium/low: lifecycle-timeout audit gap
+(DeploymentManager-019), audit-user attribution in reconciliation
+(DeploymentManager-020), silent fallback in `ResolveSiteIdentifierAsync`
+(DeploymentManager-021), back-to-back `Pending`→`InProgress` writes
+(DeploymentManager-022), per-site re-query of system-wide artifacts
+(DeploymentManager-023), and shared static state across `*ProbeActor` tests
+(DeploymentManager-024).
+
 ## Checklist coverage

+#### Re-review 2026-05-28 (commit `1eb6e97`)
+
 | # | Category | Examined | Notes |
 |---|----------|----------|-------|
-| 1 | Correctness & logic bugs | ✓ | Re-review 2026-05-17: reconciliation skips instance-state/snapshot updates (DeploymentManager-015) and keeps a stale `RevisionHash` (DeploymentManager-016). Prior: stuck `InProgress` / cancelled-token write (resolved). |
-| 2 | Akka.NET conventions | ✓ | Module is a plain service layer; it calls `CommunicationService` which wraps Ask. No actors here. No issues. |
-| 3 | Concurrency & thread safety | ✓ | `OperationLockManager` ref-counts and reclaims semaphores; `DeployToAllSitesAsync` correctly builds commands sequentially before parallel send. No issues at re-review. |
-| 4 | Error handling & resilience | ✓ | Prior gaps DeploymentManager-001/002/003/004 resolved and verified. No new issues. |
-| 5 | Security | ✓ | SMTP credential handling documented as an accepted design decision (DeploymentManager-013). No injection vectors; no authz here (enforced upstream). No new issues. |
-| 6 | Performance & resource management | ✓ | Semaphore leak resolved (DeploymentManager-005). No new issues. |
-| 7 | Design-document adherence | ✓ | Query-before-redeploy and Diff View implemented (DeploymentManager-006/007). Re-review: reconciliation path breaks the deployed-snapshot/instance-state invariants — see DeploymentManager-015. |
-| 8 | Code organization & conventions | ✓ | Options binding resolved (DeploymentManager-008). POCO/repo placement correct. No new issues. |
-| 9 | Testing coverage | ✓ | Broad coverage added (success, lifecycle, lock serialization, reconciliation, artifact matrix). Re-review: reconciled-success path's missing side effects (DeploymentManager-015) are untested. |
-| 10 | Documentation & comments | ✓ | Prior comment findings resolved. Re-review: `GetDeploymentStatusAsync` XML doc is now stale — DeploymentManager-017. |
+| 1 | Correctness & logic bugs | ✓ | New: reconciliation forces `Enabled` even if the user disabled the instance in between (DeploymentManager-018). |
+| 2 | Akka.NET conventions | ✓ | Module remains a plain service layer; no actors. No issues. |
+| 3 | Concurrency & thread safety | ✓ | `OperationLockManager` ref-counting verified. Note: test probes hold static state (DeploymentManager-024) — a test concern, not production code. |
+| 4 | Error handling & resilience | ✓ | New: Disable/Enable/Delete timeouts return early without writing any audit entry — deploy has `DeployFailed`, lifecycle has nothing (DeploymentManager-019). |
+| 5 | Security | ✓ | No new issues. SMTP credential decision documented (DeploymentManager-013 closed). |
+| 6 | Performance & resource management | ✓ | New: `BuildDeployArtifactsCommandAsync` re-queries every system-wide artifact set per site in `DeployToAllSitesAsync` (DeploymentManager-023). |
+| 7 | Design-document adherence | ✓ | Reconciliation now performs post-success side effects (DeploymentManager-015 resolved). DeploymentManager-018 surfaces a new gap on `Disabled`-state preservation. |
+| 8 | Code organization & conventions | ✓ | New: redundant `Pending`→`InProgress` back-to-back write with no intervening work (DeploymentManager-022). Silent string-fallback in `ResolveSiteIdentifierAsync` (DeploymentManager-021). |
+| 9 | Testing coverage | ✓ | New: no coverage for the reconciliation-overwrites-Disabled case (part of DeploymentManager-018); test probes share static state across tests (DeploymentManager-024). |
+| 10 | Documentation & comments | ✓ | New: `DeployReconciled` audit uses `prior.DeployedBy` instead of the current `user` parameter — misleading for forensics (DeploymentManager-020). |

 ## Findings

@@ -873,3 +905,293 @@ database as a pure local read, and cross-references `TryReconcileWithSiteAsync`
 as where the query-the-site-before-redeploy reconciliation actually lives.
 Documentation-only change; no regression test (a test asserting comment text
 would be meaningless).
+
+### DeploymentManager-018 — Reconciliation force-sets `Enabled`, overwriting an intentional `Disabled` after central failover
+
+| | |
+|--|--|
+| Severity | High |
+| Category | Correctness & logic bugs |
+| Status | Open |
+| Location | `src/ScadaLink.DeploymentManager/DeploymentService.cs:675-682,721-748` |
+
+**Description**
+
+`TryReconcileWithSiteAsync` calls `ApplyPostSuccessSideEffectsAsync` whenever
+the site reports it has the target revision hash, and that helper
+unconditionally writes `instance.State = InstanceState.Enabled`. The
+reconciliation shortcut only runs when the prior `DeploymentRecord` is
+`InProgress` or timeout-`Failed` — exactly the scenarios that survive a central
+failover (the in-memory `OperationLockManager` is lost on failover, by design:
+*"Lost on central failover (acceptable per design — in-progress treated as
+failed)"*).
+
+After such a failover, the per-instance operation lock is gone but the
+deployment record is still `InProgress` in the DB. A user can legitimately
+issue `DisableInstanceAsync` for the same instance — there is nothing in
+`DisableInstanceAsync` that consults the deployment record, only the
+`StateTransitionValidator` over `Instance.State`. If the state is `Enabled`
+(the typical case when the deploy started), the disable proceeds, the site
+honours it (the design states a disabled instance retains its deployed
+configuration), and central now persists `Instance.State = Disabled`. The
+deployment-record row remains `InProgress` (no one transitioned it). Later the
+user retries the deploy: `TryReconcileWithSiteAsync` runs, the site still has
+the target revision hash (Disable doesn't change the deployed config), the
+prior record is marked `Success`, and `ApplyPostSuccessSideEffectsAsync` writes
+`Instance.State = Enabled` — silently overriding the user's explicit Disable.
+
+The same trap exists for any direct DB edit / migration that flipped the state
+between the timed-out deploy and the redeploy. The normal deploy path can
+defensibly assume `Enabled` after a fresh successful apply, but the
+reconciliation path is reconciling *prior* state with *prior* user intent; it
+should preserve `Disabled` if that is the current `Instance.State` at the time
+of reconciliation, mirroring the design's separation between deploy (config
+apply) and disable (subscription/script lifecycle).
+
+**Recommendation**
+
+In the reconciliation branch, do not force `Enabled`. Either:
+- Pass a flag/parameter to `ApplyPostSuccessSideEffectsAsync` telling it
+  whether to touch state, and skip the state write on the reconciliation path
+  (leaving the current `Instance.State` intact, which is already `Enabled`
+  for a fresh deploy that timed out and `Disabled` for the user-disabled
+  follow-up case); or
+- Only set `Enabled` when the current `Instance.State` is `NotDeployed` (i.e.
+  the first-deploy timed-out case), and leave existing `Enabled`/`Disabled`
+  alone.
+
+Add a regression test where an instance with `Instance.State = Disabled` and a
+prior `InProgress` deployment record is reconciled — the resulting
+`Instance.State` must remain `Disabled`, and the deployment record must still
+be marked `Success`.
+
+### DeploymentManager-019 — Lifecycle command timeout writes no audit entry
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Error handling & resilience |
+| Status | Open |
+| Location | `src/ScadaLink.DeploymentManager/DeploymentService.cs:328-339,385-396,445-458` |
+
+**Description**
+
+`DisableInstanceAsync`, `EnableInstanceAsync`, and `DeleteInstanceAsync` each
+wrap the `CommunicationService` call in a linked CTS with
+`LifecycleCommandTimeout` (DeploymentManager-012). On timeout they log a
+warning and `return Result<...>.Failure(...)` — and skip the
+`_auditService.LogAsync` call entirely. As a result, an operator-initiated
+disable/enable/delete that times out at the site leaves **no audit trail**:
+the user, the timestamp, the command id, and the failure mode are not
+recorded in the audit log. The deploy path goes out of its way to write a
+`DeployFailed` audit entry on the same failure mode
+(`DeploymentService.cs:274-276`), with `CancellationToken.None` so the write is
+durable; the lifecycle commands do not.
+
+The design lists audit logging as a Deployment Manager responsibility for "all
+deployment actions, system-wide artifact deployments, and instance lifecycle
+changes" — a timed-out lifecycle command **is** an attempted lifecycle change,
+and the operator action is exactly the kind of event the audit log exists to
+record.
+
+**Recommendation**
+
+In each of the three `catch (Exception ex) when (ex is TimeoutException or
+OperationCanceledException)` blocks, write a `DisableTimeout`/`EnableTimeout`/
+`DeleteTimeout` (or use the existing operation name with a failure flag)
+audit entry with `CancellationToken.None` so a cancelled outer token does not
+prevent the audit write, mirroring `DeployFailed`. Add a unit test asserting
+that `DisableInstanceAsync_SiteUnresponsive_LifecycleCommandTimeoutBoundsTheWait`
+also produces an audit entry.
+
+### DeploymentManager-020 — `DeployReconciled` audit attributes the action to the prior deployer, not the current user
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Documentation & comments |
+| Status | Open |
+| Location | `src/ScadaLink.DeploymentManager/DeploymentService.cs:683-686` |
+
+**Description**
+
+In `TryReconcileWithSiteAsync` the audit call is:
+
+```
+await _auditService.LogAsync(prior.DeployedBy, "DeployReconciled", ...)
+```
+
+`prior.DeployedBy` is the user who issued the original (timed-out / stuck)
+deployment, not the `user` parameter passed into `DeployInstanceAsync`. The
+current user — the one who triggered the redeploy that produced the
+reconciliation — is dropped on the floor. For audit forensics this is
+misleading: the row will read "user A reconciled their own deployment"
+when in fact user B initiated the action that reconciled it.
+
+The original deployer is interesting context, but it should be carried in the
+audit-detail object (where `DeploymentId` and `RevisionHash` already live), not
+substituted for the actor.
+
+**Recommendation**
+
+Use `user` (the parameter on `DeployInstanceAsync`, threaded through
+`TryReconcileWithSiteAsync`) as the audit actor, and include
+`OriginalDeployer = prior.DeployedBy` in the detail object so the original
+attribution is preserved without misrepresenting who took the action.
+
+### DeploymentManager-021 — `ResolveSiteIdentifierAsync` silently substitutes the DB id when the site row is missing
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Correctness & logic bugs |
+| Status | Open |
+| Location | `src/ScadaLink.DeploymentManager/DeploymentService.cs:107-111` |
+
+**Description**
+
+```
+private async Task<string> ResolveSiteIdentifierAsync(int siteId, CancellationToken cancellationToken)
+{
+    var site = await _siteRepository.GetSiteByIdAsync(siteId, cancellationToken);
+    return site?.SiteIdentifier ?? siteId.ToString();
+}
+```
+
+If the `Site` row is missing (FK was deleted, race with admin delete, DB
+inconsistency), the method silently returns the numeric DB id rendered as a
+string. This is then passed to `CommunicationService.{Deploy,Disable,Enable,
+Delete}InstanceAsync` and `QueryDeploymentStateAsync` as if it were a real
+`SiteIdentifier` (e.g. "site-a"). The communication layer will fail with an
+"unknown site" or routing error, producing a confusing diagnostic that hides
+the actual problem (no site row).
+
+This is a defensive concern, but every mutating operation in the module goes
+through this method, so a stale instance whose site was deleted will produce a
+misleading error every time it is touched.
+
+**Recommendation**
+
+Treat a missing site as a hard validation failure: return a
+`Result.Failure($"Site with ID {siteId} not found")` early from the calling
+operations, instead of fabricating an identifier. The repository already
+returns `Site?`, so the null path is type-visible; just don't paper over it.
+
+### DeploymentManager-022 — `Pending` and `InProgress` are written back-to-back with no intervening work
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Code organization & conventions |
+| Status | Open |
+| Location | `src/ScadaLink.DeploymentManager/DeploymentService.cs:178-194` |
+
+**Description**
+
+`DeployInstanceAsync` does:
+
+```
+record.Status = Pending;
+AddDeploymentRecordAsync(record); SaveChangesAsync(); NotifyStatusChange(record);
+record.Status = InProgress;
+UpdateDeploymentRecordAsync(record); SaveChangesAsync(); NotifyStatusChange(record);
+```
+
+There is no work between the two writes — flattening, validation, and
+reconciliation have already completed by line 174. The deploy command is sent
+immediately after the `InProgress` write. The `Pending` write therefore costs:
+an extra `SaveChangesAsync` round-trip, an extra `IDeploymentStatusNotifier`
+invocation (which the CentralUI-006 page renders, so the user briefly sees a
+`Pending` flicker before `InProgress`), and an extra row-version bump if EF
+optimistic concurrency is enabled on the table.
+
+The design uses `Pending` to mean "queued, not yet sent" and `InProgress` to
+mean "sent to site, awaiting response". The code's `Pending` slot has no
+queuing — it is set and immediately overwritten — so the state buys nothing
+operationally.
+
+**Recommendation**
+
+Either:
+- Drop the `Pending` write entirely and create the record directly in
+  `InProgress` (one row insert, one notification, simpler UI); or
+- Move the `Pending`→`InProgress` transition to bracket actual queueing/work
+  (e.g. set `Pending` *before* flattening + reconciliation, set `InProgress`
+  immediately before `DeployInstanceAsync` on the comm service) so the two
+  states carry distinguishable semantics worth a separate write.
+
+### DeploymentManager-023 — `BuildDeployArtifactsCommandAsync` re-queries system-wide artifacts once per site
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Performance & resource management |
+| Status | Open |
+| Location | `src/ScadaLink.DeploymentManager/ArtifactDeploymentService.cs:82-144,169-173` |
+
+**Description**
+
+`DeployToAllSitesAsync` loops over sites and calls
+`BuildDeployArtifactsCommandAsync(site.Id, ...)` for each one. Of the six
+artifact sets the method gathers, **only** `dataConnections` is per-site:
+
+- `_templateRepo.GetAllSharedScriptsAsync` — global.
+- `_externalSystemRepo.GetAllExternalSystemsAsync` — global, plus
+  `GetMethodsByExternalSystemIdAsync` per external system per site.
+- `_externalSystemRepo.GetAllDatabaseConnectionsAsync` — global.
+- `_notificationRepo.GetAllNotificationListsAsync` — global.
+- `_notificationRepo.GetAllSmtpConfigurationsAsync` — global.
+- `_siteRepo.GetDataConnectionsBySiteIdAsync(siteId, ...)` — **per-site**.
+
+With N sites this issues ≈ 5·N redundant queries on the global sets (plus
+M·N method queries, where M is the external-system count). On a hub-and-spoke
+deployment with many sites the artifact-deploy path is noticeably slower than
+necessary and pins DbContext usage longer than needed. Per CLAUDE.md, the
+DbContext is not thread-safe and the per-site commands are already built
+sequentially (good); the redundant queries are sequential too, but the
+network/round-trip cost is real.
+
+**Recommendation**
+
+Hoist the global queries (shared scripts, external systems + their methods,
+DB connections, notification lists, SMTP configurations) out of
+`BuildDeployArtifactsCommandAsync`, fetch them once in `DeployToAllSitesAsync`,
+and pass them in alongside the site id (or expose a
+`BuildDeployArtifactsCommandAsync(siteId, prefetchedGlobals)` overload).
+`RetryForSiteAsync` (the single-site path) can keep the convenience-overload
+behaviour. Add a test using NSubstitute's `.Received()` to assert
+`_templateRepo.GetAllSharedScriptsAsync` is called exactly once for an
+N-site deployment.
+
+### DeploymentManager-024 — Test probe actors hold mutable static state across tests
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Testing coverage |
+| Status | Open |
+| Location | `tests/ScadaLink.DeploymentManager.Tests/DeploymentServiceTests.cs:966-1075`, `tests/ScadaLink.DeploymentManager.Tests/ArtifactDeploymentServiceTests.cs:196-217` |
+
+**Description**
+
+`ReconcileProbeActor.QueryCount` / `DeployCount`, `SerializationProbeActor.MaxConcurrent`
+/ `_current`, and `ArtifactProbeActor.Received` are all `static` fields.
+Each test's actor constructor resets them — but reset-on-construction only
+works as long as no two tests in the same class run concurrently. xUnit's
+default parallelism disables intra-class parallelism, so today's tests pass;
+flip the assembly-level `[CollectionBehavior(DisableTestParallelization = true)]`
+or move to xUnit v3 (which enables intra-class parallelism by default) and the
+counters race — a deploy in test A could increment `DeployCount` while test B
+is asserting on it.
+
+Static state shared across tests is also why a flaky-test investigation here
+will be unusually painful: the offending interaction is invisible from any
+single test file.
+
+**Recommendation**
+
+Replace the static counters with instance state, hand the actor a probe
+recipient (an `IActorRef` to a TestKit probe), and assert via `ExpectMsg`
+in each test. Where the simpler counter shape is preferred, pass a
+shared-state object into the actor's constructor so each test owns its own
+instance — never reach for `static` mutable test state.