docs: add code review process and baseline review of all 19 modules

Establishes a per-module code review workflow under code-reviews/ and records the 2026-05-16 baseline review (commit 9c60592): 241 findings across all src/ modules (6 Critical, 46 High, 100 Medium, 89 Low). This is the clean starting point for remediation work.
2026-05-16 18:09:09 -04:00
parent 9c60592632
commit 977d7369a7
23 changed files with 8899 additions and 0 deletions
--- a/code-reviews/DeploymentManager/findings.md
+++ b/code-reviews/DeploymentManager/findings.md
@@ -0,0 +1,493 @@
+# Code Review — DeploymentManager
+
+| Field | Value |
+|-------|-------|
+| Module | `src/ScadaLink.DeploymentManager` |
+| Design doc | `docs/requirements/Component-DeploymentManager.md` |
+| Status | Reviewed |
+| Last reviewed | 2026-05-16 |
+| Reviewer | claude-agent |
+| Commit reviewed | `9c60592` |
+| Open findings | 14 |
+
+## Summary
+
+The DeploymentManager module is small, well-structured, and clearly maps work
+packages (WP-N) onto code. The happy paths for instance deployment, lifecycle
+commands, artifact broadcast, and staleness comparison are implemented
+sensibly, and the operation lock correctly serializes mutating operations per
+instance while allowing cross-instance parallelism. However, the review found a
+significant cluster of error-handling and resilience gaps: the deployment
+record can be left permanently stuck in `InProgress` when an exception other
+than timeout/cancellation is thrown, the catch block writes its failure status
+using a cancellation token that may already be cancelled, and the
+`OperationLockManager` leaks one `SemaphoreSlim` per instance name forever.
+There are also two notable design-document adherence gaps: the
+"query-the-site-before-redeploy" idempotency requirement is not implemented
+(`GetDeploymentStatusAsync` only reads the local DB), and the "Diff View"
+feature is reduced to a bare hash comparison with no added/removed/changed
+detail. Configuration is not bound to `appsettings.json`, leaving one option
+entirely dead. Test coverage stops at the communication boundary and never
+exercises a successful deployment or the lifecycle success paths.
+
+## Checklist coverage
+
+| # | Category | Examined | Notes |
+|---|----------|----------|-------|
+| 1 | Correctness & logic bugs | ✓ | Stuck `InProgress` record on unexpected exception; cancelled-token failure write. |
+| 2 | Akka.NET conventions | ✓ | Module is a plain service layer; it calls `CommunicationService` which wraps Ask. No actors here. No issues. |
+| 3 | Concurrency & thread safety | ✓ | `OperationLockManager` is sound but leaks semaphores; `DeployToAllSitesAsync` correctly builds commands sequentially before parallel send. |
+| 4 | Error handling & resilience | ✓ | Several gaps — see DeploymentManager-001/002/003/004. |
+| 5 | Security | ✓ | SMTP credentials are serialized and broadcast to sites — see DeploymentManager-013. No injection vectors; no authz here (enforced upstream). |
+| 6 | Performance & resource management | ✓ | Semaphore leak (DeploymentManager-005); artifact rebuild does N+1 method queries per external system. |
+| 7 | Design-document adherence | ✓ | Missing query-before-redeploy (DeploymentManager-006); Diff View not implemented (DeploymentManager-007). |
+| 8 | Code organization & conventions | ✓ | Options class not bound to configuration — DeploymentManager-008. POCO/repo placement correct. |
+| 9 | Testing coverage | ✓ | No successful-deploy test, no lifecycle success test — DeploymentManager-011; dead `CreateCommand` helper — DeploymentManager-014. |
+| 10 | Documentation & comments | ✓ | Misleading timeout comment — DeploymentManager-009; stale option XML doc — DeploymentManager-012. |
+
+## Findings
+
+### DeploymentManager-001 — Unexpected exceptions leave the deployment record stuck in `InProgress`
+
+| | |
+|--|--|
+| Severity | High |
+| Category | Error handling & resilience |
+| Status | Open |
+| Location | `src/ScadaLink.DeploymentManager/DeploymentService.cs:141-199` |
+
+**Description**
+
+`DeployInstanceAsync` sets the record to `InProgress` (lines 137-139), then the
+`try` block calls into `CommunicationService` and the repository. The only
+`catch` filter is `when (ex is TimeoutException or OperationCanceledException)`.
+Any other exception — `InvalidOperationException` (thrown by
+`CommunicationService.GetCommunicationActor()` when the actor is not set), a
+JSON serialization error, a deserialization failure of the response, a DB
+exception on `UpdateDeploymentRecordAsync`, or any transport error — escapes the
+method. The deployment record remains in `DeploymentStatus.InProgress`
+permanently. Because staleness and the UI both read current status, the
+instance is then misreported as "deploying" forever and a re-deploy may be
+blocked or misinterpreted. The design explicitly states an interrupted
+deployment must be "treated as failed".
+
+**Recommendation**
+
+Broaden the catch to a general `catch (Exception ex)` that records
+`DeploymentStatus.Failed` with the error message, audit-logs the failure, and
+re-throws or returns a failed `Result`. Keep the timeout-specific branch only
+if a distinct message is desired. Ensure the failure-status write happens for
+every exit path out of the `try`.
+
+**Resolution**
+
+_Unresolved._
+
+### DeploymentManager-002 — Failure-status write uses a possibly-cancelled cancellation token
+
+| | |
+|--|--|
+| Severity | High |
+| Category | Error handling & resilience |
+| Status | Open |
+| Location | `src/ScadaLink.DeploymentManager/DeploymentService.cs:186-196` |
+
+**Description**
+
+The `catch (Exception ex) when (ex is TimeoutException or
+OperationCanceledException)` block updates the record to `Failed` and calls
+`UpdateDeploymentRecordAsync`/`SaveChangesAsync`/`LogAsync` passing the same
+`cancellationToken` that was just cancelled (an `OperationCanceledException`
+caught here means the token is already in the cancelled state). Those
+repository and audit calls will themselves throw `OperationCanceledException`
+before the failure status is persisted, so the record stays `InProgress` — the
+exact bug DeploymentManager-001 describes, reached via the supposedly-handled
+path.
+
+**Recommendation**
+
+Perform the cleanup writes with a fresh, non-cancellable token (e.g.
+`CancellationToken.None`, optionally with an independent short timeout) so the
+failure status is durably recorded even when the original operation was
+cancelled or timed out.
+
+**Resolution**
+
+_Unresolved._
+
+### DeploymentManager-003 — Successful-deployment cleanup is not atomic with the status write
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Error handling & resilience |
+| Status | Open |
+| Location | `src/ScadaLink.DeploymentManager/DeploymentService.cs:155-170` |
+
+**Description**
+
+After a successful site response the code calls `UpdateDeploymentRecordAsync`
+(no `SaveChanges` yet), then `UpdateInstanceAsync`, then
+`StoreDeployedSnapshotAsync` (which itself issues `Add`/`Update` calls), then a
+single `SaveChangesAsync` at line 170. If `StoreDeployedSnapshotAsync` throws,
+the exception is not caught (see DeploymentManager-001) and the
+`SaveChangesAsync` never runs — the instance state, deployment status, and
+snapshot are all left unpersisted even though the site has actually applied the
+deployment. Central and site are now divergent: the site is running the new
+config but central still shows the old state and a non-`Success` deployment
+record.
+
+**Recommendation**
+
+Wrap the post-success persistence so that, at minimum, the deployment record's
+`Success` status is committed. Consider committing the status first, then the
+instance state and snapshot, so a later failure does not lose the fact that the
+site succeeded. Log loudly if the snapshot write fails after a confirmed site
+apply.
+
+**Resolution**
+
+_Unresolved._
+
+### DeploymentManager-004 — Site-success but central-delete-failure leaves orphaned site config
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Error handling & resilience |
+| Status | Open |
+| Location | `src/ScadaLink.DeploymentManager/DeploymentService.cs:312-319` |
+
+**Description**
+
+In `DeleteInstanceAsync`, when the site responds `Success` the code calls
+`_repository.DeleteInstanceAsync` then `SaveChangesAsync`. If `SaveChangesAsync`
+throws (DB error, concurrency), the exception propagates uncaught: the site has
+already destroyed the Instance Actor and removed its config, but the central
+instance record still exists. The instance is now un-deletable through the
+normal path (the site no longer has it, so a re-issued delete may fail) and is
+permanently orphaned. The design states central must not mark the instance
+deleted until the site confirms — but it does not address the inverse failure.
+
+**Recommendation**
+
+Catch persistence failures in the post-success block and surface a distinct
+error indicating the site succeeded but the central record could not be
+removed, so an operator/retry can reconcile. Consider making the central delete
+idempotent and retryable independently of the site command.
+
+**Resolution**
+
+_Unresolved._
+
+### DeploymentManager-005 — `OperationLockManager` leaks a `SemaphoreSlim` per instance name
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Performance & resource management |
+| Status | Open |
+| Location | `src/ScadaLink.DeploymentManager/OperationLockManager.cs:15-33` |
+
+**Description**
+
+`AcquireAsync` does `_locks.GetOrAdd(instanceUniqueName, _ => new
+SemaphoreSlim(1, 1))` and entries are never removed. Every distinct instance
+unique name that is ever deployed/disabled/enabled/deleted permanently adds a
+`SemaphoreSlim` (an `IDisposable` holding a kernel wait handle) to the
+dictionary. Over the lifetime of a long-running central process — especially
+with the bulk "deploy all out-of-date instances" workflow and instances that
+are created and deleted over time — this is an unbounded leak of both managed
+memory and OS handles. Deleted instances' semaphores are never reclaimed.
+
+**Recommendation**
+
+Either accept the leak explicitly and document the expected bounded cardinality
+of instance names, or implement reclamation: e.g. ref-count handles and remove
+ `Dispose()` the semaphore when the count reaches zero and the lock is free.
+At minimum, remove the semaphore entry when an instance is deleted
+(`DeleteInstanceAsync`).
+
+**Resolution**
+
+_Unresolved._
+
+### DeploymentManager-006 — Query-the-site-before-redeploy idempotency requirement not implemented
+
+| | |
+|--|--|
+| Severity | High |
+| Category | Design-document adherence |
+| Status | Open |
+| Location | `src/ScadaLink.DeploymentManager/DeploymentService.cs:84-200,363-368` |
+
+**Description**
+
+The design ("Deployment Identity & Idempotency") requires: "After a central
+failover or timeout, the Deployment Manager queries the site for current
+deployment state before allowing a re-deploy. This prevents duplicate
+application and out-of-order config changes." The code never does this.
+`GetDeploymentStatusAsync` only reads the local `DeploymentRecord` from the DB
+(`GetDeploymentByDeploymentIdAsync`) — it does not contact the site.
+`DeployInstanceAsync` unconditionally generates a new deployment ID and sends a
+new `DeployInstanceCommand` regardless of any prior in-flight or timed-out
+deployment. After a timeout where the site actually applied the config, a
+re-deploy produces a second deployment with no reconciliation against the
+site's current revision hash. Site-side stale-rejection is the only safety
+net, and that is not verified here.
+
+**Recommendation**
+
+Add a site query (a new `CommunicationService` pattern returning the site's
+currently-applied deployment ID / revision hash) and call it before re-deploy
+when a prior record for the instance is in `InProgress`/`Failed` due to
+timeout. Reconcile: if the site already has the target revision, mark the prior
+record `Success` instead of re-sending. Either implement this or update the
+design doc to reflect that reconciliation is delegated entirely to site-side
+stale-rejection.
+
+**Resolution**
+
+_Unresolved._
+
+### DeploymentManager-007 — "Diff View" reduced to a hash comparison with no diff detail
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Design-document adherence |
+| Status | Open |
+| Location | `src/ScadaLink.DeploymentManager/DeploymentService.cs:334-358,401-406` |
+
+**Description**
+
+The design ("Diff View" and "Dependencies" sections) states the Deployment
+Manager can request a diff from the Template Engine showing added/removed
+members, changed values, and connection-binding changes.
+`GetDeploymentComparisonAsync` and `DeploymentComparisonResult` only compare two
+revision hashes and return a boolean `IsStale` plus the two hashes. No
+added/removed/changed detail is produced, and the Template Engine's diff
+capability is not invoked. The UI cannot render a meaningful diff from this
+result.
+
+**Recommendation**
+
+Either implement a real diff (deserialize the stored
+`DeployedConfigSnapshot.ConfigurationJson` and the freshly flattened config and
+invoke the Template Engine's diff service, surfacing structured
+added/removed/changed entries), or revise the design doc to scope the feature
+down to staleness detection only.
+
+**Resolution**
+
+_Unresolved._
+
+### DeploymentManager-008 — `DeploymentManagerOptions` is never bound to configuration
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Code organization & conventions |
+| Status | Open |
+| Location | `src/ScadaLink.DeploymentManager/ServiceCollectionExtensions.cs:7-14` |
+
+**Description**
+
+`AddDeploymentManager` registers the services but never calls
+`services.Configure<DeploymentManagerOptions>(configuration.GetSection(...))`.
+`IOptions<DeploymentManagerOptions>` therefore always resolves to a
+default-constructed instance — the operation-lock and artifact-deployment
+timeouts cannot be tuned via `appsettings.json`, contrary to the CLAUDE.md
+convention "Per-component configuration via `appsettings.json` sections bound
+to options classes (Options pattern)." `Host/Program.cs` binds
+`SecurityOptions` and `InboundApiOptions` from configuration sections but has
+no equivalent for `DeploymentManagerOptions`.
+
+**Recommendation**
+
+Add an `IConfiguration` parameter (or a configure callback) to
+`AddDeploymentManager` and bind `DeploymentManagerOptions` to a section such as
+`ScadaLink:DeploymentManager`, consistent with the other components.
+
+**Resolution**
+
+_Unresolved._
+
+### DeploymentManager-009 — Misleading timeout comment on `DeleteInstanceAsync`
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Documentation & comments |
+| Status | Open |
+| Location | `src/ScadaLink.DeploymentManager/DeploymentService.cs:288` |
+
+**Description**
+
+The XML doc says "Delete fails if site unreachable (30s timeout via
+CommunicationOptions)." The actual delete timeout is whatever
+`CommunicationOptions.LifecycleTimeout` is configured to (passed inside
+`CommunicationService.DeleteInstanceAsync`); the "30s" figure is hard-coded
+into the comment and not derived from any constant in this module. If
+`LifecycleTimeout` is reconfigured, the comment becomes wrong. It also wrongly
+implies the value lives in this module.
+
+**Recommendation**
+
+Reword to "Delete fails if the site is unreachable within
+`CommunicationOptions.LifecycleTimeout`" without quoting a specific number.
+
+**Resolution**
+
+_Unresolved._
+
+### DeploymentManager-010 — `SystemArtifactDeploymentRecord` does not persist the deployment ID
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Correctness & logic bugs |
+| Status | Open |
+| Location | `src/ScadaLink.DeploymentManager/ArtifactDeploymentService.cs:136,194-211` |
+
+**Description**
+
+`DeployToAllSitesAsync` generates a `deploymentId` (line 136) and returns it in
+the `ArtifactDeploymentSummary` and audit log, but the persisted
+`SystemArtifactDeploymentRecord` has no field for it (the entity only has `Id`,
+`ArtifactType`, `DeployedBy`, `DeployedAt`, `PerSiteStatus`). The deployment ID
+that appears in the UI summary and audit log cannot be correlated back to the
+stored record. Additionally each per-site `DeployArtifactsCommand` carries its
+own separate GUID (`BuildDeployArtifactsCommandAsync` line 114), so there are in
+fact N+1 unrelated IDs for one logical artifact deployment.
+
+**Recommendation**
+
+Add a `DeploymentId` column to `SystemArtifactDeploymentRecord` and store the
+single logical `deploymentId`; reuse that ID (or a derived per-site ID) for the
+per-site commands so the audit log, UI summary, and persisted record agree.
+
+**Resolution**
+
+_Unresolved._
+
+### DeploymentManager-011 — Tests never exercise a successful deployment or lifecycle success path
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Testing coverage |
+| Status | Open |
+| Location | `tests/ScadaLink.DeploymentManager.Tests/DeploymentServiceTests.cs:100-151,155-199` |
+
+**Description**
+
+`DeploymentServiceTests` never sets the `CommunicationService` actor, so every
+deploy/lifecycle test deliberately stops at the `InvalidOperationException`
+thrown by `GetCommunicationActor()` (see lines 118-125, 147). As a result there
+is no test covering: a successful deployment (`DeploymentStatus.Success`
+response → instance state set to `Enabled`, snapshot stored, audit logged); a
+failed-but-handled site response; the `InProgress`-stuck bug
+(DeploymentManager-001); successful Disable/Enable/Delete; or the operation
+lock actually serializing two concurrent deploys of the same instance. The
+critical post-response branch (`DeploymentService.cs:154-184`) and the entire
+delete/disable/enable success path are untested. The `AuditLogs` test
+(lines 277-289) asserts nothing.
+
+**Recommendation**
+
+Introduce a seam to inject a fake/substitute communication path (e.g. an
+interface over `CommunicationService`, or wire a TestKit actor) so success and
+handled-failure paths can be unit tested. Add tests for the stuck-`InProgress`
+scenario and for per-instance lock contention during deploy. Make the audit
+test assert on `IAuditService.LogAsync`.
+
+**Resolution**
+
+_Unresolved._
+
+### DeploymentManager-012 — `LifecycleCommandTimeout` option is dead code
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Documentation & comments |
+| Status | Open |
+| Location | `src/ScadaLink.DeploymentManager/DeploymentManagerOptions.cs:8-9` |
+
+**Description**
+
+`DeploymentManagerOptions.LifecycleCommandTimeout` is declared with a 30s
+default and an XML doc, but it is never read anywhere in the codebase
+(lifecycle commands rely on `CommunicationOptions.LifecycleTimeout` inside
+`CommunicationService`). The option misleads readers into thinking it controls
+disable/enable/delete timeouts, when setting it has no effect.
+
+**Recommendation**
+
+Remove `LifecycleCommandTimeout`, or actually thread it through to the
+lifecycle command calls (e.g. by creating a linked CTS with this timeout in
+`DisableInstanceAsync`/`EnableInstanceAsync`/`DeleteInstanceAsync`, the way
+`ArtifactDeploymentTimeoutPerSite` is used).
+
+**Resolution**
+
+_Unresolved._
+
+### DeploymentManager-013 — SMTP credentials serialized and broadcast to all sites
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Security |
+| Status | Open |
+| Location | `src/ScadaLink.DeploymentManager/ArtifactDeploymentService.cs:108-111` |
+
+**Description**
+
+`BuildDeployArtifactsCommandAsync` maps `smtp.Credentials` directly into
+`SmtpConfigurationArtifact` and that command is sent to every site. Distributing
+SMTP credentials to sites is consistent with the design (SMTP configuration is
+a deployable artifact), but the credentials travel inside a serialized command
+across the inter-cluster transport and are stored on each site's SQLite. There
+is no indication the value is encrypted at rest on the site or scrubbed from
+logs. Worth confirming the transport is TLS-protected and the site stores the
+credential securely; at minimum this should be a conscious, documented decision.
+
+**Recommendation**
+
+Confirm inter-cluster transport encryption covers artifact commands, ensure
+`Credentials` is never written to logs, and document the at-rest protection of
+SMTP credentials on site SQLite. Consider encrypting the credential field
+within the artifact payload.
+
+**Resolution**
+
+_Unresolved._
+
+### DeploymentManager-014 — Dead `CreateCommand` helper in artifact tests
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Testing coverage |
+| Status | Open |
+| Location | `tests/ScadaLink.DeploymentManager.Tests/ArtifactDeploymentServiceTests.cs:86-90` |
+
+**Description**
+
+The private static `CreateCommand()` helper is never referenced by any test in
+the file. It is dead code that suggests an intended test (e.g. a successful
+multi-site artifact deployment) was never written — coverage of
+`DeployToAllSitesAsync` is limited to the no-sites failure case, and
+`RetryForSiteAsync` and `BuildDeployArtifactsCommandAsync` have no tests at all.
+
+**Recommendation**
+
+Either remove the unused helper or, preferably, write the missing tests for
+`DeployToAllSitesAsync` (per-site success/failure matrix, partial failure) and
+`RetryForSiteAsync` using it.
+
+**Resolution**
+
+_Unresolved._