# Code Review — DeploymentManager

| Field | Value |
|-------|-------|
| Module | `src/ScadaLink.DeploymentManager` |
| Design doc | `docs/requirements/Component-DeploymentManager.md` |
| Status | Reviewed |
| Last reviewed | 2026-05-16 |
| Reviewer | claude-agent |
| Commit reviewed | `9c60592` |
| Open findings | 12 |

## Summary

The DeploymentManager module is small, well-structured, and clearly maps work
packages (WP-N) onto code. The happy paths for instance deployment, lifecycle
commands, artifact broadcast, and staleness comparison are implemented
sensibly, and the operation lock correctly serializes mutating operations per
instance while allowing cross-instance parallelism. However, the review found a
significant cluster of error-handling and resilience gaps: the deployment
record can be left permanently stuck in `InProgress` when an exception other
than timeout/cancellation is thrown, the catch block writes its failure status
using a cancellation token that may already be cancelled, and the
`OperationLockManager` leaks one `SemaphoreSlim` per instance name forever.
There are also two notable design-document adherence gaps: the
"query-the-site-before-redeploy" idempotency requirement is not implemented
(`GetDeploymentStatusAsync` only reads the local DB), and the "Diff View"
feature is reduced to a bare hash comparison with no added/removed/changed
detail. Configuration is not bound to `appsettings.json`, leaving one option
entirely dead. Test coverage stops at the communication boundary and never
exercises a successful deployment or the lifecycle success paths.

## Checklist coverage

| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | ✓ | Stuck `InProgress` record on unexpected exception; cancelled-token failure write. |
| 2 | Akka.NET conventions | ✓ | Module is a plain service layer; it calls `CommunicationService` which wraps Ask. No actors here. No issues. |
| 3 | Concurrency & thread safety | ✓ | `OperationLockManager` is sound but leaks semaphores; `DeployToAllSitesAsync` correctly builds commands sequentially before parallel send. |
| 4 | Error handling & resilience | ✓ | Several gaps — see DeploymentManager-001/002/003/004. |
| 5 | Security | ✓ | SMTP credentials are serialized and broadcast to sites — see DeploymentManager-013. No injection vectors; no authz here (enforced upstream). |
| 6 | Performance & resource management | ✓ | Semaphore leak (DeploymentManager-005); artifact rebuild does N+1 method queries per external system. |
| 7 | Design-document adherence | ✓ | Missing query-before-redeploy (DeploymentManager-006); Diff View not implemented (DeploymentManager-007). |
| 8 | Code organization & conventions | ✓ | Options class not bound to configuration — DeploymentManager-008. POCO/repo placement correct. |
| 9 | Testing coverage | ✓ | No successful-deploy test, no lifecycle success test — DeploymentManager-011; dead `CreateCommand` helper — DeploymentManager-014. |
| 10 | Documentation & comments | ✓ | Misleading timeout comment — DeploymentManager-009; stale option XML doc — DeploymentManager-012. |

## Findings

### DeploymentManager-001 — Unexpected exceptions leave the deployment record stuck in `InProgress`

| | |
|--|--|
| Severity | High |
| Category | Error handling & resilience |
| Status | Resolved |
| Location | `src/ScadaLink.DeploymentManager/DeploymentService.cs:141-199` |

**Description**

`DeployInstanceAsync` sets the record to `InProgress` (lines 137-139), then the
`try` block calls into `CommunicationService` and the repository. The only
`catch` filter is `when (ex is TimeoutException or OperationCanceledException)`.
Any other exception — `InvalidOperationException` (thrown by
`CommunicationService.GetCommunicationActor()` when the actor is not set), a
JSON serialization error, a deserialization failure of the response, a DB
exception on `UpdateDeploymentRecordAsync`, or any transport error — escapes the
method. The deployment record remains in `DeploymentStatus.InProgress`
permanently. Because staleness and the UI both read current status, the
instance is then misreported as "deploying" forever and a re-deploy may be
blocked or misinterpreted. The design explicitly states an interrupted
deployment must be "treated as failed".

**Recommendation**

Broaden the catch to a general `catch (Exception ex)` that records
`DeploymentStatus.Failed` with the error message, audit-logs the failure, and
re-throws or returns a failed `Result`. Keep the timeout-specific branch only
if a distinct message is desired. Ensure the failure-status write happens for
every exit path out of the `try`.

**Resolution**

Resolved 2026-05-16 (commit `<pending>`): broadened the `catch` in
`DeployInstanceAsync` to `catch (Exception ex)` so any exception (transport,
serialization, DB, `InvalidOperationException` from an uninitialized
`CommunicationService`) marks the deployment record `Failed` with the error
message and audit-logs the failure, instead of escaping and leaving the record
stuck in `InProgress`. Regression test:
`DeployInstanceAsync_CommunicationThrowsUnexpectedException_RecordMarkedFailed`.

### DeploymentManager-002 — Failure-status write uses a possibly-cancelled cancellation token

| | |
|--|--|
| Severity | High |
| Category | Error handling & resilience |
| Status | Resolved |
| Location | `src/ScadaLink.DeploymentManager/DeploymentService.cs:186-196` |

**Description**

The `catch (Exception ex) when (ex is TimeoutException or
OperationCanceledException)` block updates the record to `Failed` and calls
`UpdateDeploymentRecordAsync`/`SaveChangesAsync`/`LogAsync` passing the same
`cancellationToken` that was just cancelled (an `OperationCanceledException`
caught here means the token is already in the cancelled state). Those
repository and audit calls will themselves throw `OperationCanceledException`
before the failure status is persisted, so the record stays `InProgress` — the
exact bug DeploymentManager-001 describes, reached via the supposedly-handled
path.

**Recommendation**

Perform the cleanup writes with a fresh, non-cancellable token (e.g.
`CancellationToken.None`, optionally with an independent short timeout) so the
failure status is durably recorded even when the original operation was
cancelled or timed out.

**Resolution**

Resolved 2026-05-16 (commit `<pending>`): the broadened `catch` block now
performs the failure-status write (`UpdateDeploymentRecordAsync`,
`SaveChangesAsync`) and the audit `LogAsync` with `CancellationToken.None`
instead of the operation's (possibly-cancelled) token, so the `Failed` status
is durably recorded even after a timeout/cancellation. The cleanup writes are
themselves wrapped in a `try`/`catch` that logs (without masking the original
error) if persistence still fails. Regression test:
`DeployInstanceAsync_FailureWrite_UsesNonCancellableToken`.

### DeploymentManager-003 — Successful-deployment cleanup is not atomic with the status write

| | |
|--|--|
| Severity | Medium |
| Category | Error handling & resilience |
| Status | Open |
| Location | `src/ScadaLink.DeploymentManager/DeploymentService.cs:155-170` |

**Description**

After a successful site response the code calls `UpdateDeploymentRecordAsync`
(no `SaveChanges` yet), then `UpdateInstanceAsync`, then
`StoreDeployedSnapshotAsync` (which itself issues `Add`/`Update` calls), then a
single `SaveChangesAsync` at line 170. If `StoreDeployedSnapshotAsync` throws,
the exception is not caught (see DeploymentManager-001) and the
`SaveChangesAsync` never runs — the instance state, deployment status, and
snapshot are all left unpersisted even though the site has actually applied the
deployment. Central and site are now divergent: the site is running the new
config but central still shows the old state and a non-`Success` deployment
record.

**Recommendation**

Wrap the post-success persistence so that, at minimum, the deployment record's
`Success` status is committed. Consider committing the status first, then the
instance state and snapshot, so a later failure does not lose the fact that the
site succeeded. Log loudly if the snapshot write fails after a confirmed site
apply.

**Resolution**

_Unresolved._

### DeploymentManager-004 — Site-success but central-delete-failure leaves orphaned site config

| | |
|--|--|
| Severity | Medium |
| Category | Error handling & resilience |
| Status | Open |
| Location | `src/ScadaLink.DeploymentManager/DeploymentService.cs:312-319` |

**Description**

In `DeleteInstanceAsync`, when the site responds `Success` the code calls
`_repository.DeleteInstanceAsync` then `SaveChangesAsync`. If `SaveChangesAsync`
throws (DB error, concurrency), the exception propagates uncaught: the site has
already destroyed the Instance Actor and removed its config, but the central
instance record still exists. The instance is now un-deletable through the
normal path (the site no longer has it, so a re-issued delete may fail) and is
permanently orphaned. The design states central must not mark the instance
deleted until the site confirms — but it does not address the inverse failure.

**Recommendation**

Catch persistence failures in the post-success block and surface a distinct
error indicating the site succeeded but the central record could not be
removed, so an operator/retry can reconcile. Consider making the central delete
idempotent and retryable independently of the site command.

**Resolution**

_Unresolved._

### DeploymentManager-005 — `OperationLockManager` leaks a `SemaphoreSlim` per instance name

| | |
|--|--|
| Severity | Medium |
| Category | Performance & resource management |
| Status | Open |
| Location | `src/ScadaLink.DeploymentManager/OperationLockManager.cs:15-33` |

**Description**

`AcquireAsync` does `_locks.GetOrAdd(instanceUniqueName, _ => new
SemaphoreSlim(1, 1))` and entries are never removed. Every distinct instance
unique name that is ever deployed/disabled/enabled/deleted permanently adds a
`SemaphoreSlim` (an `IDisposable` holding a kernel wait handle) to the
dictionary. Over the lifetime of a long-running central process — especially
with the bulk "deploy all out-of-date instances" workflow and instances that
are created and deleted over time — this is an unbounded leak of both managed
memory and OS handles. Deleted instances' semaphores are never reclaimed.

**Recommendation**

Either accept the leak explicitly and document the expected bounded cardinality
of instance names, or implement reclamation: e.g. ref-count handles and remove
+ `Dispose()` the semaphore when the count reaches zero and the lock is free.
At minimum, remove the semaphore entry when an instance is deleted
(`DeleteInstanceAsync`).

**Resolution**

_Unresolved._

### DeploymentManager-006 — Query-the-site-before-redeploy idempotency requirement not implemented

| | |
|--|--|
| Severity | High |
| Category | Design-document adherence |
| Status | Open |
| Location | `src/ScadaLink.DeploymentManager/DeploymentService.cs:84-200,363-368` |

**Description**

The design ("Deployment Identity & Idempotency") requires: "After a central
failover or timeout, the Deployment Manager queries the site for current
deployment state before allowing a re-deploy. This prevents duplicate
application and out-of-order config changes." The code never does this.
`GetDeploymentStatusAsync` only reads the local `DeploymentRecord` from the DB
(`GetDeploymentByDeploymentIdAsync`) — it does not contact the site.
`DeployInstanceAsync` unconditionally generates a new deployment ID and sends a
new `DeployInstanceCommand` regardless of any prior in-flight or timed-out
deployment. After a timeout where the site actually applied the config, a
re-deploy produces a second deployment with no reconciliation against the
site's current revision hash. Site-side stale-rejection is the only safety
net, and that is not verified here.

**Recommendation**

Add a site query (a new `CommunicationService` pattern returning the site's
currently-applied deployment ID / revision hash) and call it before re-deploy
when a prior record for the instance is in `InProgress`/`Failed` due to
timeout. Reconcile: if the site already has the target revision, mark the prior
record `Success` instead of re-sending. Either implement this or update the
design doc to reflect that reconciliation is delegated entirely to site-side
stale-rejection.

**Resolution**

_Unresolved._ Finding confirmed valid against the source — `GetDeploymentStatusAsync`
only reads the local `DeploymentRecord` via `GetDeploymentByDeploymentIdAsync`,
and `DeployInstanceAsync` unconditionally generates a new deployment ID with no
site reconciliation. Left Open: a proper fix is a cross-module new feature, not
a bug fix scoped to `ScadaLink.DeploymentManager`. It requires (1) a new
request/response message contract in `ScadaLink.Commons`, (2) a new
`CommunicationService` query method in `ScadaLink.Communication`, and (3)
site-side handling of the query — all outside the DeploymentManager module — plus
a design decision on the query protocol. The reconciliation logic in
`DeploymentService` cannot be implemented without those. Recommend tracking as a
dedicated cross-module feature work item (or, alternatively, amending the design
doc to delegate reconciliation entirely to site-side stale-rejection — also
outside this module's editable scope).

### DeploymentManager-007 — "Diff View" reduced to a hash comparison with no diff detail

| | |
|--|--|
| Severity | Medium |
| Category | Design-document adherence |
| Status | Open |
| Location | `src/ScadaLink.DeploymentManager/DeploymentService.cs:334-358,401-406` |

**Description**

The design ("Diff View" and "Dependencies" sections) states the Deployment
Manager can request a diff from the Template Engine showing added/removed
members, changed values, and connection-binding changes.
`GetDeploymentComparisonAsync` and `DeploymentComparisonResult` only compare two
revision hashes and return a boolean `IsStale` plus the two hashes. No
added/removed/changed detail is produced, and the Template Engine's diff
capability is not invoked. The UI cannot render a meaningful diff from this
result.

**Recommendation**

Either implement a real diff (deserialize the stored
`DeployedConfigSnapshot.ConfigurationJson` and the freshly flattened config and
invoke the Template Engine's diff service, surfacing structured
added/removed/changed entries), or revise the design doc to scope the feature
down to staleness detection only.

**Resolution**

_Unresolved._

### DeploymentManager-008 — `DeploymentManagerOptions` is never bound to configuration

| | |
|--|--|
| Severity | Medium |
| Category | Code organization & conventions |
| Status | Open |
| Location | `src/ScadaLink.DeploymentManager/ServiceCollectionExtensions.cs:7-14` |

**Description**

`AddDeploymentManager` registers the services but never calls
`services.Configure<DeploymentManagerOptions>(configuration.GetSection(...))`.
`IOptions<DeploymentManagerOptions>` therefore always resolves to a
default-constructed instance — the operation-lock and artifact-deployment
timeouts cannot be tuned via `appsettings.json`, contrary to the CLAUDE.md
convention "Per-component configuration via `appsettings.json` sections bound
to options classes (Options pattern)." `Host/Program.cs` binds
`SecurityOptions` and `InboundApiOptions` from configuration sections but has
no equivalent for `DeploymentManagerOptions`.

**Recommendation**

Add an `IConfiguration` parameter (or a configure callback) to
`AddDeploymentManager` and bind `DeploymentManagerOptions` to a section such as
`ScadaLink:DeploymentManager`, consistent with the other components.

**Resolution**

_Unresolved._

### DeploymentManager-009 — Misleading timeout comment on `DeleteInstanceAsync`

| | |
|--|--|
| Severity | Low |
| Category | Documentation & comments |
| Status | Open |
| Location | `src/ScadaLink.DeploymentManager/DeploymentService.cs:288` |

**Description**

The XML doc says "Delete fails if site unreachable (30s timeout via
CommunicationOptions)." The actual delete timeout is whatever
`CommunicationOptions.LifecycleTimeout` is configured to (passed inside
`CommunicationService.DeleteInstanceAsync`); the "30s" figure is hard-coded
into the comment and not derived from any constant in this module. If
`LifecycleTimeout` is reconfigured, the comment becomes wrong. It also wrongly
implies the value lives in this module.

**Recommendation**

Reword to "Delete fails if the site is unreachable within
`CommunicationOptions.LifecycleTimeout`" without quoting a specific number.

**Resolution**

_Unresolved._

### DeploymentManager-010 — `SystemArtifactDeploymentRecord` does not persist the deployment ID

| | |
|--|--|
| Severity | Low |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.DeploymentManager/ArtifactDeploymentService.cs:136,194-211` |

**Description**

`DeployToAllSitesAsync` generates a `deploymentId` (line 136) and returns it in
the `ArtifactDeploymentSummary` and audit log, but the persisted
`SystemArtifactDeploymentRecord` has no field for it (the entity only has `Id`,
`ArtifactType`, `DeployedBy`, `DeployedAt`, `PerSiteStatus`). The deployment ID
that appears in the UI summary and audit log cannot be correlated back to the
stored record. Additionally each per-site `DeployArtifactsCommand` carries its
own separate GUID (`BuildDeployArtifactsCommandAsync` line 114), so there are in
fact N+1 unrelated IDs for one logical artifact deployment.

**Recommendation**

Add a `DeploymentId` column to `SystemArtifactDeploymentRecord` and store the
single logical `deploymentId`; reuse that ID (or a derived per-site ID) for the
per-site commands so the audit log, UI summary, and persisted record agree.

**Resolution**

_Unresolved._

### DeploymentManager-011 — Tests never exercise a successful deployment or lifecycle success path

| | |
|--|--|
| Severity | Medium |
| Category | Testing coverage |
| Status | Open |
| Location | `tests/ScadaLink.DeploymentManager.Tests/DeploymentServiceTests.cs:100-151,155-199` |

**Description**

`DeploymentServiceTests` never sets the `CommunicationService` actor, so every
deploy/lifecycle test deliberately stops at the `InvalidOperationException`
thrown by `GetCommunicationActor()` (see lines 118-125, 147). As a result there
is no test covering: a successful deployment (`DeploymentStatus.Success`
response → instance state set to `Enabled`, snapshot stored, audit logged); a
failed-but-handled site response; the `InProgress`-stuck bug
(DeploymentManager-001); successful Disable/Enable/Delete; or the operation
lock actually serializing two concurrent deploys of the same instance. The
critical post-response branch (`DeploymentService.cs:154-184`) and the entire
delete/disable/enable success path are untested. The `AuditLogs` test
(lines 277-289) asserts nothing.

**Recommendation**

Introduce a seam to inject a fake/substitute communication path (e.g. an
interface over `CommunicationService`, or wire a TestKit actor) so success and
handled-failure paths can be unit tested. Add tests for the stuck-`InProgress`
scenario and for per-instance lock contention during deploy. Make the audit
test assert on `IAuditService.LogAsync`.

**Resolution**

_Unresolved._

### DeploymentManager-012 — `LifecycleCommandTimeout` option is dead code

| | |
|--|--|
| Severity | Low |
| Category | Documentation & comments |
| Status | Open |
| Location | `src/ScadaLink.DeploymentManager/DeploymentManagerOptions.cs:8-9` |

**Description**

`DeploymentManagerOptions.LifecycleCommandTimeout` is declared with a 30s
default and an XML doc, but it is never read anywhere in the codebase
(lifecycle commands rely on `CommunicationOptions.LifecycleTimeout` inside
`CommunicationService`). The option misleads readers into thinking it controls
disable/enable/delete timeouts, when setting it has no effect.

**Recommendation**

Remove `LifecycleCommandTimeout`, or actually thread it through to the
lifecycle command calls (e.g. by creating a linked CTS with this timeout in
`DisableInstanceAsync`/`EnableInstanceAsync`/`DeleteInstanceAsync`, the way
`ArtifactDeploymentTimeoutPerSite` is used).

**Resolution**

_Unresolved._

### DeploymentManager-013 — SMTP credentials serialized and broadcast to all sites

| | |
|--|--|
| Severity | Low |
| Category | Security |
| Status | Open |
| Location | `src/ScadaLink.DeploymentManager/ArtifactDeploymentService.cs:108-111` |

**Description**

`BuildDeployArtifactsCommandAsync` maps `smtp.Credentials` directly into
`SmtpConfigurationArtifact` and that command is sent to every site. Distributing
SMTP credentials to sites is consistent with the design (SMTP configuration is
a deployable artifact), but the credentials travel inside a serialized command
across the inter-cluster transport and are stored on each site's SQLite. There
is no indication the value is encrypted at rest on the site or scrubbed from
logs. Worth confirming the transport is TLS-protected and the site stores the
credential securely; at minimum this should be a conscious, documented decision.

**Recommendation**

Confirm inter-cluster transport encryption covers artifact commands, ensure
`Credentials` is never written to logs, and document the at-rest protection of
SMTP credentials on site SQLite. Consider encrypting the credential field
within the artifact payload.

**Resolution**

_Unresolved._

### DeploymentManager-014 — Dead `CreateCommand` helper in artifact tests

| | |
|--|--|
| Severity | Low |
| Category | Testing coverage |
| Status | Open |
| Location | `tests/ScadaLink.DeploymentManager.Tests/ArtifactDeploymentServiceTests.cs:86-90` |

**Description**

The private static `CreateCommand()` helper is never referenced by any test in
the file. It is dead code that suggests an intended test (e.g. a successful
multi-site artifact deployment) was never written — coverage of
`DeployToAllSitesAsync` is limited to the no-sites failure case, and
`RetryForSiteAsync` and `BuildDeployArtifactsCommandAsync` have no tests at all.

**Recommendation**

Either remove the unused helper or, preferably, write the missing tests for
`DeployToAllSitesAsync` (per-site success/failure matrix, partial failure) and
`RetryForSiteAsync` using it.

**Resolution**

_Unresolved._