ScadaBridge/code-reviews/DeploymentManager/findings.md

# Code Review — DeploymentManager

| Field | Value |
|-------|-------|
| Module | `src/ScadaLink.DeploymentManager` |
| Design doc | `docs/requirements/Component-DeploymentManager.md` |
| Status | Reviewed |
| Last reviewed | 2026-05-28 |
| Reviewer | claude-agent |
| Commit reviewed | `1eb6e97` |
| Open findings | 7 |

## Summary

The DeploymentManager module is small, well-structured, and clearly maps work
packages (WP-N) onto code. The happy paths for instance deployment, lifecycle
commands, artifact broadcast, and staleness comparison are implemented
sensibly, and the operation lock correctly serializes mutating operations per
instance while allowing cross-instance parallelism. However, the review found a
significant cluster of error-handling and resilience gaps: the deployment
record can be left permanently stuck in `InProgress` when an exception other
than timeout/cancellation is thrown, the catch block writes its failure status
using a cancellation token that may already be cancelled, and the
`OperationLockManager` leaks one `SemaphoreSlim` per instance name forever.
There are also two notable design-document adherence gaps: the
"query-the-site-before-redeploy" idempotency requirement is not implemented
(`GetDeploymentStatusAsync` only reads the local DB), and the "Diff View"
feature is reduced to a bare hash comparison with no added/removed/changed
detail. Configuration is not bound to `appsettings.json`, leaving one option
entirely dead. Test coverage stops at the communication boundary and never
exercises a successful deployment or the lifecycle success paths.

#### Re-review 2026-05-17 (commit `39d737e`)

Re-reviewed at commit `39d737e` after the batch of fixes for
DeploymentManager-001..014. All fourteen prior findings remain `Resolved` and
verified against source — the broadened catch, non-cancellable cleanup writes,
ref-counted `OperationLockManager`, query-before-redeploy reconciliation,
structured diff, options binding, and the expanded TestKit-actor test suite are
all present and correct. The module is in markedly better shape than the
first review: error paths are now defensively handled and test coverage is
broad (successful deploy/lifecycle, lock serialization, reconciliation
matrix, artifact per-site matrix).

This re-review found **3 new findings**, all clustered on the
DeploymentManager-006 reconciliation path added since the last review. The
reconciliation shortcut (`TryReconcileWithSiteAsync`) marks a stale prior
record `Success` when the site already has the target revision, but it does
**not** perform the side effects the normal success path does — it never
updates the instance `State`, never refreshes the `DeployedConfigSnapshot`,
and never corrects the prior record's own `RevisionHash` (DeploymentManager-015,
DeploymentManager-016). The `GetDeploymentStatusAsync` XML doc is now stale —
it still describes the query-before-redeploy behaviour that actually moved into
`TryReconcileWithSiteAsync` (DeploymentManager-017).

#### Re-review 2026-05-28 (commit `1eb6e97`)

Re-reviewed at commit `1eb6e97` after the DeploymentManager-015/016/017 fixes
and a docs-only XML-comment pass. The three prior findings remain `Resolved`
and verified — `ApplyPostSuccessSideEffectsAsync` is now invoked from both the
normal success path and `TryReconcileWithSiteAsync`, the reconciled-success
branch corrects `prior.RevisionHash` to the target, and `GetDeploymentStatusAsync`'s
XML doc now describes the local-DB-read it actually performs and cross-refs the
reconciliation helper. The DiffService wiring, options binding, ref-counted
operation lock, broadened catch, non-cancellable cleanup, and TestKit-actor
test seam are still in place. The 7 new findings here are not regressions in
the DeploymentManager-015/016 fixes — they are issues uncovered by widening
the lens to the lifecycle paths, reconciliation's interaction with
intentional `Disabled` state, audit semantics, and operational concerns
(per-site artifact-build cost, Pending→InProgress double-write).

The single notable correctness issue is DeploymentManager-018: the
reconciliation shortcut unconditionally sets `instance.State = Enabled` via
`ApplyPostSuccessSideEffectsAsync`. After a central failover that loses the
in-memory operation lock, a user can legitimately `Disable` an instance whose
prior deploy record is still `InProgress`; a subsequent redeploy then reconciles
and silently re-enables the instance against the user's explicit intent.
The remaining six findings are medium/low: lifecycle-timeout audit gap
(DeploymentManager-019), audit-user attribution in reconciliation
(DeploymentManager-020), silent fallback in `ResolveSiteIdentifierAsync`
(DeploymentManager-021), back-to-back `Pending`→`InProgress` writes
(DeploymentManager-022), per-site re-query of system-wide artifacts
(DeploymentManager-023), and shared static state across `*ProbeActor` tests
(DeploymentManager-024).

## Checklist coverage

#### Re-review 2026-05-28 (commit `1eb6e97`)

| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | ✓ | New: reconciliation forces `Enabled` even if the user disabled the instance in between (DeploymentManager-018). |
| 2 | Akka.NET conventions | ✓ | Module remains a plain service layer; no actors. No issues. |
| 3 | Concurrency & thread safety | ✓ | `OperationLockManager` ref-counting verified. Note: test probes hold static state (DeploymentManager-024) — a test concern, not production code. |
| 4 | Error handling & resilience | ✓ | New: Disable/Enable/Delete timeouts return early without writing any audit entry — deploy has `DeployFailed`, lifecycle has nothing (DeploymentManager-019). |
| 5 | Security | ✓ | No new issues. SMTP credential decision documented (DeploymentManager-013 closed). |
| 6 | Performance & resource management | ✓ | New: `BuildDeployArtifactsCommandAsync` re-queries every system-wide artifact set per site in `DeployToAllSitesAsync` (DeploymentManager-023). |
| 7 | Design-document adherence | ✓ | Reconciliation now performs post-success side effects (DeploymentManager-015 resolved). DeploymentManager-018 surfaces a new gap on `Disabled`-state preservation. |
| 8 | Code organization & conventions | ✓ | New: redundant `Pending`→`InProgress` back-to-back write with no intervening work (DeploymentManager-022). Silent string-fallback in `ResolveSiteIdentifierAsync` (DeploymentManager-021). |
| 9 | Testing coverage | ✓ | New: no coverage for the reconciliation-overwrites-Disabled case (part of DeploymentManager-018); test probes share static state across tests (DeploymentManager-024). |
| 10 | Documentation & comments | ✓ | New: `DeployReconciled` audit uses `prior.DeployedBy` instead of the current `user` parameter — misleading for forensics (DeploymentManager-020). |

## Findings

### DeploymentManager-001 — Unexpected exceptions leave the deployment record stuck in `InProgress`

| | |
|--|--|
| Severity | High |
| Category | Error handling & resilience |
| Status | Resolved |
| Location | `src/ScadaLink.DeploymentManager/DeploymentService.cs:141-199` |

**Description**

`DeployInstanceAsync` sets the record to `InProgress` (lines 137-139), then the
`try` block calls into `CommunicationService` and the repository. The only
`catch` filter is `when (ex is TimeoutException or OperationCanceledException)`.
Any other exception — `InvalidOperationException` (thrown by
`CommunicationService.GetCommunicationActor()` when the actor is not set), a
JSON serialization error, a deserialization failure of the response, a DB
exception on `UpdateDeploymentRecordAsync`, or any transport error — escapes the
method. The deployment record remains in `DeploymentStatus.InProgress`
permanently. Because staleness and the UI both read current status, the
instance is then misreported as "deploying" forever and a re-deploy may be
blocked or misinterpreted. The design explicitly states an interrupted
deployment must be "treated as failed".

**Recommendation**

Broaden the catch to a general `catch (Exception ex)` that records
`DeploymentStatus.Failed` with the error message, audit-logs the failure, and
re-throws or returns a failed `Result`. Keep the timeout-specific branch only
if a distinct message is desired. Ensure the failure-status write happens for
every exit path out of the `try`.

**Resolution**

Resolved 2026-05-16 (commit `<pending>`): broadened the `catch` in
`DeployInstanceAsync` to `catch (Exception ex)` so any exception (transport,
serialization, DB, `InvalidOperationException` from an uninitialized
`CommunicationService`) marks the deployment record `Failed` with the error
message and audit-logs the failure, instead of escaping and leaving the record
stuck in `InProgress`. Regression test:
`DeployInstanceAsync_CommunicationThrowsUnexpectedException_RecordMarkedFailed`.

### DeploymentManager-002 — Failure-status write uses a possibly-cancelled cancellation token

| | |
|--|--|
| Severity | High |
| Category | Error handling & resilience |
| Status | Resolved |
| Location | `src/ScadaLink.DeploymentManager/DeploymentService.cs:186-196` |

**Description**

The `catch (Exception ex) when (ex is TimeoutException or
OperationCanceledException)` block updates the record to `Failed` and calls
`UpdateDeploymentRecordAsync`/`SaveChangesAsync`/`LogAsync` passing the same
`cancellationToken` that was just cancelled (an `OperationCanceledException`
caught here means the token is already in the cancelled state). Those
repository and audit calls will themselves throw `OperationCanceledException`
before the failure status is persisted, so the record stays `InProgress` — the
exact bug DeploymentManager-001 describes, reached via the supposedly-handled
path.

**Recommendation**

Perform the cleanup writes with a fresh, non-cancellable token (e.g.
`CancellationToken.None`, optionally with an independent short timeout) so the
failure status is durably recorded even when the original operation was
cancelled or timed out.

**Resolution**

Resolved 2026-05-16 (commit `<pending>`): the broadened `catch` block now
performs the failure-status write (`UpdateDeploymentRecordAsync`,
`SaveChangesAsync`) and the audit `LogAsync` with `CancellationToken.None`
instead of the operation's (possibly-cancelled) token, so the `Failed` status
is durably recorded even after a timeout/cancellation. The cleanup writes are
themselves wrapped in a `try`/`catch` that logs (without masking the original
error) if persistence still fails. Regression test:
`DeployInstanceAsync_FailureWrite_UsesNonCancellableToken`.

### DeploymentManager-003 — Successful-deployment cleanup is not atomic with the status write

| | |
|--|--|
| Severity | Medium |
| Category | Error handling & resilience |
| Status | Resolved |
| Location | `src/ScadaLink.DeploymentManager/DeploymentService.cs:155-170` |

**Description**

After a successful site response the code calls `UpdateDeploymentRecordAsync`
(no `SaveChanges` yet), then `UpdateInstanceAsync`, then
`StoreDeployedSnapshotAsync` (which itself issues `Add`/`Update` calls), then a
single `SaveChangesAsync` at line 170. If `StoreDeployedSnapshotAsync` throws,
the exception is not caught (see DeploymentManager-001) and the
`SaveChangesAsync` never runs — the instance state, deployment status, and
snapshot are all left unpersisted even though the site has actually applied the
deployment. Central and site are now divergent: the site is running the new
config but central still shows the old state and a non-`Success` deployment
record.

**Verification:** Confirmed against source. The DeploymentManager-001 fix made
this strictly worse, not better — after that fix a snapshot-store failure is
caught and the record is flipped from `Success` back to `Failed`, so central
reports a *failed* deployment while the site is running the new config.

**Recommendation**

Wrap the post-success persistence so that, at minimum, the deployment record's
`Success` status is committed. Consider committing the status first, then the
instance state and snapshot, so a later failure does not lose the fact that the
site succeeded. Log loudly if the snapshot write fails after a confirmed site
apply.

**Resolution**

Resolved 2026-05-16 (commit pending): `DeployInstanceAsync` now commits the
deployment record's terminal status (`UpdateDeploymentRecordAsync` +
`SaveChangesAsync`) immediately after the site confirms the apply, *before*
touching instance state or the deployed-config snapshot. The post-success
instance-state update and `StoreDeployedSnapshotAsync` are wrapped in a
best-effort `try`/`catch` that logs loudly for operator reconciliation but no
longer flips the already-committed `Success` record back to `Failed`.
Regression test:
`DeployInstanceAsync_SiteSucceeds_SnapshotWriteFails_RecordStillCommittedSuccess`.

### DeploymentManager-004 — Site-success but central-delete-failure leaves orphaned site config

| | |
|--|--|
| Severity | Medium |
| Category | Error handling & resilience |
| Status | Resolved |
| Location | `src/ScadaLink.DeploymentManager/DeploymentService.cs:312-319` |

**Description**

In `DeleteInstanceAsync`, when the site responds `Success` the code calls
`_repository.DeleteInstanceAsync` then `SaveChangesAsync`. If `SaveChangesAsync`
throws (DB error, concurrency), the exception propagates uncaught: the site has
already destroyed the Instance Actor and removed its config, but the central
instance record still exists. The instance is now un-deletable through the
normal path (the site no longer has it, so a re-issued delete may fail) and is
permanently orphaned. The design states central must not mark the instance
deleted until the site confirms — but it does not address the inverse failure.

**Verification:** Confirmed against source. `DeleteInstanceAsync` has no
`try`/`catch` around the post-success block, so any exception from
`DeleteInstanceAsync`/`SaveChangesAsync` escapes uncaught to the caller.

**Recommendation**

Catch persistence failures in the post-success block and surface a distinct
error indicating the site succeeded but the central record could not be
removed, so an operator/retry can reconcile. Consider making the central delete
idempotent and retryable independently of the site command.

**Resolution**

Resolved 2026-05-16 (commit pending): the post-success removal in
`DeleteInstanceAsync` (`DeleteInstanceAsync` + `SaveChangesAsync`) is now
wrapped in a `try`/`catch`. A persistence failure no longer escapes uncaught —
it is logged, recorded with a `DeleteOrphaned` audit entry, and surfaced as a
distinct `Result` failure stating the site deleted the instance but the central
record is orphaned and must be reconciled. Regression test:
`DeleteInstanceAsync_SiteSucceeds_CentralDeleteFails_ReturnsDistinctFailure`.

### DeploymentManager-005 — `OperationLockManager` leaks a `SemaphoreSlim` per instance name

| | |
|--|--|
| Severity | Medium |
| Category | Performance & resource management |
| Status | Resolved |
| Location | `src/ScadaLink.DeploymentManager/OperationLockManager.cs:15-33` |

**Description**

`AcquireAsync` does `_locks.GetOrAdd(instanceUniqueName, _ => new
SemaphoreSlim(1, 1))` and entries are never removed. Every distinct instance
unique name that is ever deployed/disabled/enabled/deleted permanently adds a
`SemaphoreSlim` (an `IDisposable` holding a kernel wait handle) to the
dictionary. Over the lifetime of a long-running central process — especially
with the bulk "deploy all out-of-date instances" workflow and instances that
are created and deleted over time — this is an unbounded leak of both managed
memory and OS handles. Deleted instances' semaphores are never reclaimed.

**Verification:** Confirmed against source. `_locks` is a `ConcurrentDictionary`
with no removal path anywhere in the type.

**Recommendation**

Either accept the leak explicitly and document the expected bounded cardinality
of instance names, or implement reclamation: e.g. ref-count handles and remove
+ `Dispose()` the semaphore when the count reaches zero and the lock is free.
At minimum, remove the semaphore entry when an instance is deleted
(`DeleteInstanceAsync`).

**Resolution**

Resolved 2026-05-16 (commit pending): `OperationLockManager` now ref-counts each
lock entry. A reference is reserved (creating the entry if needed) before the
`SemaphoreSlim.WaitAsync`, so concurrent waiters for the same instance share one
semaphore and the entry survives until every waiter/holder has released. When
the reference count reaches zero — on release, timeout, or cancellation — the
entry is removed from the dictionary and the semaphore is `Dispose()`d, so the
process no longer accumulates one kernel wait handle per distinct instance name.
A `TrackedLockCount` diagnostic property was added to make reclamation testable.
Regression tests: `AcquireAsync_ReleasedLock_RemovesSemaphoreEntry`,
`AcquireAsync_ManyDistinctInstances_DoesNotAccumulateSemaphores`,
`AcquireAsync_ContendedLock_KeepsSemaphoreUntilLastReleaseThenReclaims`.

### DeploymentManager-006 — Query-the-site-before-redeploy idempotency requirement not implemented

| | |
|--|--|
| Severity | High |
| Category | Design-document adherence |
| Status | Resolved |
| Location | `src/ScadaLink.DeploymentManager/DeploymentService.cs:84-200,363-368` |

**Description**

The design ("Deployment Identity & Idempotency") requires: "After a central
failover or timeout, the Deployment Manager queries the site for current
deployment state before allowing a re-deploy. This prevents duplicate
application and out-of-order config changes." The code never does this.
`GetDeploymentStatusAsync` only reads the local `DeploymentRecord` from the DB
(`GetDeploymentByDeploymentIdAsync`) — it does not contact the site.
`DeployInstanceAsync` unconditionally generates a new deployment ID and sends a
new `DeployInstanceCommand` regardless of any prior in-flight or timed-out
deployment. After a timeout where the site actually applied the config, a
re-deploy produces a second deployment with no reconciliation against the
site's current revision hash. Site-side stale-rejection is the only safety
net, and that is not verified here.

**Recommendation**

Add a site query (a new `CommunicationService` pattern returning the site's
currently-applied deployment ID / revision hash) and call it before re-deploy
when a prior record for the instance is in `InProgress`/`Failed` due to
timeout. Reconcile: if the site already has the target revision, mark the prior
record `Success` instead of re-sending. Either implement this or update the
design doc to reflect that reconciliation is delegated entirely to site-side
stale-rejection.

**Resolution**

Resolved 2026-05-16 (commit `<pending>`): implemented the cross-module
query-the-site-before-redeploy idempotency feature across Commons, SiteRuntime,
Communication, and DeploymentManager — new `DeploymentStateQueryRequest` /
`DeploymentStateQueryResponse` contracts, a `DeploymentManagerActor` handler
answering from the site's deployed-config store, a
`CommunicationService.QueryDeploymentStateAsync` method routed over the
ClusterClient command/control transport, and reconciliation in
`DeployInstanceAsync` (`TryReconcileWithSiteAsync`) that queries the site only
when a prior record is `InProgress` or `Failed` due to a timeout, marks the
prior record `Success` without re-sending if the site already has the target
revision hash, and falls through to a normal deploy (relying on site-side
stale-rejection) when the query fails. Regression tests:
`RoundTrip_DeploymentStateQueryRequest_Succeeds`,
`RoundTrip_DeploymentStateQueryResponse_Deployed_Succeeds`,
`RoundTrip_DeploymentStateQueryResponse_NotDeployed_NullApplied`,
`DeploymentStateQuery_DeployedInstance_ReturnsAppliedIdentity`,
`DeploymentStateQuery_UnknownInstance_ReturnsNotDeployed`,
`DeploymentStateQuery_ForwardedToDeploymentManager`,
`QueryDeploymentStateAsync_BeforeInitialization_Throws`,
`QueryDeploymentStateAsync_SendsEnvelopeAndReturnsResponse`,
`DeployInstanceAsync_PriorInProgressRecord_SiteHasTargetHash_MarksSuccessWithoutRedeploy`,
`DeployInstanceAsync_PriorInProgressRecord_SiteHasDifferentHash_ProceedsWithDeploy`,
`DeployInstanceAsync_PriorFailedTimeoutRecord_QueriesSite`,
`DeployInstanceAsync_PriorSuccessRecord_SkipsSiteQuery`,
`DeployInstanceAsync_FreshFirstTimeDeploy_SkipsSiteQuery`,
`DeployInstanceAsync_PriorInProgressRecord_QueryFails_FallsThroughToDeploy`.

### DeploymentManager-007 — "Diff View" reduced to a hash comparison with no diff detail

| | |
|--|--|
| Severity | Medium |
| Category | Design-document adherence |
| Status | Resolved |
| Location | `src/ScadaLink.DeploymentManager/DeploymentService.cs:334-358,401-406` |

**Description**

The design ("Diff View" and "Dependencies" sections) states the Deployment
Manager can request a diff from the Template Engine showing added/removed
members, changed values, and connection-binding changes.
`GetDeploymentComparisonAsync` and `DeploymentComparisonResult` only compare two
revision hashes and return a boolean `IsStale` plus the two hashes. No
added/removed/changed detail is produced, and the Template Engine's diff
capability is not invoked. The UI cannot render a meaningful diff from this
result.

**Verification:** Confirmed against source. The Template Engine already provides
`DiffService` + `ConfigurationDiff` (structured Added/Removed/Changed entries
for attributes, alarms, and scripts, including data connection binding fields),
and `DiffService` is DI-registered — it was simply never wired into the
Deployment Manager's comparison path.

**Recommendation**

Either implement a real diff (deserialize the stored
`DeployedConfigSnapshot.ConfigurationJson` and the freshly flattened config and
invoke the Template Engine's diff service, surfacing structured
added/removed/changed entries), or revise the design doc to scope the feature
down to staleness detection only.

**Resolution**

Resolved 2026-05-16 (commit pending): `GetDeploymentComparisonAsync` now
deserializes the stored `DeployedConfigSnapshot.ConfigurationJson` and runs the
Template Engine `DiffService` against the freshly flattened current
configuration, attaching the resulting `ConfigurationDiff` (added/removed/changed
attributes, alarms, scripts) to a new optional `Diff` property on
`DeploymentComparisonResult`. `DiffService` is injected into `DeploymentService`.
A snapshot that cannot be deserialized (corrupt / older schema) still yields the
hash-based staleness result with a null diff, logged at warning level.
Regression test: `GetDeploymentComparisonAsync_ProducesStructuredDiff`.

### DeploymentManager-008 — `DeploymentManagerOptions` is never bound to configuration

| | |
|--|--|
| Severity | Medium |
| Category | Code organization & conventions |
| Status | Resolved |
| Location | `src/ScadaLink.DeploymentManager/ServiceCollectionExtensions.cs:7-14` |

**Description**

`AddDeploymentManager` registers the services but never calls
`services.Configure<DeploymentManagerOptions>(configuration.GetSection(...))`.
`IOptions<DeploymentManagerOptions>` therefore always resolves to a
default-constructed instance — the operation-lock and artifact-deployment
timeouts cannot be tuned via `appsettings.json`, contrary to the CLAUDE.md
convention "Per-component configuration via `appsettings.json` sections bound
to options classes (Options pattern)." `Host/Program.cs` binds
`SecurityOptions` and `InboundApiOptions` from configuration sections but has
no equivalent for `DeploymentManagerOptions`.

**Verification:** Confirmed against source. Neither `AddDeploymentManager` nor
`Host/Program.cs` binds `DeploymentManagerOptions`.

**Recommendation**

Add an `IConfiguration` parameter (or a configure callback) to
`AddDeploymentManager` and bind `DeploymentManagerOptions` to a section such as
`ScadaLink:DeploymentManager`, consistent with the other components.

**Resolution**

Resolved 2026-05-16 (commit pending): `AddDeploymentManager()` now calls
`services.AddOptions<DeploymentManagerOptions>()` so `IOptions<DeploymentManagerOptions>`
is always resolvable, and `Host/Program.cs` binds the
`ScadaLink:DeploymentManager` section (exposed as
`ServiceCollectionExtensions.OptionsSection`) via
`services.Configure<DeploymentManagerOptions>(...)` — the same pattern the Host
uses for `SecurityOptions`/`InboundApiOptions`. An earlier attempt added an
`AddDeploymentManager(IConfiguration)` overload; that was reverted because the
project convention (enforced by `Host.Tests.OptionsTests`) forbids component
`Add*` methods from depending on `IConfiguration` — the Host owns
configuration binding. Regression tests:
`AddDeploymentManager_RegistersResolvableOptions_WithDefaults`,
`AddDeploymentManager_OptionsBindToConfigurationSection_AsTheHostWires`,
`OptionsSection_MatchesTheConventionalComponentSectionPath`.

### DeploymentManager-009 — Misleading timeout comment on `DeleteInstanceAsync`

| | |
|--|--|
| Severity | Low |
| Category | Documentation & comments |
| Status | Resolved |
| Location | `src/ScadaLink.DeploymentManager/DeploymentService.cs:288` |

**Description**

The XML doc says "Delete fails if site unreachable (30s timeout via
CommunicationOptions)." The actual delete timeout is whatever
`CommunicationOptions.LifecycleTimeout` is configured to (passed inside
`CommunicationService.DeleteInstanceAsync`); the "30s" figure is hard-coded
into the comment and not derived from any constant in this module. If
`LifecycleTimeout` is reconfigured, the comment becomes wrong. It also wrongly
implies the value lives in this module.

**Verification:** Confirmed against source. The `DeleteInstanceAsync` XML doc
quoted a hard-coded "30s" value.

**Recommendation**

Reword to "Delete fails if the site is unreachable within
`CommunicationOptions.LifecycleTimeout`" without quoting a specific number.

**Resolution**

Resolved 2026-05-16 (commit pending): the `DeleteInstanceAsync` XML doc no
longer quotes a hard-coded "30s" — it now states delete fails if the site is
unreachable within `CommunicationOptions.LifecycleTimeout` (and notes the
deadline is applied inside `CommunicationService.DeleteInstanceAsync`).
Documentation-only change; no regression test (a test asserting comment text
would be meaningless).

### DeploymentManager-010 — `SystemArtifactDeploymentRecord` does not persist the deployment ID

| | |
|--|--|
| Severity | Low |
| Category | Correctness & logic bugs |
| Status | Resolved |
| Location | `src/ScadaLink.DeploymentManager/ArtifactDeploymentService.cs:136,194-211` |

**Description**

`DeployToAllSitesAsync` generates a `deploymentId` (line 136) and returns it in
the `ArtifactDeploymentSummary` and audit log, but the persisted
`SystemArtifactDeploymentRecord` has no field for it (the entity only has `Id`,
`ArtifactType`, `DeployedBy`, `DeployedAt`, `PerSiteStatus`). The deployment ID
that appears in the UI summary and audit log cannot be correlated back to the
stored record. Additionally each per-site `DeployArtifactsCommand` carries its
own separate GUID (`BuildDeployArtifactsCommandAsync` line 114), so there are in
fact N+1 unrelated IDs for one logical artifact deployment.

**Verification:** Confirmed against source. Each per-site command minted its own
GUID and the persisted record had no way to reference the logical id.

**Recommendation**

Add a `DeploymentId` column to `SystemArtifactDeploymentRecord` and store the
single logical `deploymentId`; reuse that ID (or a derived per-site ID) for the
per-site commands so the audit log, UI summary, and persisted record agree.

**Resolution**

Resolved 2026-05-16 (commit pending): `BuildDeployArtifactsCommandAsync` now
accepts an optional `deploymentId`, and `DeployToAllSitesAsync` passes the one
logical `deploymentId` to every per-site command — so the per-site commands,
the audit log, and the UI summary all reference a single id instead of N+1
unrelated GUIDs (`RetryForSiteAsync`, an independent single-site retry, still
mints its own id). Adding a dedicated `DeploymentId` *column* to
`SystemArtifactDeploymentRecord` was deliberately **not** done: that entity
lives in `ScadaLink.Commons` with its EF mapping in
`ScadaLink.ConfigurationDatabase`, both outside this module's edit scope.
Instead the logical `deploymentId` is embedded in the record's free-form
`PerSiteStatus` JSON payload (`{ DeploymentId, Sites }`), which is fully within
this module's control, so the persisted record is correlatable with the
summary/audit. A follow-up to promote it to a first-class column should be
filed against Commons/ConfigurationDatabase if a queryable index is needed.
Regression tests: `DeployToAllSitesAsync_AllPerSiteCommandsShareTheSummaryDeploymentId`,
`DeployToAllSitesAsync_PartialFailure_ReportsPerSiteMatrix`,
`RetryForSiteAsync_SiteSucceeds_ReturnsSuccessAndAudits`.

### DeploymentManager-011 — Tests never exercise a successful deployment or lifecycle success path

| | |
|--|--|
| Severity | Medium |
| Category | Testing coverage |
| Status | Resolved |
| Location | `tests/ScadaLink.DeploymentManager.Tests/DeploymentServiceTests.cs:100-151,155-199` |

**Description**

`DeploymentServiceTests` never sets the `CommunicationService` actor, so every
deploy/lifecycle test deliberately stops at the `InvalidOperationException`
thrown by `GetCommunicationActor()` (see lines 118-125, 147). As a result there
is no test covering: a successful deployment (`DeploymentStatus.Success`
response → instance state set to `Enabled`, snapshot stored, audit logged); a
failed-but-handled site response; the `InProgress`-stuck bug
(DeploymentManager-001); successful Disable/Enable/Delete; or the operation
lock actually serializing two concurrent deploys of the same instance. The
critical post-response branch (`DeploymentService.cs:154-184`) and the entire
delete/disable/enable success path are untested. The `AuditLogs` test
(lines 277-289) asserts nothing.

**Verification:** Partially confirmed. By the time this finding was being
resolved, the DeploymentManager-006 fix had already introduced a TestKit-actor
seam (`CreateServiceWithCommActor` + `ReconcileProbeActor`) and successful-deploy
tests. The genuinely-still-missing coverage was: successful Disable/Enable/Delete
paths, per-instance lock serialization during deploy, and the assertionless
`AuditLogs` test — those gaps were addressed.

**Recommendation**

Introduce a seam to inject a fake/substitute communication path (e.g. an
interface over `CommunicationService`, or wire a TestKit actor) so success and
handled-failure paths can be unit tested. Add tests for the stuck-`InProgress`
scenario and for per-instance lock contention during deploy. Make the audit
test assert on `IAuditService.LogAsync`.

**Resolution**

Resolved 2026-05-16 (commit pending): extended the TestKit-actor seam
(`ReconcileProbeActor` now also answers lifecycle commands) and added the
missing coverage — successful Disable/Enable/Delete (state transition + audit
assertions), a successful-deploy audit assertion, and per-instance lock
serialization via a new deferred-reply `SerializationProbeActor` that asserts a
single instance's concurrent deploys never overlap. The assertionless `AuditLogs`
test was replaced with `DeployInstanceAsync_FlatteningFails_DoesNotReachAudit`,
which asserts on `IAuditService.LogAsync`. Regression tests:
`DisableInstanceAsync_SiteSucceeds_SetsDisabledStateAndAudits`,
`EnableInstanceAsync_SiteSucceeds_SetsEnabledStateAndAudits`,
`DeleteInstanceAsync_SiteSucceeds_RemovesRecordAndAudits`,
`DeployInstanceAsync_SiteSucceeds_WritesDeployAuditEntry`,
`DeployInstanceAsync_FlatteningFails_DoesNotReachAudit`,
`DeployInstanceAsync_SameInstance_OperationLockSerializesConcurrentDeploys`.

### DeploymentManager-012 — `LifecycleCommandTimeout` option is dead code

| | |
|--|--|
| Severity | Low |
| Category | Documentation & comments |
| Status | Resolved |
| Location | `src/ScadaLink.DeploymentManager/DeploymentManagerOptions.cs:8-9` |

**Description**

`DeploymentManagerOptions.LifecycleCommandTimeout` is declared with a 30s
default and an XML doc, but it is never read anywhere in the codebase
(lifecycle commands rely on `CommunicationOptions.LifecycleTimeout` inside
`CommunicationService`). The option misleads readers into thinking it controls
disable/enable/delete timeouts, when setting it has no effect.

**Verification:** Confirmed against source. A repo-wide grep found exactly one
occurrence of `LifecycleCommandTimeout` — the declaration itself.

**Recommendation**

Remove `LifecycleCommandTimeout`, or actually thread it through to the
lifecycle command calls (e.g. by creating a linked CTS with this timeout in
`DisableInstanceAsync`/`EnableInstanceAsync`/`DeleteInstanceAsync`, the way
`ArtifactDeploymentTimeoutPerSite` is used).

**Resolution**

Resolved 2026-05-16 (commit pending): `LifecycleCommandTimeout` is now actually
threaded through (the option exists for tuning, so it was wired up rather than
deleted). `DisableInstanceAsync`/`EnableInstanceAsync`/`DeleteInstanceAsync`
each create a linked `CancellationTokenSource` with `CancelAfter(
_options.LifecycleCommandTimeout)` — the same pattern `ArtifactDeploymentService`
uses for `ArtifactDeploymentTimeoutPerSite` — and pass its token to the
`CommunicationService` call. Each method now catches the resulting
`TimeoutException`/`OperationCanceledException`, logs a warning, and returns a
`Result.Failure` (previously an `AskTimeoutException` from a hung site escaped
uncaught). The option's XML doc was corrected to describe the real behaviour.
Regression test:
`DisableInstanceAsync_SiteUnresponsive_LifecycleCommandTimeoutBoundsTheWait`
(asserts a 300 ms `LifecycleCommandTimeout` bounds the wait far below the 30 s
`CommunicationOptions.LifecycleTimeout`; confirmed to fail before the fix —
the call hung the full 30 s and threw `AskTimeoutException`).

### DeploymentManager-013 — SMTP credentials serialized and broadcast to all sites

| | |
|--|--|
| Severity | Low |
| Category | Security |
| Status | Resolved |
| Location | `src/ScadaLink.DeploymentManager/ArtifactDeploymentService.cs:108-111` |

**Description**

`BuildDeployArtifactsCommandAsync` maps `smtp.Credentials` directly into
`SmtpConfigurationArtifact` and that command is sent to every site. Distributing
SMTP credentials to sites is consistent with the design (SMTP configuration is
a deployable artifact), but the credentials travel inside a serialized command
across the inter-cluster transport and are stored on each site's SQLite. There
is no indication the value is encrypted at rest on the site or scrubbed from
logs. Worth confirming the transport is TLS-protected and the site stores the
credential securely; at minimum this should be a conscious, documented decision.

**Recommendation**

Confirm inter-cluster transport encryption covers artifact commands, ensure
`Credentials` is never written to logs, and document the at-rest protection of
SMTP credentials on site SQLite. Consider encrypting the credential field
within the artifact payload.

**Verification (2026-05-16):** Re-triaged against source. The DeploymentManager
side is **clean**: `ArtifactDeploymentService` maps `SmtpConfiguration.Credentials`
into the artifact (which the design explicitly mandates — SMTP configuration is
a deployable artifact) and **never logs it** — the three log statements in
`DeployToAllSitesAsync` only reference `SiteId`, `SiteName`, `DeploymentId`, and
`ex.Message`, never the credential. There is no defect to fix purely within
`src/ScadaLink.DeploymentManager`. The finding's remaining recommendations are
all cross-module and one needs a design decision:
  - inter-cluster transport TLS — `ScadaLink.Communication` /
    `ScadaLink.ClusterInfrastructure` (Akka remoting + ClusterClient config);
  - at-rest encryption of the credential on site SQLite — `ScadaLink.SiteRuntime`
    artifact store;
  - encrypting the credential field inside the artifact payload — needs the
    `SmtpConfigurationArtifact` shape in `ScadaLink.Commons` plus cooperating
    producer (DeploymentManager) and consumer (SiteRuntime) changes, and a
    **key-management design decision** (where the encryption key lives, how it
    is distributed to sites) that cannot be made unilaterally here.

**Status: Open — flagged.** No purely-DeploymentManager fix exists; the work
crosses Communication / SiteRuntime / Commons and requires a key-management
design decision. Severity confirmed Low: with TLS-protected inter-cluster
transport (a separate, assumed-in-place control) and no logging leak, this is a
hardening item, not an active leak.

**Resolution**

Resolved 2026-05-16 (commit `<pending>`). Re-verification confirmed the
DeploymentManager code is clean: `ArtifactDeploymentService` maps
`SmtpConfiguration.Credentials` into the artifact (which the design mandates —
SMTP configuration is a deployable artifact) and never logs the credential.
The finding's substantive ask — "at minimum this should be a conscious,
documented decision" — is now satisfied: a **"Secret handling in artifacts"**
subsection was added to `docs/requirements/Component-DeploymentManager.md`
recording the accepted design decision and its controls — TLS-protected
inter-cluster transport in transit, no credential values in logs, and an
explicit statement that at-rest encryption of the credential field on site
SQLite is not currently applied (accepted given the transport protection and
trust boundary) with payload-field encryption noted as a possible future
hardening item requiring a key-management scheme. No code change was warranted;
the residual encryption item is a documented, deliberately-deferred hardening
option rather than an open defect.

### DeploymentManager-014 — Dead `CreateCommand` helper in artifact tests

| | |
|--|--|
| Severity | Low |
| Category | Testing coverage |
| Status | Resolved |
| Location | `tests/ScadaLink.DeploymentManager.Tests/ArtifactDeploymentServiceTests.cs:86-90` |

**Description**

The private static `CreateCommand()` helper is never referenced by any test in
the file. It is dead code that suggests an intended test (e.g. a successful
multi-site artifact deployment) was never written — coverage of
`DeployToAllSitesAsync` is limited to the no-sites failure case, and
`RetryForSiteAsync` and `BuildDeployArtifactsCommandAsync` have no tests at all.

**Verification:** Confirmed against source. The `CreateCommand()` helper had no
callers, and `DeployToAllSitesAsync`/`RetryForSiteAsync` only had the no-sites
failure case.

**Recommendation**

Either remove the unused helper or, preferably, write the missing tests for
`DeployToAllSitesAsync` (per-site success/failure matrix, partial failure) and
`RetryForSiteAsync` using it.

**Resolution**

Resolved 2026-05-16 (commit pending): took the recommendation's preferred
option — removed the dead `CreateCommand()` helper and wrote the missing
coverage instead. `ArtifactDeploymentServiceTests` now extends `TestKit` and
uses a stand-in `ArtifactProbeActor` (records the `DeployArtifactsCommand`s it
receives, replies success or, for a configured failure set, failure) so
`DeployToAllSitesAsync` and `RetryForSiteAsync` are exercised end-to-end past
the communication boundary. New tests:
`DeployToAllSitesAsync_AllPerSiteCommandsShareTheSummaryDeploymentId` (also
covers DeploymentManager-010), `DeployToAllSitesAsync_PartialFailure_ReportsPerSiteMatrix`
(per-site success/failure matrix), `RetryForSiteAsync_SiteSucceeds_ReturnsSuccessAndAudits`.

### DeploymentManager-015 — Site-query reconciliation marks a deployment `Success` but skips instance-state and snapshot updates

| | |
|--|--|
| Severity | High |
| Category | Correctness & logic bugs |
| Status | Resolved |
| Location | `src/ScadaLink.DeploymentManager/DeploymentService.cs:631-655` |

**Description**

`TryReconcileWithSiteAsync` (the DeploymentManager-006 query-before-redeploy
path) handles the case where a prior `InProgress`/timeout-`Failed` record exists
and the site reports it already has the target revision hash. In that case it
marks the prior `DeploymentRecord` `Success`, audit-logs `DeployReconciled`, and
returns it — the caller then returns `Result.Success` and **never enters the
normal deploy body**.

The normal success path (`DeployInstanceAsync.cs:215-223`) does three things on
a successful site response: writes the deployment record terminal status, sets
`instance.State = InstanceState.Enabled` + `UpdateInstanceAsync`, and calls
`StoreDeployedSnapshotAsync`. The reconciliation shortcut performs only the
first. Consequently, after a reconciled deployment:

- The instance `State` is left at whatever it was (e.g. `NotDeployed` for a
  first-time deploy that timed out, or `Disabled`) even though the site is
  actually running the configuration — the central state machine and the site
  diverge, and a subsequent `DisableInstanceAsync`/`EnableInstanceAsync` will be
  rejected or allowed incorrectly by `StateTransitionValidator`.
- No `DeployedConfigSnapshot` is created or refreshed. A first-time deploy that
  is resolved purely by reconciliation leaves `GetDeploymentComparisonAsync`
  permanently returning `"No deployed snapshot found for this instance."`, and a
  redeploy reconciliation leaves the stored snapshot showing the *old* config
  even though the deployment record claims `Success` for the new revision.

The design ("Deployed vs. Template-Derived State", WP-4/WP-8) requires the
deployed snapshot and instance state to reflect the last successful deployment;
the reconciliation path silently breaks both invariants.

**Recommendation**

In the reconciled-success branch of `TryReconcileWithSiteAsync`, perform the
same post-success side effects as the normal path: set `instance.State =
InstanceState.Enabled` (+ `UpdateInstanceAsync`) and call
`StoreDeployedSnapshotAsync` with the target deployment ID / revision hash /
config JSON. Factor the shared post-success logic into one helper so the normal
and reconciliation paths cannot drift. Add a regression test asserting that a
reconciled deployment leaves the instance `Enabled` and a snapshot stored.

**Resolution**

Resolved 2026-05-17 (commit pending): extracted the shared post-success side
effects into `ApplyPostSuccessSideEffectsAsync` (sets instance `State =
Enabled` + `UpdateInstanceAsync`, stores/refreshes the `DeployedConfigSnapshot`)
and invoked it from both the normal deploy success path and the
`TryReconcileWithSiteAsync` reconciled-success branch, so a reconciled
deployment now performs the same instance-state and snapshot updates as a
normal one (`configJson` is now computed before the reconciliation call and
threaded into `TryReconcileWithSiteAsync`). Regression test:
`DeployInstanceAsync_Reconciled_SetsInstanceEnabledAndStoresSnapshot`.

### DeploymentManager-016 — Reconciled prior record keeps its stale `RevisionHash`

| | |
|--|--|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Resolved |
| Location | `src/ScadaLink.DeploymentManager/DeploymentService.cs:639-651` |

**Description**

When `TryReconcileWithSiteAsync` reconciles a prior record, it mutates
`prior.Status`, `prior.ErrorMessage`, and `prior.CompletedAt`, but **not**
`prior.RevisionHash`. The reconciliation condition only compares the *site's*
`AppliedRevisionHash` against the *freshly-flattened* `targetRevisionHash` — it
does not require `prior.RevisionHash` to equal either of them.

The prior record can legitimately carry a different revision hash than the
current target: e.g. a deploy timed out at revision `R1`, the template was then
edited so the current flatten yields `R2`, and meanwhile the site actually
applied `R2` through some other path (or `R1` and `R2` are equal-by-content but
the prior record predates a hash recompute). After reconciliation the record's
`Status` is `Success` but its `RevisionHash` still says `R1`, so staleness
checks and any UI that reads `DeploymentRecord.RevisionHash` will report the
instance as deployed at the wrong revision. The audit `DeployReconciled` entry
records `RevisionHash = targetRevisionHash`, contradicting the persisted record.

**Recommendation**

In the reconciled-success branch, also set `prior.RevisionHash =
targetRevisionHash` so the persisted record, the audit entry, and the site's
actual applied revision all agree. Alternatively, only reconcile when
`prior.RevisionHash == targetRevisionHash` and otherwise fall through to a
normal deploy.

**Resolution**

Resolved 2026-05-17 (commit pending): the reconciled-success branch of
`TryReconcileWithSiteAsync` now also sets `prior.RevisionHash =
targetRevisionHash`, so the persisted record, the `DeployReconciled` audit
entry, and the site's actually-applied revision all agree. Regression test:
`DeployInstanceAsync_Reconciled_PriorRecordRevisionHashUpdatedToTarget`.

### DeploymentManager-017 — `GetDeploymentStatusAsync` XML doc describes behaviour it does not implement

| | |
|--|--|
| Severity | Low |
| Category | Documentation & comments |
| Status | Resolved |
| Location | `src/ScadaLink.DeploymentManager/DeploymentService.cs:562-570` |

**Description**

The XML summary on `GetDeploymentStatusAsync` reads: *"WP-2: After
failover/timeout, query site for current deployment state before
re-deploying."* The method body does no such thing — it is a one-line
pass-through to `_repository.GetDeploymentByDeploymentIdAsync`, a pure local DB
read. The query-the-site-before-redeploy behaviour the comment describes was
implemented separately in `TryReconcileWithSiteAsync` (DeploymentManager-006).
The stale comment is a leftover of the original design intent and misleads a
reader into thinking this method contacts the site.

**Recommendation**

Reword the summary to describe what the method actually does — "returns the
current persisted `DeploymentRecord` for the given deployment ID from the
configuration database" — and, if useful, cross-reference
`TryReconcileWithSiteAsync` as the place the site-query reconciliation lives.

**Resolution**

Resolved 2026-05-17 (commit pending): the `GetDeploymentStatusAsync` XML doc
now states it returns the persisted `DeploymentRecord` from the configuration
database as a pure local read, and cross-references `TryReconcileWithSiteAsync`
as where the query-the-site-before-redeploy reconciliation actually lives.
Documentation-only change; no regression test (a test asserting comment text
would be meaningless).

### DeploymentManager-018 — Reconciliation force-sets `Enabled`, overwriting an intentional `Disabled` after central failover

| | |
|--|--|
| Severity | High |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.DeploymentManager/DeploymentService.cs:675-682,721-748` |

**Description**

`TryReconcileWithSiteAsync` calls `ApplyPostSuccessSideEffectsAsync` whenever
the site reports it has the target revision hash, and that helper
unconditionally writes `instance.State = InstanceState.Enabled`. The
reconciliation shortcut only runs when the prior `DeploymentRecord` is
`InProgress` or timeout-`Failed` — exactly the scenarios that survive a central
failover (the in-memory `OperationLockManager` is lost on failover, by design:
*"Lost on central failover (acceptable per design — in-progress treated as
failed)"*).

After such a failover, the per-instance operation lock is gone but the
deployment record is still `InProgress` in the DB. A user can legitimately
issue `DisableInstanceAsync` for the same instance — there is nothing in
`DisableInstanceAsync` that consults the deployment record, only the
`StateTransitionValidator` over `Instance.State`. If the state is `Enabled`
(the typical case when the deploy started), the disable proceeds, the site
honours it (the design states a disabled instance retains its deployed
configuration), and central now persists `Instance.State = Disabled`. The
deployment-record row remains `InProgress` (no one transitioned it). Later the
user retries the deploy: `TryReconcileWithSiteAsync` runs, the site still has
the target revision hash (Disable doesn't change the deployed config), the
prior record is marked `Success`, and `ApplyPostSuccessSideEffectsAsync` writes
`Instance.State = Enabled` — silently overriding the user's explicit Disable.

The same trap exists for any direct DB edit / migration that flipped the state
between the timed-out deploy and the redeploy. The normal deploy path can
defensibly assume `Enabled` after a fresh successful apply, but the
reconciliation path is reconciling *prior* state with *prior* user intent; it
should preserve `Disabled` if that is the current `Instance.State` at the time
of reconciliation, mirroring the design's separation between deploy (config
apply) and disable (subscription/script lifecycle).

**Recommendation**

In the reconciliation branch, do not force `Enabled`. Either:
- Pass a flag/parameter to `ApplyPostSuccessSideEffectsAsync` telling it
  whether to touch state, and skip the state write on the reconciliation path
  (leaving the current `Instance.State` intact, which is already `Enabled`
  for a fresh deploy that timed out and `Disabled` for the user-disabled
  follow-up case); or
- Only set `Enabled` when the current `Instance.State` is `NotDeployed` (i.e.
  the first-deploy timed-out case), and leave existing `Enabled`/`Disabled`
  alone.

Add a regression test where an instance with `Instance.State = Disabled` and a
prior `InProgress` deployment record is reconciled — the resulting
`Instance.State` must remain `Disabled`, and the deployment record must still
be marked `Success`.

### DeploymentManager-019 — Lifecycle command timeout writes no audit entry

| | |
|--|--|
| Severity | Medium |
| Category | Error handling & resilience |
| Status | Open |
| Location | `src/ScadaLink.DeploymentManager/DeploymentService.cs:328-339,385-396,445-458` |

**Description**

`DisableInstanceAsync`, `EnableInstanceAsync`, and `DeleteInstanceAsync` each
wrap the `CommunicationService` call in a linked CTS with
`LifecycleCommandTimeout` (DeploymentManager-012). On timeout they log a
warning and `return Result<...>.Failure(...)` — and skip the
`_auditService.LogAsync` call entirely. As a result, an operator-initiated
disable/enable/delete that times out at the site leaves **no audit trail**:
the user, the timestamp, the command id, and the failure mode are not
recorded in the audit log. The deploy path goes out of its way to write a
`DeployFailed` audit entry on the same failure mode
(`DeploymentService.cs:274-276`), with `CancellationToken.None` so the write is
durable; the lifecycle commands do not.

The design lists audit logging as a Deployment Manager responsibility for "all
deployment actions, system-wide artifact deployments, and instance lifecycle
changes" — a timed-out lifecycle command **is** an attempted lifecycle change,
and the operator action is exactly the kind of event the audit log exists to
record.

**Recommendation**

In each of the three `catch (Exception ex) when (ex is TimeoutException or
OperationCanceledException)` blocks, write a `DisableTimeout`/`EnableTimeout`/
`DeleteTimeout` (or use the existing operation name with a failure flag)
audit entry with `CancellationToken.None` so a cancelled outer token does not
prevent the audit write, mirroring `DeployFailed`. Add a unit test asserting
that `DisableInstanceAsync_SiteUnresponsive_LifecycleCommandTimeoutBoundsTheWait`
also produces an audit entry.

### DeploymentManager-020 — `DeployReconciled` audit attributes the action to the prior deployer, not the current user

| | |
|--|--|
| Severity | Low |
| Category | Documentation & comments |
| Status | Open |
| Location | `src/ScadaLink.DeploymentManager/DeploymentService.cs:683-686` |

**Description**

In `TryReconcileWithSiteAsync` the audit call is:

```
await _auditService.LogAsync(prior.DeployedBy, "DeployReconciled", ...)
```

`prior.DeployedBy` is the user who issued the original (timed-out / stuck)
deployment, not the `user` parameter passed into `DeployInstanceAsync`. The
current user — the one who triggered the redeploy that produced the
reconciliation — is dropped on the floor. For audit forensics this is
misleading: the row will read "user A reconciled their own deployment"
when in fact user B initiated the action that reconciled it.

The original deployer is interesting context, but it should be carried in the
audit-detail object (where `DeploymentId` and `RevisionHash` already live), not
substituted for the actor.

**Recommendation**

Use `user` (the parameter on `DeployInstanceAsync`, threaded through
`TryReconcileWithSiteAsync`) as the audit actor, and include
`OriginalDeployer = prior.DeployedBy` in the detail object so the original
attribution is preserved without misrepresenting who took the action.

### DeploymentManager-021 — `ResolveSiteIdentifierAsync` silently substitutes the DB id when the site row is missing

| | |
|--|--|
| Severity | Low |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.DeploymentManager/DeploymentService.cs:107-111` |

**Description**

```
private async Task<string> ResolveSiteIdentifierAsync(int siteId, CancellationToken cancellationToken)
{
    var site = await _siteRepository.GetSiteByIdAsync(siteId, cancellationToken);
    return site?.SiteIdentifier ?? siteId.ToString();
}
```

If the `Site` row is missing (FK was deleted, race with admin delete, DB
inconsistency), the method silently returns the numeric DB id rendered as a
string. This is then passed to `CommunicationService.{Deploy,Disable,Enable,
Delete}InstanceAsync` and `QueryDeploymentStateAsync` as if it were a real
`SiteIdentifier` (e.g. "site-a"). The communication layer will fail with an
"unknown site" or routing error, producing a confusing diagnostic that hides
the actual problem (no site row).

This is a defensive concern, but every mutating operation in the module goes
through this method, so a stale instance whose site was deleted will produce a
misleading error every time it is touched.

**Recommendation**

Treat a missing site as a hard validation failure: return a
`Result.Failure($"Site with ID {siteId} not found")` early from the calling
operations, instead of fabricating an identifier. The repository already
returns `Site?`, so the null path is type-visible; just don't paper over it.

### DeploymentManager-022 — `Pending` and `InProgress` are written back-to-back with no intervening work

| | |
|--|--|
| Severity | Low |
| Category | Code organization & conventions |
| Status | Open |
| Location | `src/ScadaLink.DeploymentManager/DeploymentService.cs:178-194` |

**Description**

`DeployInstanceAsync` does:

```
record.Status = Pending;
AddDeploymentRecordAsync(record); SaveChangesAsync(); NotifyStatusChange(record);
record.Status = InProgress;
UpdateDeploymentRecordAsync(record); SaveChangesAsync(); NotifyStatusChange(record);
```

There is no work between the two writes — flattening, validation, and
reconciliation have already completed by line 174. The deploy command is sent
immediately after the `InProgress` write. The `Pending` write therefore costs:
an extra `SaveChangesAsync` round-trip, an extra `IDeploymentStatusNotifier`
invocation (which the CentralUI-006 page renders, so the user briefly sees a
`Pending` flicker before `InProgress`), and an extra row-version bump if EF
optimistic concurrency is enabled on the table.

The design uses `Pending` to mean "queued, not yet sent" and `InProgress` to
mean "sent to site, awaiting response". The code's `Pending` slot has no
queuing — it is set and immediately overwritten — so the state buys nothing
operationally.

**Recommendation**

Either:
- Drop the `Pending` write entirely and create the record directly in
  `InProgress` (one row insert, one notification, simpler UI); or
- Move the `Pending`→`InProgress` transition to bracket actual queueing/work
  (e.g. set `Pending` *before* flattening + reconciliation, set `InProgress`
  immediately before `DeployInstanceAsync` on the comm service) so the two
  states carry distinguishable semantics worth a separate write.

### DeploymentManager-023 — `BuildDeployArtifactsCommandAsync` re-queries system-wide artifacts once per site

| | |
|--|--|
| Severity | Low |
| Category | Performance & resource management |
| Status | Open |
| Location | `src/ScadaLink.DeploymentManager/ArtifactDeploymentService.cs:82-144,169-173` |

**Description**

`DeployToAllSitesAsync` loops over sites and calls
`BuildDeployArtifactsCommandAsync(site.Id, ...)` for each one. Of the six
artifact sets the method gathers, **only** `dataConnections` is per-site:

- `_templateRepo.GetAllSharedScriptsAsync` — global.
- `_externalSystemRepo.GetAllExternalSystemsAsync` — global, plus
  `GetMethodsByExternalSystemIdAsync` per external system per site.
- `_externalSystemRepo.GetAllDatabaseConnectionsAsync` — global.
- `_notificationRepo.GetAllNotificationListsAsync` — global.
- `_notificationRepo.GetAllSmtpConfigurationsAsync` — global.
- `_siteRepo.GetDataConnectionsBySiteIdAsync(siteId, ...)` — **per-site**.

With N sites this issues ≈ 5·N redundant queries on the global sets (plus
M·N method queries, where M is the external-system count). On a hub-and-spoke
deployment with many sites the artifact-deploy path is noticeably slower than
necessary and pins DbContext usage longer than needed. Per CLAUDE.md, the
DbContext is not thread-safe and the per-site commands are already built
sequentially (good); the redundant queries are sequential too, but the
network/round-trip cost is real.

**Recommendation**

Hoist the global queries (shared scripts, external systems + their methods,
DB connections, notification lists, SMTP configurations) out of
`BuildDeployArtifactsCommandAsync`, fetch them once in `DeployToAllSitesAsync`,
and pass them in alongside the site id (or expose a
`BuildDeployArtifactsCommandAsync(siteId, prefetchedGlobals)` overload).
`RetryForSiteAsync` (the single-site path) can keep the convenience-overload
behaviour. Add a test using NSubstitute's `.Received()` to assert
`_templateRepo.GetAllSharedScriptsAsync` is called exactly once for an
N-site deployment.

### DeploymentManager-024 — Test probe actors hold mutable static state across tests

| | |
|--|--|
| Severity | Low |
| Category | Testing coverage |
| Status | Open |
| Location | `tests/ScadaLink.DeploymentManager.Tests/DeploymentServiceTests.cs:966-1075`, `tests/ScadaLink.DeploymentManager.Tests/ArtifactDeploymentServiceTests.cs:196-217` |

**Description**

`ReconcileProbeActor.QueryCount` / `DeployCount`, `SerializationProbeActor.MaxConcurrent`
/ `_current`, and `ArtifactProbeActor.Received` are all `static` fields.
Each test's actor constructor resets them — but reset-on-construction only
works as long as no two tests in the same class run concurrently. xUnit's
default parallelism disables intra-class parallelism, so today's tests pass;
flip the assembly-level `[CollectionBehavior(DisableTestParallelization = true)]`
or move to xUnit v3 (which enables intra-class parallelism by default) and the
counters race — a deploy in test A could increment `DeployCount` while test B
is asserting on it.

Static state shared across tests is also why a flaky-test investigation here
will be unusually painful: the offending interaction is invisible from any
single test file.

**Recommendation**

Replace the static counters with instance state, hand the actor a probe
recipient (an `IActorRef` to a TestKit probe), and assert via `ExpectMsg`
in each test. Where the simpler counter shape is preferred, pass a
shared-state object into the actor's constructor so each test owns its own
instance — never reach for `static` mutable test state.