f93b7b99bb
Re-applies the full 10-category checklist to every src/ project — including
first-time reviews of the four newer components (AuditLog, NotificationOutbox,
SiteCallAudit, Transport) — so the code-reviews/ index reflects today's
codebase rather than the 2026-05-16 baseline. 172 new Open findings (0
Critical, 18 High, 62 Medium, 92 Low); 481 findings total across 23 modules.
regen-readme.py now derives each module's Last reviewed + Commit from its
findings.md header instead of hard-coding 2026-05-16 / 9c60592, so future
single-module re-reviews show their own date in the Module Status table.
1198 lines
59 KiB
Markdown
1198 lines
59 KiB
Markdown
# Code Review — DeploymentManager
|
|
|
|
| Field | Value |
|
|
|-------|-------|
|
|
| Module | `src/ScadaLink.DeploymentManager` |
|
|
| Design doc | `docs/requirements/Component-DeploymentManager.md` |
|
|
| Status | Reviewed |
|
|
| Last reviewed | 2026-05-28 |
|
|
| Reviewer | claude-agent |
|
|
| Commit reviewed | `1eb6e97` |
|
|
| Open findings | 7 |
|
|
|
|
## Summary
|
|
|
|
The DeploymentManager module is small, well-structured, and clearly maps work
|
|
packages (WP-N) onto code. The happy paths for instance deployment, lifecycle
|
|
commands, artifact broadcast, and staleness comparison are implemented
|
|
sensibly, and the operation lock correctly serializes mutating operations per
|
|
instance while allowing cross-instance parallelism. However, the review found a
|
|
significant cluster of error-handling and resilience gaps: the deployment
|
|
record can be left permanently stuck in `InProgress` when an exception other
|
|
than timeout/cancellation is thrown, the catch block writes its failure status
|
|
using a cancellation token that may already be cancelled, and the
|
|
`OperationLockManager` leaks one `SemaphoreSlim` per instance name forever.
|
|
There are also two notable design-document adherence gaps: the
|
|
"query-the-site-before-redeploy" idempotency requirement is not implemented
|
|
(`GetDeploymentStatusAsync` only reads the local DB), and the "Diff View"
|
|
feature is reduced to a bare hash comparison with no added/removed/changed
|
|
detail. Configuration is not bound to `appsettings.json`, leaving one option
|
|
entirely dead. Test coverage stops at the communication boundary and never
|
|
exercises a successful deployment or the lifecycle success paths.
|
|
|
|
#### Re-review 2026-05-17 (commit `39d737e`)
|
|
|
|
Re-reviewed at commit `39d737e` after the batch of fixes for
|
|
DeploymentManager-001..014. All fourteen prior findings remain `Resolved` and
|
|
verified against source — the broadened catch, non-cancellable cleanup writes,
|
|
ref-counted `OperationLockManager`, query-before-redeploy reconciliation,
|
|
structured diff, options binding, and the expanded TestKit-actor test suite are
|
|
all present and correct. The module is in markedly better shape than the
|
|
first review: error paths are now defensively handled and test coverage is
|
|
broad (successful deploy/lifecycle, lock serialization, reconciliation
|
|
matrix, artifact per-site matrix).
|
|
|
|
This re-review found **3 new findings**, all clustered on the
|
|
DeploymentManager-006 reconciliation path added since the last review. The
|
|
reconciliation shortcut (`TryReconcileWithSiteAsync`) marks a stale prior
|
|
record `Success` when the site already has the target revision, but it does
|
|
**not** perform the side effects the normal success path does — it never
|
|
updates the instance `State`, never refreshes the `DeployedConfigSnapshot`,
|
|
and never corrects the prior record's own `RevisionHash` (DeploymentManager-015,
|
|
DeploymentManager-016). The `GetDeploymentStatusAsync` XML doc is now stale —
|
|
it still describes the query-before-redeploy behaviour that actually moved into
|
|
`TryReconcileWithSiteAsync` (DeploymentManager-017).
|
|
|
|
#### Re-review 2026-05-28 (commit `1eb6e97`)
|
|
|
|
Re-reviewed at commit `1eb6e97` after the DeploymentManager-015/016/017 fixes
|
|
and a docs-only XML-comment pass. The three prior findings remain `Resolved`
|
|
and verified — `ApplyPostSuccessSideEffectsAsync` is now invoked from both the
|
|
normal success path and `TryReconcileWithSiteAsync`, the reconciled-success
|
|
branch corrects `prior.RevisionHash` to the target, and `GetDeploymentStatusAsync`'s
|
|
XML doc now describes the local-DB-read it actually performs and cross-refs the
|
|
reconciliation helper. The DiffService wiring, options binding, ref-counted
|
|
operation lock, broadened catch, non-cancellable cleanup, and TestKit-actor
|
|
test seam are still in place. The 7 new findings here are not regressions in
|
|
the DeploymentManager-015/016 fixes — they are issues uncovered by widening
|
|
the lens to the lifecycle paths, reconciliation's interaction with
|
|
intentional `Disabled` state, audit semantics, and operational concerns
|
|
(per-site artifact-build cost, Pending→InProgress double-write).
|
|
|
|
The single notable correctness issue is DeploymentManager-018: the
|
|
reconciliation shortcut unconditionally sets `instance.State = Enabled` via
|
|
`ApplyPostSuccessSideEffectsAsync`. After a central failover that loses the
|
|
in-memory operation lock, a user can legitimately `Disable` an instance whose
|
|
prior deploy record is still `InProgress`; a subsequent redeploy then reconciles
|
|
and silently re-enables the instance against the user's explicit intent.
|
|
The remaining six findings are medium/low: lifecycle-timeout audit gap
|
|
(DeploymentManager-019), audit-user attribution in reconciliation
|
|
(DeploymentManager-020), silent fallback in `ResolveSiteIdentifierAsync`
|
|
(DeploymentManager-021), back-to-back `Pending`→`InProgress` writes
|
|
(DeploymentManager-022), per-site re-query of system-wide artifacts
|
|
(DeploymentManager-023), and shared static state across `*ProbeActor` tests
|
|
(DeploymentManager-024).
|
|
|
|
## Checklist coverage
|
|
|
|
#### Re-review 2026-05-28 (commit `1eb6e97`)
|
|
|
|
| # | Category | Examined | Notes |
|
|
|---|----------|----------|-------|
|
|
| 1 | Correctness & logic bugs | ✓ | New: reconciliation forces `Enabled` even if the user disabled the instance in between (DeploymentManager-018). |
|
|
| 2 | Akka.NET conventions | ✓ | Module remains a plain service layer; no actors. No issues. |
|
|
| 3 | Concurrency & thread safety | ✓ | `OperationLockManager` ref-counting verified. Note: test probes hold static state (DeploymentManager-024) — a test concern, not production code. |
|
|
| 4 | Error handling & resilience | ✓ | New: Disable/Enable/Delete timeouts return early without writing any audit entry — deploy has `DeployFailed`, lifecycle has nothing (DeploymentManager-019). |
|
|
| 5 | Security | ✓ | No new issues. SMTP credential decision documented (DeploymentManager-013 closed). |
|
|
| 6 | Performance & resource management | ✓ | New: `BuildDeployArtifactsCommandAsync` re-queries every system-wide artifact set per site in `DeployToAllSitesAsync` (DeploymentManager-023). |
|
|
| 7 | Design-document adherence | ✓ | Reconciliation now performs post-success side effects (DeploymentManager-015 resolved). DeploymentManager-018 surfaces a new gap on `Disabled`-state preservation. |
|
|
| 8 | Code organization & conventions | ✓ | New: redundant `Pending`→`InProgress` back-to-back write with no intervening work (DeploymentManager-022). Silent string-fallback in `ResolveSiteIdentifierAsync` (DeploymentManager-021). |
|
|
| 9 | Testing coverage | ✓ | New: no coverage for the reconciliation-overwrites-Disabled case (part of DeploymentManager-018); test probes share static state across tests (DeploymentManager-024). |
|
|
| 10 | Documentation & comments | ✓ | New: `DeployReconciled` audit uses `prior.DeployedBy` instead of the current `user` parameter — misleading for forensics (DeploymentManager-020). |
|
|
|
|
## Findings
|
|
|
|
### DeploymentManager-001 — Unexpected exceptions leave the deployment record stuck in `InProgress`
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | High |
|
|
| Category | Error handling & resilience |
|
|
| Status | Resolved |
|
|
| Location | `src/ScadaLink.DeploymentManager/DeploymentService.cs:141-199` |
|
|
|
|
**Description**
|
|
|
|
`DeployInstanceAsync` sets the record to `InProgress` (lines 137-139), then the
|
|
`try` block calls into `CommunicationService` and the repository. The only
|
|
`catch` filter is `when (ex is TimeoutException or OperationCanceledException)`.
|
|
Any other exception — `InvalidOperationException` (thrown by
|
|
`CommunicationService.GetCommunicationActor()` when the actor is not set), a
|
|
JSON serialization error, a deserialization failure of the response, a DB
|
|
exception on `UpdateDeploymentRecordAsync`, or any transport error — escapes the
|
|
method. The deployment record remains in `DeploymentStatus.InProgress`
|
|
permanently. Because staleness and the UI both read current status, the
|
|
instance is then misreported as "deploying" forever and a re-deploy may be
|
|
blocked or misinterpreted. The design explicitly states an interrupted
|
|
deployment must be "treated as failed".
|
|
|
|
**Recommendation**
|
|
|
|
Broaden the catch to a general `catch (Exception ex)` that records
|
|
`DeploymentStatus.Failed` with the error message, audit-logs the failure, and
|
|
re-throws or returns a failed `Result`. Keep the timeout-specific branch only
|
|
if a distinct message is desired. Ensure the failure-status write happens for
|
|
every exit path out of the `try`.
|
|
|
|
**Resolution**
|
|
|
|
Resolved 2026-05-16 (commit `<pending>`): broadened the `catch` in
|
|
`DeployInstanceAsync` to `catch (Exception ex)` so any exception (transport,
|
|
serialization, DB, `InvalidOperationException` from an uninitialized
|
|
`CommunicationService`) marks the deployment record `Failed` with the error
|
|
message and audit-logs the failure, instead of escaping and leaving the record
|
|
stuck in `InProgress`. Regression test:
|
|
`DeployInstanceAsync_CommunicationThrowsUnexpectedException_RecordMarkedFailed`.
|
|
|
|
### DeploymentManager-002 — Failure-status write uses a possibly-cancelled cancellation token
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | High |
|
|
| Category | Error handling & resilience |
|
|
| Status | Resolved |
|
|
| Location | `src/ScadaLink.DeploymentManager/DeploymentService.cs:186-196` |
|
|
|
|
**Description**
|
|
|
|
The `catch (Exception ex) when (ex is TimeoutException or
|
|
OperationCanceledException)` block updates the record to `Failed` and calls
|
|
`UpdateDeploymentRecordAsync`/`SaveChangesAsync`/`LogAsync` passing the same
|
|
`cancellationToken` that was just cancelled (an `OperationCanceledException`
|
|
caught here means the token is already in the cancelled state). Those
|
|
repository and audit calls will themselves throw `OperationCanceledException`
|
|
before the failure status is persisted, so the record stays `InProgress` — the
|
|
exact bug DeploymentManager-001 describes, reached via the supposedly-handled
|
|
path.
|
|
|
|
**Recommendation**
|
|
|
|
Perform the cleanup writes with a fresh, non-cancellable token (e.g.
|
|
`CancellationToken.None`, optionally with an independent short timeout) so the
|
|
failure status is durably recorded even when the original operation was
|
|
cancelled or timed out.
|
|
|
|
**Resolution**
|
|
|
|
Resolved 2026-05-16 (commit `<pending>`): the broadened `catch` block now
|
|
performs the failure-status write (`UpdateDeploymentRecordAsync`,
|
|
`SaveChangesAsync`) and the audit `LogAsync` with `CancellationToken.None`
|
|
instead of the operation's (possibly-cancelled) token, so the `Failed` status
|
|
is durably recorded even after a timeout/cancellation. The cleanup writes are
|
|
themselves wrapped in a `try`/`catch` that logs (without masking the original
|
|
error) if persistence still fails. Regression test:
|
|
`DeployInstanceAsync_FailureWrite_UsesNonCancellableToken`.
|
|
|
|
### DeploymentManager-003 — Successful-deployment cleanup is not atomic with the status write
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Medium |
|
|
| Category | Error handling & resilience |
|
|
| Status | Resolved |
|
|
| Location | `src/ScadaLink.DeploymentManager/DeploymentService.cs:155-170` |
|
|
|
|
**Description**
|
|
|
|
After a successful site response the code calls `UpdateDeploymentRecordAsync`
|
|
(no `SaveChanges` yet), then `UpdateInstanceAsync`, then
|
|
`StoreDeployedSnapshotAsync` (which itself issues `Add`/`Update` calls), then a
|
|
single `SaveChangesAsync` at line 170. If `StoreDeployedSnapshotAsync` throws,
|
|
the exception is not caught (see DeploymentManager-001) and the
|
|
`SaveChangesAsync` never runs — the instance state, deployment status, and
|
|
snapshot are all left unpersisted even though the site has actually applied the
|
|
deployment. Central and site are now divergent: the site is running the new
|
|
config but central still shows the old state and a non-`Success` deployment
|
|
record.
|
|
|
|
**Verification:** Confirmed against source. The DeploymentManager-001 fix made
|
|
this strictly worse, not better — after that fix a snapshot-store failure is
|
|
caught and the record is flipped from `Success` back to `Failed`, so central
|
|
reports a *failed* deployment while the site is running the new config.
|
|
|
|
**Recommendation**
|
|
|
|
Wrap the post-success persistence so that, at minimum, the deployment record's
|
|
`Success` status is committed. Consider committing the status first, then the
|
|
instance state and snapshot, so a later failure does not lose the fact that the
|
|
site succeeded. Log loudly if the snapshot write fails after a confirmed site
|
|
apply.
|
|
|
|
**Resolution**
|
|
|
|
Resolved 2026-05-16 (commit pending): `DeployInstanceAsync` now commits the
|
|
deployment record's terminal status (`UpdateDeploymentRecordAsync` +
|
|
`SaveChangesAsync`) immediately after the site confirms the apply, *before*
|
|
touching instance state or the deployed-config snapshot. The post-success
|
|
instance-state update and `StoreDeployedSnapshotAsync` are wrapped in a
|
|
best-effort `try`/`catch` that logs loudly for operator reconciliation but no
|
|
longer flips the already-committed `Success` record back to `Failed`.
|
|
Regression test:
|
|
`DeployInstanceAsync_SiteSucceeds_SnapshotWriteFails_RecordStillCommittedSuccess`.
|
|
|
|
### DeploymentManager-004 — Site-success but central-delete-failure leaves orphaned site config
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Medium |
|
|
| Category | Error handling & resilience |
|
|
| Status | Resolved |
|
|
| Location | `src/ScadaLink.DeploymentManager/DeploymentService.cs:312-319` |
|
|
|
|
**Description**
|
|
|
|
In `DeleteInstanceAsync`, when the site responds `Success` the code calls
|
|
`_repository.DeleteInstanceAsync` then `SaveChangesAsync`. If `SaveChangesAsync`
|
|
throws (DB error, concurrency), the exception propagates uncaught: the site has
|
|
already destroyed the Instance Actor and removed its config, but the central
|
|
instance record still exists. The instance is now un-deletable through the
|
|
normal path (the site no longer has it, so a re-issued delete may fail) and is
|
|
permanently orphaned. The design states central must not mark the instance
|
|
deleted until the site confirms — but it does not address the inverse failure.
|
|
|
|
**Verification:** Confirmed against source. `DeleteInstanceAsync` has no
|
|
`try`/`catch` around the post-success block, so any exception from
|
|
`DeleteInstanceAsync`/`SaveChangesAsync` escapes uncaught to the caller.
|
|
|
|
**Recommendation**
|
|
|
|
Catch persistence failures in the post-success block and surface a distinct
|
|
error indicating the site succeeded but the central record could not be
|
|
removed, so an operator/retry can reconcile. Consider making the central delete
|
|
idempotent and retryable independently of the site command.
|
|
|
|
**Resolution**
|
|
|
|
Resolved 2026-05-16 (commit pending): the post-success removal in
|
|
`DeleteInstanceAsync` (`DeleteInstanceAsync` + `SaveChangesAsync`) is now
|
|
wrapped in a `try`/`catch`. A persistence failure no longer escapes uncaught —
|
|
it is logged, recorded with a `DeleteOrphaned` audit entry, and surfaced as a
|
|
distinct `Result` failure stating the site deleted the instance but the central
|
|
record is orphaned and must be reconciled. Regression test:
|
|
`DeleteInstanceAsync_SiteSucceeds_CentralDeleteFails_ReturnsDistinctFailure`.
|
|
|
|
### DeploymentManager-005 — `OperationLockManager` leaks a `SemaphoreSlim` per instance name
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Medium |
|
|
| Category | Performance & resource management |
|
|
| Status | Resolved |
|
|
| Location | `src/ScadaLink.DeploymentManager/OperationLockManager.cs:15-33` |
|
|
|
|
**Description**
|
|
|
|
`AcquireAsync` does `_locks.GetOrAdd(instanceUniqueName, _ => new
|
|
SemaphoreSlim(1, 1))` and entries are never removed. Every distinct instance
|
|
unique name that is ever deployed/disabled/enabled/deleted permanently adds a
|
|
`SemaphoreSlim` (an `IDisposable` holding a kernel wait handle) to the
|
|
dictionary. Over the lifetime of a long-running central process — especially
|
|
with the bulk "deploy all out-of-date instances" workflow and instances that
|
|
are created and deleted over time — this is an unbounded leak of both managed
|
|
memory and OS handles. Deleted instances' semaphores are never reclaimed.
|
|
|
|
**Verification:** Confirmed against source. `_locks` is a `ConcurrentDictionary`
|
|
with no removal path anywhere in the type.
|
|
|
|
**Recommendation**
|
|
|
|
Either accept the leak explicitly and document the expected bounded cardinality
|
|
of instance names, or implement reclamation: e.g. ref-count handles and remove
|
|
+ `Dispose()` the semaphore when the count reaches zero and the lock is free.
|
|
At minimum, remove the semaphore entry when an instance is deleted
|
|
(`DeleteInstanceAsync`).
|
|
|
|
**Resolution**
|
|
|
|
Resolved 2026-05-16 (commit pending): `OperationLockManager` now ref-counts each
|
|
lock entry. A reference is reserved (creating the entry if needed) before the
|
|
`SemaphoreSlim.WaitAsync`, so concurrent waiters for the same instance share one
|
|
semaphore and the entry survives until every waiter/holder has released. When
|
|
the reference count reaches zero — on release, timeout, or cancellation — the
|
|
entry is removed from the dictionary and the semaphore is `Dispose()`d, so the
|
|
process no longer accumulates one kernel wait handle per distinct instance name.
|
|
A `TrackedLockCount` diagnostic property was added to make reclamation testable.
|
|
Regression tests: `AcquireAsync_ReleasedLock_RemovesSemaphoreEntry`,
|
|
`AcquireAsync_ManyDistinctInstances_DoesNotAccumulateSemaphores`,
|
|
`AcquireAsync_ContendedLock_KeepsSemaphoreUntilLastReleaseThenReclaims`.
|
|
|
|
### DeploymentManager-006 — Query-the-site-before-redeploy idempotency requirement not implemented
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | High |
|
|
| Category | Design-document adherence |
|
|
| Status | Resolved |
|
|
| Location | `src/ScadaLink.DeploymentManager/DeploymentService.cs:84-200,363-368` |
|
|
|
|
**Description**
|
|
|
|
The design ("Deployment Identity & Idempotency") requires: "After a central
|
|
failover or timeout, the Deployment Manager queries the site for current
|
|
deployment state before allowing a re-deploy. This prevents duplicate
|
|
application and out-of-order config changes." The code never does this.
|
|
`GetDeploymentStatusAsync` only reads the local `DeploymentRecord` from the DB
|
|
(`GetDeploymentByDeploymentIdAsync`) — it does not contact the site.
|
|
`DeployInstanceAsync` unconditionally generates a new deployment ID and sends a
|
|
new `DeployInstanceCommand` regardless of any prior in-flight or timed-out
|
|
deployment. After a timeout where the site actually applied the config, a
|
|
re-deploy produces a second deployment with no reconciliation against the
|
|
site's current revision hash. Site-side stale-rejection is the only safety
|
|
net, and that is not verified here.
|
|
|
|
**Recommendation**
|
|
|
|
Add a site query (a new `CommunicationService` pattern returning the site's
|
|
currently-applied deployment ID / revision hash) and call it before re-deploy
|
|
when a prior record for the instance is in `InProgress`/`Failed` due to
|
|
timeout. Reconcile: if the site already has the target revision, mark the prior
|
|
record `Success` instead of re-sending. Either implement this or update the
|
|
design doc to reflect that reconciliation is delegated entirely to site-side
|
|
stale-rejection.
|
|
|
|
**Resolution**
|
|
|
|
Resolved 2026-05-16 (commit `<pending>`): implemented the cross-module
|
|
query-the-site-before-redeploy idempotency feature across Commons, SiteRuntime,
|
|
Communication, and DeploymentManager — new `DeploymentStateQueryRequest` /
|
|
`DeploymentStateQueryResponse` contracts, a `DeploymentManagerActor` handler
|
|
answering from the site's deployed-config store, a
|
|
`CommunicationService.QueryDeploymentStateAsync` method routed over the
|
|
ClusterClient command/control transport, and reconciliation in
|
|
`DeployInstanceAsync` (`TryReconcileWithSiteAsync`) that queries the site only
|
|
when a prior record is `InProgress` or `Failed` due to a timeout, marks the
|
|
prior record `Success` without re-sending if the site already has the target
|
|
revision hash, and falls through to a normal deploy (relying on site-side
|
|
stale-rejection) when the query fails. Regression tests:
|
|
`RoundTrip_DeploymentStateQueryRequest_Succeeds`,
|
|
`RoundTrip_DeploymentStateQueryResponse_Deployed_Succeeds`,
|
|
`RoundTrip_DeploymentStateQueryResponse_NotDeployed_NullApplied`,
|
|
`DeploymentStateQuery_DeployedInstance_ReturnsAppliedIdentity`,
|
|
`DeploymentStateQuery_UnknownInstance_ReturnsNotDeployed`,
|
|
`DeploymentStateQuery_ForwardedToDeploymentManager`,
|
|
`QueryDeploymentStateAsync_BeforeInitialization_Throws`,
|
|
`QueryDeploymentStateAsync_SendsEnvelopeAndReturnsResponse`,
|
|
`DeployInstanceAsync_PriorInProgressRecord_SiteHasTargetHash_MarksSuccessWithoutRedeploy`,
|
|
`DeployInstanceAsync_PriorInProgressRecord_SiteHasDifferentHash_ProceedsWithDeploy`,
|
|
`DeployInstanceAsync_PriorFailedTimeoutRecord_QueriesSite`,
|
|
`DeployInstanceAsync_PriorSuccessRecord_SkipsSiteQuery`,
|
|
`DeployInstanceAsync_FreshFirstTimeDeploy_SkipsSiteQuery`,
|
|
`DeployInstanceAsync_PriorInProgressRecord_QueryFails_FallsThroughToDeploy`.
|
|
|
|
### DeploymentManager-007 — "Diff View" reduced to a hash comparison with no diff detail
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Medium |
|
|
| Category | Design-document adherence |
|
|
| Status | Resolved |
|
|
| Location | `src/ScadaLink.DeploymentManager/DeploymentService.cs:334-358,401-406` |
|
|
|
|
**Description**
|
|
|
|
The design ("Diff View" and "Dependencies" sections) states the Deployment
|
|
Manager can request a diff from the Template Engine showing added/removed
|
|
members, changed values, and connection-binding changes.
|
|
`GetDeploymentComparisonAsync` and `DeploymentComparisonResult` only compare two
|
|
revision hashes and return a boolean `IsStale` plus the two hashes. No
|
|
added/removed/changed detail is produced, and the Template Engine's diff
|
|
capability is not invoked. The UI cannot render a meaningful diff from this
|
|
result.
|
|
|
|
**Verification:** Confirmed against source. The Template Engine already provides
|
|
`DiffService` + `ConfigurationDiff` (structured Added/Removed/Changed entries
|
|
for attributes, alarms, and scripts, including data connection binding fields),
|
|
and `DiffService` is DI-registered — it was simply never wired into the
|
|
Deployment Manager's comparison path.
|
|
|
|
**Recommendation**
|
|
|
|
Either implement a real diff (deserialize the stored
|
|
`DeployedConfigSnapshot.ConfigurationJson` and the freshly flattened config and
|
|
invoke the Template Engine's diff service, surfacing structured
|
|
added/removed/changed entries), or revise the design doc to scope the feature
|
|
down to staleness detection only.
|
|
|
|
**Resolution**
|
|
|
|
Resolved 2026-05-16 (commit pending): `GetDeploymentComparisonAsync` now
|
|
deserializes the stored `DeployedConfigSnapshot.ConfigurationJson` and runs the
|
|
Template Engine `DiffService` against the freshly flattened current
|
|
configuration, attaching the resulting `ConfigurationDiff` (added/removed/changed
|
|
attributes, alarms, scripts) to a new optional `Diff` property on
|
|
`DeploymentComparisonResult`. `DiffService` is injected into `DeploymentService`.
|
|
A snapshot that cannot be deserialized (corrupt / older schema) still yields the
|
|
hash-based staleness result with a null diff, logged at warning level.
|
|
Regression test: `GetDeploymentComparisonAsync_ProducesStructuredDiff`.
|
|
|
|
### DeploymentManager-008 — `DeploymentManagerOptions` is never bound to configuration
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Medium |
|
|
| Category | Code organization & conventions |
|
|
| Status | Resolved |
|
|
| Location | `src/ScadaLink.DeploymentManager/ServiceCollectionExtensions.cs:7-14` |
|
|
|
|
**Description**
|
|
|
|
`AddDeploymentManager` registers the services but never calls
|
|
`services.Configure<DeploymentManagerOptions>(configuration.GetSection(...))`.
|
|
`IOptions<DeploymentManagerOptions>` therefore always resolves to a
|
|
default-constructed instance — the operation-lock and artifact-deployment
|
|
timeouts cannot be tuned via `appsettings.json`, contrary to the CLAUDE.md
|
|
convention "Per-component configuration via `appsettings.json` sections bound
|
|
to options classes (Options pattern)." `Host/Program.cs` binds
|
|
`SecurityOptions` and `InboundApiOptions` from configuration sections but has
|
|
no equivalent for `DeploymentManagerOptions`.
|
|
|
|
**Verification:** Confirmed against source. Neither `AddDeploymentManager` nor
|
|
`Host/Program.cs` binds `DeploymentManagerOptions`.
|
|
|
|
**Recommendation**
|
|
|
|
Add an `IConfiguration` parameter (or a configure callback) to
|
|
`AddDeploymentManager` and bind `DeploymentManagerOptions` to a section such as
|
|
`ScadaLink:DeploymentManager`, consistent with the other components.
|
|
|
|
**Resolution**
|
|
|
|
Resolved 2026-05-16 (commit pending): `AddDeploymentManager()` now calls
|
|
`services.AddOptions<DeploymentManagerOptions>()` so `IOptions<DeploymentManagerOptions>`
|
|
is always resolvable, and `Host/Program.cs` binds the
|
|
`ScadaLink:DeploymentManager` section (exposed as
|
|
`ServiceCollectionExtensions.OptionsSection`) via
|
|
`services.Configure<DeploymentManagerOptions>(...)` — the same pattern the Host
|
|
uses for `SecurityOptions`/`InboundApiOptions`. An earlier attempt added an
|
|
`AddDeploymentManager(IConfiguration)` overload; that was reverted because the
|
|
project convention (enforced by `Host.Tests.OptionsTests`) forbids component
|
|
`Add*` methods from depending on `IConfiguration` — the Host owns
|
|
configuration binding. Regression tests:
|
|
`AddDeploymentManager_RegistersResolvableOptions_WithDefaults`,
|
|
`AddDeploymentManager_OptionsBindToConfigurationSection_AsTheHostWires`,
|
|
`OptionsSection_MatchesTheConventionalComponentSectionPath`.
|
|
|
|
### DeploymentManager-009 — Misleading timeout comment on `DeleteInstanceAsync`
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Low |
|
|
| Category | Documentation & comments |
|
|
| Status | Resolved |
|
|
| Location | `src/ScadaLink.DeploymentManager/DeploymentService.cs:288` |
|
|
|
|
**Description**
|
|
|
|
The XML doc says "Delete fails if site unreachable (30s timeout via
|
|
CommunicationOptions)." The actual delete timeout is whatever
|
|
`CommunicationOptions.LifecycleTimeout` is configured to (passed inside
|
|
`CommunicationService.DeleteInstanceAsync`); the "30s" figure is hard-coded
|
|
into the comment and not derived from any constant in this module. If
|
|
`LifecycleTimeout` is reconfigured, the comment becomes wrong. It also wrongly
|
|
implies the value lives in this module.
|
|
|
|
**Verification:** Confirmed against source. The `DeleteInstanceAsync` XML doc
|
|
quoted a hard-coded "30s" value.
|
|
|
|
**Recommendation**
|
|
|
|
Reword to "Delete fails if the site is unreachable within
|
|
`CommunicationOptions.LifecycleTimeout`" without quoting a specific number.
|
|
|
|
**Resolution**
|
|
|
|
Resolved 2026-05-16 (commit pending): the `DeleteInstanceAsync` XML doc no
|
|
longer quotes a hard-coded "30s" — it now states delete fails if the site is
|
|
unreachable within `CommunicationOptions.LifecycleTimeout` (and notes the
|
|
deadline is applied inside `CommunicationService.DeleteInstanceAsync`).
|
|
Documentation-only change; no regression test (a test asserting comment text
|
|
would be meaningless).
|
|
|
|
### DeploymentManager-010 — `SystemArtifactDeploymentRecord` does not persist the deployment ID
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Low |
|
|
| Category | Correctness & logic bugs |
|
|
| Status | Resolved |
|
|
| Location | `src/ScadaLink.DeploymentManager/ArtifactDeploymentService.cs:136,194-211` |
|
|
|
|
**Description**
|
|
|
|
`DeployToAllSitesAsync` generates a `deploymentId` (line 136) and returns it in
|
|
the `ArtifactDeploymentSummary` and audit log, but the persisted
|
|
`SystemArtifactDeploymentRecord` has no field for it (the entity only has `Id`,
|
|
`ArtifactType`, `DeployedBy`, `DeployedAt`, `PerSiteStatus`). The deployment ID
|
|
that appears in the UI summary and audit log cannot be correlated back to the
|
|
stored record. Additionally each per-site `DeployArtifactsCommand` carries its
|
|
own separate GUID (`BuildDeployArtifactsCommandAsync` line 114), so there are in
|
|
fact N+1 unrelated IDs for one logical artifact deployment.
|
|
|
|
**Verification:** Confirmed against source. Each per-site command minted its own
|
|
GUID and the persisted record had no way to reference the logical id.
|
|
|
|
**Recommendation**
|
|
|
|
Add a `DeploymentId` column to `SystemArtifactDeploymentRecord` and store the
|
|
single logical `deploymentId`; reuse that ID (or a derived per-site ID) for the
|
|
per-site commands so the audit log, UI summary, and persisted record agree.
|
|
|
|
**Resolution**
|
|
|
|
Resolved 2026-05-16 (commit pending): `BuildDeployArtifactsCommandAsync` now
|
|
accepts an optional `deploymentId`, and `DeployToAllSitesAsync` passes the one
|
|
logical `deploymentId` to every per-site command — so the per-site commands,
|
|
the audit log, and the UI summary all reference a single id instead of N+1
|
|
unrelated GUIDs (`RetryForSiteAsync`, an independent single-site retry, still
|
|
mints its own id). Adding a dedicated `DeploymentId` *column* to
|
|
`SystemArtifactDeploymentRecord` was deliberately **not** done: that entity
|
|
lives in `ScadaLink.Commons` with its EF mapping in
|
|
`ScadaLink.ConfigurationDatabase`, both outside this module's edit scope.
|
|
Instead the logical `deploymentId` is embedded in the record's free-form
|
|
`PerSiteStatus` JSON payload (`{ DeploymentId, Sites }`), which is fully within
|
|
this module's control, so the persisted record is correlatable with the
|
|
summary/audit. A follow-up to promote it to a first-class column should be
|
|
filed against Commons/ConfigurationDatabase if a queryable index is needed.
|
|
Regression tests: `DeployToAllSitesAsync_AllPerSiteCommandsShareTheSummaryDeploymentId`,
|
|
`DeployToAllSitesAsync_PartialFailure_ReportsPerSiteMatrix`,
|
|
`RetryForSiteAsync_SiteSucceeds_ReturnsSuccessAndAudits`.
|
|
|
|
### DeploymentManager-011 — Tests never exercise a successful deployment or lifecycle success path
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Medium |
|
|
| Category | Testing coverage |
|
|
| Status | Resolved |
|
|
| Location | `tests/ScadaLink.DeploymentManager.Tests/DeploymentServiceTests.cs:100-151,155-199` |
|
|
|
|
**Description**
|
|
|
|
`DeploymentServiceTests` never sets the `CommunicationService` actor, so every
|
|
deploy/lifecycle test deliberately stops at the `InvalidOperationException`
|
|
thrown by `GetCommunicationActor()` (see lines 118-125, 147). As a result there
|
|
is no test covering: a successful deployment (`DeploymentStatus.Success`
|
|
response → instance state set to `Enabled`, snapshot stored, audit logged); a
|
|
failed-but-handled site response; the `InProgress`-stuck bug
|
|
(DeploymentManager-001); successful Disable/Enable/Delete; or the operation
|
|
lock actually serializing two concurrent deploys of the same instance. The
|
|
critical post-response branch (`DeploymentService.cs:154-184`) and the entire
|
|
delete/disable/enable success path are untested. The `AuditLogs` test
|
|
(lines 277-289) asserts nothing.
|
|
|
|
**Verification:** Partially confirmed. By the time this finding was being
|
|
resolved, the DeploymentManager-006 fix had already introduced a TestKit-actor
|
|
seam (`CreateServiceWithCommActor` + `ReconcileProbeActor`) and successful-deploy
|
|
tests. The genuinely-still-missing coverage was: successful Disable/Enable/Delete
|
|
paths, per-instance lock serialization during deploy, and the assertionless
|
|
`AuditLogs` test — those gaps were addressed.
|
|
|
|
**Recommendation**
|
|
|
|
Introduce a seam to inject a fake/substitute communication path (e.g. an
|
|
interface over `CommunicationService`, or wire a TestKit actor) so success and
|
|
handled-failure paths can be unit tested. Add tests for the stuck-`InProgress`
|
|
scenario and for per-instance lock contention during deploy. Make the audit
|
|
test assert on `IAuditService.LogAsync`.
|
|
|
|
**Resolution**
|
|
|
|
Resolved 2026-05-16 (commit pending): extended the TestKit-actor seam
|
|
(`ReconcileProbeActor` now also answers lifecycle commands) and added the
|
|
missing coverage — successful Disable/Enable/Delete (state transition + audit
|
|
assertions), a successful-deploy audit assertion, and per-instance lock
|
|
serialization via a new deferred-reply `SerializationProbeActor` that asserts a
|
|
single instance's concurrent deploys never overlap. The assertionless `AuditLogs`
|
|
test was replaced with `DeployInstanceAsync_FlatteningFails_DoesNotReachAudit`,
|
|
which asserts on `IAuditService.LogAsync`. Regression tests:
|
|
`DisableInstanceAsync_SiteSucceeds_SetsDisabledStateAndAudits`,
|
|
`EnableInstanceAsync_SiteSucceeds_SetsEnabledStateAndAudits`,
|
|
`DeleteInstanceAsync_SiteSucceeds_RemovesRecordAndAudits`,
|
|
`DeployInstanceAsync_SiteSucceeds_WritesDeployAuditEntry`,
|
|
`DeployInstanceAsync_FlatteningFails_DoesNotReachAudit`,
|
|
`DeployInstanceAsync_SameInstance_OperationLockSerializesConcurrentDeploys`.
|
|
|
|
### DeploymentManager-012 — `LifecycleCommandTimeout` option is dead code
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Low |
|
|
| Category | Documentation & comments |
|
|
| Status | Resolved |
|
|
| Location | `src/ScadaLink.DeploymentManager/DeploymentManagerOptions.cs:8-9` |
|
|
|
|
**Description**
|
|
|
|
`DeploymentManagerOptions.LifecycleCommandTimeout` is declared with a 30s
|
|
default and an XML doc, but it is never read anywhere in the codebase
|
|
(lifecycle commands rely on `CommunicationOptions.LifecycleTimeout` inside
|
|
`CommunicationService`). The option misleads readers into thinking it controls
|
|
disable/enable/delete timeouts, when setting it has no effect.
|
|
|
|
**Verification:** Confirmed against source. A repo-wide grep found exactly one
|
|
occurrence of `LifecycleCommandTimeout` — the declaration itself.
|
|
|
|
**Recommendation**
|
|
|
|
Remove `LifecycleCommandTimeout`, or actually thread it through to the
|
|
lifecycle command calls (e.g. by creating a linked CTS with this timeout in
|
|
`DisableInstanceAsync`/`EnableInstanceAsync`/`DeleteInstanceAsync`, the way
|
|
`ArtifactDeploymentTimeoutPerSite` is used).
|
|
|
|
**Resolution**
|
|
|
|
Resolved 2026-05-16 (commit pending): `LifecycleCommandTimeout` is now actually
|
|
threaded through (the option exists for tuning, so it was wired up rather than
|
|
deleted). `DisableInstanceAsync`/`EnableInstanceAsync`/`DeleteInstanceAsync`
|
|
each create a linked `CancellationTokenSource` with `CancelAfter(
|
|
_options.LifecycleCommandTimeout)` — the same pattern `ArtifactDeploymentService`
|
|
uses for `ArtifactDeploymentTimeoutPerSite` — and pass its token to the
|
|
`CommunicationService` call. Each method now catches the resulting
|
|
`TimeoutException`/`OperationCanceledException`, logs a warning, and returns a
|
|
`Result.Failure` (previously an `AskTimeoutException` from a hung site escaped
|
|
uncaught). The option's XML doc was corrected to describe the real behaviour.
|
|
Regression test:
|
|
`DisableInstanceAsync_SiteUnresponsive_LifecycleCommandTimeoutBoundsTheWait`
|
|
(asserts a 300 ms `LifecycleCommandTimeout` bounds the wait far below the 30 s
|
|
`CommunicationOptions.LifecycleTimeout`; confirmed to fail before the fix —
|
|
the call hung the full 30 s and threw `AskTimeoutException`).
|
|
|
|
### DeploymentManager-013 — SMTP credentials serialized and broadcast to all sites
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Low |
|
|
| Category | Security |
|
|
| Status | Resolved |
|
|
| Location | `src/ScadaLink.DeploymentManager/ArtifactDeploymentService.cs:108-111` |
|
|
|
|
**Description**
|
|
|
|
`BuildDeployArtifactsCommandAsync` maps `smtp.Credentials` directly into
|
|
`SmtpConfigurationArtifact` and that command is sent to every site. Distributing
|
|
SMTP credentials to sites is consistent with the design (SMTP configuration is
|
|
a deployable artifact), but the credentials travel inside a serialized command
|
|
across the inter-cluster transport and are stored on each site's SQLite. There
|
|
is no indication the value is encrypted at rest on the site or scrubbed from
|
|
logs. Worth confirming the transport is TLS-protected and the site stores the
|
|
credential securely; at minimum this should be a conscious, documented decision.
|
|
|
|
**Recommendation**
|
|
|
|
Confirm inter-cluster transport encryption covers artifact commands, ensure
|
|
`Credentials` is never written to logs, and document the at-rest protection of
|
|
SMTP credentials on site SQLite. Consider encrypting the credential field
|
|
within the artifact payload.
|
|
|
|
**Verification (2026-05-16):** Re-triaged against source. The DeploymentManager
|
|
side is **clean**: `ArtifactDeploymentService` maps `SmtpConfiguration.Credentials`
|
|
into the artifact (which the design explicitly mandates — SMTP configuration is
|
|
a deployable artifact) and **never logs it** — the three log statements in
|
|
`DeployToAllSitesAsync` only reference `SiteId`, `SiteName`, `DeploymentId`, and
|
|
`ex.Message`, never the credential. There is no defect to fix purely within
|
|
`src/ScadaLink.DeploymentManager`. The finding's remaining recommendations are
|
|
all cross-module and one needs a design decision:
|
|
- inter-cluster transport TLS — `ScadaLink.Communication` /
|
|
`ScadaLink.ClusterInfrastructure` (Akka remoting + ClusterClient config);
|
|
- at-rest encryption of the credential on site SQLite — `ScadaLink.SiteRuntime`
|
|
artifact store;
|
|
- encrypting the credential field inside the artifact payload — needs the
|
|
`SmtpConfigurationArtifact` shape in `ScadaLink.Commons` plus cooperating
|
|
producer (DeploymentManager) and consumer (SiteRuntime) changes, and a
|
|
**key-management design decision** (where the encryption key lives, how it
|
|
is distributed to sites) that cannot be made unilaterally here.
|
|
|
|
**Status: Open — flagged.** No purely-DeploymentManager fix exists; the work
|
|
crosses Communication / SiteRuntime / Commons and requires a key-management
|
|
design decision. Severity confirmed Low: with TLS-protected inter-cluster
|
|
transport (a separate, assumed-in-place control) and no logging leak, this is a
|
|
hardening item, not an active leak.
|
|
|
|
**Resolution**
|
|
|
|
Resolved 2026-05-16 (commit `<pending>`). Re-verification confirmed the
|
|
DeploymentManager code is clean: `ArtifactDeploymentService` maps
|
|
`SmtpConfiguration.Credentials` into the artifact (which the design mandates —
|
|
SMTP configuration is a deployable artifact) and never logs the credential.
|
|
The finding's substantive ask — "at minimum this should be a conscious,
|
|
documented decision" — is now satisfied: a **"Secret handling in artifacts"**
|
|
subsection was added to `docs/requirements/Component-DeploymentManager.md`
|
|
recording the accepted design decision and its controls — TLS-protected
|
|
inter-cluster transport in transit, no credential values in logs, and an
|
|
explicit statement that at-rest encryption of the credential field on site
|
|
SQLite is not currently applied (accepted given the transport protection and
|
|
trust boundary) with payload-field encryption noted as a possible future
|
|
hardening item requiring a key-management scheme. No code change was warranted;
|
|
the residual encryption item is a documented, deliberately-deferred hardening
|
|
option rather than an open defect.
|
|
|
|
### DeploymentManager-014 — Dead `CreateCommand` helper in artifact tests
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Low |
|
|
| Category | Testing coverage |
|
|
| Status | Resolved |
|
|
| Location | `tests/ScadaLink.DeploymentManager.Tests/ArtifactDeploymentServiceTests.cs:86-90` |
|
|
|
|
**Description**
|
|
|
|
The private static `CreateCommand()` helper is never referenced by any test in
|
|
the file. It is dead code that suggests an intended test (e.g. a successful
|
|
multi-site artifact deployment) was never written — coverage of
|
|
`DeployToAllSitesAsync` is limited to the no-sites failure case, and
|
|
`RetryForSiteAsync` and `BuildDeployArtifactsCommandAsync` have no tests at all.
|
|
|
|
**Verification:** Confirmed against source. The `CreateCommand()` helper had no
|
|
callers, and `DeployToAllSitesAsync`/`RetryForSiteAsync` only had the no-sites
|
|
failure case.
|
|
|
|
**Recommendation**
|
|
|
|
Either remove the unused helper or, preferably, write the missing tests for
|
|
`DeployToAllSitesAsync` (per-site success/failure matrix, partial failure) and
|
|
`RetryForSiteAsync` using it.
|
|
|
|
**Resolution**
|
|
|
|
Resolved 2026-05-16 (commit pending): took the recommendation's preferred
|
|
option — removed the dead `CreateCommand()` helper and wrote the missing
|
|
coverage instead. `ArtifactDeploymentServiceTests` now extends `TestKit` and
|
|
uses a stand-in `ArtifactProbeActor` (records the `DeployArtifactsCommand`s it
|
|
receives, replies success or, for a configured failure set, failure) so
|
|
`DeployToAllSitesAsync` and `RetryForSiteAsync` are exercised end-to-end past
|
|
the communication boundary. New tests:
|
|
`DeployToAllSitesAsync_AllPerSiteCommandsShareTheSummaryDeploymentId` (also
|
|
covers DeploymentManager-010), `DeployToAllSitesAsync_PartialFailure_ReportsPerSiteMatrix`
|
|
(per-site success/failure matrix), `RetryForSiteAsync_SiteSucceeds_ReturnsSuccessAndAudits`.
|
|
|
|
### DeploymentManager-015 — Site-query reconciliation marks a deployment `Success` but skips instance-state and snapshot updates
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | High |
|
|
| Category | Correctness & logic bugs |
|
|
| Status | Resolved |
|
|
| Location | `src/ScadaLink.DeploymentManager/DeploymentService.cs:631-655` |
|
|
|
|
**Description**
|
|
|
|
`TryReconcileWithSiteAsync` (the DeploymentManager-006 query-before-redeploy
|
|
path) handles the case where a prior `InProgress`/timeout-`Failed` record exists
|
|
and the site reports it already has the target revision hash. In that case it
|
|
marks the prior `DeploymentRecord` `Success`, audit-logs `DeployReconciled`, and
|
|
returns it — the caller then returns `Result.Success` and **never enters the
|
|
normal deploy body**.
|
|
|
|
The normal success path (`DeployInstanceAsync.cs:215-223`) does three things on
|
|
a successful site response: writes the deployment record terminal status, sets
|
|
`instance.State = InstanceState.Enabled` + `UpdateInstanceAsync`, and calls
|
|
`StoreDeployedSnapshotAsync`. The reconciliation shortcut performs only the
|
|
first. Consequently, after a reconciled deployment:
|
|
|
|
- The instance `State` is left at whatever it was (e.g. `NotDeployed` for a
|
|
first-time deploy that timed out, or `Disabled`) even though the site is
|
|
actually running the configuration — the central state machine and the site
|
|
diverge, and a subsequent `DisableInstanceAsync`/`EnableInstanceAsync` will be
|
|
rejected or allowed incorrectly by `StateTransitionValidator`.
|
|
- No `DeployedConfigSnapshot` is created or refreshed. A first-time deploy that
|
|
is resolved purely by reconciliation leaves `GetDeploymentComparisonAsync`
|
|
permanently returning `"No deployed snapshot found for this instance."`, and a
|
|
redeploy reconciliation leaves the stored snapshot showing the *old* config
|
|
even though the deployment record claims `Success` for the new revision.
|
|
|
|
The design ("Deployed vs. Template-Derived State", WP-4/WP-8) requires the
|
|
deployed snapshot and instance state to reflect the last successful deployment;
|
|
the reconciliation path silently breaks both invariants.
|
|
|
|
**Recommendation**
|
|
|
|
In the reconciled-success branch of `TryReconcileWithSiteAsync`, perform the
|
|
same post-success side effects as the normal path: set `instance.State =
|
|
InstanceState.Enabled` (+ `UpdateInstanceAsync`) and call
|
|
`StoreDeployedSnapshotAsync` with the target deployment ID / revision hash /
|
|
config JSON. Factor the shared post-success logic into one helper so the normal
|
|
and reconciliation paths cannot drift. Add a regression test asserting that a
|
|
reconciled deployment leaves the instance `Enabled` and a snapshot stored.
|
|
|
|
**Resolution**
|
|
|
|
Resolved 2026-05-17 (commit pending): extracted the shared post-success side
|
|
effects into `ApplyPostSuccessSideEffectsAsync` (sets instance `State =
|
|
Enabled` + `UpdateInstanceAsync`, stores/refreshes the `DeployedConfigSnapshot`)
|
|
and invoked it from both the normal deploy success path and the
|
|
`TryReconcileWithSiteAsync` reconciled-success branch, so a reconciled
|
|
deployment now performs the same instance-state and snapshot updates as a
|
|
normal one (`configJson` is now computed before the reconciliation call and
|
|
threaded into `TryReconcileWithSiteAsync`). Regression test:
|
|
`DeployInstanceAsync_Reconciled_SetsInstanceEnabledAndStoresSnapshot`.
|
|
|
|
### DeploymentManager-016 — Reconciled prior record keeps its stale `RevisionHash`
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Medium |
|
|
| Category | Correctness & logic bugs |
|
|
| Status | Resolved |
|
|
| Location | `src/ScadaLink.DeploymentManager/DeploymentService.cs:639-651` |
|
|
|
|
**Description**
|
|
|
|
When `TryReconcileWithSiteAsync` reconciles a prior record, it mutates
|
|
`prior.Status`, `prior.ErrorMessage`, and `prior.CompletedAt`, but **not**
|
|
`prior.RevisionHash`. The reconciliation condition only compares the *site's*
|
|
`AppliedRevisionHash` against the *freshly-flattened* `targetRevisionHash` — it
|
|
does not require `prior.RevisionHash` to equal either of them.
|
|
|
|
The prior record can legitimately carry a different revision hash than the
|
|
current target: e.g. a deploy timed out at revision `R1`, the template was then
|
|
edited so the current flatten yields `R2`, and meanwhile the site actually
|
|
applied `R2` through some other path (or `R1` and `R2` are equal-by-content but
|
|
the prior record predates a hash recompute). After reconciliation the record's
|
|
`Status` is `Success` but its `RevisionHash` still says `R1`, so staleness
|
|
checks and any UI that reads `DeploymentRecord.RevisionHash` will report the
|
|
instance as deployed at the wrong revision. The audit `DeployReconciled` entry
|
|
records `RevisionHash = targetRevisionHash`, contradicting the persisted record.
|
|
|
|
**Recommendation**
|
|
|
|
In the reconciled-success branch, also set `prior.RevisionHash =
|
|
targetRevisionHash` so the persisted record, the audit entry, and the site's
|
|
actual applied revision all agree. Alternatively, only reconcile when
|
|
`prior.RevisionHash == targetRevisionHash` and otherwise fall through to a
|
|
normal deploy.
|
|
|
|
**Resolution**
|
|
|
|
Resolved 2026-05-17 (commit pending): the reconciled-success branch of
|
|
`TryReconcileWithSiteAsync` now also sets `prior.RevisionHash =
|
|
targetRevisionHash`, so the persisted record, the `DeployReconciled` audit
|
|
entry, and the site's actually-applied revision all agree. Regression test:
|
|
`DeployInstanceAsync_Reconciled_PriorRecordRevisionHashUpdatedToTarget`.
|
|
|
|
### DeploymentManager-017 — `GetDeploymentStatusAsync` XML doc describes behaviour it does not implement
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Low |
|
|
| Category | Documentation & comments |
|
|
| Status | Resolved |
|
|
| Location | `src/ScadaLink.DeploymentManager/DeploymentService.cs:562-570` |
|
|
|
|
**Description**
|
|
|
|
The XML summary on `GetDeploymentStatusAsync` reads: *"WP-2: After
|
|
failover/timeout, query site for current deployment state before
|
|
re-deploying."* The method body does no such thing — it is a one-line
|
|
pass-through to `_repository.GetDeploymentByDeploymentIdAsync`, a pure local DB
|
|
read. The query-the-site-before-redeploy behaviour the comment describes was
|
|
implemented separately in `TryReconcileWithSiteAsync` (DeploymentManager-006).
|
|
The stale comment is a leftover of the original design intent and misleads a
|
|
reader into thinking this method contacts the site.
|
|
|
|
**Recommendation**
|
|
|
|
Reword the summary to describe what the method actually does — "returns the
|
|
current persisted `DeploymentRecord` for the given deployment ID from the
|
|
configuration database" — and, if useful, cross-reference
|
|
`TryReconcileWithSiteAsync` as the place the site-query reconciliation lives.
|
|
|
|
**Resolution**
|
|
|
|
Resolved 2026-05-17 (commit pending): the `GetDeploymentStatusAsync` XML doc
|
|
now states it returns the persisted `DeploymentRecord` from the configuration
|
|
database as a pure local read, and cross-references `TryReconcileWithSiteAsync`
|
|
as where the query-the-site-before-redeploy reconciliation actually lives.
|
|
Documentation-only change; no regression test (a test asserting comment text
|
|
would be meaningless).
|
|
|
|
### DeploymentManager-018 — Reconciliation force-sets `Enabled`, overwriting an intentional `Disabled` after central failover
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | High |
|
|
| Category | Correctness & logic bugs |
|
|
| Status | Open |
|
|
| Location | `src/ScadaLink.DeploymentManager/DeploymentService.cs:675-682,721-748` |
|
|
|
|
**Description**
|
|
|
|
`TryReconcileWithSiteAsync` calls `ApplyPostSuccessSideEffectsAsync` whenever
|
|
the site reports it has the target revision hash, and that helper
|
|
unconditionally writes `instance.State = InstanceState.Enabled`. The
|
|
reconciliation shortcut only runs when the prior `DeploymentRecord` is
|
|
`InProgress` or timeout-`Failed` — exactly the scenarios that survive a central
|
|
failover (the in-memory `OperationLockManager` is lost on failover, by design:
|
|
*"Lost on central failover (acceptable per design — in-progress treated as
|
|
failed)"*).
|
|
|
|
After such a failover, the per-instance operation lock is gone but the
|
|
deployment record is still `InProgress` in the DB. A user can legitimately
|
|
issue `DisableInstanceAsync` for the same instance — there is nothing in
|
|
`DisableInstanceAsync` that consults the deployment record, only the
|
|
`StateTransitionValidator` over `Instance.State`. If the state is `Enabled`
|
|
(the typical case when the deploy started), the disable proceeds, the site
|
|
honours it (the design states a disabled instance retains its deployed
|
|
configuration), and central now persists `Instance.State = Disabled`. The
|
|
deployment-record row remains `InProgress` (no one transitioned it). Later the
|
|
user retries the deploy: `TryReconcileWithSiteAsync` runs, the site still has
|
|
the target revision hash (Disable doesn't change the deployed config), the
|
|
prior record is marked `Success`, and `ApplyPostSuccessSideEffectsAsync` writes
|
|
`Instance.State = Enabled` — silently overriding the user's explicit Disable.
|
|
|
|
The same trap exists for any direct DB edit / migration that flipped the state
|
|
between the timed-out deploy and the redeploy. The normal deploy path can
|
|
defensibly assume `Enabled` after a fresh successful apply, but the
|
|
reconciliation path is reconciling *prior* state with *prior* user intent; it
|
|
should preserve `Disabled` if that is the current `Instance.State` at the time
|
|
of reconciliation, mirroring the design's separation between deploy (config
|
|
apply) and disable (subscription/script lifecycle).
|
|
|
|
**Recommendation**
|
|
|
|
In the reconciliation branch, do not force `Enabled`. Either:
|
|
- Pass a flag/parameter to `ApplyPostSuccessSideEffectsAsync` telling it
|
|
whether to touch state, and skip the state write on the reconciliation path
|
|
(leaving the current `Instance.State` intact, which is already `Enabled`
|
|
for a fresh deploy that timed out and `Disabled` for the user-disabled
|
|
follow-up case); or
|
|
- Only set `Enabled` when the current `Instance.State` is `NotDeployed` (i.e.
|
|
the first-deploy timed-out case), and leave existing `Enabled`/`Disabled`
|
|
alone.
|
|
|
|
Add a regression test where an instance with `Instance.State = Disabled` and a
|
|
prior `InProgress` deployment record is reconciled — the resulting
|
|
`Instance.State` must remain `Disabled`, and the deployment record must still
|
|
be marked `Success`.
|
|
|
|
### DeploymentManager-019 — Lifecycle command timeout writes no audit entry
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Medium |
|
|
| Category | Error handling & resilience |
|
|
| Status | Open |
|
|
| Location | `src/ScadaLink.DeploymentManager/DeploymentService.cs:328-339,385-396,445-458` |
|
|
|
|
**Description**
|
|
|
|
`DisableInstanceAsync`, `EnableInstanceAsync`, and `DeleteInstanceAsync` each
|
|
wrap the `CommunicationService` call in a linked CTS with
|
|
`LifecycleCommandTimeout` (DeploymentManager-012). On timeout they log a
|
|
warning and `return Result<...>.Failure(...)` — and skip the
|
|
`_auditService.LogAsync` call entirely. As a result, an operator-initiated
|
|
disable/enable/delete that times out at the site leaves **no audit trail**:
|
|
the user, the timestamp, the command id, and the failure mode are not
|
|
recorded in the audit log. The deploy path goes out of its way to write a
|
|
`DeployFailed` audit entry on the same failure mode
|
|
(`DeploymentService.cs:274-276`), with `CancellationToken.None` so the write is
|
|
durable; the lifecycle commands do not.
|
|
|
|
The design lists audit logging as a Deployment Manager responsibility for "all
|
|
deployment actions, system-wide artifact deployments, and instance lifecycle
|
|
changes" — a timed-out lifecycle command **is** an attempted lifecycle change,
|
|
and the operator action is exactly the kind of event the audit log exists to
|
|
record.
|
|
|
|
**Recommendation**
|
|
|
|
In each of the three `catch (Exception ex) when (ex is TimeoutException or
|
|
OperationCanceledException)` blocks, write a `DisableTimeout`/`EnableTimeout`/
|
|
`DeleteTimeout` (or use the existing operation name with a failure flag)
|
|
audit entry with `CancellationToken.None` so a cancelled outer token does not
|
|
prevent the audit write, mirroring `DeployFailed`. Add a unit test asserting
|
|
that `DisableInstanceAsync_SiteUnresponsive_LifecycleCommandTimeoutBoundsTheWait`
|
|
also produces an audit entry.
|
|
|
|
### DeploymentManager-020 — `DeployReconciled` audit attributes the action to the prior deployer, not the current user
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Low |
|
|
| Category | Documentation & comments |
|
|
| Status | Open |
|
|
| Location | `src/ScadaLink.DeploymentManager/DeploymentService.cs:683-686` |
|
|
|
|
**Description**
|
|
|
|
In `TryReconcileWithSiteAsync` the audit call is:
|
|
|
|
```
|
|
await _auditService.LogAsync(prior.DeployedBy, "DeployReconciled", ...)
|
|
```
|
|
|
|
`prior.DeployedBy` is the user who issued the original (timed-out / stuck)
|
|
deployment, not the `user` parameter passed into `DeployInstanceAsync`. The
|
|
current user — the one who triggered the redeploy that produced the
|
|
reconciliation — is dropped on the floor. For audit forensics this is
|
|
misleading: the row will read "user A reconciled their own deployment"
|
|
when in fact user B initiated the action that reconciled it.
|
|
|
|
The original deployer is interesting context, but it should be carried in the
|
|
audit-detail object (where `DeploymentId` and `RevisionHash` already live), not
|
|
substituted for the actor.
|
|
|
|
**Recommendation**
|
|
|
|
Use `user` (the parameter on `DeployInstanceAsync`, threaded through
|
|
`TryReconcileWithSiteAsync`) as the audit actor, and include
|
|
`OriginalDeployer = prior.DeployedBy` in the detail object so the original
|
|
attribution is preserved without misrepresenting who took the action.
|
|
|
|
### DeploymentManager-021 — `ResolveSiteIdentifierAsync` silently substitutes the DB id when the site row is missing
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Low |
|
|
| Category | Correctness & logic bugs |
|
|
| Status | Open |
|
|
| Location | `src/ScadaLink.DeploymentManager/DeploymentService.cs:107-111` |
|
|
|
|
**Description**
|
|
|
|
```
|
|
private async Task<string> ResolveSiteIdentifierAsync(int siteId, CancellationToken cancellationToken)
|
|
{
|
|
var site = await _siteRepository.GetSiteByIdAsync(siteId, cancellationToken);
|
|
return site?.SiteIdentifier ?? siteId.ToString();
|
|
}
|
|
```
|
|
|
|
If the `Site` row is missing (FK was deleted, race with admin delete, DB
|
|
inconsistency), the method silently returns the numeric DB id rendered as a
|
|
string. This is then passed to `CommunicationService.{Deploy,Disable,Enable,
|
|
Delete}InstanceAsync` and `QueryDeploymentStateAsync` as if it were a real
|
|
`SiteIdentifier` (e.g. "site-a"). The communication layer will fail with an
|
|
"unknown site" or routing error, producing a confusing diagnostic that hides
|
|
the actual problem (no site row).
|
|
|
|
This is a defensive concern, but every mutating operation in the module goes
|
|
through this method, so a stale instance whose site was deleted will produce a
|
|
misleading error every time it is touched.
|
|
|
|
**Recommendation**
|
|
|
|
Treat a missing site as a hard validation failure: return a
|
|
`Result.Failure($"Site with ID {siteId} not found")` early from the calling
|
|
operations, instead of fabricating an identifier. The repository already
|
|
returns `Site?`, so the null path is type-visible; just don't paper over it.
|
|
|
|
### DeploymentManager-022 — `Pending` and `InProgress` are written back-to-back with no intervening work
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Low |
|
|
| Category | Code organization & conventions |
|
|
| Status | Open |
|
|
| Location | `src/ScadaLink.DeploymentManager/DeploymentService.cs:178-194` |
|
|
|
|
**Description**
|
|
|
|
`DeployInstanceAsync` does:
|
|
|
|
```
|
|
record.Status = Pending;
|
|
AddDeploymentRecordAsync(record); SaveChangesAsync(); NotifyStatusChange(record);
|
|
record.Status = InProgress;
|
|
UpdateDeploymentRecordAsync(record); SaveChangesAsync(); NotifyStatusChange(record);
|
|
```
|
|
|
|
There is no work between the two writes — flattening, validation, and
|
|
reconciliation have already completed by line 174. The deploy command is sent
|
|
immediately after the `InProgress` write. The `Pending` write therefore costs:
|
|
an extra `SaveChangesAsync` round-trip, an extra `IDeploymentStatusNotifier`
|
|
invocation (which the CentralUI-006 page renders, so the user briefly sees a
|
|
`Pending` flicker before `InProgress`), and an extra row-version bump if EF
|
|
optimistic concurrency is enabled on the table.
|
|
|
|
The design uses `Pending` to mean "queued, not yet sent" and `InProgress` to
|
|
mean "sent to site, awaiting response". The code's `Pending` slot has no
|
|
queuing — it is set and immediately overwritten — so the state buys nothing
|
|
operationally.
|
|
|
|
**Recommendation**
|
|
|
|
Either:
|
|
- Drop the `Pending` write entirely and create the record directly in
|
|
`InProgress` (one row insert, one notification, simpler UI); or
|
|
- Move the `Pending`→`InProgress` transition to bracket actual queueing/work
|
|
(e.g. set `Pending` *before* flattening + reconciliation, set `InProgress`
|
|
immediately before `DeployInstanceAsync` on the comm service) so the two
|
|
states carry distinguishable semantics worth a separate write.
|
|
|
|
### DeploymentManager-023 — `BuildDeployArtifactsCommandAsync` re-queries system-wide artifacts once per site
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Low |
|
|
| Category | Performance & resource management |
|
|
| Status | Open |
|
|
| Location | `src/ScadaLink.DeploymentManager/ArtifactDeploymentService.cs:82-144,169-173` |
|
|
|
|
**Description**
|
|
|
|
`DeployToAllSitesAsync` loops over sites and calls
|
|
`BuildDeployArtifactsCommandAsync(site.Id, ...)` for each one. Of the six
|
|
artifact sets the method gathers, **only** `dataConnections` is per-site:
|
|
|
|
- `_templateRepo.GetAllSharedScriptsAsync` — global.
|
|
- `_externalSystemRepo.GetAllExternalSystemsAsync` — global, plus
|
|
`GetMethodsByExternalSystemIdAsync` per external system per site.
|
|
- `_externalSystemRepo.GetAllDatabaseConnectionsAsync` — global.
|
|
- `_notificationRepo.GetAllNotificationListsAsync` — global.
|
|
- `_notificationRepo.GetAllSmtpConfigurationsAsync` — global.
|
|
- `_siteRepo.GetDataConnectionsBySiteIdAsync(siteId, ...)` — **per-site**.
|
|
|
|
With N sites this issues ≈ 5·N redundant queries on the global sets (plus
|
|
M·N method queries, where M is the external-system count). On a hub-and-spoke
|
|
deployment with many sites the artifact-deploy path is noticeably slower than
|
|
necessary and pins DbContext usage longer than needed. Per CLAUDE.md, the
|
|
DbContext is not thread-safe and the per-site commands are already built
|
|
sequentially (good); the redundant queries are sequential too, but the
|
|
network/round-trip cost is real.
|
|
|
|
**Recommendation**
|
|
|
|
Hoist the global queries (shared scripts, external systems + their methods,
|
|
DB connections, notification lists, SMTP configurations) out of
|
|
`BuildDeployArtifactsCommandAsync`, fetch them once in `DeployToAllSitesAsync`,
|
|
and pass them in alongside the site id (or expose a
|
|
`BuildDeployArtifactsCommandAsync(siteId, prefetchedGlobals)` overload).
|
|
`RetryForSiteAsync` (the single-site path) can keep the convenience-overload
|
|
behaviour. Add a test using NSubstitute's `.Received()` to assert
|
|
`_templateRepo.GetAllSharedScriptsAsync` is called exactly once for an
|
|
N-site deployment.
|
|
|
|
### DeploymentManager-024 — Test probe actors hold mutable static state across tests
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Low |
|
|
| Category | Testing coverage |
|
|
| Status | Open |
|
|
| Location | `tests/ScadaLink.DeploymentManager.Tests/DeploymentServiceTests.cs:966-1075`, `tests/ScadaLink.DeploymentManager.Tests/ArtifactDeploymentServiceTests.cs:196-217` |
|
|
|
|
**Description**
|
|
|
|
`ReconcileProbeActor.QueryCount` / `DeployCount`, `SerializationProbeActor.MaxConcurrent`
|
|
/ `_current`, and `ArtifactProbeActor.Received` are all `static` fields.
|
|
Each test's actor constructor resets them — but reset-on-construction only
|
|
works as long as no two tests in the same class run concurrently. xUnit's
|
|
default parallelism disables intra-class parallelism, so today's tests pass;
|
|
flip the assembly-level `[CollectionBehavior(DisableTestParallelization = true)]`
|
|
or move to xUnit v3 (which enables intra-class parallelism by default) and the
|
|
counters race — a deploy in test A could increment `DeployCount` while test B
|
|
is asserting on it.
|
|
|
|
Static state shared across tests is also why a flaky-test investigation here
|
|
will be unusually painful: the offending interaction is invisible from any
|
|
single test file.
|
|
|
|
**Recommendation**
|
|
|
|
Replace the static counters with instance state, hand the actor a probe
|
|
recipient (an `IActorRef` to a TestKit probe), and assert via `ExpectMsg`
|
|
in each test. Where the simpler counter shape is preferred, pass a
|
|
shared-state object into the actor's constructor so each test owns its own
|
|
instance — never reach for `static` mutable test state.
|