The largest themed batch — small mechanical fixes across 11 modules.
API / message hygiene:
- Comm-020: SiteAddressCacheLoaded now carries IReadOnlyDictionary /
IReadOnlyList — Akka messages must be immutable.
- Commons-016: BundleSession.MaxUnlockAttempts named constant replaces
magic 3.
- Commons-018: IOperationTrackingStore + IPartitionMaintenance moved from
Interfaces/ root to Interfaces/Services/ (namespace preserved — 9
consumers exceeded the in-prompt move threshold).
- Commons-023: TrackingStatusSnapshot.SourceNode now consistent with the
trailing-optional-with-default pattern used elsewhere.
- SR-022: AuditingDbCommand.DbConnection.set no longer uses reflection —
exposes AuditingDbConnection.Inner via internal API surface.
Dead code / config cleanup:
- ClusterInfra-011: decorative SectionName constant deleted.
- ClusterInfra-014: dead AddClusterInfrastructureActors method + its
"throws-when-called" test deleted.
- Host-021: Microsoft Logging:LogLevel block deleted from appsettings.json
(dead under Serilog).
Fail-loud over fail-silent:
- DM-021: ResolveSiteIdentifierAsync throws on missing site (was silently
substituting a DB id).
- DM-022: dropped transient Pending write — record now lands directly in
InProgress (no UI flicker, one fewer DB write).
- Host-020: LoggerConfigurationFactory emits a Console.Error warning when
both Serilog:MinimumLevel and ScadaLink:Logging:MinimumLevel are set
(ScadaLink remains truth per Host-011).
- SnF-022: NotifyCachedCallObserverAsync logs Warning on unparseable
TrackedOperationId (was silently dropping).
- SnF-023: empty siteId default replaced with $unknown-site sentinel
+ constructor normalisation.
Correctness:
- SCA-001: SupervisorStrategy XML rewritten to match actual
DefaultDecider/Restart semantics (was claiming Resume).
- SCA-003: OnUpsertAsync now restamps IngestedAtUtc on every upsert.
- SR-021: HandleDeployArtifacts now dispatches an internal
ApplyArtifactDataConnectionsToDcl message after the SQLite write so
system-wide artifact-deploy data-connection changes go live
immediately (was requiring a site restart).
- SnF-020: RetryParkedMessageAsync captures the parked row BEFORE the
local write so a concurrent delete can't skip standby replication.
Sentinels / naming collisions:
- HM-021: CentralSiteId changed from "central" to "$central"
(uncollideable — leading $ is forbidden in real SiteIdentifiers).
Doc / surface cleanups:
- SEL-018: FailedWriteCount promoted to ISiteEventLogger; XML softened
to "Available for future Health Monitoring integration".
- SnF-019: VERIFY outcome — documented parking-after-DefaultMaxRetries
in Component-StoreAndForward.md + DefaultMaxRetries XML (uniform
cap; maxRetries:0 is the unbounded escape hatch).
- SnF-021: Component-StoreAndForward.md no longer claims the tracking
table lives in SnF — it's in SiteRuntime, the interface is in Commons.
- CLI-020: bundle export response parse guarded with try/catch on
JsonException / KeyNotFoundException / FormatException — emits a
clean INVALID_RESPONSE exit instead of a stack trace.
Config:
- ClusterInfra-013: intent comment added to "catastrophic config" test.
- Host-016: appsettings.Site.json second CentralContactPoints entry
removed (was pointing at the SITE's own port); doc-key explains how
to extend.
- Host-018: NodeName added to both shipped per-role configs (was
causing SourceNode to be null on audit rows).
UI:
- CentralUI-029: replaced JS.InvokeAsync<int>("eval", …) with an ES
module import (new wwwroot/js/browser-time.js).
- CentralUI-032: AuditResultsGrid gains a Previous button backed by a
cursor stack.
10+ new regression tests across the affected projects. Build clean;
all suites green. README regenerated: 6 open (was 33).
Session-to-date: 130 of 136 originally-open Theme findings closed.
64 KiB
Code Review — DeploymentManager
| Field | Value |
|---|---|
| Module | src/ScadaLink.DeploymentManager |
| Design doc | docs/requirements/Component-DeploymentManager.md |
| Status | Reviewed |
| Last reviewed | 2026-05-28 |
| Reviewer | claude-agent |
| Commit reviewed | 1eb6e97 |
| Open findings | 0 |
Summary
The DeploymentManager module is small, well-structured, and clearly maps work
packages (WP-N) onto code. The happy paths for instance deployment, lifecycle
commands, artifact broadcast, and staleness comparison are implemented
sensibly, and the operation lock correctly serializes mutating operations per
instance while allowing cross-instance parallelism. However, the review found a
significant cluster of error-handling and resilience gaps: the deployment
record can be left permanently stuck in InProgress when an exception other
than timeout/cancellation is thrown, the catch block writes its failure status
using a cancellation token that may already be cancelled, and the
OperationLockManager leaks one SemaphoreSlim per instance name forever.
There are also two notable design-document adherence gaps: the
"query-the-site-before-redeploy" idempotency requirement is not implemented
(GetDeploymentStatusAsync only reads the local DB), and the "Diff View"
feature is reduced to a bare hash comparison with no added/removed/changed
detail. Configuration is not bound to appsettings.json, leaving one option
entirely dead. Test coverage stops at the communication boundary and never
exercises a successful deployment or the lifecycle success paths.
Re-review 2026-05-17 (commit 39d737e)
Re-reviewed at commit 39d737e after the batch of fixes for
DeploymentManager-001..014. All fourteen prior findings remain Resolved and
verified against source — the broadened catch, non-cancellable cleanup writes,
ref-counted OperationLockManager, query-before-redeploy reconciliation,
structured diff, options binding, and the expanded TestKit-actor test suite are
all present and correct. The module is in markedly better shape than the
first review: error paths are now defensively handled and test coverage is
broad (successful deploy/lifecycle, lock serialization, reconciliation
matrix, artifact per-site matrix).
This re-review found 3 new findings, all clustered on the
DeploymentManager-006 reconciliation path added since the last review. The
reconciliation shortcut (TryReconcileWithSiteAsync) marks a stale prior
record Success when the site already has the target revision, but it does
not perform the side effects the normal success path does — it never
updates the instance State, never refreshes the DeployedConfigSnapshot,
and never corrects the prior record's own RevisionHash (DeploymentManager-015,
DeploymentManager-016). The GetDeploymentStatusAsync XML doc is now stale —
it still describes the query-before-redeploy behaviour that actually moved into
TryReconcileWithSiteAsync (DeploymentManager-017).
Re-review 2026-05-28 (commit 1eb6e97)
Re-reviewed at commit 1eb6e97 after the DeploymentManager-015/016/017 fixes
and a docs-only XML-comment pass. The three prior findings remain Resolved
and verified — ApplyPostSuccessSideEffectsAsync is now invoked from both the
normal success path and TryReconcileWithSiteAsync, the reconciled-success
branch corrects prior.RevisionHash to the target, and GetDeploymentStatusAsync's
XML doc now describes the local-DB-read it actually performs and cross-refs the
reconciliation helper. The DiffService wiring, options binding, ref-counted
operation lock, broadened catch, non-cancellable cleanup, and TestKit-actor
test seam are still in place. The 7 new findings here are not regressions in
the DeploymentManager-015/016 fixes — they are issues uncovered by widening
the lens to the lifecycle paths, reconciliation's interaction with
intentional Disabled state, audit semantics, and operational concerns
(per-site artifact-build cost, Pending→InProgress double-write).
The single notable correctness issue is DeploymentManager-018: the
reconciliation shortcut unconditionally sets instance.State = Enabled via
ApplyPostSuccessSideEffectsAsync. After a central failover that loses the
in-memory operation lock, a user can legitimately Disable an instance whose
prior deploy record is still InProgress; a subsequent redeploy then reconciles
and silently re-enables the instance against the user's explicit intent.
The remaining six findings are medium/low: lifecycle-timeout audit gap
(DeploymentManager-019), audit-user attribution in reconciliation
(DeploymentManager-020), silent fallback in ResolveSiteIdentifierAsync
(DeploymentManager-021), back-to-back Pending→InProgress writes
(DeploymentManager-022), per-site re-query of system-wide artifacts
(DeploymentManager-023), and shared static state across *ProbeActor tests
(DeploymentManager-024).
Checklist coverage
Re-review 2026-05-28 (commit 1eb6e97)
| # | Category | Examined | Notes |
|---|---|---|---|
| 1 | Correctness & logic bugs | ✓ | New: reconciliation forces Enabled even if the user disabled the instance in between (DeploymentManager-018). |
| 2 | Akka.NET conventions | ✓ | Module remains a plain service layer; no actors. No issues. |
| 3 | Concurrency & thread safety | ✓ | OperationLockManager ref-counting verified. Note: test probes hold static state (DeploymentManager-024) — a test concern, not production code. |
| 4 | Error handling & resilience | ✓ | New: Disable/Enable/Delete timeouts return early without writing any audit entry — deploy has DeployFailed, lifecycle has nothing (DeploymentManager-019). |
| 5 | Security | ✓ | No new issues. SMTP credential decision documented (DeploymentManager-013 closed). |
| 6 | Performance & resource management | ✓ | New: BuildDeployArtifactsCommandAsync re-queries every system-wide artifact set per site in DeployToAllSitesAsync (DeploymentManager-023). |
| 7 | Design-document adherence | ✓ | Reconciliation now performs post-success side effects (DeploymentManager-015 resolved). DeploymentManager-018 surfaces a new gap on Disabled-state preservation. |
| 8 | Code organization & conventions | ✓ | New: redundant Pending→InProgress back-to-back write with no intervening work (DeploymentManager-022). Silent string-fallback in ResolveSiteIdentifierAsync (DeploymentManager-021). |
| 9 | Testing coverage | ✓ | New: no coverage for the reconciliation-overwrites-Disabled case (part of DeploymentManager-018); test probes share static state across tests (DeploymentManager-024). |
| 10 | Documentation & comments | ✓ | New: DeployReconciled audit uses prior.DeployedBy instead of the current user parameter — misleading for forensics (DeploymentManager-020). |
Findings
DeploymentManager-001 — Unexpected exceptions leave the deployment record stuck in InProgress
| Severity | High |
| Category | Error handling & resilience |
| Status | Resolved |
| Location | src/ScadaLink.DeploymentManager/DeploymentService.cs:141-199 |
Description
DeployInstanceAsync sets the record to InProgress (lines 137-139), then the
try block calls into CommunicationService and the repository. The only
catch filter is when (ex is TimeoutException or OperationCanceledException).
Any other exception — InvalidOperationException (thrown by
CommunicationService.GetCommunicationActor() when the actor is not set), a
JSON serialization error, a deserialization failure of the response, a DB
exception on UpdateDeploymentRecordAsync, or any transport error — escapes the
method. The deployment record remains in DeploymentStatus.InProgress
permanently. Because staleness and the UI both read current status, the
instance is then misreported as "deploying" forever and a re-deploy may be
blocked or misinterpreted. The design explicitly states an interrupted
deployment must be "treated as failed".
Recommendation
Broaden the catch to a general catch (Exception ex) that records
DeploymentStatus.Failed with the error message, audit-logs the failure, and
re-throws or returns a failed Result. Keep the timeout-specific branch only
if a distinct message is desired. Ensure the failure-status write happens for
every exit path out of the try.
Resolution
Resolved 2026-05-16 (commit <pending>): broadened the catch in
DeployInstanceAsync to catch (Exception ex) so any exception (transport,
serialization, DB, InvalidOperationException from an uninitialized
CommunicationService) marks the deployment record Failed with the error
message and audit-logs the failure, instead of escaping and leaving the record
stuck in InProgress. Regression test:
DeployInstanceAsync_CommunicationThrowsUnexpectedException_RecordMarkedFailed.
DeploymentManager-002 — Failure-status write uses a possibly-cancelled cancellation token
| Severity | High |
| Category | Error handling & resilience |
| Status | Resolved |
| Location | src/ScadaLink.DeploymentManager/DeploymentService.cs:186-196 |
Description
The catch (Exception ex) when (ex is TimeoutException or OperationCanceledException) block updates the record to Failed and calls
UpdateDeploymentRecordAsync/SaveChangesAsync/LogAsync passing the same
cancellationToken that was just cancelled (an OperationCanceledException
caught here means the token is already in the cancelled state). Those
repository and audit calls will themselves throw OperationCanceledException
before the failure status is persisted, so the record stays InProgress — the
exact bug DeploymentManager-001 describes, reached via the supposedly-handled
path.
Recommendation
Perform the cleanup writes with a fresh, non-cancellable token (e.g.
CancellationToken.None, optionally with an independent short timeout) so the
failure status is durably recorded even when the original operation was
cancelled or timed out.
Resolution
Resolved 2026-05-16 (commit <pending>): the broadened catch block now
performs the failure-status write (UpdateDeploymentRecordAsync,
SaveChangesAsync) and the audit LogAsync with CancellationToken.None
instead of the operation's (possibly-cancelled) token, so the Failed status
is durably recorded even after a timeout/cancellation. The cleanup writes are
themselves wrapped in a try/catch that logs (without masking the original
error) if persistence still fails. Regression test:
DeployInstanceAsync_FailureWrite_UsesNonCancellableToken.
DeploymentManager-003 — Successful-deployment cleanup is not atomic with the status write
| Severity | Medium |
| Category | Error handling & resilience |
| Status | Resolved |
| Location | src/ScadaLink.DeploymentManager/DeploymentService.cs:155-170 |
Description
After a successful site response the code calls UpdateDeploymentRecordAsync
(no SaveChanges yet), then UpdateInstanceAsync, then
StoreDeployedSnapshotAsync (which itself issues Add/Update calls), then a
single SaveChangesAsync at line 170. If StoreDeployedSnapshotAsync throws,
the exception is not caught (see DeploymentManager-001) and the
SaveChangesAsync never runs — the instance state, deployment status, and
snapshot are all left unpersisted even though the site has actually applied the
deployment. Central and site are now divergent: the site is running the new
config but central still shows the old state and a non-Success deployment
record.
Verification: Confirmed against source. The DeploymentManager-001 fix made
this strictly worse, not better — after that fix a snapshot-store failure is
caught and the record is flipped from Success back to Failed, so central
reports a failed deployment while the site is running the new config.
Recommendation
Wrap the post-success persistence so that, at minimum, the deployment record's
Success status is committed. Consider committing the status first, then the
instance state and snapshot, so a later failure does not lose the fact that the
site succeeded. Log loudly if the snapshot write fails after a confirmed site
apply.
Resolution
Resolved 2026-05-16 (commit pending): DeployInstanceAsync now commits the
deployment record's terminal status (UpdateDeploymentRecordAsync +
SaveChangesAsync) immediately after the site confirms the apply, before
touching instance state or the deployed-config snapshot. The post-success
instance-state update and StoreDeployedSnapshotAsync are wrapped in a
best-effort try/catch that logs loudly for operator reconciliation but no
longer flips the already-committed Success record back to Failed.
Regression test:
DeployInstanceAsync_SiteSucceeds_SnapshotWriteFails_RecordStillCommittedSuccess.
DeploymentManager-004 — Site-success but central-delete-failure leaves orphaned site config
| Severity | Medium |
| Category | Error handling & resilience |
| Status | Resolved |
| Location | src/ScadaLink.DeploymentManager/DeploymentService.cs:312-319 |
Description
In DeleteInstanceAsync, when the site responds Success the code calls
_repository.DeleteInstanceAsync then SaveChangesAsync. If SaveChangesAsync
throws (DB error, concurrency), the exception propagates uncaught: the site has
already destroyed the Instance Actor and removed its config, but the central
instance record still exists. The instance is now un-deletable through the
normal path (the site no longer has it, so a re-issued delete may fail) and is
permanently orphaned. The design states central must not mark the instance
deleted until the site confirms — but it does not address the inverse failure.
Verification: Confirmed against source. DeleteInstanceAsync has no
try/catch around the post-success block, so any exception from
DeleteInstanceAsync/SaveChangesAsync escapes uncaught to the caller.
Recommendation
Catch persistence failures in the post-success block and surface a distinct error indicating the site succeeded but the central record could not be removed, so an operator/retry can reconcile. Consider making the central delete idempotent and retryable independently of the site command.
Resolution
Resolved 2026-05-16 (commit pending): the post-success removal in
DeleteInstanceAsync (DeleteInstanceAsync + SaveChangesAsync) is now
wrapped in a try/catch. A persistence failure no longer escapes uncaught —
it is logged, recorded with a DeleteOrphaned audit entry, and surfaced as a
distinct Result failure stating the site deleted the instance but the central
record is orphaned and must be reconciled. Regression test:
DeleteInstanceAsync_SiteSucceeds_CentralDeleteFails_ReturnsDistinctFailure.
DeploymentManager-005 — OperationLockManager leaks a SemaphoreSlim per instance name
| Severity | Medium |
| Category | Performance & resource management |
| Status | Resolved |
| Location | src/ScadaLink.DeploymentManager/OperationLockManager.cs:15-33 |
Description
AcquireAsync does _locks.GetOrAdd(instanceUniqueName, _ => new SemaphoreSlim(1, 1)) and entries are never removed. Every distinct instance
unique name that is ever deployed/disabled/enabled/deleted permanently adds a
SemaphoreSlim (an IDisposable holding a kernel wait handle) to the
dictionary. Over the lifetime of a long-running central process — especially
with the bulk "deploy all out-of-date instances" workflow and instances that
are created and deleted over time — this is an unbounded leak of both managed
memory and OS handles. Deleted instances' semaphores are never reclaimed.
Verification: Confirmed against source. _locks is a ConcurrentDictionary
with no removal path anywhere in the type.
Recommendation
Either accept the leak explicitly and document the expected bounded cardinality of instance names, or implement reclamation: e.g. ref-count handles and remove
Dispose()the semaphore when the count reaches zero and the lock is free. At minimum, remove the semaphore entry when an instance is deleted (DeleteInstanceAsync).
Resolution
Resolved 2026-05-16 (commit pending): OperationLockManager now ref-counts each
lock entry. A reference is reserved (creating the entry if needed) before the
SemaphoreSlim.WaitAsync, so concurrent waiters for the same instance share one
semaphore and the entry survives until every waiter/holder has released. When
the reference count reaches zero — on release, timeout, or cancellation — the
entry is removed from the dictionary and the semaphore is Dispose()d, so the
process no longer accumulates one kernel wait handle per distinct instance name.
A TrackedLockCount diagnostic property was added to make reclamation testable.
Regression tests: AcquireAsync_ReleasedLock_RemovesSemaphoreEntry,
AcquireAsync_ManyDistinctInstances_DoesNotAccumulateSemaphores,
AcquireAsync_ContendedLock_KeepsSemaphoreUntilLastReleaseThenReclaims.
DeploymentManager-006 — Query-the-site-before-redeploy idempotency requirement not implemented
| Severity | High |
| Category | Design-document adherence |
| Status | Resolved |
| Location | src/ScadaLink.DeploymentManager/DeploymentService.cs:84-200,363-368 |
Description
The design ("Deployment Identity & Idempotency") requires: "After a central
failover or timeout, the Deployment Manager queries the site for current
deployment state before allowing a re-deploy. This prevents duplicate
application and out-of-order config changes." The code never does this.
GetDeploymentStatusAsync only reads the local DeploymentRecord from the DB
(GetDeploymentByDeploymentIdAsync) — it does not contact the site.
DeployInstanceAsync unconditionally generates a new deployment ID and sends a
new DeployInstanceCommand regardless of any prior in-flight or timed-out
deployment. After a timeout where the site actually applied the config, a
re-deploy produces a second deployment with no reconciliation against the
site's current revision hash. Site-side stale-rejection is the only safety
net, and that is not verified here.
Recommendation
Add a site query (a new CommunicationService pattern returning the site's
currently-applied deployment ID / revision hash) and call it before re-deploy
when a prior record for the instance is in InProgress/Failed due to
timeout. Reconcile: if the site already has the target revision, mark the prior
record Success instead of re-sending. Either implement this or update the
design doc to reflect that reconciliation is delegated entirely to site-side
stale-rejection.
Resolution
Resolved 2026-05-16 (commit <pending>): implemented the cross-module
query-the-site-before-redeploy idempotency feature across Commons, SiteRuntime,
Communication, and DeploymentManager — new DeploymentStateQueryRequest /
DeploymentStateQueryResponse contracts, a DeploymentManagerActor handler
answering from the site's deployed-config store, a
CommunicationService.QueryDeploymentStateAsync method routed over the
ClusterClient command/control transport, and reconciliation in
DeployInstanceAsync (TryReconcileWithSiteAsync) that queries the site only
when a prior record is InProgress or Failed due to a timeout, marks the
prior record Success without re-sending if the site already has the target
revision hash, and falls through to a normal deploy (relying on site-side
stale-rejection) when the query fails. Regression tests:
RoundTrip_DeploymentStateQueryRequest_Succeeds,
RoundTrip_DeploymentStateQueryResponse_Deployed_Succeeds,
RoundTrip_DeploymentStateQueryResponse_NotDeployed_NullApplied,
DeploymentStateQuery_DeployedInstance_ReturnsAppliedIdentity,
DeploymentStateQuery_UnknownInstance_ReturnsNotDeployed,
DeploymentStateQuery_ForwardedToDeploymentManager,
QueryDeploymentStateAsync_BeforeInitialization_Throws,
QueryDeploymentStateAsync_SendsEnvelopeAndReturnsResponse,
DeployInstanceAsync_PriorInProgressRecord_SiteHasTargetHash_MarksSuccessWithoutRedeploy,
DeployInstanceAsync_PriorInProgressRecord_SiteHasDifferentHash_ProceedsWithDeploy,
DeployInstanceAsync_PriorFailedTimeoutRecord_QueriesSite,
DeployInstanceAsync_PriorSuccessRecord_SkipsSiteQuery,
DeployInstanceAsync_FreshFirstTimeDeploy_SkipsSiteQuery,
DeployInstanceAsync_PriorInProgressRecord_QueryFails_FallsThroughToDeploy.
DeploymentManager-007 — "Diff View" reduced to a hash comparison with no diff detail
| Severity | Medium |
| Category | Design-document adherence |
| Status | Resolved |
| Location | src/ScadaLink.DeploymentManager/DeploymentService.cs:334-358,401-406 |
Description
The design ("Diff View" and "Dependencies" sections) states the Deployment
Manager can request a diff from the Template Engine showing added/removed
members, changed values, and connection-binding changes.
GetDeploymentComparisonAsync and DeploymentComparisonResult only compare two
revision hashes and return a boolean IsStale plus the two hashes. No
added/removed/changed detail is produced, and the Template Engine's diff
capability is not invoked. The UI cannot render a meaningful diff from this
result.
Verification: Confirmed against source. The Template Engine already provides
DiffService + ConfigurationDiff (structured Added/Removed/Changed entries
for attributes, alarms, and scripts, including data connection binding fields),
and DiffService is DI-registered — it was simply never wired into the
Deployment Manager's comparison path.
Recommendation
Either implement a real diff (deserialize the stored
DeployedConfigSnapshot.ConfigurationJson and the freshly flattened config and
invoke the Template Engine's diff service, surfacing structured
added/removed/changed entries), or revise the design doc to scope the feature
down to staleness detection only.
Resolution
Resolved 2026-05-16 (commit pending): GetDeploymentComparisonAsync now
deserializes the stored DeployedConfigSnapshot.ConfigurationJson and runs the
Template Engine DiffService against the freshly flattened current
configuration, attaching the resulting ConfigurationDiff (added/removed/changed
attributes, alarms, scripts) to a new optional Diff property on
DeploymentComparisonResult. DiffService is injected into DeploymentService.
A snapshot that cannot be deserialized (corrupt / older schema) still yields the
hash-based staleness result with a null diff, logged at warning level.
Regression test: GetDeploymentComparisonAsync_ProducesStructuredDiff.
DeploymentManager-008 — DeploymentManagerOptions is never bound to configuration
| Severity | Medium |
| Category | Code organization & conventions |
| Status | Resolved |
| Location | src/ScadaLink.DeploymentManager/ServiceCollectionExtensions.cs:7-14 |
Description
AddDeploymentManager registers the services but never calls
services.Configure<DeploymentManagerOptions>(configuration.GetSection(...)).
IOptions<DeploymentManagerOptions> therefore always resolves to a
default-constructed instance — the operation-lock and artifact-deployment
timeouts cannot be tuned via appsettings.json, contrary to the CLAUDE.md
convention "Per-component configuration via appsettings.json sections bound
to options classes (Options pattern)." Host/Program.cs binds
SecurityOptions and InboundApiOptions from configuration sections but has
no equivalent for DeploymentManagerOptions.
Verification: Confirmed against source. Neither AddDeploymentManager nor
Host/Program.cs binds DeploymentManagerOptions.
Recommendation
Add an IConfiguration parameter (or a configure callback) to
AddDeploymentManager and bind DeploymentManagerOptions to a section such as
ScadaLink:DeploymentManager, consistent with the other components.
Resolution
Resolved 2026-05-16 (commit pending): AddDeploymentManager() now calls
services.AddOptions<DeploymentManagerOptions>() so IOptions<DeploymentManagerOptions>
is always resolvable, and Host/Program.cs binds the
ScadaLink:DeploymentManager section (exposed as
ServiceCollectionExtensions.OptionsSection) via
services.Configure<DeploymentManagerOptions>(...) — the same pattern the Host
uses for SecurityOptions/InboundApiOptions. An earlier attempt added an
AddDeploymentManager(IConfiguration) overload; that was reverted because the
project convention (enforced by Host.Tests.OptionsTests) forbids component
Add* methods from depending on IConfiguration — the Host owns
configuration binding. Regression tests:
AddDeploymentManager_RegistersResolvableOptions_WithDefaults,
AddDeploymentManager_OptionsBindToConfigurationSection_AsTheHostWires,
OptionsSection_MatchesTheConventionalComponentSectionPath.
DeploymentManager-009 — Misleading timeout comment on DeleteInstanceAsync
| Severity | Low |
| Category | Documentation & comments |
| Status | Resolved |
| Location | src/ScadaLink.DeploymentManager/DeploymentService.cs:288 |
Description
The XML doc says "Delete fails if site unreachable (30s timeout via
CommunicationOptions)." The actual delete timeout is whatever
CommunicationOptions.LifecycleTimeout is configured to (passed inside
CommunicationService.DeleteInstanceAsync); the "30s" figure is hard-coded
into the comment and not derived from any constant in this module. If
LifecycleTimeout is reconfigured, the comment becomes wrong. It also wrongly
implies the value lives in this module.
Verification: Confirmed against source. The DeleteInstanceAsync XML doc
quoted a hard-coded "30s" value.
Recommendation
Reword to "Delete fails if the site is unreachable within
CommunicationOptions.LifecycleTimeout" without quoting a specific number.
Resolution
Resolved 2026-05-16 (commit pending): the DeleteInstanceAsync XML doc no
longer quotes a hard-coded "30s" — it now states delete fails if the site is
unreachable within CommunicationOptions.LifecycleTimeout (and notes the
deadline is applied inside CommunicationService.DeleteInstanceAsync).
Documentation-only change; no regression test (a test asserting comment text
would be meaningless).
DeploymentManager-010 — SystemArtifactDeploymentRecord does not persist the deployment ID
| Severity | Low |
| Category | Correctness & logic bugs |
| Status | Resolved |
| Location | src/ScadaLink.DeploymentManager/ArtifactDeploymentService.cs:136,194-211 |
Description
DeployToAllSitesAsync generates a deploymentId (line 136) and returns it in
the ArtifactDeploymentSummary and audit log, but the persisted
SystemArtifactDeploymentRecord has no field for it (the entity only has Id,
ArtifactType, DeployedBy, DeployedAt, PerSiteStatus). The deployment ID
that appears in the UI summary and audit log cannot be correlated back to the
stored record. Additionally each per-site DeployArtifactsCommand carries its
own separate GUID (BuildDeployArtifactsCommandAsync line 114), so there are in
fact N+1 unrelated IDs for one logical artifact deployment.
Verification: Confirmed against source. Each per-site command minted its own GUID and the persisted record had no way to reference the logical id.
Recommendation
Add a DeploymentId column to SystemArtifactDeploymentRecord and store the
single logical deploymentId; reuse that ID (or a derived per-site ID) for the
per-site commands so the audit log, UI summary, and persisted record agree.
Resolution
Resolved 2026-05-16 (commit pending): BuildDeployArtifactsCommandAsync now
accepts an optional deploymentId, and DeployToAllSitesAsync passes the one
logical deploymentId to every per-site command — so the per-site commands,
the audit log, and the UI summary all reference a single id instead of N+1
unrelated GUIDs (RetryForSiteAsync, an independent single-site retry, still
mints its own id). Adding a dedicated DeploymentId column to
SystemArtifactDeploymentRecord was deliberately not done: that entity
lives in ScadaLink.Commons with its EF mapping in
ScadaLink.ConfigurationDatabase, both outside this module's edit scope.
Instead the logical deploymentId is embedded in the record's free-form
PerSiteStatus JSON payload ({ DeploymentId, Sites }), which is fully within
this module's control, so the persisted record is correlatable with the
summary/audit. A follow-up to promote it to a first-class column should be
filed against Commons/ConfigurationDatabase if a queryable index is needed.
Regression tests: DeployToAllSitesAsync_AllPerSiteCommandsShareTheSummaryDeploymentId,
DeployToAllSitesAsync_PartialFailure_ReportsPerSiteMatrix,
RetryForSiteAsync_SiteSucceeds_ReturnsSuccessAndAudits.
DeploymentManager-011 — Tests never exercise a successful deployment or lifecycle success path
| Severity | Medium |
| Category | Testing coverage |
| Status | Resolved |
| Location | tests/ScadaLink.DeploymentManager.Tests/DeploymentServiceTests.cs:100-151,155-199 |
Description
DeploymentServiceTests never sets the CommunicationService actor, so every
deploy/lifecycle test deliberately stops at the InvalidOperationException
thrown by GetCommunicationActor() (see lines 118-125, 147). As a result there
is no test covering: a successful deployment (DeploymentStatus.Success
response → instance state set to Enabled, snapshot stored, audit logged); a
failed-but-handled site response; the InProgress-stuck bug
(DeploymentManager-001); successful Disable/Enable/Delete; or the operation
lock actually serializing two concurrent deploys of the same instance. The
critical post-response branch (DeploymentService.cs:154-184) and the entire
delete/disable/enable success path are untested. The AuditLogs test
(lines 277-289) asserts nothing.
Verification: Partially confirmed. By the time this finding was being
resolved, the DeploymentManager-006 fix had already introduced a TestKit-actor
seam (CreateServiceWithCommActor + ReconcileProbeActor) and successful-deploy
tests. The genuinely-still-missing coverage was: successful Disable/Enable/Delete
paths, per-instance lock serialization during deploy, and the assertionless
AuditLogs test — those gaps were addressed.
Recommendation
Introduce a seam to inject a fake/substitute communication path (e.g. an
interface over CommunicationService, or wire a TestKit actor) so success and
handled-failure paths can be unit tested. Add tests for the stuck-InProgress
scenario and for per-instance lock contention during deploy. Make the audit
test assert on IAuditService.LogAsync.
Resolution
Resolved 2026-05-16 (commit pending): extended the TestKit-actor seam
(ReconcileProbeActor now also answers lifecycle commands) and added the
missing coverage — successful Disable/Enable/Delete (state transition + audit
assertions), a successful-deploy audit assertion, and per-instance lock
serialization via a new deferred-reply SerializationProbeActor that asserts a
single instance's concurrent deploys never overlap. The assertionless AuditLogs
test was replaced with DeployInstanceAsync_FlatteningFails_DoesNotReachAudit,
which asserts on IAuditService.LogAsync. Regression tests:
DisableInstanceAsync_SiteSucceeds_SetsDisabledStateAndAudits,
EnableInstanceAsync_SiteSucceeds_SetsEnabledStateAndAudits,
DeleteInstanceAsync_SiteSucceeds_RemovesRecordAndAudits,
DeployInstanceAsync_SiteSucceeds_WritesDeployAuditEntry,
DeployInstanceAsync_FlatteningFails_DoesNotReachAudit,
DeployInstanceAsync_SameInstance_OperationLockSerializesConcurrentDeploys.
DeploymentManager-012 — LifecycleCommandTimeout option is dead code
| Severity | Low |
| Category | Documentation & comments |
| Status | Resolved |
| Location | src/ScadaLink.DeploymentManager/DeploymentManagerOptions.cs:8-9 |
Description
DeploymentManagerOptions.LifecycleCommandTimeout is declared with a 30s
default and an XML doc, but it is never read anywhere in the codebase
(lifecycle commands rely on CommunicationOptions.LifecycleTimeout inside
CommunicationService). The option misleads readers into thinking it controls
disable/enable/delete timeouts, when setting it has no effect.
Verification: Confirmed against source. A repo-wide grep found exactly one
occurrence of LifecycleCommandTimeout — the declaration itself.
Recommendation
Remove LifecycleCommandTimeout, or actually thread it through to the
lifecycle command calls (e.g. by creating a linked CTS with this timeout in
DisableInstanceAsync/EnableInstanceAsync/DeleteInstanceAsync, the way
ArtifactDeploymentTimeoutPerSite is used).
Resolution
Resolved 2026-05-16 (commit pending): LifecycleCommandTimeout is now actually
threaded through (the option exists for tuning, so it was wired up rather than
deleted). DisableInstanceAsync/EnableInstanceAsync/DeleteInstanceAsync
each create a linked CancellationTokenSource with CancelAfter( _options.LifecycleCommandTimeout) — the same pattern ArtifactDeploymentService
uses for ArtifactDeploymentTimeoutPerSite — and pass its token to the
CommunicationService call. Each method now catches the resulting
TimeoutException/OperationCanceledException, logs a warning, and returns a
Result.Failure (previously an AskTimeoutException from a hung site escaped
uncaught). The option's XML doc was corrected to describe the real behaviour.
Regression test:
DisableInstanceAsync_SiteUnresponsive_LifecycleCommandTimeoutBoundsTheWait
(asserts a 300 ms LifecycleCommandTimeout bounds the wait far below the 30 s
CommunicationOptions.LifecycleTimeout; confirmed to fail before the fix —
the call hung the full 30 s and threw AskTimeoutException).
DeploymentManager-013 — SMTP credentials serialized and broadcast to all sites
| Severity | Low |
| Category | Security |
| Status | Resolved |
| Location | src/ScadaLink.DeploymentManager/ArtifactDeploymentService.cs:108-111 |
Description
BuildDeployArtifactsCommandAsync maps smtp.Credentials directly into
SmtpConfigurationArtifact and that command is sent to every site. Distributing
SMTP credentials to sites is consistent with the design (SMTP configuration is
a deployable artifact), but the credentials travel inside a serialized command
across the inter-cluster transport and are stored on each site's SQLite. There
is no indication the value is encrypted at rest on the site or scrubbed from
logs. Worth confirming the transport is TLS-protected and the site stores the
credential securely; at minimum this should be a conscious, documented decision.
Recommendation
Confirm inter-cluster transport encryption covers artifact commands, ensure
Credentials is never written to logs, and document the at-rest protection of
SMTP credentials on site SQLite. Consider encrypting the credential field
within the artifact payload.
Verification (2026-05-16): Re-triaged against source. The DeploymentManager
side is clean: ArtifactDeploymentService maps SmtpConfiguration.Credentials
into the artifact (which the design explicitly mandates — SMTP configuration is
a deployable artifact) and never logs it — the three log statements in
DeployToAllSitesAsync only reference SiteId, SiteName, DeploymentId, and
ex.Message, never the credential. There is no defect to fix purely within
src/ScadaLink.DeploymentManager. The finding's remaining recommendations are
all cross-module and one needs a design decision:
- inter-cluster transport TLS —
ScadaLink.Communication/ScadaLink.ClusterInfrastructure(Akka remoting + ClusterClient config); - at-rest encryption of the credential on site SQLite —
ScadaLink.SiteRuntimeartifact store; - encrypting the credential field inside the artifact payload — needs the
SmtpConfigurationArtifactshape inScadaLink.Commonsplus cooperating producer (DeploymentManager) and consumer (SiteRuntime) changes, and a key-management design decision (where the encryption key lives, how it is distributed to sites) that cannot be made unilaterally here.
Status: Open — flagged. No purely-DeploymentManager fix exists; the work crosses Communication / SiteRuntime / Commons and requires a key-management design decision. Severity confirmed Low: with TLS-protected inter-cluster transport (a separate, assumed-in-place control) and no logging leak, this is a hardening item, not an active leak.
Resolution
Resolved 2026-05-16 (commit <pending>). Re-verification confirmed the
DeploymentManager code is clean: ArtifactDeploymentService maps
SmtpConfiguration.Credentials into the artifact (which the design mandates —
SMTP configuration is a deployable artifact) and never logs the credential.
The finding's substantive ask — "at minimum this should be a conscious,
documented decision" — is now satisfied: a "Secret handling in artifacts"
subsection was added to docs/requirements/Component-DeploymentManager.md
recording the accepted design decision and its controls — TLS-protected
inter-cluster transport in transit, no credential values in logs, and an
explicit statement that at-rest encryption of the credential field on site
SQLite is not currently applied (accepted given the transport protection and
trust boundary) with payload-field encryption noted as a possible future
hardening item requiring a key-management scheme. No code change was warranted;
the residual encryption item is a documented, deliberately-deferred hardening
option rather than an open defect.
DeploymentManager-014 — Dead CreateCommand helper in artifact tests
| Severity | Low |
| Category | Testing coverage |
| Status | Resolved |
| Location | tests/ScadaLink.DeploymentManager.Tests/ArtifactDeploymentServiceTests.cs:86-90 |
Description
The private static CreateCommand() helper is never referenced by any test in
the file. It is dead code that suggests an intended test (e.g. a successful
multi-site artifact deployment) was never written — coverage of
DeployToAllSitesAsync is limited to the no-sites failure case, and
RetryForSiteAsync and BuildDeployArtifactsCommandAsync have no tests at all.
Verification: Confirmed against source. The CreateCommand() helper had no
callers, and DeployToAllSitesAsync/RetryForSiteAsync only had the no-sites
failure case.
Recommendation
Either remove the unused helper or, preferably, write the missing tests for
DeployToAllSitesAsync (per-site success/failure matrix, partial failure) and
RetryForSiteAsync using it.
Resolution
Resolved 2026-05-16 (commit pending): took the recommendation's preferred
option — removed the dead CreateCommand() helper and wrote the missing
coverage instead. ArtifactDeploymentServiceTests now extends TestKit and
uses a stand-in ArtifactProbeActor (records the DeployArtifactsCommands it
receives, replies success or, for a configured failure set, failure) so
DeployToAllSitesAsync and RetryForSiteAsync are exercised end-to-end past
the communication boundary. New tests:
DeployToAllSitesAsync_AllPerSiteCommandsShareTheSummaryDeploymentId (also
covers DeploymentManager-010), DeployToAllSitesAsync_PartialFailure_ReportsPerSiteMatrix
(per-site success/failure matrix), RetryForSiteAsync_SiteSucceeds_ReturnsSuccessAndAudits.
DeploymentManager-015 — Site-query reconciliation marks a deployment Success but skips instance-state and snapshot updates
| Severity | High |
| Category | Correctness & logic bugs |
| Status | Resolved |
| Location | src/ScadaLink.DeploymentManager/DeploymentService.cs:631-655 |
Description
TryReconcileWithSiteAsync (the DeploymentManager-006 query-before-redeploy
path) handles the case where a prior InProgress/timeout-Failed record exists
and the site reports it already has the target revision hash. In that case it
marks the prior DeploymentRecord Success, audit-logs DeployReconciled, and
returns it — the caller then returns Result.Success and never enters the
normal deploy body.
The normal success path (DeployInstanceAsync.cs:215-223) does three things on
a successful site response: writes the deployment record terminal status, sets
instance.State = InstanceState.Enabled + UpdateInstanceAsync, and calls
StoreDeployedSnapshotAsync. The reconciliation shortcut performs only the
first. Consequently, after a reconciled deployment:
- The instance
Stateis left at whatever it was (e.g.NotDeployedfor a first-time deploy that timed out, orDisabled) even though the site is actually running the configuration — the central state machine and the site diverge, and a subsequentDisableInstanceAsync/EnableInstanceAsyncwill be rejected or allowed incorrectly byStateTransitionValidator. - No
DeployedConfigSnapshotis created or refreshed. A first-time deploy that is resolved purely by reconciliation leavesGetDeploymentComparisonAsyncpermanently returning"No deployed snapshot found for this instance.", and a redeploy reconciliation leaves the stored snapshot showing the old config even though the deployment record claimsSuccessfor the new revision.
The design ("Deployed vs. Template-Derived State", WP-4/WP-8) requires the deployed snapshot and instance state to reflect the last successful deployment; the reconciliation path silently breaks both invariants.
Recommendation
In the reconciled-success branch of TryReconcileWithSiteAsync, perform the
same post-success side effects as the normal path: set instance.State = InstanceState.Enabled (+ UpdateInstanceAsync) and call
StoreDeployedSnapshotAsync with the target deployment ID / revision hash /
config JSON. Factor the shared post-success logic into one helper so the normal
and reconciliation paths cannot drift. Add a regression test asserting that a
reconciled deployment leaves the instance Enabled and a snapshot stored.
Resolution
Resolved 2026-05-17 (commit pending): extracted the shared post-success side
effects into ApplyPostSuccessSideEffectsAsync (sets instance State = Enabled + UpdateInstanceAsync, stores/refreshes the DeployedConfigSnapshot)
and invoked it from both the normal deploy success path and the
TryReconcileWithSiteAsync reconciled-success branch, so a reconciled
deployment now performs the same instance-state and snapshot updates as a
normal one (configJson is now computed before the reconciliation call and
threaded into TryReconcileWithSiteAsync). Regression test:
DeployInstanceAsync_Reconciled_SetsInstanceEnabledAndStoresSnapshot.
DeploymentManager-016 — Reconciled prior record keeps its stale RevisionHash
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Resolved |
| Location | src/ScadaLink.DeploymentManager/DeploymentService.cs:639-651 |
Description
When TryReconcileWithSiteAsync reconciles a prior record, it mutates
prior.Status, prior.ErrorMessage, and prior.CompletedAt, but not
prior.RevisionHash. The reconciliation condition only compares the site's
AppliedRevisionHash against the freshly-flattened targetRevisionHash — it
does not require prior.RevisionHash to equal either of them.
The prior record can legitimately carry a different revision hash than the
current target: e.g. a deploy timed out at revision R1, the template was then
edited so the current flatten yields R2, and meanwhile the site actually
applied R2 through some other path (or R1 and R2 are equal-by-content but
the prior record predates a hash recompute). After reconciliation the record's
Status is Success but its RevisionHash still says R1, so staleness
checks and any UI that reads DeploymentRecord.RevisionHash will report the
instance as deployed at the wrong revision. The audit DeployReconciled entry
records RevisionHash = targetRevisionHash, contradicting the persisted record.
Recommendation
In the reconciled-success branch, also set prior.RevisionHash = targetRevisionHash so the persisted record, the audit entry, and the site's
actual applied revision all agree. Alternatively, only reconcile when
prior.RevisionHash == targetRevisionHash and otherwise fall through to a
normal deploy.
Resolution
Resolved 2026-05-17 (commit pending): the reconciled-success branch of
TryReconcileWithSiteAsync now also sets prior.RevisionHash = targetRevisionHash, so the persisted record, the DeployReconciled audit
entry, and the site's actually-applied revision all agree. Regression test:
DeployInstanceAsync_Reconciled_PriorRecordRevisionHashUpdatedToTarget.
DeploymentManager-017 — GetDeploymentStatusAsync XML doc describes behaviour it does not implement
| Severity | Low |
| Category | Documentation & comments |
| Status | Resolved |
| Location | src/ScadaLink.DeploymentManager/DeploymentService.cs:562-570 |
Description
The XML summary on GetDeploymentStatusAsync reads: "WP-2: After
failover/timeout, query site for current deployment state before
re-deploying." The method body does no such thing — it is a one-line
pass-through to _repository.GetDeploymentByDeploymentIdAsync, a pure local DB
read. The query-the-site-before-redeploy behaviour the comment describes was
implemented separately in TryReconcileWithSiteAsync (DeploymentManager-006).
The stale comment is a leftover of the original design intent and misleads a
reader into thinking this method contacts the site.
Recommendation
Reword the summary to describe what the method actually does — "returns the
current persisted DeploymentRecord for the given deployment ID from the
configuration database" — and, if useful, cross-reference
TryReconcileWithSiteAsync as the place the site-query reconciliation lives.
Resolution
Resolved 2026-05-17 (commit pending): the GetDeploymentStatusAsync XML doc
now states it returns the persisted DeploymentRecord from the configuration
database as a pure local read, and cross-references TryReconcileWithSiteAsync
as where the query-the-site-before-redeploy reconciliation actually lives.
Documentation-only change; no regression test (a test asserting comment text
would be meaningless).
DeploymentManager-018 — Reconciliation force-sets Enabled, overwriting an intentional Disabled after central failover
| Severity | High |
| Category | Correctness & logic bugs |
| Status | Resolved |
| Location | src/ScadaLink.DeploymentManager/DeploymentService.cs:675-682,721-748 |
Resolution — Added a forceEnabledState parameter to ApplyPostSuccessSideEffectsAsync. The normal deploy path passes true (fresh apply legitimately ends in Enabled); the reconciliation path passes false, so the helper only promotes NotDeployed → Enabled and leaves an existing Disabled (or Enabled) untouched. Regression test DeployInstanceAsync_Reconciled_DisabledInstance_PreservesDisabledState exercises the failover scenario and asserts the prior record still flips to Success while Instance.State stays Disabled.
Description
TryReconcileWithSiteAsync calls ApplyPostSuccessSideEffectsAsync whenever
the site reports it has the target revision hash, and that helper
unconditionally writes instance.State = InstanceState.Enabled. The
reconciliation shortcut only runs when the prior DeploymentRecord is
InProgress or timeout-Failed — exactly the scenarios that survive a central
failover (the in-memory OperationLockManager is lost on failover, by design:
"Lost on central failover (acceptable per design — in-progress treated as
failed)").
After such a failover, the per-instance operation lock is gone but the
deployment record is still InProgress in the DB. A user can legitimately
issue DisableInstanceAsync for the same instance — there is nothing in
DisableInstanceAsync that consults the deployment record, only the
StateTransitionValidator over Instance.State. If the state is Enabled
(the typical case when the deploy started), the disable proceeds, the site
honours it (the design states a disabled instance retains its deployed
configuration), and central now persists Instance.State = Disabled. The
deployment-record row remains InProgress (no one transitioned it). Later the
user retries the deploy: TryReconcileWithSiteAsync runs, the site still has
the target revision hash (Disable doesn't change the deployed config), the
prior record is marked Success, and ApplyPostSuccessSideEffectsAsync writes
Instance.State = Enabled — silently overriding the user's explicit Disable.
The same trap exists for any direct DB edit / migration that flipped the state
between the timed-out deploy and the redeploy. The normal deploy path can
defensibly assume Enabled after a fresh successful apply, but the
reconciliation path is reconciling prior state with prior user intent; it
should preserve Disabled if that is the current Instance.State at the time
of reconciliation, mirroring the design's separation between deploy (config
apply) and disable (subscription/script lifecycle).
Recommendation
In the reconciliation branch, do not force Enabled. Either:
- Pass a flag/parameter to
ApplyPostSuccessSideEffectsAsynctelling it whether to touch state, and skip the state write on the reconciliation path (leaving the currentInstance.Stateintact, which is alreadyEnabledfor a fresh deploy that timed out andDisabledfor the user-disabled follow-up case); or - Only set
Enabledwhen the currentInstance.StateisNotDeployed(i.e. the first-deploy timed-out case), and leave existingEnabled/Disabledalone.
Add a regression test where an instance with Instance.State = Disabled and a
prior InProgress deployment record is reconciled — the resulting
Instance.State must remain Disabled, and the deployment record must still
be marked Success.
DeploymentManager-019 — Lifecycle command timeout writes no audit entry
| Severity | Medium |
| Category | Error handling & resilience |
| Status | Resolved |
| Location | src/ScadaLink.DeploymentManager/DeploymentService.cs:328-339,385-396,445-458 |
Resolution (2026-05-28): added TryLogLifecycleTimeoutAsync, a private
helper that mirrors the DeployFailed pattern — it calls _auditService.LogAsync
with CancellationToken.None (so the operator's already-cancelled outer
token cannot also prevent the audit write) and stamps the row with the
<Action>TimedOut action name (DisableTimedOut / EnableTimedOut /
DeleteTimedOut), the command id, the configured deadline, and the captured
exception message. Each of DisableInstanceAsync / EnableInstanceAsync /
DeleteInstanceAsync invokes the helper from its
catch (TimeoutException or OperationCanceledException) block before
returning the failure Result. The helper itself try/catches around the
audit write so a failed audit pipeline does not mask the underlying timeout
for the caller — it only logs at Warning. Regression tests
DisableInstanceAsync_LifecycleTimeout_WritesDisableTimedOutAuditEntry,
EnableInstanceAsync_LifecycleTimeout_WritesEnableTimedOutAuditEntry, and
DeleteInstanceAsync_LifecycleTimeout_WritesDeleteTimedOutAuditEntry use the
existing SilentProbeActor to keep the site unresponsive, configure a 300 ms
LifecycleCommandTimeout to bound the wait, and assert the audit log
received the corresponding <Action>TimedOut entry exactly once.
Description
DisableInstanceAsync, EnableInstanceAsync, and DeleteInstanceAsync each
wrap the CommunicationService call in a linked CTS with
LifecycleCommandTimeout (DeploymentManager-012). On timeout they log a
warning and return Result<...>.Failure(...) — and skip the
_auditService.LogAsync call entirely. As a result, an operator-initiated
disable/enable/delete that times out at the site leaves no audit trail:
the user, the timestamp, the command id, and the failure mode are not
recorded in the audit log. The deploy path goes out of its way to write a
DeployFailed audit entry on the same failure mode
(DeploymentService.cs:274-276), with CancellationToken.None so the write is
durable; the lifecycle commands do not.
The design lists audit logging as a Deployment Manager responsibility for "all deployment actions, system-wide artifact deployments, and instance lifecycle changes" — a timed-out lifecycle command is an attempted lifecycle change, and the operator action is exactly the kind of event the audit log exists to record.
Recommendation
In each of the three catch (Exception ex) when (ex is TimeoutException or OperationCanceledException) blocks, write a DisableTimeout/EnableTimeout/
DeleteTimeout (or use the existing operation name with a failure flag)
audit entry with CancellationToken.None so a cancelled outer token does not
prevent the audit write, mirroring DeployFailed. Add a unit test asserting
that DisableInstanceAsync_SiteUnresponsive_LifecycleCommandTimeoutBoundsTheWait
also produces an audit entry.
DeploymentManager-020 — DeployReconciled audit attributes the action to the prior deployer, not the current user
| Severity | Low |
| Category | Documentation & comments |
| Status | Resolved |
| Location | src/ScadaLink.DeploymentManager/DeploymentService.cs:698-712 |
Description
In TryReconcileWithSiteAsync the audit call is:
await _auditService.LogAsync(prior.DeployedBy, "DeployReconciled", ...)
prior.DeployedBy is the user who issued the original (timed-out / stuck)
deployment, not the user parameter passed into DeployInstanceAsync. The
current user — the one who triggered the redeploy that produced the
reconciliation — is dropped on the floor. For audit forensics this is
misleading: the row will read "user A reconciled their own deployment"
when in fact user B initiated the action that reconciled it.
The original deployer is interesting context, but it should be carried in the
audit-detail object (where DeploymentId and RevisionHash already live), not
substituted for the actor.
Recommendation
Use user (the parameter on DeployInstanceAsync, threaded through
TryReconcileWithSiteAsync) as the audit actor, and include
OriginalDeployer = prior.DeployedBy in the detail object so the original
attribution is preserved without misrepresenting who took the action.
Resolution (2026-05-28): Threaded the user parameter from
DeployInstanceAsync into TryReconcileWithSiteAsync as a new currentUser
argument (consistent with the DeploymentManager-018 forceEnabledState
parameter-threading pattern) and rewrote the audit call to log
currentUser as the actor with OriginalDeployer = prior.DeployedBy carried
in the detail object. Added test
DeployInstanceAsync_Reconciled_AuditAttributesCurrentUserNotPriorDeployer
that pins the new attribution and asserts the prior deployer is no longer used
as the actor. Tests green (80/80 in DeploymentManager.Tests).
DeploymentManager-021 — ResolveSiteIdentifierAsync silently substitutes the DB id when the site row is missing
| Severity | Low |
| Category | Correctness & logic bugs |
| Status | Resolved |
| Location | src/ScadaLink.DeploymentManager/DeploymentService.cs:107-111 |
Resolution (2026-05-28): ResolveSiteIdentifierAsync now throws InvalidOperationException ("Site with ID {siteId} not found; cannot resolve its SiteIdentifier for routing.") when the Site row is missing, instead of returning the numeric id rendered as a string. The deploy path's existing try/catch turns the throw into a DeploymentStatus.Failed record carrying the descriptive message (the DeploymentManager-001/-002 cleanup write the failure with CancellationToken.None); the lifecycle paths (Disable/Enable/Delete) propagate the exception so the CLI/UI caller surfaces the actual cause to the operator rather than seeing a confusing downstream "unknown site" routing error. The repository contract already returned Site?, so the null path is now type-visible at the call site instead of silently papered over.
Description
private async Task<string> ResolveSiteIdentifierAsync(int siteId, CancellationToken cancellationToken)
{
var site = await _siteRepository.GetSiteByIdAsync(siteId, cancellationToken);
return site?.SiteIdentifier ?? siteId.ToString();
}
If the Site row is missing (FK was deleted, race with admin delete, DB
inconsistency), the method silently returns the numeric DB id rendered as a
string. This is then passed to CommunicationService.{Deploy,Disable,Enable, Delete}InstanceAsync and QueryDeploymentStateAsync as if it were a real
SiteIdentifier (e.g. "site-a"). The communication layer will fail with an
"unknown site" or routing error, producing a confusing diagnostic that hides
the actual problem (no site row).
This is a defensive concern, but every mutating operation in the module goes through this method, so a stale instance whose site was deleted will produce a misleading error every time it is touched.
Recommendation
Treat a missing site as a hard validation failure: return a
Result.Failure($"Site with ID {siteId} not found") early from the calling
operations, instead of fabricating an identifier. The repository already
returns Site?, so the null path is type-visible; just don't paper over it.
DeploymentManager-022 — Pending and InProgress are written back-to-back with no intervening work
| Severity | Low |
| Category | Code organization & conventions |
| Status | Resolved |
| Location | src/ScadaLink.DeploymentManager/DeploymentService.cs:178-194 |
Resolution (2026-05-28): The transient Pending write was dropped — the deployment record is now created directly in DeploymentStatus.InProgress, which collapses the start of the deploy into a single AddDeploymentRecordAsync + SaveChangesAsync + NotifyStatusChange (instead of two writes back-to-back). The flattening, validation, and TryReconcileWithSiteAsync round-trip have all completed before the insert, and the deploy command is sent immediately after, so Pending carried no operational meaning between the two writes. InProgress retains its documented "sent to site, awaiting response" semantics. Eliminating the extra SaveChangesAsync round-trip also removes the Pending→InProgress flicker the CentralUI-006 deployment-status page used to render via the second IDeploymentStatusNotifier.NotifyStatusChanged invocation.
Description
DeployInstanceAsync does:
record.Status = Pending;
AddDeploymentRecordAsync(record); SaveChangesAsync(); NotifyStatusChange(record);
record.Status = InProgress;
UpdateDeploymentRecordAsync(record); SaveChangesAsync(); NotifyStatusChange(record);
There is no work between the two writes — flattening, validation, and
reconciliation have already completed by line 174. The deploy command is sent
immediately after the InProgress write. The Pending write therefore costs:
an extra SaveChangesAsync round-trip, an extra IDeploymentStatusNotifier
invocation (which the CentralUI-006 page renders, so the user briefly sees a
Pending flicker before InProgress), and an extra row-version bump if EF
optimistic concurrency is enabled on the table.
The design uses Pending to mean "queued, not yet sent" and InProgress to
mean "sent to site, awaiting response". The code's Pending slot has no
queuing — it is set and immediately overwritten — so the state buys nothing
operationally.
Recommendation
Either:
- Drop the
Pendingwrite entirely and create the record directly inInProgress(one row insert, one notification, simpler UI); or - Move the
Pending→InProgresstransition to bracket actual queueing/work (e.g. setPendingbefore flattening + reconciliation, setInProgressimmediately beforeDeployInstanceAsyncon the comm service) so the two states carry distinguishable semantics worth a separate write.
DeploymentManager-023 — BuildDeployArtifactsCommandAsync re-queries system-wide artifacts once per site
| Severity | Low |
| Category | Performance & resource management |
| Status | Resolved |
| Location | src/ScadaLink.DeploymentManager/ArtifactDeploymentService.cs:82-144,169-173 |
Resolution (2026-05-28): Hoisted the global artifact queries (shared scripts, external systems + methods, DB connections, notification lists, SMTP configurations) out of the per-site loop into a new private FetchGlobalArtifactsAsync that produces a GlobalArtifactSnapshot record. DeployToAllSitesAsync now calls it ONCE before the loop and threads the snapshot through a new prefetched-globals overload of BuildDeployArtifactsCommandAsync; the public single-site overload keeps the prior fetch-then-build behaviour for RetryForSiteAsync. Only the per-site data-connection query remains inside the loop. Regression tests DeployToAllSitesAsync_HoistsGlobalArtifactQueriesOutOfPerSiteLoop (three sites; pins exactly-one call to each global getter and one per-site call to GetDataConnectionsBySiteIdAsync) and RetryForSiteAsync_SingleSitePath_StillRunsTheGlobalQueriesOnce (single-site path still owns its own fetch).
Description
DeployToAllSitesAsync loops over sites and calls
BuildDeployArtifactsCommandAsync(site.Id, ...) for each one. Of the six
artifact sets the method gathers, only dataConnections is per-site:
_templateRepo.GetAllSharedScriptsAsync— global._externalSystemRepo.GetAllExternalSystemsAsync— global, plusGetMethodsByExternalSystemIdAsyncper external system per site._externalSystemRepo.GetAllDatabaseConnectionsAsync— global._notificationRepo.GetAllNotificationListsAsync— global._notificationRepo.GetAllSmtpConfigurationsAsync— global._siteRepo.GetDataConnectionsBySiteIdAsync(siteId, ...)— per-site.
With N sites this issues ≈ 5·N redundant queries on the global sets (plus M·N method queries, where M is the external-system count). On a hub-and-spoke deployment with many sites the artifact-deploy path is noticeably slower than necessary and pins DbContext usage longer than needed. Per CLAUDE.md, the DbContext is not thread-safe and the per-site commands are already built sequentially (good); the redundant queries are sequential too, but the network/round-trip cost is real.
Recommendation
Hoist the global queries (shared scripts, external systems + their methods,
DB connections, notification lists, SMTP configurations) out of
BuildDeployArtifactsCommandAsync, fetch them once in DeployToAllSitesAsync,
and pass them in alongside the site id (or expose a
BuildDeployArtifactsCommandAsync(siteId, prefetchedGlobals) overload).
RetryForSiteAsync (the single-site path) can keep the convenience-overload
behaviour. Add a test using NSubstitute's .Received() to assert
_templateRepo.GetAllSharedScriptsAsync is called exactly once for an
N-site deployment.
DeploymentManager-024 — Test probe actors hold mutable static state across tests
| Severity | Low |
| Category | Testing coverage |
| Status | Resolved |
| Location | tests/ScadaLink.DeploymentManager.Tests/DeploymentServiceTests.cs:966-1075, tests/ScadaLink.DeploymentManager.Tests/ArtifactDeploymentServiceTests.cs:196-217 |
Resolution (2026-05-28): Replaced the static counters with per-test instance state. Introduced ReconcileProbeCounters and SerializationProbeCounters (in DeploymentServiceTests) and ArtifactProbeRecorder (in ArtifactDeploymentServiceTests); each probe actor now takes the counter object as its first constructor argument. Every test instantiates a fresh counter local, passes it via Props.Create(() => new ReconcileProbeActor(counters, ...)), and reads the counts directly off counters — no shared static fields remain. ReconcileProbeActor's counter increments swap to Interlocked.Increment for the cross-thread CAS, and SerializationProbeActor retains its lock on a per-test Gate. All 85 ScadaLink.DeploymentManager.Tests continue to pass after the refactor.
Description
ReconcileProbeActor.QueryCount / DeployCount, SerializationProbeActor.MaxConcurrent
/ _current, and ArtifactProbeActor.Received are all static fields.
Each test's actor constructor resets them — but reset-on-construction only
works as long as no two tests in the same class run concurrently. xUnit's
default parallelism disables intra-class parallelism, so today's tests pass;
flip the assembly-level [CollectionBehavior(DisableTestParallelization = true)]
or move to xUnit v3 (which enables intra-class parallelism by default) and the
counters race — a deploy in test A could increment DeployCount while test B
is asserting on it.
Static state shared across tests is also why a flaky-test investigation here will be unusually painful: the offending interaction is invisible from any single test file.
Recommendation
Replace the static counters with instance state, hand the actor a probe
recipient (an IActorRef to a TestKit probe), and assert via ExpectMsg
in each test. Where the simpler counter shape is preferred, pass a
shared-state object into the actor's constructor so each test owns its own
instance — never reach for static mutable test state.