Files
scadalink-design/code-reviews/DeploymentManager/findings.md

35 KiB

Code Review — DeploymentManager

Field Value
Module src/ScadaLink.DeploymentManager
Design doc docs/requirements/Component-DeploymentManager.md
Status Reviewed
Last reviewed 2026-05-16
Reviewer claude-agent
Commit reviewed 9c60592
Open findings 0

Summary

The DeploymentManager module is small, well-structured, and clearly maps work packages (WP-N) onto code. The happy paths for instance deployment, lifecycle commands, artifact broadcast, and staleness comparison are implemented sensibly, and the operation lock correctly serializes mutating operations per instance while allowing cross-instance parallelism. However, the review found a significant cluster of error-handling and resilience gaps: the deployment record can be left permanently stuck in InProgress when an exception other than timeout/cancellation is thrown, the catch block writes its failure status using a cancellation token that may already be cancelled, and the OperationLockManager leaks one SemaphoreSlim per instance name forever. There are also two notable design-document adherence gaps: the "query-the-site-before-redeploy" idempotency requirement is not implemented (GetDeploymentStatusAsync only reads the local DB), and the "Diff View" feature is reduced to a bare hash comparison with no added/removed/changed detail. Configuration is not bound to appsettings.json, leaving one option entirely dead. Test coverage stops at the communication boundary and never exercises a successful deployment or the lifecycle success paths.

Checklist coverage

# Category Examined Notes
1 Correctness & logic bugs Stuck InProgress record on unexpected exception; cancelled-token failure write.
2 Akka.NET conventions Module is a plain service layer; it calls CommunicationService which wraps Ask. No actors here. No issues.
3 Concurrency & thread safety OperationLockManager is sound but leaks semaphores; DeployToAllSitesAsync correctly builds commands sequentially before parallel send.
4 Error handling & resilience Several gaps — see DeploymentManager-001/002/003/004.
5 Security SMTP credentials are serialized and broadcast to sites — see DeploymentManager-013. No injection vectors; no authz here (enforced upstream).
6 Performance & resource management Semaphore leak (DeploymentManager-005); artifact rebuild does N+1 method queries per external system.
7 Design-document adherence Missing query-before-redeploy (DeploymentManager-006); Diff View not implemented (DeploymentManager-007).
8 Code organization & conventions Options class not bound to configuration — DeploymentManager-008. POCO/repo placement correct.
9 Testing coverage No successful-deploy test, no lifecycle success test — DeploymentManager-011; dead CreateCommand helper — DeploymentManager-014.
10 Documentation & comments Misleading timeout comment — DeploymentManager-009; stale option XML doc — DeploymentManager-012.

Findings

DeploymentManager-001 — Unexpected exceptions leave the deployment record stuck in InProgress

Severity High
Category Error handling & resilience
Status Resolved
Location src/ScadaLink.DeploymentManager/DeploymentService.cs:141-199

Description

DeployInstanceAsync sets the record to InProgress (lines 137-139), then the try block calls into CommunicationService and the repository. The only catch filter is when (ex is TimeoutException or OperationCanceledException). Any other exception — InvalidOperationException (thrown by CommunicationService.GetCommunicationActor() when the actor is not set), a JSON serialization error, a deserialization failure of the response, a DB exception on UpdateDeploymentRecordAsync, or any transport error — escapes the method. The deployment record remains in DeploymentStatus.InProgress permanently. Because staleness and the UI both read current status, the instance is then misreported as "deploying" forever and a re-deploy may be blocked or misinterpreted. The design explicitly states an interrupted deployment must be "treated as failed".

Recommendation

Broaden the catch to a general catch (Exception ex) that records DeploymentStatus.Failed with the error message, audit-logs the failure, and re-throws or returns a failed Result. Keep the timeout-specific branch only if a distinct message is desired. Ensure the failure-status write happens for every exit path out of the try.

Resolution

Resolved 2026-05-16 (commit <pending>): broadened the catch in DeployInstanceAsync to catch (Exception ex) so any exception (transport, serialization, DB, InvalidOperationException from an uninitialized CommunicationService) marks the deployment record Failed with the error message and audit-logs the failure, instead of escaping and leaving the record stuck in InProgress. Regression test: DeployInstanceAsync_CommunicationThrowsUnexpectedException_RecordMarkedFailed.

DeploymentManager-002 — Failure-status write uses a possibly-cancelled cancellation token

Severity High
Category Error handling & resilience
Status Resolved
Location src/ScadaLink.DeploymentManager/DeploymentService.cs:186-196

Description

The catch (Exception ex) when (ex is TimeoutException or OperationCanceledException) block updates the record to Failed and calls UpdateDeploymentRecordAsync/SaveChangesAsync/LogAsync passing the same cancellationToken that was just cancelled (an OperationCanceledException caught here means the token is already in the cancelled state). Those repository and audit calls will themselves throw OperationCanceledException before the failure status is persisted, so the record stays InProgress — the exact bug DeploymentManager-001 describes, reached via the supposedly-handled path.

Recommendation

Perform the cleanup writes with a fresh, non-cancellable token (e.g. CancellationToken.None, optionally with an independent short timeout) so the failure status is durably recorded even when the original operation was cancelled or timed out.

Resolution

Resolved 2026-05-16 (commit <pending>): the broadened catch block now performs the failure-status write (UpdateDeploymentRecordAsync, SaveChangesAsync) and the audit LogAsync with CancellationToken.None instead of the operation's (possibly-cancelled) token, so the Failed status is durably recorded even after a timeout/cancellation. The cleanup writes are themselves wrapped in a try/catch that logs (without masking the original error) if persistence still fails. Regression test: DeployInstanceAsync_FailureWrite_UsesNonCancellableToken.

DeploymentManager-003 — Successful-deployment cleanup is not atomic with the status write

Severity Medium
Category Error handling & resilience
Status Resolved
Location src/ScadaLink.DeploymentManager/DeploymentService.cs:155-170

Description

After a successful site response the code calls UpdateDeploymentRecordAsync (no SaveChanges yet), then UpdateInstanceAsync, then StoreDeployedSnapshotAsync (which itself issues Add/Update calls), then a single SaveChangesAsync at line 170. If StoreDeployedSnapshotAsync throws, the exception is not caught (see DeploymentManager-001) and the SaveChangesAsync never runs — the instance state, deployment status, and snapshot are all left unpersisted even though the site has actually applied the deployment. Central and site are now divergent: the site is running the new config but central still shows the old state and a non-Success deployment record.

Verification: Confirmed against source. The DeploymentManager-001 fix made this strictly worse, not better — after that fix a snapshot-store failure is caught and the record is flipped from Success back to Failed, so central reports a failed deployment while the site is running the new config.

Recommendation

Wrap the post-success persistence so that, at minimum, the deployment record's Success status is committed. Consider committing the status first, then the instance state and snapshot, so a later failure does not lose the fact that the site succeeded. Log loudly if the snapshot write fails after a confirmed site apply.

Resolution

Resolved 2026-05-16 (commit pending): DeployInstanceAsync now commits the deployment record's terminal status (UpdateDeploymentRecordAsync + SaveChangesAsync) immediately after the site confirms the apply, before touching instance state or the deployed-config snapshot. The post-success instance-state update and StoreDeployedSnapshotAsync are wrapped in a best-effort try/catch that logs loudly for operator reconciliation but no longer flips the already-committed Success record back to Failed. Regression test: DeployInstanceAsync_SiteSucceeds_SnapshotWriteFails_RecordStillCommittedSuccess.

DeploymentManager-004 — Site-success but central-delete-failure leaves orphaned site config

Severity Medium
Category Error handling & resilience
Status Resolved
Location src/ScadaLink.DeploymentManager/DeploymentService.cs:312-319

Description

In DeleteInstanceAsync, when the site responds Success the code calls _repository.DeleteInstanceAsync then SaveChangesAsync. If SaveChangesAsync throws (DB error, concurrency), the exception propagates uncaught: the site has already destroyed the Instance Actor and removed its config, but the central instance record still exists. The instance is now un-deletable through the normal path (the site no longer has it, so a re-issued delete may fail) and is permanently orphaned. The design states central must not mark the instance deleted until the site confirms — but it does not address the inverse failure.

Verification: Confirmed against source. DeleteInstanceAsync has no try/catch around the post-success block, so any exception from DeleteInstanceAsync/SaveChangesAsync escapes uncaught to the caller.

Recommendation

Catch persistence failures in the post-success block and surface a distinct error indicating the site succeeded but the central record could not be removed, so an operator/retry can reconcile. Consider making the central delete idempotent and retryable independently of the site command.

Resolution

Resolved 2026-05-16 (commit pending): the post-success removal in DeleteInstanceAsync (DeleteInstanceAsync + SaveChangesAsync) is now wrapped in a try/catch. A persistence failure no longer escapes uncaught — it is logged, recorded with a DeleteOrphaned audit entry, and surfaced as a distinct Result failure stating the site deleted the instance but the central record is orphaned and must be reconciled. Regression test: DeleteInstanceAsync_SiteSucceeds_CentralDeleteFails_ReturnsDistinctFailure.

DeploymentManager-005 — OperationLockManager leaks a SemaphoreSlim per instance name

Severity Medium
Category Performance & resource management
Status Resolved
Location src/ScadaLink.DeploymentManager/OperationLockManager.cs:15-33

Description

AcquireAsync does _locks.GetOrAdd(instanceUniqueName, _ => new SemaphoreSlim(1, 1)) and entries are never removed. Every distinct instance unique name that is ever deployed/disabled/enabled/deleted permanently adds a SemaphoreSlim (an IDisposable holding a kernel wait handle) to the dictionary. Over the lifetime of a long-running central process — especially with the bulk "deploy all out-of-date instances" workflow and instances that are created and deleted over time — this is an unbounded leak of both managed memory and OS handles. Deleted instances' semaphores are never reclaimed.

Verification: Confirmed against source. _locks is a ConcurrentDictionary with no removal path anywhere in the type.

Recommendation

Either accept the leak explicitly and document the expected bounded cardinality of instance names, or implement reclamation: e.g. ref-count handles and remove

  • Dispose() the semaphore when the count reaches zero and the lock is free. At minimum, remove the semaphore entry when an instance is deleted (DeleteInstanceAsync).

Resolution

Resolved 2026-05-16 (commit pending): OperationLockManager now ref-counts each lock entry. A reference is reserved (creating the entry if needed) before the SemaphoreSlim.WaitAsync, so concurrent waiters for the same instance share one semaphore and the entry survives until every waiter/holder has released. When the reference count reaches zero — on release, timeout, or cancellation — the entry is removed from the dictionary and the semaphore is Dispose()d, so the process no longer accumulates one kernel wait handle per distinct instance name. A TrackedLockCount diagnostic property was added to make reclamation testable. Regression tests: AcquireAsync_ReleasedLock_RemovesSemaphoreEntry, AcquireAsync_ManyDistinctInstances_DoesNotAccumulateSemaphores, AcquireAsync_ContendedLock_KeepsSemaphoreUntilLastReleaseThenReclaims.

DeploymentManager-006 — Query-the-site-before-redeploy idempotency requirement not implemented

Severity High
Category Design-document adherence
Status Resolved
Location src/ScadaLink.DeploymentManager/DeploymentService.cs:84-200,363-368

Description

The design ("Deployment Identity & Idempotency") requires: "After a central failover or timeout, the Deployment Manager queries the site for current deployment state before allowing a re-deploy. This prevents duplicate application and out-of-order config changes." The code never does this. GetDeploymentStatusAsync only reads the local DeploymentRecord from the DB (GetDeploymentByDeploymentIdAsync) — it does not contact the site. DeployInstanceAsync unconditionally generates a new deployment ID and sends a new DeployInstanceCommand regardless of any prior in-flight or timed-out deployment. After a timeout where the site actually applied the config, a re-deploy produces a second deployment with no reconciliation against the site's current revision hash. Site-side stale-rejection is the only safety net, and that is not verified here.

Recommendation

Add a site query (a new CommunicationService pattern returning the site's currently-applied deployment ID / revision hash) and call it before re-deploy when a prior record for the instance is in InProgress/Failed due to timeout. Reconcile: if the site already has the target revision, mark the prior record Success instead of re-sending. Either implement this or update the design doc to reflect that reconciliation is delegated entirely to site-side stale-rejection.

Resolution

Resolved 2026-05-16 (commit <pending>): implemented the cross-module query-the-site-before-redeploy idempotency feature across Commons, SiteRuntime, Communication, and DeploymentManager — new DeploymentStateQueryRequest / DeploymentStateQueryResponse contracts, a DeploymentManagerActor handler answering from the site's deployed-config store, a CommunicationService.QueryDeploymentStateAsync method routed over the ClusterClient command/control transport, and reconciliation in DeployInstanceAsync (TryReconcileWithSiteAsync) that queries the site only when a prior record is InProgress or Failed due to a timeout, marks the prior record Success without re-sending if the site already has the target revision hash, and falls through to a normal deploy (relying on site-side stale-rejection) when the query fails. Regression tests: RoundTrip_DeploymentStateQueryRequest_Succeeds, RoundTrip_DeploymentStateQueryResponse_Deployed_Succeeds, RoundTrip_DeploymentStateQueryResponse_NotDeployed_NullApplied, DeploymentStateQuery_DeployedInstance_ReturnsAppliedIdentity, DeploymentStateQuery_UnknownInstance_ReturnsNotDeployed, DeploymentStateQuery_ForwardedToDeploymentManager, QueryDeploymentStateAsync_BeforeInitialization_Throws, QueryDeploymentStateAsync_SendsEnvelopeAndReturnsResponse, DeployInstanceAsync_PriorInProgressRecord_SiteHasTargetHash_MarksSuccessWithoutRedeploy, DeployInstanceAsync_PriorInProgressRecord_SiteHasDifferentHash_ProceedsWithDeploy, DeployInstanceAsync_PriorFailedTimeoutRecord_QueriesSite, DeployInstanceAsync_PriorSuccessRecord_SkipsSiteQuery, DeployInstanceAsync_FreshFirstTimeDeploy_SkipsSiteQuery, DeployInstanceAsync_PriorInProgressRecord_QueryFails_FallsThroughToDeploy.

DeploymentManager-007 — "Diff View" reduced to a hash comparison with no diff detail

Severity Medium
Category Design-document adherence
Status Resolved
Location src/ScadaLink.DeploymentManager/DeploymentService.cs:334-358,401-406

Description

The design ("Diff View" and "Dependencies" sections) states the Deployment Manager can request a diff from the Template Engine showing added/removed members, changed values, and connection-binding changes. GetDeploymentComparisonAsync and DeploymentComparisonResult only compare two revision hashes and return a boolean IsStale plus the two hashes. No added/removed/changed detail is produced, and the Template Engine's diff capability is not invoked. The UI cannot render a meaningful diff from this result.

Verification: Confirmed against source. The Template Engine already provides DiffService + ConfigurationDiff (structured Added/Removed/Changed entries for attributes, alarms, and scripts, including data connection binding fields), and DiffService is DI-registered — it was simply never wired into the Deployment Manager's comparison path.

Recommendation

Either implement a real diff (deserialize the stored DeployedConfigSnapshot.ConfigurationJson and the freshly flattened config and invoke the Template Engine's diff service, surfacing structured added/removed/changed entries), or revise the design doc to scope the feature down to staleness detection only.

Resolution

Resolved 2026-05-16 (commit pending): GetDeploymentComparisonAsync now deserializes the stored DeployedConfigSnapshot.ConfigurationJson and runs the Template Engine DiffService against the freshly flattened current configuration, attaching the resulting ConfigurationDiff (added/removed/changed attributes, alarms, scripts) to a new optional Diff property on DeploymentComparisonResult. DiffService is injected into DeploymentService. A snapshot that cannot be deserialized (corrupt / older schema) still yields the hash-based staleness result with a null diff, logged at warning level. Regression test: GetDeploymentComparisonAsync_ProducesStructuredDiff.

DeploymentManager-008 — DeploymentManagerOptions is never bound to configuration

Severity Medium
Category Code organization & conventions
Status Resolved
Location src/ScadaLink.DeploymentManager/ServiceCollectionExtensions.cs:7-14

Description

AddDeploymentManager registers the services but never calls services.Configure<DeploymentManagerOptions>(configuration.GetSection(...)). IOptions<DeploymentManagerOptions> therefore always resolves to a default-constructed instance — the operation-lock and artifact-deployment timeouts cannot be tuned via appsettings.json, contrary to the CLAUDE.md convention "Per-component configuration via appsettings.json sections bound to options classes (Options pattern)." Host/Program.cs binds SecurityOptions and InboundApiOptions from configuration sections but has no equivalent for DeploymentManagerOptions.

Verification: Confirmed against source. Neither AddDeploymentManager nor Host/Program.cs binds DeploymentManagerOptions.

Recommendation

Add an IConfiguration parameter (or a configure callback) to AddDeploymentManager and bind DeploymentManagerOptions to a section such as ScadaLink:DeploymentManager, consistent with the other components.

Resolution

Resolved 2026-05-16 (commit pending): AddDeploymentManager() now calls services.AddOptions<DeploymentManagerOptions>() so IOptions<DeploymentManagerOptions> is always resolvable, and Host/Program.cs binds the ScadaLink:DeploymentManager section (exposed as ServiceCollectionExtensions.OptionsSection) via services.Configure<DeploymentManagerOptions>(...) — the same pattern the Host uses for SecurityOptions/InboundApiOptions. An earlier attempt added an AddDeploymentManager(IConfiguration) overload; that was reverted because the project convention (enforced by Host.Tests.OptionsTests) forbids component Add* methods from depending on IConfiguration — the Host owns configuration binding. Regression tests: AddDeploymentManager_RegistersResolvableOptions_WithDefaults, AddDeploymentManager_OptionsBindToConfigurationSection_AsTheHostWires, OptionsSection_MatchesTheConventionalComponentSectionPath.

DeploymentManager-009 — Misleading timeout comment on DeleteInstanceAsync

Severity Low
Category Documentation & comments
Status Resolved
Location src/ScadaLink.DeploymentManager/DeploymentService.cs:288

Description

The XML doc says "Delete fails if site unreachable (30s timeout via CommunicationOptions)." The actual delete timeout is whatever CommunicationOptions.LifecycleTimeout is configured to (passed inside CommunicationService.DeleteInstanceAsync); the "30s" figure is hard-coded into the comment and not derived from any constant in this module. If LifecycleTimeout is reconfigured, the comment becomes wrong. It also wrongly implies the value lives in this module.

Verification: Confirmed against source. The DeleteInstanceAsync XML doc quoted a hard-coded "30s" value.

Recommendation

Reword to "Delete fails if the site is unreachable within CommunicationOptions.LifecycleTimeout" without quoting a specific number.

Resolution

Resolved 2026-05-16 (commit pending): the DeleteInstanceAsync XML doc no longer quotes a hard-coded "30s" — it now states delete fails if the site is unreachable within CommunicationOptions.LifecycleTimeout (and notes the deadline is applied inside CommunicationService.DeleteInstanceAsync). Documentation-only change; no regression test (a test asserting comment text would be meaningless).

DeploymentManager-010 — SystemArtifactDeploymentRecord does not persist the deployment ID

Severity Low
Category Correctness & logic bugs
Status Resolved
Location src/ScadaLink.DeploymentManager/ArtifactDeploymentService.cs:136,194-211

Description

DeployToAllSitesAsync generates a deploymentId (line 136) and returns it in the ArtifactDeploymentSummary and audit log, but the persisted SystemArtifactDeploymentRecord has no field for it (the entity only has Id, ArtifactType, DeployedBy, DeployedAt, PerSiteStatus). The deployment ID that appears in the UI summary and audit log cannot be correlated back to the stored record. Additionally each per-site DeployArtifactsCommand carries its own separate GUID (BuildDeployArtifactsCommandAsync line 114), so there are in fact N+1 unrelated IDs for one logical artifact deployment.

Verification: Confirmed against source. Each per-site command minted its own GUID and the persisted record had no way to reference the logical id.

Recommendation

Add a DeploymentId column to SystemArtifactDeploymentRecord and store the single logical deploymentId; reuse that ID (or a derived per-site ID) for the per-site commands so the audit log, UI summary, and persisted record agree.

Resolution

Resolved 2026-05-16 (commit pending): BuildDeployArtifactsCommandAsync now accepts an optional deploymentId, and DeployToAllSitesAsync passes the one logical deploymentId to every per-site command — so the per-site commands, the audit log, and the UI summary all reference a single id instead of N+1 unrelated GUIDs (RetryForSiteAsync, an independent single-site retry, still mints its own id). Adding a dedicated DeploymentId column to SystemArtifactDeploymentRecord was deliberately not done: that entity lives in ScadaLink.Commons with its EF mapping in ScadaLink.ConfigurationDatabase, both outside this module's edit scope. Instead the logical deploymentId is embedded in the record's free-form PerSiteStatus JSON payload ({ DeploymentId, Sites }), which is fully within this module's control, so the persisted record is correlatable with the summary/audit. A follow-up to promote it to a first-class column should be filed against Commons/ConfigurationDatabase if a queryable index is needed. Regression tests: DeployToAllSitesAsync_AllPerSiteCommandsShareTheSummaryDeploymentId, DeployToAllSitesAsync_PartialFailure_ReportsPerSiteMatrix, RetryForSiteAsync_SiteSucceeds_ReturnsSuccessAndAudits.

DeploymentManager-011 — Tests never exercise a successful deployment or lifecycle success path

Severity Medium
Category Testing coverage
Status Resolved
Location tests/ScadaLink.DeploymentManager.Tests/DeploymentServiceTests.cs:100-151,155-199

Description

DeploymentServiceTests never sets the CommunicationService actor, so every deploy/lifecycle test deliberately stops at the InvalidOperationException thrown by GetCommunicationActor() (see lines 118-125, 147). As a result there is no test covering: a successful deployment (DeploymentStatus.Success response → instance state set to Enabled, snapshot stored, audit logged); a failed-but-handled site response; the InProgress-stuck bug (DeploymentManager-001); successful Disable/Enable/Delete; or the operation lock actually serializing two concurrent deploys of the same instance. The critical post-response branch (DeploymentService.cs:154-184) and the entire delete/disable/enable success path are untested. The AuditLogs test (lines 277-289) asserts nothing.

Verification: Partially confirmed. By the time this finding was being resolved, the DeploymentManager-006 fix had already introduced a TestKit-actor seam (CreateServiceWithCommActor + ReconcileProbeActor) and successful-deploy tests. The genuinely-still-missing coverage was: successful Disable/Enable/Delete paths, per-instance lock serialization during deploy, and the assertionless AuditLogs test — those gaps were addressed.

Recommendation

Introduce a seam to inject a fake/substitute communication path (e.g. an interface over CommunicationService, or wire a TestKit actor) so success and handled-failure paths can be unit tested. Add tests for the stuck-InProgress scenario and for per-instance lock contention during deploy. Make the audit test assert on IAuditService.LogAsync.

Resolution

Resolved 2026-05-16 (commit pending): extended the TestKit-actor seam (ReconcileProbeActor now also answers lifecycle commands) and added the missing coverage — successful Disable/Enable/Delete (state transition + audit assertions), a successful-deploy audit assertion, and per-instance lock serialization via a new deferred-reply SerializationProbeActor that asserts a single instance's concurrent deploys never overlap. The assertionless AuditLogs test was replaced with DeployInstanceAsync_FlatteningFails_DoesNotReachAudit, which asserts on IAuditService.LogAsync. Regression tests: DisableInstanceAsync_SiteSucceeds_SetsDisabledStateAndAudits, EnableInstanceAsync_SiteSucceeds_SetsEnabledStateAndAudits, DeleteInstanceAsync_SiteSucceeds_RemovesRecordAndAudits, DeployInstanceAsync_SiteSucceeds_WritesDeployAuditEntry, DeployInstanceAsync_FlatteningFails_DoesNotReachAudit, DeployInstanceAsync_SameInstance_OperationLockSerializesConcurrentDeploys.

DeploymentManager-012 — LifecycleCommandTimeout option is dead code

Severity Low
Category Documentation & comments
Status Resolved
Location src/ScadaLink.DeploymentManager/DeploymentManagerOptions.cs:8-9

Description

DeploymentManagerOptions.LifecycleCommandTimeout is declared with a 30s default and an XML doc, but it is never read anywhere in the codebase (lifecycle commands rely on CommunicationOptions.LifecycleTimeout inside CommunicationService). The option misleads readers into thinking it controls disable/enable/delete timeouts, when setting it has no effect.

Verification: Confirmed against source. A repo-wide grep found exactly one occurrence of LifecycleCommandTimeout — the declaration itself.

Recommendation

Remove LifecycleCommandTimeout, or actually thread it through to the lifecycle command calls (e.g. by creating a linked CTS with this timeout in DisableInstanceAsync/EnableInstanceAsync/DeleteInstanceAsync, the way ArtifactDeploymentTimeoutPerSite is used).

Resolution

Resolved 2026-05-16 (commit pending): LifecycleCommandTimeout is now actually threaded through (the option exists for tuning, so it was wired up rather than deleted). DisableInstanceAsync/EnableInstanceAsync/DeleteInstanceAsync each create a linked CancellationTokenSource with CancelAfter( _options.LifecycleCommandTimeout) — the same pattern ArtifactDeploymentService uses for ArtifactDeploymentTimeoutPerSite — and pass its token to the CommunicationService call. Each method now catches the resulting TimeoutException/OperationCanceledException, logs a warning, and returns a Result.Failure (previously an AskTimeoutException from a hung site escaped uncaught). The option's XML doc was corrected to describe the real behaviour. Regression test: DisableInstanceAsync_SiteUnresponsive_LifecycleCommandTimeoutBoundsTheWait (asserts a 300 ms LifecycleCommandTimeout bounds the wait far below the 30 s CommunicationOptions.LifecycleTimeout; confirmed to fail before the fix — the call hung the full 30 s and threw AskTimeoutException).

DeploymentManager-013 — SMTP credentials serialized and broadcast to all sites

Severity Low
Category Security
Status Resolved
Location src/ScadaLink.DeploymentManager/ArtifactDeploymentService.cs:108-111

Description

BuildDeployArtifactsCommandAsync maps smtp.Credentials directly into SmtpConfigurationArtifact and that command is sent to every site. Distributing SMTP credentials to sites is consistent with the design (SMTP configuration is a deployable artifact), but the credentials travel inside a serialized command across the inter-cluster transport and are stored on each site's SQLite. There is no indication the value is encrypted at rest on the site or scrubbed from logs. Worth confirming the transport is TLS-protected and the site stores the credential securely; at minimum this should be a conscious, documented decision.

Recommendation

Confirm inter-cluster transport encryption covers artifact commands, ensure Credentials is never written to logs, and document the at-rest protection of SMTP credentials on site SQLite. Consider encrypting the credential field within the artifact payload.

Verification (2026-05-16): Re-triaged against source. The DeploymentManager side is clean: ArtifactDeploymentService maps SmtpConfiguration.Credentials into the artifact (which the design explicitly mandates — SMTP configuration is a deployable artifact) and never logs it — the three log statements in DeployToAllSitesAsync only reference SiteId, SiteName, DeploymentId, and ex.Message, never the credential. There is no defect to fix purely within src/ScadaLink.DeploymentManager. The finding's remaining recommendations are all cross-module and one needs a design decision:

  • inter-cluster transport TLS — ScadaLink.Communication / ScadaLink.ClusterInfrastructure (Akka remoting + ClusterClient config);
  • at-rest encryption of the credential on site SQLite — ScadaLink.SiteRuntime artifact store;
  • encrypting the credential field inside the artifact payload — needs the SmtpConfigurationArtifact shape in ScadaLink.Commons plus cooperating producer (DeploymentManager) and consumer (SiteRuntime) changes, and a key-management design decision (where the encryption key lives, how it is distributed to sites) that cannot be made unilaterally here.

Status: Open — flagged. No purely-DeploymentManager fix exists; the work crosses Communication / SiteRuntime / Commons and requires a key-management design decision. Severity confirmed Low: with TLS-protected inter-cluster transport (a separate, assumed-in-place control) and no logging leak, this is a hardening item, not an active leak.

Resolution

Resolved 2026-05-16 (commit <pending>). Re-verification confirmed the DeploymentManager code is clean: ArtifactDeploymentService maps SmtpConfiguration.Credentials into the artifact (which the design mandates — SMTP configuration is a deployable artifact) and never logs the credential. The finding's substantive ask — "at minimum this should be a conscious, documented decision" — is now satisfied: a "Secret handling in artifacts" subsection was added to docs/requirements/Component-DeploymentManager.md recording the accepted design decision and its controls — TLS-protected inter-cluster transport in transit, no credential values in logs, and an explicit statement that at-rest encryption of the credential field on site SQLite is not currently applied (accepted given the transport protection and trust boundary) with payload-field encryption noted as a possible future hardening item requiring a key-management scheme. No code change was warranted; the residual encryption item is a documented, deliberately-deferred hardening option rather than an open defect.

DeploymentManager-014 — Dead CreateCommand helper in artifact tests

Severity Low
Category Testing coverage
Status Resolved
Location tests/ScadaLink.DeploymentManager.Tests/ArtifactDeploymentServiceTests.cs:86-90

Description

The private static CreateCommand() helper is never referenced by any test in the file. It is dead code that suggests an intended test (e.g. a successful multi-site artifact deployment) was never written — coverage of DeployToAllSitesAsync is limited to the no-sites failure case, and RetryForSiteAsync and BuildDeployArtifactsCommandAsync have no tests at all.

Verification: Confirmed against source. The CreateCommand() helper had no callers, and DeployToAllSitesAsync/RetryForSiteAsync only had the no-sites failure case.

Recommendation

Either remove the unused helper or, preferably, write the missing tests for DeployToAllSitesAsync (per-site success/failure matrix, partial failure) and RetryForSiteAsync using it.

Resolution

Resolved 2026-05-16 (commit pending): took the recommendation's preferred option — removed the dead CreateCommand() helper and wrote the missing coverage instead. ArtifactDeploymentServiceTests now extends TestKit and uses a stand-in ArtifactProbeActor (records the DeployArtifactsCommands it receives, replies success or, for a configured failure set, failure) so DeployToAllSitesAsync and RetryForSiteAsync are exercised end-to-end past the communication boundary. New tests: DeployToAllSitesAsync_AllPerSiteCommandsShareTheSummaryDeploymentId (also covers DeploymentManager-010), DeployToAllSitesAsync_PartialFailure_ReportsPerSiteMatrix (per-site success/failure matrix), RetryForSiteAsync_SiteSucceeds_ReturnsSuccessAndAudits.