Files

T

Joseph Doherty 77cb0ad0e2 fix(api-surface): close Theme 9 — 27 naming / dead-code / config / hygiene findings

The largest themed batch — small mechanical fixes across 11 modules.

API / message hygiene:
- Comm-020: SiteAddressCacheLoaded now carries IReadOnlyDictionary /
  IReadOnlyList — Akka messages must be immutable.
- Commons-016: BundleSession.MaxUnlockAttempts named constant replaces
  magic 3.
- Commons-018: IOperationTrackingStore + IPartitionMaintenance moved from
  Interfaces/ root to Interfaces/Services/ (namespace preserved — 9
  consumers exceeded the in-prompt move threshold).
- Commons-023: TrackingStatusSnapshot.SourceNode now consistent with the
  trailing-optional-with-default pattern used elsewhere.
- SR-022: AuditingDbCommand.DbConnection.set no longer uses reflection —
  exposes AuditingDbConnection.Inner via internal API surface.

Dead code / config cleanup:
- ClusterInfra-011: decorative SectionName constant deleted.
- ClusterInfra-014: dead AddClusterInfrastructureActors method + its
  "throws-when-called" test deleted.
- Host-021: Microsoft Logging:LogLevel block deleted from appsettings.json
  (dead under Serilog).

Fail-loud over fail-silent:
- DM-021: ResolveSiteIdentifierAsync throws on missing site (was silently
  substituting a DB id).
- DM-022: dropped transient Pending write — record now lands directly in
  InProgress (no UI flicker, one fewer DB write).
- Host-020: LoggerConfigurationFactory emits a Console.Error warning when
  both Serilog:MinimumLevel and ScadaLink:Logging:MinimumLevel are set
  (ScadaLink remains truth per Host-011).
- SnF-022: NotifyCachedCallObserverAsync logs Warning on unparseable
  TrackedOperationId (was silently dropping).
- SnF-023: empty siteId default replaced with $unknown-site sentinel
  + constructor normalisation.

Correctness:
- SCA-001: SupervisorStrategy XML rewritten to match actual
  DefaultDecider/Restart semantics (was claiming Resume).
- SCA-003: OnUpsertAsync now restamps IngestedAtUtc on every upsert.
- SR-021: HandleDeployArtifacts now dispatches an internal
  ApplyArtifactDataConnectionsToDcl message after the SQLite write so
  system-wide artifact-deploy data-connection changes go live
  immediately (was requiring a site restart).
- SnF-020: RetryParkedMessageAsync captures the parked row BEFORE the
  local write so a concurrent delete can't skip standby replication.

Sentinels / naming collisions:
- HM-021: CentralSiteId changed from "central" to "$central"
  (uncollideable — leading $ is forbidden in real SiteIdentifiers).

Doc / surface cleanups:
- SEL-018: FailedWriteCount promoted to ISiteEventLogger; XML softened
  to "Available for future Health Monitoring integration".
- SnF-019: VERIFY outcome — documented parking-after-DefaultMaxRetries
  in Component-StoreAndForward.md + DefaultMaxRetries XML (uniform
  cap; maxRetries:0 is the unbounded escape hatch).
- SnF-021: Component-StoreAndForward.md no longer claims the tracking
  table lives in SnF — it's in SiteRuntime, the interface is in Commons.
- CLI-020: bundle export response parse guarded with try/catch on
  JsonException / KeyNotFoundException / FormatException — emits a
  clean INVALID_RESPONSE exit instead of a stack trace.

Config:
- ClusterInfra-013: intent comment added to "catastrophic config" test.
- Host-016: appsettings.Site.json second CentralContactPoints entry
  removed (was pointing at the SITE's own port); doc-key explains how
  to extend.
- Host-018: NodeName added to both shipped per-role configs (was
  causing SourceNode to be null on audit rows).

UI:
- CentralUI-029: replaced JS.InvokeAsync<int>("eval", …) with an ES
  module import (new wwwroot/js/browser-time.js).
- CentralUI-032: AuditResultsGrid gains a Previous button backed by a
  cursor stack.

10+ new regression tests across the affected projects. Build clean;
all suites green. README regenerated: 6 open (was 33).

Session-to-date: 130 of 136 originally-open Theme findings closed.

2026-05-28 08:39:01 -04:00

64 KiB

Raw Blame History

Code Review — DeploymentManager

Field	Value
Module	`src/ScadaLink.DeploymentManager`
Design doc	`docs/requirements/Component-DeploymentManager.md`
Status	Reviewed
Last reviewed	2026-05-28
Reviewer	claude-agent
Commit reviewed	`1eb6e97`
Open findings	0

Summary

The DeploymentManager module is small, well-structured, and clearly maps work packages (WP-N) onto code. The happy paths for instance deployment, lifecycle commands, artifact broadcast, and staleness comparison are implemented sensibly, and the operation lock correctly serializes mutating operations per instance while allowing cross-instance parallelism. However, the review found a significant cluster of error-handling and resilience gaps: the deployment record can be left permanently stuck in InProgress when an exception other than timeout/cancellation is thrown, the catch block writes its failure status using a cancellation token that may already be cancelled, and the OperationLockManager leaks one SemaphoreSlim per instance name forever. There are also two notable design-document adherence gaps: the "query-the-site-before-redeploy" idempotency requirement is not implemented (GetDeploymentStatusAsync only reads the local DB), and the "Diff View" feature is reduced to a bare hash comparison with no added/removed/changed detail. Configuration is not bound to appsettings.json, leaving one option entirely dead. Test coverage stops at the communication boundary and never exercises a successful deployment or the lifecycle success paths.

Re-review 2026-05-17 (commit `39d737e`)

Re-reviewed at commit 39d737e after the batch of fixes for DeploymentManager-001..014. All fourteen prior findings remain Resolved and verified against source — the broadened catch, non-cancellable cleanup writes, ref-counted OperationLockManager, query-before-redeploy reconciliation, structured diff, options binding, and the expanded TestKit-actor test suite are all present and correct. The module is in markedly better shape than the first review: error paths are now defensively handled and test coverage is broad (successful deploy/lifecycle, lock serialization, reconciliation matrix, artifact per-site matrix).

This re-review found 3 new findings, all clustered on the DeploymentManager-006 reconciliation path added since the last review. The reconciliation shortcut (TryReconcileWithSiteAsync) marks a stale prior record Success when the site already has the target revision, but it does not perform the side effects the normal success path does — it never updates the instance State, never refreshes the DeployedConfigSnapshot, and never corrects the prior record's own RevisionHash (DeploymentManager-015, DeploymentManager-016). The GetDeploymentStatusAsync XML doc is now stale — it still describes the query-before-redeploy behaviour that actually moved into TryReconcileWithSiteAsync (DeploymentManager-017).

Re-review 2026-05-28 (commit `1eb6e97`)

Re-reviewed at commit 1eb6e97 after the DeploymentManager-015/016/017 fixes and a docs-only XML-comment pass. The three prior findings remain Resolved and verified — ApplyPostSuccessSideEffectsAsync is now invoked from both the normal success path and TryReconcileWithSiteAsync, the reconciled-success branch corrects prior.RevisionHash to the target, and GetDeploymentStatusAsync's XML doc now describes the local-DB-read it actually performs and cross-refs the reconciliation helper. The DiffService wiring, options binding, ref-counted operation lock, broadened catch, non-cancellable cleanup, and TestKit-actor test seam are still in place. The 7 new findings here are not regressions in the DeploymentManager-015/016 fixes — they are issues uncovered by widening the lens to the lifecycle paths, reconciliation's interaction with intentional Disabled state, audit semantics, and operational concerns (per-site artifact-build cost, Pending→InProgress double-write).

The single notable correctness issue is DeploymentManager-018: the reconciliation shortcut unconditionally sets instance.State = Enabled via ApplyPostSuccessSideEffectsAsync. After a central failover that loses the in-memory operation lock, a user can legitimately Disable an instance whose prior deploy record is still InProgress; a subsequent redeploy then reconciles and silently re-enables the instance against the user's explicit intent. The remaining six findings are medium/low: lifecycle-timeout audit gap (DeploymentManager-019), audit-user attribution in reconciliation (DeploymentManager-020), silent fallback in ResolveSiteIdentifierAsync (DeploymentManager-021), back-to-back Pending→InProgress writes (DeploymentManager-022), per-site re-query of system-wide artifacts (DeploymentManager-023), and shared static state across *ProbeActor tests (DeploymentManager-024).

Checklist coverage

Re-review 2026-05-28 (commit `1eb6e97`)

#	Category	Examined	Notes
1	Correctness & logic bugs	✓	New: reconciliation forces `Enabled` even if the user disabled the instance in between (DeploymentManager-018).
2	Akka.NET conventions	✓	Module remains a plain service layer; no actors. No issues.
3	Concurrency & thread safety	✓	`OperationLockManager` ref-counting verified. Note: test probes hold static state (DeploymentManager-024) — a test concern, not production code.
4	Error handling & resilience	✓	New: Disable/Enable/Delete timeouts return early without writing any audit entry — deploy has `DeployFailed`, lifecycle has nothing (DeploymentManager-019).
5	Security	✓	No new issues. SMTP credential decision documented (DeploymentManager-013 closed).
6	Performance & resource management	✓	New: `BuildDeployArtifactsCommandAsync` re-queries every system-wide artifact set per site in `DeployToAllSitesAsync` (DeploymentManager-023).
7	Design-document adherence	✓	Reconciliation now performs post-success side effects (DeploymentManager-015 resolved). DeploymentManager-018 surfaces a new gap on `Disabled`-state preservation.
8	Code organization & conventions	✓	New: redundant `Pending`→`InProgress` back-to-back write with no intervening work (DeploymentManager-022). Silent string-fallback in `ResolveSiteIdentifierAsync` (DeploymentManager-021).
9	Testing coverage	✓	New: no coverage for the reconciliation-overwrites-Disabled case (part of DeploymentManager-018); test probes share static state across tests (DeploymentManager-024).
10	Documentation & comments	✓	New: `DeployReconciled` audit uses `prior.DeployedBy` instead of the current `user` parameter — misleading for forensics (DeploymentManager-020).

Findings

DeploymentManager-001 — Unexpected exceptions leave the deployment record stuck in `InProgress`


Severity	High
Category	Error handling & resilience
Status	Resolved
Location	`src/ScadaLink.DeploymentManager/DeploymentService.cs:141-199`

Description

DeployInstanceAsync sets the record to InProgress (lines 137-139), then the try block calls into CommunicationService and the repository. The only catch filter is when (ex is TimeoutException or OperationCanceledException). Any other exception — InvalidOperationException (thrown by CommunicationService.GetCommunicationActor() when the actor is not set), a JSON serialization error, a deserialization failure of the response, a DB exception on UpdateDeploymentRecordAsync, or any transport error — escapes the method. The deployment record remains in DeploymentStatus.InProgress permanently. Because staleness and the UI both read current status, the instance is then misreported as "deploying" forever and a re-deploy may be blocked or misinterpreted. The design explicitly states an interrupted deployment must be "treated as failed".

Recommendation

Broaden the catch to a general catch (Exception ex) that records DeploymentStatus.Failed with the error message, audit-logs the failure, and re-throws or returns a failed Result. Keep the timeout-specific branch only if a distinct message is desired. Ensure the failure-status write happens for every exit path out of the try.

Resolution

Resolved 2026-05-16 (commit <pending>): broadened the catch in DeployInstanceAsync to catch (Exception ex) so any exception (transport, serialization, DB, InvalidOperationException from an uninitialized CommunicationService) marks the deployment record Failed with the error message and audit-logs the failure, instead of escaping and leaving the record stuck in InProgress. Regression test: DeployInstanceAsync_CommunicationThrowsUnexpectedException_RecordMarkedFailed.

DeploymentManager-002 — Failure-status write uses a possibly-cancelled cancellation token


Severity	High
Category	Error handling & resilience
Status	Resolved
Location	`src/ScadaLink.DeploymentManager/DeploymentService.cs:186-196`

Description

The catch (Exception ex) when (ex is TimeoutException or OperationCanceledException) block updates the record to Failed and calls UpdateDeploymentRecordAsync/SaveChangesAsync/LogAsync passing the same cancellationToken that was just cancelled (an OperationCanceledException caught here means the token is already in the cancelled state). Those repository and audit calls will themselves throw OperationCanceledException before the failure status is persisted, so the record stays InProgress — the exact bug DeploymentManager-001 describes, reached via the supposedly-handled path.

Recommendation

Perform the cleanup writes with a fresh, non-cancellable token (e.g. CancellationToken.None, optionally with an independent short timeout) so the failure status is durably recorded even when the original operation was cancelled or timed out.

Resolution

Resolved 2026-05-16 (commit <pending>): the broadened catch block now performs the failure-status write (UpdateDeploymentRecordAsync, SaveChangesAsync) and the audit LogAsync with CancellationToken.None instead of the operation's (possibly-cancelled) token, so the Failed status is durably recorded even after a timeout/cancellation. The cleanup writes are themselves wrapped in a try/catch that logs (without masking the original error) if persistence still fails. Regression test: DeployInstanceAsync_FailureWrite_UsesNonCancellableToken.

DeploymentManager-003 — Successful-deployment cleanup is not atomic with the status write


Severity	Medium
Category	Error handling & resilience
Status	Resolved
Location	`src/ScadaLink.DeploymentManager/DeploymentService.cs:155-170`

Description

After a successful site response the code calls UpdateDeploymentRecordAsync (no SaveChanges yet), then UpdateInstanceAsync, then StoreDeployedSnapshotAsync (which itself issues Add/Update calls), then a single SaveChangesAsync at line 170. If StoreDeployedSnapshotAsync throws, the exception is not caught (see DeploymentManager-001) and the SaveChangesAsync never runs — the instance state, deployment status, and snapshot are all left unpersisted even though the site has actually applied the deployment. Central and site are now divergent: the site is running the new config but central still shows the old state and a non-Success deployment record.

Verification: Confirmed against source. The DeploymentManager-001 fix made this strictly worse, not better — after that fix a snapshot-store failure is caught and the record is flipped from Success back to Failed, so central reports a failed deployment while the site is running the new config.

Recommendation

Wrap the post-success persistence so that, at minimum, the deployment record's Success status is committed. Consider committing the status first, then the instance state and snapshot, so a later failure does not lose the fact that the site succeeded. Log loudly if the snapshot write fails after a confirmed site apply.

Resolution

Resolved 2026-05-16 (commit pending): DeployInstanceAsync now commits the deployment record's terminal status (UpdateDeploymentRecordAsync + SaveChangesAsync) immediately after the site confirms the apply, before touching instance state or the deployed-config snapshot. The post-success instance-state update and StoreDeployedSnapshotAsync are wrapped in a best-effort try/catch that logs loudly for operator reconciliation but no longer flips the already-committed Success record back to Failed. Regression test: DeployInstanceAsync_SiteSucceeds_SnapshotWriteFails_RecordStillCommittedSuccess.

DeploymentManager-004 — Site-success but central-delete-failure leaves orphaned site config


Severity	Medium
Category	Error handling & resilience
Status	Resolved
Location	`src/ScadaLink.DeploymentManager/DeploymentService.cs:312-319`

Description

In DeleteInstanceAsync, when the site responds Success the code calls _repository.DeleteInstanceAsync then SaveChangesAsync. If SaveChangesAsync throws (DB error, concurrency), the exception propagates uncaught: the site has already destroyed the Instance Actor and removed its config, but the central instance record still exists. The instance is now un-deletable through the normal path (the site no longer has it, so a re-issued delete may fail) and is permanently orphaned. The design states central must not mark the instance deleted until the site confirms — but it does not address the inverse failure.

Verification: Confirmed against source. DeleteInstanceAsync has no try/catch around the post-success block, so any exception from DeleteInstanceAsync/SaveChangesAsync escapes uncaught to the caller.

Recommendation

Catch persistence failures in the post-success block and surface a distinct error indicating the site succeeded but the central record could not be removed, so an operator/retry can reconcile. Consider making the central delete idempotent and retryable independently of the site command.

Resolution

Resolved 2026-05-16 (commit pending): the post-success removal in DeleteInstanceAsync (DeleteInstanceAsync + SaveChangesAsync) is now wrapped in a try/catch. A persistence failure no longer escapes uncaught — it is logged, recorded with a DeleteOrphaned audit entry, and surfaced as a distinct Result failure stating the site deleted the instance but the central record is orphaned and must be reconciled. Regression test: DeleteInstanceAsync_SiteSucceeds_CentralDeleteFails_ReturnsDistinctFailure.

DeploymentManager-005 — `OperationLockManager` leaks a `SemaphoreSlim` per instance name


Severity	Medium
Category	Performance & resource management
Status	Resolved
Location	`src/ScadaLink.DeploymentManager/OperationLockManager.cs:15-33`

Description

AcquireAsync does _locks.GetOrAdd(instanceUniqueName, _ => new SemaphoreSlim(1, 1)) and entries are never removed. Every distinct instance unique name that is ever deployed/disabled/enabled/deleted permanently adds a SemaphoreSlim (an IDisposable holding a kernel wait handle) to the dictionary. Over the lifetime of a long-running central process — especially with the bulk "deploy all out-of-date instances" workflow and instances that are created and deleted over time — this is an unbounded leak of both managed memory and OS handles. Deleted instances' semaphores are never reclaimed.

Verification: Confirmed against source. _locks is a ConcurrentDictionary with no removal path anywhere in the type.

Recommendation

Either accept the leak explicitly and document the expected bounded cardinality of instance names, or implement reclamation: e.g. ref-count handles and remove

Dispose() the semaphore when the count reaches zero and the lock is free. At minimum, remove the semaphore entry when an instance is deleted (DeleteInstanceAsync).

Resolution

Resolved 2026-05-16 (commit pending): OperationLockManager now ref-counts each lock entry. A reference is reserved (creating the entry if needed) before the SemaphoreSlim.WaitAsync, so concurrent waiters for the same instance share one semaphore and the entry survives until every waiter/holder has released. When the reference count reaches zero — on release, timeout, or cancellation — the entry is removed from the dictionary and the semaphore is Dispose()d, so the process no longer accumulates one kernel wait handle per distinct instance name. A TrackedLockCount diagnostic property was added to make reclamation testable. Regression tests: AcquireAsync_ReleasedLock_RemovesSemaphoreEntry, AcquireAsync_ManyDistinctInstances_DoesNotAccumulateSemaphores, AcquireAsync_ContendedLock_KeepsSemaphoreUntilLastReleaseThenReclaims.

DeploymentManager-006 — Query-the-site-before-redeploy idempotency requirement not implemented


Severity	High
Category	Design-document adherence
Status	Resolved
Location	`src/ScadaLink.DeploymentManager/DeploymentService.cs:84-200,363-368`

Description

The design ("Deployment Identity & Idempotency") requires: "After a central failover or timeout, the Deployment Manager queries the site for current deployment state before allowing a re-deploy. This prevents duplicate application and out-of-order config changes." The code never does this. GetDeploymentStatusAsync only reads the local DeploymentRecord from the DB (GetDeploymentByDeploymentIdAsync) — it does not contact the site. DeployInstanceAsync unconditionally generates a new deployment ID and sends a new DeployInstanceCommand regardless of any prior in-flight or timed-out deployment. After a timeout where the site actually applied the config, a re-deploy produces a second deployment with no reconciliation against the site's current revision hash. Site-side stale-rejection is the only safety net, and that is not verified here.

Recommendation

Add a site query (a new CommunicationService pattern returning the site's currently-applied deployment ID / revision hash) and call it before re-deploy when a prior record for the instance is in InProgress/Failed due to timeout. Reconcile: if the site already has the target revision, mark the prior record Success instead of re-sending. Either implement this or update the design doc to reflect that reconciliation is delegated entirely to site-side stale-rejection.

Resolution

Resolved 2026-05-16 (commit <pending>): implemented the cross-module query-the-site-before-redeploy idempotency feature across Commons, SiteRuntime, Communication, and DeploymentManager — new DeploymentStateQueryRequest / DeploymentStateQueryResponse contracts, a DeploymentManagerActor handler answering from the site's deployed-config store, a CommunicationService.QueryDeploymentStateAsync method routed over the ClusterClient command/control transport, and reconciliation in DeployInstanceAsync (TryReconcileWithSiteAsync) that queries the site only when a prior record is InProgress or Failed due to a timeout, marks the prior record Success without re-sending if the site already has the target revision hash, and falls through to a normal deploy (relying on site-side stale-rejection) when the query fails. Regression tests: RoundTrip_DeploymentStateQueryRequest_Succeeds, RoundTrip_DeploymentStateQueryResponse_Deployed_Succeeds, RoundTrip_DeploymentStateQueryResponse_NotDeployed_NullApplied, DeploymentStateQuery_DeployedInstance_ReturnsAppliedIdentity, DeploymentStateQuery_UnknownInstance_ReturnsNotDeployed, DeploymentStateQuery_ForwardedToDeploymentManager, QueryDeploymentStateAsync_BeforeInitialization_Throws, QueryDeploymentStateAsync_SendsEnvelopeAndReturnsResponse, DeployInstanceAsync_PriorInProgressRecord_SiteHasTargetHash_MarksSuccessWithoutRedeploy, DeployInstanceAsync_PriorInProgressRecord_SiteHasDifferentHash_ProceedsWithDeploy, DeployInstanceAsync_PriorFailedTimeoutRecord_QueriesSite, DeployInstanceAsync_PriorSuccessRecord_SkipsSiteQuery, DeployInstanceAsync_FreshFirstTimeDeploy_SkipsSiteQuery, DeployInstanceAsync_PriorInProgressRecord_QueryFails_FallsThroughToDeploy.

DeploymentManager-007 — "Diff View" reduced to a hash comparison with no diff detail


Severity	Medium
Category	Design-document adherence
Status	Resolved
Location	`src/ScadaLink.DeploymentManager/DeploymentService.cs:334-358,401-406`

Description

The design ("Diff View" and "Dependencies" sections) states the Deployment Manager can request a diff from the Template Engine showing added/removed members, changed values, and connection-binding changes. GetDeploymentComparisonAsync and DeploymentComparisonResult only compare two revision hashes and return a boolean IsStale plus the two hashes. No added/removed/changed detail is produced, and the Template Engine's diff capability is not invoked. The UI cannot render a meaningful diff from this result.

Verification: Confirmed against source. The Template Engine already provides DiffService + ConfigurationDiff (structured Added/Removed/Changed entries for attributes, alarms, and scripts, including data connection binding fields), and DiffService is DI-registered — it was simply never wired into the Deployment Manager's comparison path.

Recommendation

Either implement a real diff (deserialize the stored DeployedConfigSnapshot.ConfigurationJson and the freshly flattened config and invoke the Template Engine's diff service, surfacing structured added/removed/changed entries), or revise the design doc to scope the feature down to staleness detection only.

Resolution

Resolved 2026-05-16 (commit pending): GetDeploymentComparisonAsync now deserializes the stored DeployedConfigSnapshot.ConfigurationJson and runs the Template Engine DiffService against the freshly flattened current configuration, attaching the resulting ConfigurationDiff (added/removed/changed attributes, alarms, scripts) to a new optional Diff property on DeploymentComparisonResult. DiffService is injected into DeploymentService. A snapshot that cannot be deserialized (corrupt / older schema) still yields the hash-based staleness result with a null diff, logged at warning level. Regression test: GetDeploymentComparisonAsync_ProducesStructuredDiff.

DeploymentManager-008 — `DeploymentManagerOptions` is never bound to configuration


Severity	Medium
Category	Code organization & conventions
Status	Resolved
Location	`src/ScadaLink.DeploymentManager/ServiceCollectionExtensions.cs:7-14`

Description

AddDeploymentManager registers the services but never calls services.Configure<DeploymentManagerOptions>(configuration.GetSection(...)). IOptions<DeploymentManagerOptions> therefore always resolves to a default-constructed instance — the operation-lock and artifact-deployment timeouts cannot be tuned via appsettings.json, contrary to the CLAUDE.md convention "Per-component configuration via appsettings.json sections bound to options classes (Options pattern)." Host/Program.cs binds SecurityOptions and InboundApiOptions from configuration sections but has no equivalent for DeploymentManagerOptions.

Verification: Confirmed against source. Neither AddDeploymentManager nor Host/Program.cs binds DeploymentManagerOptions.

Recommendation

Add an IConfiguration parameter (or a configure callback) to AddDeploymentManager and bind DeploymentManagerOptions to a section such as ScadaLink:DeploymentManager, consistent with the other components.

Resolution

Resolved 2026-05-16 (commit pending): AddDeploymentManager() now calls services.AddOptions<DeploymentManagerOptions>() so IOptions<DeploymentManagerOptions> is always resolvable, and Host/Program.cs binds the ScadaLink:DeploymentManager section (exposed as ServiceCollectionExtensions.OptionsSection) via services.Configure<DeploymentManagerOptions>(...) — the same pattern the Host uses for SecurityOptions/InboundApiOptions. An earlier attempt added an AddDeploymentManager(IConfiguration) overload; that was reverted because the project convention (enforced by Host.Tests.OptionsTests) forbids component Add* methods from depending on IConfiguration — the Host owns configuration binding. Regression tests: AddDeploymentManager_RegistersResolvableOptions_WithDefaults, AddDeploymentManager_OptionsBindToConfigurationSection_AsTheHostWires, OptionsSection_MatchesTheConventionalComponentSectionPath.

DeploymentManager-009 — Misleading timeout comment on `DeleteInstanceAsync`


Severity	Low
Category	Documentation & comments
Status	Resolved
Location	`src/ScadaLink.DeploymentManager/DeploymentService.cs:288`

Description

The XML doc says "Delete fails if site unreachable (30s timeout via CommunicationOptions)." The actual delete timeout is whatever CommunicationOptions.LifecycleTimeout is configured to (passed inside CommunicationService.DeleteInstanceAsync); the "30s" figure is hard-coded into the comment and not derived from any constant in this module. If LifecycleTimeout is reconfigured, the comment becomes wrong. It also wrongly implies the value lives in this module.

Verification: Confirmed against source. The DeleteInstanceAsync XML doc quoted a hard-coded "30s" value.

Recommendation

Reword to "Delete fails if the site is unreachable within CommunicationOptions.LifecycleTimeout" without quoting a specific number.

Resolution

Resolved 2026-05-16 (commit pending): the DeleteInstanceAsync XML doc no longer quotes a hard-coded "30s" — it now states delete fails if the site is unreachable within CommunicationOptions.LifecycleTimeout (and notes the deadline is applied inside CommunicationService.DeleteInstanceAsync). Documentation-only change; no regression test (a test asserting comment text would be meaningless).

DeploymentManager-010 — `SystemArtifactDeploymentRecord` does not persist the deployment ID


Severity	Low
Category	Correctness & logic bugs
Status	Resolved
Location	`src/ScadaLink.DeploymentManager/ArtifactDeploymentService.cs:136,194-211`

Description

DeployToAllSitesAsync generates a deploymentId (line 136) and returns it in the ArtifactDeploymentSummary and audit log, but the persisted SystemArtifactDeploymentRecord has no field for it (the entity only has Id, ArtifactType, DeployedBy, DeployedAt, PerSiteStatus). The deployment ID that appears in the UI summary and audit log cannot be correlated back to the stored record. Additionally each per-site DeployArtifactsCommand carries its own separate GUID (BuildDeployArtifactsCommandAsync line 114), so there are in fact N+1 unrelated IDs for one logical artifact deployment.

Verification: Confirmed against source. Each per-site command minted its own GUID and the persisted record had no way to reference the logical id.

Recommendation

Add a DeploymentId column to SystemArtifactDeploymentRecord and store the single logical deploymentId; reuse that ID (or a derived per-site ID) for the per-site commands so the audit log, UI summary, and persisted record agree.

Resolution

Resolved 2026-05-16 (commit pending): BuildDeployArtifactsCommandAsync now accepts an optional deploymentId, and DeployToAllSitesAsync passes the one logical deploymentId to every per-site command — so the per-site commands, the audit log, and the UI summary all reference a single id instead of N+1 unrelated GUIDs (RetryForSiteAsync, an independent single-site retry, still mints its own id). Adding a dedicated DeploymentId column to SystemArtifactDeploymentRecord was deliberately not done: that entity lives in ScadaLink.Commons with its EF mapping in ScadaLink.ConfigurationDatabase, both outside this module's edit scope. Instead the logical deploymentId is embedded in the record's free-form PerSiteStatus JSON payload ({ DeploymentId, Sites }), which is fully within this module's control, so the persisted record is correlatable with the summary/audit. A follow-up to promote it to a first-class column should be filed against Commons/ConfigurationDatabase if a queryable index is needed. Regression tests: DeployToAllSitesAsync_AllPerSiteCommandsShareTheSummaryDeploymentId, DeployToAllSitesAsync_PartialFailure_ReportsPerSiteMatrix, RetryForSiteAsync_SiteSucceeds_ReturnsSuccessAndAudits.

DeploymentManager-011 — Tests never exercise a successful deployment or lifecycle success path


Severity	Medium
Category	Testing coverage
Status	Resolved
Location	`tests/ScadaLink.DeploymentManager.Tests/DeploymentServiceTests.cs:100-151,155-199`

Description

DeploymentServiceTests never sets the CommunicationService actor, so every deploy/lifecycle test deliberately stops at the InvalidOperationException thrown by GetCommunicationActor() (see lines 118-125, 147). As a result there is no test covering: a successful deployment (DeploymentStatus.Success response → instance state set to Enabled, snapshot stored, audit logged); a failed-but-handled site response; the InProgress-stuck bug (DeploymentManager-001); successful Disable/Enable/Delete; or the operation lock actually serializing two concurrent deploys of the same instance. The critical post-response branch (DeploymentService.cs:154-184) and the entire delete/disable/enable success path are untested. The AuditLogs test (lines 277-289) asserts nothing.

Verification: Partially confirmed. By the time this finding was being resolved, the DeploymentManager-006 fix had already introduced a TestKit-actor seam (CreateServiceWithCommActor + ReconcileProbeActor) and successful-deploy tests. The genuinely-still-missing coverage was: successful Disable/Enable/Delete paths, per-instance lock serialization during deploy, and the assertionless AuditLogs test — those gaps were addressed.

Recommendation

Introduce a seam to inject a fake/substitute communication path (e.g. an interface over CommunicationService, or wire a TestKit actor) so success and handled-failure paths can be unit tested. Add tests for the stuck-InProgress scenario and for per-instance lock contention during deploy. Make the audit test assert on IAuditService.LogAsync.

Resolution

Resolved 2026-05-16 (commit pending): extended the TestKit-actor seam (ReconcileProbeActor now also answers lifecycle commands) and added the missing coverage — successful Disable/Enable/Delete (state transition + audit assertions), a successful-deploy audit assertion, and per-instance lock serialization via a new deferred-reply SerializationProbeActor that asserts a single instance's concurrent deploys never overlap. The assertionless AuditLogs test was replaced with DeployInstanceAsync_FlatteningFails_DoesNotReachAudit, which asserts on IAuditService.LogAsync. Regression tests: DisableInstanceAsync_SiteSucceeds_SetsDisabledStateAndAudits, EnableInstanceAsync_SiteSucceeds_SetsEnabledStateAndAudits, DeleteInstanceAsync_SiteSucceeds_RemovesRecordAndAudits, DeployInstanceAsync_SiteSucceeds_WritesDeployAuditEntry, DeployInstanceAsync_FlatteningFails_DoesNotReachAudit, DeployInstanceAsync_SameInstance_OperationLockSerializesConcurrentDeploys.

DeploymentManager-012 — `LifecycleCommandTimeout` option is dead code


Severity	Low
Category	Documentation & comments
Status	Resolved
Location	`src/ScadaLink.DeploymentManager/DeploymentManagerOptions.cs:8-9`

Description

DeploymentManagerOptions.LifecycleCommandTimeout is declared with a 30s default and an XML doc, but it is never read anywhere in the codebase (lifecycle commands rely on CommunicationOptions.LifecycleTimeout inside CommunicationService). The option misleads readers into thinking it controls disable/enable/delete timeouts, when setting it has no effect.

Verification: Confirmed against source. A repo-wide grep found exactly one occurrence of LifecycleCommandTimeout — the declaration itself.

Recommendation

Remove LifecycleCommandTimeout, or actually thread it through to the lifecycle command calls (e.g. by creating a linked CTS with this timeout in DisableInstanceAsync/EnableInstanceAsync/DeleteInstanceAsync, the way ArtifactDeploymentTimeoutPerSite is used).

Resolution

Resolved 2026-05-16 (commit pending): LifecycleCommandTimeout is now actually threaded through (the option exists for tuning, so it was wired up rather than deleted). DisableInstanceAsync/EnableInstanceAsync/DeleteInstanceAsync each create a linked CancellationTokenSource with CancelAfter( _options.LifecycleCommandTimeout) — the same pattern ArtifactDeploymentService uses for ArtifactDeploymentTimeoutPerSite — and pass its token to the CommunicationService call. Each method now catches the resulting TimeoutException/OperationCanceledException, logs a warning, and returns a Result.Failure (previously an AskTimeoutException from a hung site escaped uncaught). The option's XML doc was corrected to describe the real behaviour. Regression test: DisableInstanceAsync_SiteUnresponsive_LifecycleCommandTimeoutBoundsTheWait (asserts a 300 ms LifecycleCommandTimeout bounds the wait far below the 30 s CommunicationOptions.LifecycleTimeout; confirmed to fail before the fix — the call hung the full 30 s and threw AskTimeoutException).

DeploymentManager-013 — SMTP credentials serialized and broadcast to all sites


Severity	Low
Category	Security
Status	Resolved
Location	`src/ScadaLink.DeploymentManager/ArtifactDeploymentService.cs:108-111`

Description

BuildDeployArtifactsCommandAsync maps smtp.Credentials directly into SmtpConfigurationArtifact and that command is sent to every site. Distributing SMTP credentials to sites is consistent with the design (SMTP configuration is a deployable artifact), but the credentials travel inside a serialized command across the inter-cluster transport and are stored on each site's SQLite. There is no indication the value is encrypted at rest on the site or scrubbed from logs. Worth confirming the transport is TLS-protected and the site stores the credential securely; at minimum this should be a conscious, documented decision.

Recommendation

Confirm inter-cluster transport encryption covers artifact commands, ensure Credentials is never written to logs, and document the at-rest protection of SMTP credentials on site SQLite. Consider encrypting the credential field within the artifact payload.

Verification (2026-05-16): Re-triaged against source. The DeploymentManager side is clean: ArtifactDeploymentService maps SmtpConfiguration.Credentials into the artifact (which the design explicitly mandates — SMTP configuration is a deployable artifact) and never logs it — the three log statements in DeployToAllSitesAsync only reference SiteId, SiteName, DeploymentId, and ex.Message, never the credential. There is no defect to fix purely within src/ScadaLink.DeploymentManager. The finding's remaining recommendations are all cross-module and one needs a design decision:

inter-cluster transport TLS — ScadaLink.Communication / ScadaLink.ClusterInfrastructure (Akka remoting + ClusterClient config);
at-rest encryption of the credential on site SQLite — ScadaLink.SiteRuntime artifact store;
encrypting the credential field inside the artifact payload — needs the SmtpConfigurationArtifact shape in ScadaLink.Commons plus cooperating producer (DeploymentManager) and consumer (SiteRuntime) changes, and a key-management design decision (where the encryption key lives, how it is distributed to sites) that cannot be made unilaterally here.

Status: Open — flagged. No purely-DeploymentManager fix exists; the work crosses Communication / SiteRuntime / Commons and requires a key-management design decision. Severity confirmed Low: with TLS-protected inter-cluster transport (a separate, assumed-in-place control) and no logging leak, this is a hardening item, not an active leak.

Resolution

Resolved 2026-05-16 (commit <pending>). Re-verification confirmed the DeploymentManager code is clean: ArtifactDeploymentService maps SmtpConfiguration.Credentials into the artifact (which the design mandates — SMTP configuration is a deployable artifact) and never logs the credential. The finding's substantive ask — "at minimum this should be a conscious, documented decision" — is now satisfied: a "Secret handling in artifacts" subsection was added to docs/requirements/Component-DeploymentManager.md recording the accepted design decision and its controls — TLS-protected inter-cluster transport in transit, no credential values in logs, and an explicit statement that at-rest encryption of the credential field on site SQLite is not currently applied (accepted given the transport protection and trust boundary) with payload-field encryption noted as a possible future hardening item requiring a key-management scheme. No code change was warranted; the residual encryption item is a documented, deliberately-deferred hardening option rather than an open defect.

DeploymentManager-014 — Dead `CreateCommand` helper in artifact tests


Severity	Low
Category	Testing coverage
Status	Resolved
Location	`tests/ScadaLink.DeploymentManager.Tests/ArtifactDeploymentServiceTests.cs:86-90`

Description

The private static CreateCommand() helper is never referenced by any test in the file. It is dead code that suggests an intended test (e.g. a successful multi-site artifact deployment) was never written — coverage of DeployToAllSitesAsync is limited to the no-sites failure case, and RetryForSiteAsync and BuildDeployArtifactsCommandAsync have no tests at all.

Verification: Confirmed against source. The CreateCommand() helper had no callers, and DeployToAllSitesAsync/RetryForSiteAsync only had the no-sites failure case.

Recommendation

Either remove the unused helper or, preferably, write the missing tests for DeployToAllSitesAsync (per-site success/failure matrix, partial failure) and RetryForSiteAsync using it.

Resolution

Resolved 2026-05-16 (commit pending): took the recommendation's preferred option — removed the dead CreateCommand() helper and wrote the missing coverage instead. ArtifactDeploymentServiceTests now extends TestKit and uses a stand-in ArtifactProbeActor (records the DeployArtifactsCommands it receives, replies success or, for a configured failure set, failure) so DeployToAllSitesAsync and RetryForSiteAsync are exercised end-to-end past the communication boundary. New tests: DeployToAllSitesAsync_AllPerSiteCommandsShareTheSummaryDeploymentId (also covers DeploymentManager-010), DeployToAllSitesAsync_PartialFailure_ReportsPerSiteMatrix (per-site success/failure matrix), RetryForSiteAsync_SiteSucceeds_ReturnsSuccessAndAudits.

DeploymentManager-015 — Site-query reconciliation marks a deployment `Success` but skips instance-state and snapshot updates


Severity	High
Category	Correctness & logic bugs
Status	Resolved
Location	`src/ScadaLink.DeploymentManager/DeploymentService.cs:631-655`

Description

TryReconcileWithSiteAsync (the DeploymentManager-006 query-before-redeploy path) handles the case where a prior InProgress/timeout-Failed record exists and the site reports it already has the target revision hash. In that case it marks the prior DeploymentRecord Success, audit-logs DeployReconciled, and returns it — the caller then returns Result.Success and never enters the normal deploy body.

The normal success path (DeployInstanceAsync.cs:215-223) does three things on a successful site response: writes the deployment record terminal status, sets instance.State = InstanceState.Enabled + UpdateInstanceAsync, and calls StoreDeployedSnapshotAsync. The reconciliation shortcut performs only the first. Consequently, after a reconciled deployment:

The instance State is left at whatever it was (e.g. NotDeployed for a first-time deploy that timed out, or Disabled) even though the site is actually running the configuration — the central state machine and the site diverge, and a subsequent DisableInstanceAsync/EnableInstanceAsync will be rejected or allowed incorrectly by StateTransitionValidator.
No DeployedConfigSnapshot is created or refreshed. A first-time deploy that is resolved purely by reconciliation leaves GetDeploymentComparisonAsync permanently returning "No deployed snapshot found for this instance.", and a redeploy reconciliation leaves the stored snapshot showing the old config even though the deployment record claims Success for the new revision.

The design ("Deployed vs. Template-Derived State", WP-4/WP-8) requires the deployed snapshot and instance state to reflect the last successful deployment; the reconciliation path silently breaks both invariants.

Recommendation

In the reconciled-success branch of TryReconcileWithSiteAsync, perform the same post-success side effects as the normal path: set instance.State = InstanceState.Enabled (+ UpdateInstanceAsync) and call StoreDeployedSnapshotAsync with the target deployment ID / revision hash / config JSON. Factor the shared post-success logic into one helper so the normal and reconciliation paths cannot drift. Add a regression test asserting that a reconciled deployment leaves the instance Enabled and a snapshot stored.

Resolution

Resolved 2026-05-17 (commit pending): extracted the shared post-success side effects into ApplyPostSuccessSideEffectsAsync (sets instance State = Enabled + UpdateInstanceAsync, stores/refreshes the DeployedConfigSnapshot) and invoked it from both the normal deploy success path and the TryReconcileWithSiteAsync reconciled-success branch, so a reconciled deployment now performs the same instance-state and snapshot updates as a normal one (configJson is now computed before the reconciliation call and threaded into TryReconcileWithSiteAsync). Regression test: DeployInstanceAsync_Reconciled_SetsInstanceEnabledAndStoresSnapshot.

DeploymentManager-016 — Reconciled prior record keeps its stale `RevisionHash`


Severity	Medium
Category	Correctness & logic bugs
Status	Resolved
Location	`src/ScadaLink.DeploymentManager/DeploymentService.cs:639-651`

Description

When TryReconcileWithSiteAsync reconciles a prior record, it mutates prior.Status, prior.ErrorMessage, and prior.CompletedAt, but not prior.RevisionHash. The reconciliation condition only compares the site's AppliedRevisionHash against the freshly-flattened targetRevisionHash — it does not require prior.RevisionHash to equal either of them.

The prior record can legitimately carry a different revision hash than the current target: e.g. a deploy timed out at revision R1, the template was then edited so the current flatten yields R2, and meanwhile the site actually applied R2 through some other path (or R1 and R2 are equal-by-content but the prior record predates a hash recompute). After reconciliation the record's Status is Success but its RevisionHash still says R1, so staleness checks and any UI that reads DeploymentRecord.RevisionHash will report the instance as deployed at the wrong revision. The audit DeployReconciled entry records RevisionHash = targetRevisionHash, contradicting the persisted record.

Recommendation

In the reconciled-success branch, also set prior.RevisionHash = targetRevisionHash so the persisted record, the audit entry, and the site's actual applied revision all agree. Alternatively, only reconcile when prior.RevisionHash == targetRevisionHash and otherwise fall through to a normal deploy.

Resolution

Resolved 2026-05-17 (commit pending): the reconciled-success branch of TryReconcileWithSiteAsync now also sets prior.RevisionHash = targetRevisionHash, so the persisted record, the DeployReconciled audit entry, and the site's actually-applied revision all agree. Regression test: DeployInstanceAsync_Reconciled_PriorRecordRevisionHashUpdatedToTarget.

DeploymentManager-017 — `GetDeploymentStatusAsync` XML doc describes behaviour it does not implement


Severity	Low
Category	Documentation & comments
Status	Resolved
Location	`src/ScadaLink.DeploymentManager/DeploymentService.cs:562-570`

Description

The XML summary on GetDeploymentStatusAsync reads: "WP-2: After failover/timeout, query site for current deployment state before re-deploying." The method body does no such thing — it is a one-line pass-through to _repository.GetDeploymentByDeploymentIdAsync, a pure local DB read. The query-the-site-before-redeploy behaviour the comment describes was implemented separately in TryReconcileWithSiteAsync (DeploymentManager-006). The stale comment is a leftover of the original design intent and misleads a reader into thinking this method contacts the site.

Recommendation

Reword the summary to describe what the method actually does — "returns the current persisted DeploymentRecord for the given deployment ID from the configuration database" — and, if useful, cross-reference TryReconcileWithSiteAsync as the place the site-query reconciliation lives.

Resolution

Resolved 2026-05-17 (commit pending): the GetDeploymentStatusAsync XML doc now states it returns the persisted DeploymentRecord from the configuration database as a pure local read, and cross-references TryReconcileWithSiteAsync as where the query-the-site-before-redeploy reconciliation actually lives. Documentation-only change; no regression test (a test asserting comment text would be meaningless).

DeploymentManager-018 — Reconciliation force-sets `Enabled`, overwriting an intentional `Disabled` after central failover


Severity	High
Category	Correctness & logic bugs
Status	Resolved
Location	`src/ScadaLink.DeploymentManager/DeploymentService.cs:675-682,721-748`

Resolution — Added a forceEnabledState parameter to ApplyPostSuccessSideEffectsAsync. The normal deploy path passes true (fresh apply legitimately ends in Enabled); the reconciliation path passes false, so the helper only promotes NotDeployed → Enabled and leaves an existing Disabled (or Enabled) untouched. Regression test DeployInstanceAsync_Reconciled_DisabledInstance_PreservesDisabledState exercises the failover scenario and asserts the prior record still flips to Success while Instance.State stays Disabled.

Description

TryReconcileWithSiteAsync calls ApplyPostSuccessSideEffectsAsync whenever the site reports it has the target revision hash, and that helper unconditionally writes instance.State = InstanceState.Enabled. The reconciliation shortcut only runs when the prior DeploymentRecord is InProgress or timeout-Failed — exactly the scenarios that survive a central failover (the in-memory OperationLockManager is lost on failover, by design: "Lost on central failover (acceptable per design — in-progress treated as failed)").

After such a failover, the per-instance operation lock is gone but the deployment record is still InProgress in the DB. A user can legitimately issue DisableInstanceAsync for the same instance — there is nothing in DisableInstanceAsync that consults the deployment record, only the StateTransitionValidator over Instance.State. If the state is Enabled (the typical case when the deploy started), the disable proceeds, the site honours it (the design states a disabled instance retains its deployed configuration), and central now persists Instance.State = Disabled. The deployment-record row remains InProgress (no one transitioned it). Later the user retries the deploy: TryReconcileWithSiteAsync runs, the site still has the target revision hash (Disable doesn't change the deployed config), the prior record is marked Success, and ApplyPostSuccessSideEffectsAsync writes Instance.State = Enabled — silently overriding the user's explicit Disable.

The same trap exists for any direct DB edit / migration that flipped the state between the timed-out deploy and the redeploy. The normal deploy path can defensibly assume Enabled after a fresh successful apply, but the reconciliation path is reconciling prior state with prior user intent; it should preserve Disabled if that is the current Instance.State at the time of reconciliation, mirroring the design's separation between deploy (config apply) and disable (subscription/script lifecycle).

Recommendation

In the reconciliation branch, do not force Enabled. Either:

Pass a flag/parameter to ApplyPostSuccessSideEffectsAsync telling it whether to touch state, and skip the state write on the reconciliation path (leaving the current Instance.State intact, which is already Enabled for a fresh deploy that timed out and Disabled for the user-disabled follow-up case); or
Only set Enabled when the current Instance.State is NotDeployed (i.e. the first-deploy timed-out case), and leave existing Enabled/Disabled alone.

Add a regression test where an instance with Instance.State = Disabled and a prior InProgress deployment record is reconciled — the resulting Instance.State must remain Disabled, and the deployment record must still be marked Success.

DeploymentManager-019 — Lifecycle command timeout writes no audit entry


Severity	Medium
Category	Error handling & resilience
Status	Resolved
Location	`src/ScadaLink.DeploymentManager/DeploymentService.cs:328-339,385-396,445-458`

Resolution (2026-05-28): added TryLogLifecycleTimeoutAsync, a private helper that mirrors the DeployFailed pattern — it calls _auditService.LogAsync with CancellationToken.None (so the operator's already-cancelled outer token cannot also prevent the audit write) and stamps the row with the <Action>TimedOut action name (DisableTimedOut / EnableTimedOut / DeleteTimedOut), the command id, the configured deadline, and the captured exception message. Each of DisableInstanceAsync / EnableInstanceAsync / DeleteInstanceAsync invokes the helper from its catch (TimeoutException or OperationCanceledException) block before returning the failure Result. The helper itself try/catches around the audit write so a failed audit pipeline does not mask the underlying timeout for the caller — it only logs at Warning. Regression tests DisableInstanceAsync_LifecycleTimeout_WritesDisableTimedOutAuditEntry, EnableInstanceAsync_LifecycleTimeout_WritesEnableTimedOutAuditEntry, and DeleteInstanceAsync_LifecycleTimeout_WritesDeleteTimedOutAuditEntry use the existing SilentProbeActor to keep the site unresponsive, configure a 300 ms LifecycleCommandTimeout to bound the wait, and assert the audit log received the corresponding <Action>TimedOut entry exactly once.

Description

DisableInstanceAsync, EnableInstanceAsync, and DeleteInstanceAsync each wrap the CommunicationService call in a linked CTS with LifecycleCommandTimeout (DeploymentManager-012). On timeout they log a warning and return Result<...>.Failure(...) — and skip the _auditService.LogAsync call entirely. As a result, an operator-initiated disable/enable/delete that times out at the site leaves no audit trail: the user, the timestamp, the command id, and the failure mode are not recorded in the audit log. The deploy path goes out of its way to write a DeployFailed audit entry on the same failure mode (DeploymentService.cs:274-276), with CancellationToken.None so the write is durable; the lifecycle commands do not.

The design lists audit logging as a Deployment Manager responsibility for "all deployment actions, system-wide artifact deployments, and instance lifecycle changes" — a timed-out lifecycle command is an attempted lifecycle change, and the operator action is exactly the kind of event the audit log exists to record.

Recommendation

In each of the three catch (Exception ex) when (ex is TimeoutException or OperationCanceledException) blocks, write a DisableTimeout/EnableTimeout/ DeleteTimeout (or use the existing operation name with a failure flag) audit entry with CancellationToken.None so a cancelled outer token does not prevent the audit write, mirroring DeployFailed. Add a unit test asserting that DisableInstanceAsync_SiteUnresponsive_LifecycleCommandTimeoutBoundsTheWait also produces an audit entry.

DeploymentManager-020 — `DeployReconciled` audit attributes the action to the prior deployer, not the current user


Severity	Low
Category	Documentation & comments
Status	Resolved
Location	`src/ScadaLink.DeploymentManager/DeploymentService.cs:698-712`

Description

In TryReconcileWithSiteAsync the audit call is:

await _auditService.LogAsync(prior.DeployedBy, "DeployReconciled", ...)

prior.DeployedBy is the user who issued the original (timed-out / stuck) deployment, not the user parameter passed into DeployInstanceAsync. The current user — the one who triggered the redeploy that produced the reconciliation — is dropped on the floor. For audit forensics this is misleading: the row will read "user A reconciled their own deployment" when in fact user B initiated the action that reconciled it.

The original deployer is interesting context, but it should be carried in the audit-detail object (where DeploymentId and RevisionHash already live), not substituted for the actor.

Recommendation

Use user (the parameter on DeployInstanceAsync, threaded through TryReconcileWithSiteAsync) as the audit actor, and include OriginalDeployer = prior.DeployedBy in the detail object so the original attribution is preserved without misrepresenting who took the action.

Resolution (2026-05-28): Threaded the user parameter from DeployInstanceAsync into TryReconcileWithSiteAsync as a new currentUser argument (consistent with the DeploymentManager-018 forceEnabledState parameter-threading pattern) and rewrote the audit call to log currentUser as the actor with OriginalDeployer = prior.DeployedBy carried in the detail object. Added test DeployInstanceAsync_Reconciled_AuditAttributesCurrentUserNotPriorDeployer that pins the new attribution and asserts the prior deployer is no longer used as the actor. Tests green (80/80 in DeploymentManager.Tests).

DeploymentManager-021 — `ResolveSiteIdentifierAsync` silently substitutes the DB id when the site row is missing


Severity	Low
Category	Correctness & logic bugs
Status	Resolved
Location	`src/ScadaLink.DeploymentManager/DeploymentService.cs:107-111`

Resolution (2026-05-28): ResolveSiteIdentifierAsync now throws InvalidOperationException ("Site with ID {siteId} not found; cannot resolve its SiteIdentifier for routing.") when the Site row is missing, instead of returning the numeric id rendered as a string. The deploy path's existing try/catch turns the throw into a DeploymentStatus.Failed record carrying the descriptive message (the DeploymentManager-001/-002 cleanup write the failure with CancellationToken.None); the lifecycle paths (Disable/Enable/Delete) propagate the exception so the CLI/UI caller surfaces the actual cause to the operator rather than seeing a confusing downstream "unknown site" routing error. The repository contract already returned Site?, so the null path is now type-visible at the call site instead of silently papered over.

Description

private async Task<string> ResolveSiteIdentifierAsync(int siteId, CancellationToken cancellationToken)
{
    var site = await _siteRepository.GetSiteByIdAsync(siteId, cancellationToken);
    return site?.SiteIdentifier ?? siteId.ToString();
}

If the Site row is missing (FK was deleted, race with admin delete, DB inconsistency), the method silently returns the numeric DB id rendered as a string. This is then passed to CommunicationService.{Deploy,Disable,Enable, Delete}InstanceAsync and QueryDeploymentStateAsync as if it were a real SiteIdentifier (e.g. "site-a"). The communication layer will fail with an "unknown site" or routing error, producing a confusing diagnostic that hides the actual problem (no site row).

This is a defensive concern, but every mutating operation in the module goes through this method, so a stale instance whose site was deleted will produce a misleading error every time it is touched.

Recommendation

Treat a missing site as a hard validation failure: return a Result.Failure($"Site with ID {siteId} not found") early from the calling operations, instead of fabricating an identifier. The repository already returns Site?, so the null path is type-visible; just don't paper over it.

DeploymentManager-022 — `Pending` and `InProgress` are written back-to-back with no intervening work


Severity	Low
Category	Code organization & conventions
Status	Resolved
Location	`src/ScadaLink.DeploymentManager/DeploymentService.cs:178-194`

Resolution (2026-05-28): The transient Pending write was dropped — the deployment record is now created directly in DeploymentStatus.InProgress, which collapses the start of the deploy into a single AddDeploymentRecordAsync + SaveChangesAsync + NotifyStatusChange (instead of two writes back-to-back). The flattening, validation, and TryReconcileWithSiteAsync round-trip have all completed before the insert, and the deploy command is sent immediately after, so Pending carried no operational meaning between the two writes. InProgress retains its documented "sent to site, awaiting response" semantics. Eliminating the extra SaveChangesAsync round-trip also removes the Pending→InProgress flicker the CentralUI-006 deployment-status page used to render via the second IDeploymentStatusNotifier.NotifyStatusChanged invocation.

Description

DeployInstanceAsync does:

record.Status = Pending;
AddDeploymentRecordAsync(record); SaveChangesAsync(); NotifyStatusChange(record);
record.Status = InProgress;
UpdateDeploymentRecordAsync(record); SaveChangesAsync(); NotifyStatusChange(record);

There is no work between the two writes — flattening, validation, and reconciliation have already completed by line 174. The deploy command is sent immediately after the InProgress write. The Pending write therefore costs: an extra SaveChangesAsync round-trip, an extra IDeploymentStatusNotifier invocation (which the CentralUI-006 page renders, so the user briefly sees a Pending flicker before InProgress), and an extra row-version bump if EF optimistic concurrency is enabled on the table.

The design uses Pending to mean "queued, not yet sent" and InProgress to mean "sent to site, awaiting response". The code's Pending slot has no queuing — it is set and immediately overwritten — so the state buys nothing operationally.

Recommendation

Either:

Drop the Pending write entirely and create the record directly in InProgress (one row insert, one notification, simpler UI); or
Move the Pending→InProgress transition to bracket actual queueing/work (e.g. set Pending before flattening + reconciliation, set InProgress immediately before DeployInstanceAsync on the comm service) so the two states carry distinguishable semantics worth a separate write.

DeploymentManager-023 — `BuildDeployArtifactsCommandAsync` re-queries system-wide artifacts once per site


Severity	Low
Category	Performance & resource management
Status	Resolved
Location	`src/ScadaLink.DeploymentManager/ArtifactDeploymentService.cs:82-144,169-173`

Resolution (2026-05-28): Hoisted the global artifact queries (shared scripts, external systems + methods, DB connections, notification lists, SMTP configurations) out of the per-site loop into a new private FetchGlobalArtifactsAsync that produces a GlobalArtifactSnapshot record. DeployToAllSitesAsync now calls it ONCE before the loop and threads the snapshot through a new prefetched-globals overload of BuildDeployArtifactsCommandAsync; the public single-site overload keeps the prior fetch-then-build behaviour for RetryForSiteAsync. Only the per-site data-connection query remains inside the loop. Regression tests DeployToAllSitesAsync_HoistsGlobalArtifactQueriesOutOfPerSiteLoop (three sites; pins exactly-one call to each global getter and one per-site call to GetDataConnectionsBySiteIdAsync) and RetryForSiteAsync_SingleSitePath_StillRunsTheGlobalQueriesOnce (single-site path still owns its own fetch).

Description

DeployToAllSitesAsync loops over sites and calls BuildDeployArtifactsCommandAsync(site.Id, ...) for each one. Of the six artifact sets the method gathers, only dataConnections is per-site:

_templateRepo.GetAllSharedScriptsAsync — global.
_externalSystemRepo.GetAllExternalSystemsAsync — global, plus GetMethodsByExternalSystemIdAsync per external system per site.
_externalSystemRepo.GetAllDatabaseConnectionsAsync — global.
_notificationRepo.GetAllNotificationListsAsync — global.
_notificationRepo.GetAllSmtpConfigurationsAsync — global.
_siteRepo.GetDataConnectionsBySiteIdAsync(siteId, ...) — per-site.

With N sites this issues ≈ 5·N redundant queries on the global sets (plus M·N method queries, where M is the external-system count). On a hub-and-spoke deployment with many sites the artifact-deploy path is noticeably slower than necessary and pins DbContext usage longer than needed. Per CLAUDE.md, the DbContext is not thread-safe and the per-site commands are already built sequentially (good); the redundant queries are sequential too, but the network/round-trip cost is real.

Recommendation

Hoist the global queries (shared scripts, external systems + their methods, DB connections, notification lists, SMTP configurations) out of BuildDeployArtifactsCommandAsync, fetch them once in DeployToAllSitesAsync, and pass them in alongside the site id (or expose a BuildDeployArtifactsCommandAsync(siteId, prefetchedGlobals) overload). RetryForSiteAsync (the single-site path) can keep the convenience-overload behaviour. Add a test using NSubstitute's .Received() to assert _templateRepo.GetAllSharedScriptsAsync is called exactly once for an N-site deployment.

DeploymentManager-024 — Test probe actors hold mutable static state across tests


Severity	Low
Category	Testing coverage
Status	Resolved
Location	`tests/ScadaLink.DeploymentManager.Tests/DeploymentServiceTests.cs:966-1075`, `tests/ScadaLink.DeploymentManager.Tests/ArtifactDeploymentServiceTests.cs:196-217`

Resolution (2026-05-28): Replaced the static counters with per-test instance state. Introduced ReconcileProbeCounters and SerializationProbeCounters (in DeploymentServiceTests) and ArtifactProbeRecorder (in ArtifactDeploymentServiceTests); each probe actor now takes the counter object as its first constructor argument. Every test instantiates a fresh counter local, passes it via Props.Create(() => new ReconcileProbeActor(counters, ...)), and reads the counts directly off counters — no shared static fields remain. ReconcileProbeActor's counter increments swap to Interlocked.Increment for the cross-thread CAS, and SerializationProbeActor retains its lock on a per-test Gate. All 85 ScadaLink.DeploymentManager.Tests continue to pass after the refactor.

Description

ReconcileProbeActor.QueryCount / DeployCount, SerializationProbeActor.MaxConcurrent / _current, and ArtifactProbeActor.Received are all static fields. Each test's actor constructor resets them — but reset-on-construction only works as long as no two tests in the same class run concurrently. xUnit's default parallelism disables intra-class parallelism, so today's tests pass; flip the assembly-level [CollectionBehavior(DisableTestParallelization = true)] or move to xUnit v3 (which enables intra-class parallelism by default) and the counters race — a deploy in test A could increment DeployCount while test B is asserting on it.

Static state shared across tests is also why a flaky-test investigation here will be unusually painful: the offending interaction is invisible from any single test file.

Recommendation

Replace the static counters with instance state, hand the actor a probe recipient (an IActorRef to a TestKit probe), and assert via ExpectMsg in each test. Where the simpler counter shape is preferred, pass a shared-state object into the actor's constructor so each test owns its own instance — never reach for static mutable test state.

64 KiB Raw Blame History

Code Review — DeploymentManager

Summary

Re-review 2026-05-17 (commit 39d737e)

Re-review 2026-05-28 (commit 1eb6e97)

Checklist coverage

Re-review 2026-05-28 (commit 1eb6e97)

Findings

DeploymentManager-001 — Unexpected exceptions leave the deployment record stuck in InProgress

DeploymentManager-002 — Failure-status write uses a possibly-cancelled cancellation token

DeploymentManager-003 — Successful-deployment cleanup is not atomic with the status write

DeploymentManager-004 — Site-success but central-delete-failure leaves orphaned site config

DeploymentManager-005 — OperationLockManager leaks a SemaphoreSlim per instance name

DeploymentManager-006 — Query-the-site-before-redeploy idempotency requirement not implemented

DeploymentManager-007 — "Diff View" reduced to a hash comparison with no diff detail

DeploymentManager-008 — DeploymentManagerOptions is never bound to configuration

DeploymentManager-009 — Misleading timeout comment on DeleteInstanceAsync

DeploymentManager-010 — SystemArtifactDeploymentRecord does not persist the deployment ID

DeploymentManager-011 — Tests never exercise a successful deployment or lifecycle success path

DeploymentManager-012 — LifecycleCommandTimeout option is dead code

DeploymentManager-013 — SMTP credentials serialized and broadcast to all sites

DeploymentManager-014 — Dead CreateCommand helper in artifact tests

DeploymentManager-015 — Site-query reconciliation marks a deployment Success but skips instance-state and snapshot updates

DeploymentManager-016 — Reconciled prior record keeps its stale RevisionHash

DeploymentManager-017 — GetDeploymentStatusAsync XML doc describes behaviour it does not implement

DeploymentManager-018 — Reconciliation force-sets Enabled, overwriting an intentional Disabled after central failover

DeploymentManager-019 — Lifecycle command timeout writes no audit entry

DeploymentManager-020 — DeployReconciled audit attributes the action to the prior deployer, not the current user

DeploymentManager-021 — ResolveSiteIdentifierAsync silently substitutes the DB id when the site row is missing

DeploymentManager-022 — Pending and InProgress are written back-to-back with no intervening work

DeploymentManager-023 — BuildDeployArtifactsCommandAsync re-queries system-wide artifacts once per site

DeploymentManager-024 — Test probe actors hold mutable static state across tests

64 KiB

Raw Blame History

Re-review 2026-05-17 (commit `39d737e`)

Re-review 2026-05-28 (commit `1eb6e97`)

Re-review 2026-05-28 (commit `1eb6e97`)

DeploymentManager-001 — Unexpected exceptions leave the deployment record stuck in `InProgress`

DeploymentManager-005 — `OperationLockManager` leaks a `SemaphoreSlim` per instance name

DeploymentManager-008 — `DeploymentManagerOptions` is never bound to configuration

DeploymentManager-009 — Misleading timeout comment on `DeleteInstanceAsync`

DeploymentManager-010 — `SystemArtifactDeploymentRecord` does not persist the deployment ID

DeploymentManager-012 — `LifecycleCommandTimeout` option is dead code

DeploymentManager-014 — Dead `CreateCommand` helper in artifact tests

DeploymentManager-015 — Site-query reconciliation marks a deployment `Success` but skips instance-state and snapshot updates

DeploymentManager-016 — Reconciled prior record keeps its stale `RevisionHash`

DeploymentManager-017 — `GetDeploymentStatusAsync` XML doc describes behaviour it does not implement

DeploymentManager-018 — Reconciliation force-sets `Enabled`, overwriting an intentional `Disabled` after central failover

DeploymentManager-020 — `DeployReconciled` audit attributes the action to the prior deployer, not the current user

DeploymentManager-021 — `ResolveSiteIdentifierAsync` silently substitutes the DB id when the site row is missing

DeploymentManager-022 — `Pending` and `InProgress` are written back-to-back with no intervening work

DeploymentManager-023 — `BuildDeployArtifactsCommandAsync` re-queries system-wide artifacts once per site