Files

T

Joseph Doherty d39089f4ed docs(code-review): full review at 4307c381 — 18 modules, 67 findings recorded + remediation tracked

Full per-module re-review of the 16 stale modules (last seen 1eb6e97 / 2026-05-28)
plus first-ever reviews of KpiHistory (#26) and ScriptAnalysis (#25), at HEAD 4307c381.

67 new findings (0 Critical, 6 High, 27 Medium, 34 Low). Remediation in commit
fd618cf1 closed 5 of the 6 Highs and ~33 Medium/Low; the rest are Deferred/Won't Fix
with rationale. Remaining pending (4) are all InboundAPI's Database-helper findings
(IA-026 High .. IA-029), left to the active feat/ipsen-movein effort per owner decision.

Highlights: caught a central-only-delivery security drift (SMTP creds broadcast to
sites — DM-025/SR-031), a never-committed 'Resolved' fix (SiteEventLogging-016 → -024),
an unguarded KPI recorder tick (KH-001), a trust-analyzer fallback weakening (SA-001),
and a native-alarm subscribe-path leak (DCL-023). ScriptAnalysis verdict: trust boundary
is semantically sound (symbol-based) in the production cluster config.

README regenerated; regen-readme.py --check passes (4 pending / 567 total).

2026-06-20 18:02:32 -04:00

78 KiB

Raw Blame History

Code Review — DeploymentManager

Field	Value
Module	`src/ZB.MOM.WW.ScadaBridge.DeploymentManager`
Design doc	`docs/requirements/Component-DeploymentManager.md`
Status	Reviewed
Last reviewed	2026-06-20
Reviewer	claude-agent
Commit reviewed	`4307c381`
Open findings	0

Summary

The DeploymentManager module is small, well-structured, and clearly maps work packages (WP-N) onto code. The happy paths for instance deployment, lifecycle commands, artifact broadcast, and staleness comparison are implemented sensibly, and the operation lock correctly serializes mutating operations per instance while allowing cross-instance parallelism. However, the review found a significant cluster of error-handling and resilience gaps: the deployment record can be left permanently stuck in InProgress when an exception other than timeout/cancellation is thrown, the catch block writes its failure status using a cancellation token that may already be cancelled, and the OperationLockManager leaks one SemaphoreSlim per instance name forever. There are also two notable design-document adherence gaps: the "query-the-site-before-redeploy" idempotency requirement is not implemented (GetDeploymentStatusAsync only reads the local DB), and the "Diff View" feature is reduced to a bare hash comparison with no added/removed/changed detail. Configuration is not bound to appsettings.json, leaving one option entirely dead. Test coverage stops at the communication boundary and never exercises a successful deployment or the lifecycle success paths.

Re-review 2026-05-17 (commit `39d737e`)

Re-reviewed at commit 39d737e after the batch of fixes for DeploymentManager-001..014. All fourteen prior findings remain Resolved and verified against source — the broadened catch, non-cancellable cleanup writes, ref-counted OperationLockManager, query-before-redeploy reconciliation, structured diff, options binding, and the expanded TestKit-actor test suite are all present and correct. The module is in markedly better shape than the first review: error paths are now defensively handled and test coverage is broad (successful deploy/lifecycle, lock serialization, reconciliation matrix, artifact per-site matrix).

This re-review found 3 new findings, all clustered on the DeploymentManager-006 reconciliation path added since the last review. The reconciliation shortcut (TryReconcileWithSiteAsync) marks a stale prior record Success when the site already has the target revision, but it does not perform the side effects the normal success path does — it never updates the instance State, never refreshes the DeployedConfigSnapshot, and never corrects the prior record's own RevisionHash (DeploymentManager-015, DeploymentManager-016). The GetDeploymentStatusAsync XML doc is now stale — it still describes the query-before-redeploy behaviour that actually moved into TryReconcileWithSiteAsync (DeploymentManager-017).

Re-review 2026-05-28 (commit `1eb6e97`)

Re-reviewed at commit 1eb6e97 after the DeploymentManager-015/016/017 fixes and a docs-only XML-comment pass. The three prior findings remain Resolved and verified — ApplyPostSuccessSideEffectsAsync is now invoked from both the normal success path and TryReconcileWithSiteAsync, the reconciled-success branch corrects prior.RevisionHash to the target, and GetDeploymentStatusAsync's XML doc now describes the local-DB-read it actually performs and cross-refs the reconciliation helper. The DiffService wiring, options binding, ref-counted operation lock, broadened catch, non-cancellable cleanup, and TestKit-actor test seam are still in place. The 7 new findings here are not regressions in the DeploymentManager-015/016 fixes — they are issues uncovered by widening the lens to the lifecycle paths, reconciliation's interaction with intentional Disabled state, audit semantics, and operational concerns (per-site artifact-build cost, Pending→InProgress double-write).

The single notable correctness issue is DeploymentManager-018: the reconciliation shortcut unconditionally sets instance.State = Enabled via ApplyPostSuccessSideEffectsAsync. After a central failover that loses the in-memory operation lock, a user can legitimately Disable an instance whose prior deploy record is still InProgress; a subsequent redeploy then reconciles and silently re-enables the instance against the user's explicit intent. The remaining six findings are medium/low: lifecycle-timeout audit gap (DeploymentManager-019), audit-user attribution in reconciliation (DeploymentManager-020), silent fallback in ResolveSiteIdentifierAsync (DeploymentManager-021), back-to-back Pending→InProgress writes (DeploymentManager-022), per-site re-query of system-wide artifacts (DeploymentManager-023), and shared static state across *ProbeActor tests (DeploymentManager-024).

Re-review 2026-06-20 (commit `4307c381`) — full review

Re-reviewed the whole current module at HEAD after the rename, the cert-broadcast / Transport IStaleInstanceProbe work, and milestone changes. DeploymentManager-001..024 all remain Resolved and verified against source — the ref-counted OperationLockManager, the broadened/non-cancellable failure writes, the ApplyPostSuccessSideEffectsAsync shared helper with forceEnabledState (Disabled-preservation), the lifecycle-timeout audit helper, the structured diff with List-value normalization, the hoisted global artifact fetch, and the instance-state-aware reconciliation are all present and correct. The two flagged cross-module/architectural seams the prompt called out — the TrustServerCert/RemoveServerCert broadcast-to-both-nodes and the DeploymentManagerActor deploy-state query handler — live in SiteRuntime / Communication / CentralUI, not this module, so they are out of scope here. This review found 3 new findings. The material one is DeploymentManager-025: the system-wide artifact path still fetches and broadcasts notification lists and SMTP configurations (including SMTP credentials) to every site, in direct contradiction of the now-explicit design decision that these are central-only and "no SMTP credential is ever distributed to sites" (Component-DeploymentManager.md lines 142-146; CLAUDE.md notification-central-only decision). This supersedes the earlier accepted-deployable-artifact framing of the closed DeploymentManager-013. DeploymentManager-026 (deployment records are insert-only — a new row per deploy accumulates per instance, contradicting "only current status stored, no history table", and the same-tick OrderByDescending(DeployedAt) read has no tiebreaker) and DeploymentManager-027 (artifact tests assert the forbidden notif/SMTP shipping, cementing the DeploymentManager-025 violation) are the remaining two.

Checklist coverage

Re-review 2026-06-20 (commit `4307c381`)

#	Category	Examined	Notes
1	Correctness & logic bugs	✓	New: deployment records are insert-only — `DeployInstanceAsync` Adds a new row per deploy; reconciliation's `GetCurrentDeploymentStatusAsync` orders by `DeployedAt` with no tiebreaker (DeploymentManager-026).
2	Akka.NET conventions	✓	Module remains a plain service layer; no actors. The deploy-state-query/cert-broadcast actors live in SiteRuntime, out of scope. No issues.
3	Concurrency & thread safety	✓	`OperationLockManager` ref-counting + gate re-verified; `DeployToAllSitesAsync` prebuilds per-site commands before the parallel phase (no shared DbContext under `Task.WhenAll`). No issues.
4	Error handling & resilience	✓	Failure-status writes use `CancellationToken.None`; lifecycle timeouts now audit; delete-orphan path surfaced. No new issues.
5	Security	✓	New: SMTP credentials are still serialized into the per-site artifact command and broadcast to every site, which the current design forbids outright (DeploymentManager-025).
6	Performance & resource management	✓	Global artifact queries hoisted (DM-023 resolved). Deployment-record row growth is unbounded per instance (part of DeploymentManager-026).
7	Design-document adherence	✓	New: notification lists + SMTP configs are still treated as deployable artifacts, contradicting the "central-only, never distributed to sites" design (DeploymentManager-025).
8	Code organization & conventions	✓	Options bound via Host; `OptionsSection` constant correct. No new issues.
9	Testing coverage	✓	Broad and current. New: artifact tests assert the forbidden notif/SMTP shipping (DeploymentManager-027).
10	Documentation & comments	✓	`ArtifactDeploymentService` class XML doc still lists notification lists + SMTP as broadcast artifacts (stale vs design — folded into DeploymentManager-025).

Re-review 2026-05-28 (commit `1eb6e97`)

#	Category	Examined	Notes
1	Correctness & logic bugs	✓	New: reconciliation forces `Enabled` even if the user disabled the instance in between (DeploymentManager-018).
2	Akka.NET conventions	✓	Module remains a plain service layer; no actors. No issues.
3	Concurrency & thread safety	✓	`OperationLockManager` ref-counting verified. Note: test probes hold static state (DeploymentManager-024) — a test concern, not production code.
4	Error handling & resilience	✓	New: Disable/Enable/Delete timeouts return early without writing any audit entry — deploy has `DeployFailed`, lifecycle has nothing (DeploymentManager-019).
5	Security	✓	No new issues. SMTP credential decision documented (DeploymentManager-013 closed).
6	Performance & resource management	✓	New: `BuildDeployArtifactsCommandAsync` re-queries every system-wide artifact set per site in `DeployToAllSitesAsync` (DeploymentManager-023).
7	Design-document adherence	✓	Reconciliation now performs post-success side effects (DeploymentManager-015 resolved). DeploymentManager-018 surfaces a new gap on `Disabled`-state preservation.
8	Code organization & conventions	✓	New: redundant `Pending`→`InProgress` back-to-back write with no intervening work (DeploymentManager-022). Silent string-fallback in `ResolveSiteIdentifierAsync` (DeploymentManager-021).
9	Testing coverage	✓	New: no coverage for the reconciliation-overwrites-Disabled case (part of DeploymentManager-018); test probes share static state across tests (DeploymentManager-024).
10	Documentation & comments	✓	New: `DeployReconciled` audit uses `prior.DeployedBy` instead of the current `user` parameter — misleading for forensics (DeploymentManager-020).

Findings

DeploymentManager-001 — Unexpected exceptions leave the deployment record stuck in `InProgress`


Severity	High
Category	Error handling & resilience
Status	Resolved
Location	`src/ZB.MOM.WW.ScadaBridge.DeploymentManager/DeploymentService.cs:141-199`

Description

DeployInstanceAsync sets the record to InProgress (lines 137-139), then the try block calls into CommunicationService and the repository. The only catch filter is when (ex is TimeoutException or OperationCanceledException). Any other exception — InvalidOperationException (thrown by CommunicationService.GetCommunicationActor() when the actor is not set), a JSON serialization error, a deserialization failure of the response, a DB exception on UpdateDeploymentRecordAsync, or any transport error — escapes the method. The deployment record remains in DeploymentStatus.InProgress permanently. Because staleness and the UI both read current status, the instance is then misreported as "deploying" forever and a re-deploy may be blocked or misinterpreted. The design explicitly states an interrupted deployment must be "treated as failed".

Recommendation

Broaden the catch to a general catch (Exception ex) that records DeploymentStatus.Failed with the error message, audit-logs the failure, and re-throws or returns a failed Result. Keep the timeout-specific branch only if a distinct message is desired. Ensure the failure-status write happens for every exit path out of the try.

Resolution

Resolved 2026-05-16 (commit <pending>): broadened the catch in DeployInstanceAsync to catch (Exception ex) so any exception (transport, serialization, DB, InvalidOperationException from an uninitialized CommunicationService) marks the deployment record Failed with the error message and audit-logs the failure, instead of escaping and leaving the record stuck in InProgress. Regression test: DeployInstanceAsync_CommunicationThrowsUnexpectedException_RecordMarkedFailed.

DeploymentManager-002 — Failure-status write uses a possibly-cancelled cancellation token


Severity	High
Category	Error handling & resilience
Status	Resolved
Location	`src/ZB.MOM.WW.ScadaBridge.DeploymentManager/DeploymentService.cs:186-196`

Description

The catch (Exception ex) when (ex is TimeoutException or OperationCanceledException) block updates the record to Failed and calls UpdateDeploymentRecordAsync/SaveChangesAsync/LogAsync passing the same cancellationToken that was just cancelled (an OperationCanceledException caught here means the token is already in the cancelled state). Those repository and audit calls will themselves throw OperationCanceledException before the failure status is persisted, so the record stays InProgress — the exact bug DeploymentManager-001 describes, reached via the supposedly-handled path.

Recommendation

Perform the cleanup writes with a fresh, non-cancellable token (e.g. CancellationToken.None, optionally with an independent short timeout) so the failure status is durably recorded even when the original operation was cancelled or timed out.

Resolution

Resolved 2026-05-16 (commit <pending>): the broadened catch block now performs the failure-status write (UpdateDeploymentRecordAsync, SaveChangesAsync) and the audit LogAsync with CancellationToken.None instead of the operation's (possibly-cancelled) token, so the Failed status is durably recorded even after a timeout/cancellation. The cleanup writes are themselves wrapped in a try/catch that logs (without masking the original error) if persistence still fails. Regression test: DeployInstanceAsync_FailureWrite_UsesNonCancellableToken.

DeploymentManager-003 — Successful-deployment cleanup is not atomic with the status write


Severity	Medium
Category	Error handling & resilience
Status	Resolved
Location	`src/ZB.MOM.WW.ScadaBridge.DeploymentManager/DeploymentService.cs:155-170`

Description

After a successful site response the code calls UpdateDeploymentRecordAsync (no SaveChanges yet), then UpdateInstanceAsync, then StoreDeployedSnapshotAsync (which itself issues Add/Update calls), then a single SaveChangesAsync at line 170. If StoreDeployedSnapshotAsync throws, the exception is not caught (see DeploymentManager-001) and the SaveChangesAsync never runs — the instance state, deployment status, and snapshot are all left unpersisted even though the site has actually applied the deployment. Central and site are now divergent: the site is running the new config but central still shows the old state and a non-Success deployment record.

Verification: Confirmed against source. The DeploymentManager-001 fix made this strictly worse, not better — after that fix a snapshot-store failure is caught and the record is flipped from Success back to Failed, so central reports a failed deployment while the site is running the new config.

Recommendation

Wrap the post-success persistence so that, at minimum, the deployment record's Success status is committed. Consider committing the status first, then the instance state and snapshot, so a later failure does not lose the fact that the site succeeded. Log loudly if the snapshot write fails after a confirmed site apply.

Resolution

Resolved 2026-05-16 (commit pending): DeployInstanceAsync now commits the deployment record's terminal status (UpdateDeploymentRecordAsync + SaveChangesAsync) immediately after the site confirms the apply, before touching instance state or the deployed-config snapshot. The post-success instance-state update and StoreDeployedSnapshotAsync are wrapped in a best-effort try/catch that logs loudly for operator reconciliation but no longer flips the already-committed Success record back to Failed. Regression test: DeployInstanceAsync_SiteSucceeds_SnapshotWriteFails_RecordStillCommittedSuccess.

DeploymentManager-004 — Site-success but central-delete-failure leaves orphaned site config


Severity	Medium
Category	Error handling & resilience
Status	Resolved
Location	`src/ZB.MOM.WW.ScadaBridge.DeploymentManager/DeploymentService.cs:312-319`

Description

In DeleteInstanceAsync, when the site responds Success the code calls _repository.DeleteInstanceAsync then SaveChangesAsync. If SaveChangesAsync throws (DB error, concurrency), the exception propagates uncaught: the site has already destroyed the Instance Actor and removed its config, but the central instance record still exists. The instance is now un-deletable through the normal path (the site no longer has it, so a re-issued delete may fail) and is permanently orphaned. The design states central must not mark the instance deleted until the site confirms — but it does not address the inverse failure.

Verification: Confirmed against source. DeleteInstanceAsync has no try/catch around the post-success block, so any exception from DeleteInstanceAsync/SaveChangesAsync escapes uncaught to the caller.

Recommendation

Catch persistence failures in the post-success block and surface a distinct error indicating the site succeeded but the central record could not be removed, so an operator/retry can reconcile. Consider making the central delete idempotent and retryable independently of the site command.

Resolution

Resolved 2026-05-16 (commit pending): the post-success removal in DeleteInstanceAsync (DeleteInstanceAsync + SaveChangesAsync) is now wrapped in a try/catch. A persistence failure no longer escapes uncaught — it is logged, recorded with a DeleteOrphaned audit entry, and surfaced as a distinct Result failure stating the site deleted the instance but the central record is orphaned and must be reconciled. Regression test: DeleteInstanceAsync_SiteSucceeds_CentralDeleteFails_ReturnsDistinctFailure.

DeploymentManager-005 — `OperationLockManager` leaks a `SemaphoreSlim` per instance name


Severity	Medium
Category	Performance & resource management
Status	Resolved
Location	`src/ZB.MOM.WW.ScadaBridge.DeploymentManager/OperationLockManager.cs:15-33`

Description

AcquireAsync does _locks.GetOrAdd(instanceUniqueName, _ => new SemaphoreSlim(1, 1)) and entries are never removed. Every distinct instance unique name that is ever deployed/disabled/enabled/deleted permanently adds a SemaphoreSlim (an IDisposable holding a kernel wait handle) to the dictionary. Over the lifetime of a long-running central process — especially with the bulk "deploy all out-of-date instances" workflow and instances that are created and deleted over time — this is an unbounded leak of both managed memory and OS handles. Deleted instances' semaphores are never reclaimed.

Verification: Confirmed against source. _locks is a ConcurrentDictionary with no removal path anywhere in the type.

Recommendation

Either accept the leak explicitly and document the expected bounded cardinality of instance names, or implement reclamation: e.g. ref-count handles and remove

Dispose() the semaphore when the count reaches zero and the lock is free. At minimum, remove the semaphore entry when an instance is deleted (DeleteInstanceAsync).

Resolution

Resolved 2026-05-16 (commit pending): OperationLockManager now ref-counts each lock entry. A reference is reserved (creating the entry if needed) before the SemaphoreSlim.WaitAsync, so concurrent waiters for the same instance share one semaphore and the entry survives until every waiter/holder has released. When the reference count reaches zero — on release, timeout, or cancellation — the entry is removed from the dictionary and the semaphore is Dispose()d, so the process no longer accumulates one kernel wait handle per distinct instance name. A TrackedLockCount diagnostic property was added to make reclamation testable. Regression tests: AcquireAsync_ReleasedLock_RemovesSemaphoreEntry, AcquireAsync_ManyDistinctInstances_DoesNotAccumulateSemaphores, AcquireAsync_ContendedLock_KeepsSemaphoreUntilLastReleaseThenReclaims.

DeploymentManager-006 — Query-the-site-before-redeploy idempotency requirement not implemented


Severity	High
Category	Design-document adherence
Status	Resolved
Location	`src/ZB.MOM.WW.ScadaBridge.DeploymentManager/DeploymentService.cs:84-200,363-368`

Description

The design ("Deployment Identity & Idempotency") requires: "After a central failover or timeout, the Deployment Manager queries the site for current deployment state before allowing a re-deploy. This prevents duplicate application and out-of-order config changes." The code never does this. GetDeploymentStatusAsync only reads the local DeploymentRecord from the DB (GetDeploymentByDeploymentIdAsync) — it does not contact the site. DeployInstanceAsync unconditionally generates a new deployment ID and sends a new DeployInstanceCommand regardless of any prior in-flight or timed-out deployment. After a timeout where the site actually applied the config, a re-deploy produces a second deployment with no reconciliation against the site's current revision hash. Site-side stale-rejection is the only safety net, and that is not verified here.

Recommendation

Add a site query (a new CommunicationService pattern returning the site's currently-applied deployment ID / revision hash) and call it before re-deploy when a prior record for the instance is in InProgress/Failed due to timeout. Reconcile: if the site already has the target revision, mark the prior record Success instead of re-sending. Either implement this or update the design doc to reflect that reconciliation is delegated entirely to site-side stale-rejection.

Resolution

Resolved 2026-05-16 (commit <pending>): implemented the cross-module query-the-site-before-redeploy idempotency feature across Commons, SiteRuntime, Communication, and DeploymentManager — new DeploymentStateQueryRequest / DeploymentStateQueryResponse contracts, a DeploymentManagerActor handler answering from the site's deployed-config store, a CommunicationService.QueryDeploymentStateAsync method routed over the ClusterClient command/control transport, and reconciliation in DeployInstanceAsync (TryReconcileWithSiteAsync) that queries the site only when a prior record is InProgress or Failed due to a timeout, marks the prior record Success without re-sending if the site already has the target revision hash, and falls through to a normal deploy (relying on site-side stale-rejection) when the query fails. Regression tests: RoundTrip_DeploymentStateQueryRequest_Succeeds, RoundTrip_DeploymentStateQueryResponse_Deployed_Succeeds, RoundTrip_DeploymentStateQueryResponse_NotDeployed_NullApplied, DeploymentStateQuery_DeployedInstance_ReturnsAppliedIdentity, DeploymentStateQuery_UnknownInstance_ReturnsNotDeployed, DeploymentStateQuery_ForwardedToDeploymentManager, QueryDeploymentStateAsync_BeforeInitialization_Throws, QueryDeploymentStateAsync_SendsEnvelopeAndReturnsResponse, DeployInstanceAsync_PriorInProgressRecord_SiteHasTargetHash_MarksSuccessWithoutRedeploy, DeployInstanceAsync_PriorInProgressRecord_SiteHasDifferentHash_ProceedsWithDeploy, DeployInstanceAsync_PriorFailedTimeoutRecord_QueriesSite, DeployInstanceAsync_PriorSuccessRecord_SkipsSiteQuery, DeployInstanceAsync_FreshFirstTimeDeploy_SkipsSiteQuery, DeployInstanceAsync_PriorInProgressRecord_QueryFails_FallsThroughToDeploy.

DeploymentManager-007 — "Diff View" reduced to a hash comparison with no diff detail


Severity	Medium
Category	Design-document adherence
Status	Resolved
Location	`src/ZB.MOM.WW.ScadaBridge.DeploymentManager/DeploymentService.cs:334-358,401-406`

Description

The design ("Diff View" and "Dependencies" sections) states the Deployment Manager can request a diff from the Template Engine showing added/removed members, changed values, and connection-binding changes. GetDeploymentComparisonAsync and DeploymentComparisonResult only compare two revision hashes and return a boolean IsStale plus the two hashes. No added/removed/changed detail is produced, and the Template Engine's diff capability is not invoked. The UI cannot render a meaningful diff from this result.

Verification: Confirmed against source. The Template Engine already provides DiffService + ConfigurationDiff (structured Added/Removed/Changed entries for attributes, alarms, and scripts, including data connection binding fields), and DiffService is DI-registered — it was simply never wired into the Deployment Manager's comparison path.

Recommendation

Either implement a real diff (deserialize the stored DeployedConfigSnapshot.ConfigurationJson and the freshly flattened config and invoke the Template Engine's diff service, surfacing structured added/removed/changed entries), or revise the design doc to scope the feature down to staleness detection only.

Resolution

Resolved 2026-05-16 (commit pending): GetDeploymentComparisonAsync now deserializes the stored DeployedConfigSnapshot.ConfigurationJson and runs the Template Engine DiffService against the freshly flattened current configuration, attaching the resulting ConfigurationDiff (added/removed/changed attributes, alarms, scripts) to a new optional Diff property on DeploymentComparisonResult. DiffService is injected into DeploymentService. A snapshot that cannot be deserialized (corrupt / older schema) still yields the hash-based staleness result with a null diff, logged at warning level. Regression test: GetDeploymentComparisonAsync_ProducesStructuredDiff.

DeploymentManager-008 — `DeploymentManagerOptions` is never bound to configuration


Severity	Medium
Category	Code organization & conventions
Status	Resolved
Location	`src/ZB.MOM.WW.ScadaBridge.DeploymentManager/ServiceCollectionExtensions.cs:7-14`

Description

AddDeploymentManager registers the services but never calls services.Configure<DeploymentManagerOptions>(configuration.GetSection(...)). IOptions<DeploymentManagerOptions> therefore always resolves to a default-constructed instance — the operation-lock and artifact-deployment timeouts cannot be tuned via appsettings.json, contrary to the CLAUDE.md convention "Per-component configuration via appsettings.json sections bound to options classes (Options pattern)." Host/Program.cs binds SecurityOptions and InboundApiOptions from configuration sections but has no equivalent for DeploymentManagerOptions.

Verification: Confirmed against source. Neither AddDeploymentManager nor Host/Program.cs binds DeploymentManagerOptions.

Recommendation

Add an IConfiguration parameter (or a configure callback) to AddDeploymentManager and bind DeploymentManagerOptions to a section such as ScadaBridge:DeploymentManager, consistent with the other components.

Resolution

Resolved 2026-05-16 (commit pending): AddDeploymentManager() now calls services.AddOptions<DeploymentManagerOptions>() so IOptions<DeploymentManagerOptions> is always resolvable, and Host/Program.cs binds the ScadaBridge:DeploymentManager section (exposed as ServiceCollectionExtensions.OptionsSection) via services.Configure<DeploymentManagerOptions>(...) — the same pattern the Host uses for SecurityOptions/InboundApiOptions. An earlier attempt added an AddDeploymentManager(IConfiguration) overload; that was reverted because the project convention (enforced by Host.Tests.OptionsTests) forbids component Add* methods from depending on IConfiguration — the Host owns configuration binding. Regression tests: AddDeploymentManager_RegistersResolvableOptions_WithDefaults, AddDeploymentManager_OptionsBindToConfigurationSection_AsTheHostWires, OptionsSection_MatchesTheConventionalComponentSectionPath.

DeploymentManager-009 — Misleading timeout comment on `DeleteInstanceAsync`


Severity	Low
Category	Documentation & comments
Status	Resolved
Location	`src/ZB.MOM.WW.ScadaBridge.DeploymentManager/DeploymentService.cs:288`

Description

The XML doc says "Delete fails if site unreachable (30s timeout via CommunicationOptions)." The actual delete timeout is whatever CommunicationOptions.LifecycleTimeout is configured to (passed inside CommunicationService.DeleteInstanceAsync); the "30s" figure is hard-coded into the comment and not derived from any constant in this module. If LifecycleTimeout is reconfigured, the comment becomes wrong. It also wrongly implies the value lives in this module.

Verification: Confirmed against source. The DeleteInstanceAsync XML doc quoted a hard-coded "30s" value.

Recommendation

Reword to "Delete fails if the site is unreachable within CommunicationOptions.LifecycleTimeout" without quoting a specific number.

Resolution

Resolved 2026-05-16 (commit pending): the DeleteInstanceAsync XML doc no longer quotes a hard-coded "30s" — it now states delete fails if the site is unreachable within CommunicationOptions.LifecycleTimeout (and notes the deadline is applied inside CommunicationService.DeleteInstanceAsync). Documentation-only change; no regression test (a test asserting comment text would be meaningless).

DeploymentManager-010 — `SystemArtifactDeploymentRecord` does not persist the deployment ID


Severity	Low
Category	Correctness & logic bugs
Status	Resolved
Location	`src/ZB.MOM.WW.ScadaBridge.DeploymentManager/ArtifactDeploymentService.cs:136,194-211`

Description

DeployToAllSitesAsync generates a deploymentId (line 136) and returns it in the ArtifactDeploymentSummary and audit log, but the persisted SystemArtifactDeploymentRecord has no field for it (the entity only has Id, ArtifactType, DeployedBy, DeployedAt, PerSiteStatus). The deployment ID that appears in the UI summary and audit log cannot be correlated back to the stored record. Additionally each per-site DeployArtifactsCommand carries its own separate GUID (BuildDeployArtifactsCommandAsync line 114), so there are in fact N+1 unrelated IDs for one logical artifact deployment.

Verification: Confirmed against source. Each per-site command minted its own GUID and the persisted record had no way to reference the logical id.

Recommendation

Add a DeploymentId column to SystemArtifactDeploymentRecord and store the single logical deploymentId; reuse that ID (or a derived per-site ID) for the per-site commands so the audit log, UI summary, and persisted record agree.

Resolution

Resolved 2026-05-16 (commit pending): BuildDeployArtifactsCommandAsync now accepts an optional deploymentId, and DeployToAllSitesAsync passes the one logical deploymentId to every per-site command — so the per-site commands, the audit log, and the UI summary all reference a single id instead of N+1 unrelated GUIDs (RetryForSiteAsync, an independent single-site retry, still mints its own id). Adding a dedicated DeploymentId column to SystemArtifactDeploymentRecord was deliberately not done: that entity lives in ZB.MOM.WW.ScadaBridge.Commons with its EF mapping in ZB.MOM.WW.ScadaBridge.ConfigurationDatabase, both outside this module's edit scope. Instead the logical deploymentId is embedded in the record's free-form PerSiteStatus JSON payload ({ DeploymentId, Sites }), which is fully within this module's control, so the persisted record is correlatable with the summary/audit. A follow-up to promote it to a first-class column should be filed against Commons/ConfigurationDatabase if a queryable index is needed. Regression tests: DeployToAllSitesAsync_AllPerSiteCommandsShareTheSummaryDeploymentId, DeployToAllSitesAsync_PartialFailure_ReportsPerSiteMatrix, RetryForSiteAsync_SiteSucceeds_ReturnsSuccessAndAudits.

DeploymentManager-011 — Tests never exercise a successful deployment or lifecycle success path


Severity	Medium
Category	Testing coverage
Status	Resolved
Location	`tests/ZB.MOM.WW.ScadaBridge.DeploymentManager.Tests/DeploymentServiceTests.cs:100-151,155-199`

Description

DeploymentServiceTests never sets the CommunicationService actor, so every deploy/lifecycle test deliberately stops at the InvalidOperationException thrown by GetCommunicationActor() (see lines 118-125, 147). As a result there is no test covering: a successful deployment (DeploymentStatus.Success response → instance state set to Enabled, snapshot stored, audit logged); a failed-but-handled site response; the InProgress-stuck bug (DeploymentManager-001); successful Disable/Enable/Delete; or the operation lock actually serializing two concurrent deploys of the same instance. The critical post-response branch (DeploymentService.cs:154-184) and the entire delete/disable/enable success path are untested. The AuditLogs test (lines 277-289) asserts nothing.

Verification: Partially confirmed. By the time this finding was being resolved, the DeploymentManager-006 fix had already introduced a TestKit-actor seam (CreateServiceWithCommActor + ReconcileProbeActor) and successful-deploy tests. The genuinely-still-missing coverage was: successful Disable/Enable/Delete paths, per-instance lock serialization during deploy, and the assertionless AuditLogs test — those gaps were addressed.

Recommendation

Introduce a seam to inject a fake/substitute communication path (e.g. an interface over CommunicationService, or wire a TestKit actor) so success and handled-failure paths can be unit tested. Add tests for the stuck-InProgress scenario and for per-instance lock contention during deploy. Make the audit test assert on IAuditService.LogAsync.

Resolution

Resolved 2026-05-16 (commit pending): extended the TestKit-actor seam (ReconcileProbeActor now also answers lifecycle commands) and added the missing coverage — successful Disable/Enable/Delete (state transition + audit assertions), a successful-deploy audit assertion, and per-instance lock serialization via a new deferred-reply SerializationProbeActor that asserts a single instance's concurrent deploys never overlap. The assertionless AuditLogs test was replaced with DeployInstanceAsync_FlatteningFails_DoesNotReachAudit, which asserts on IAuditService.LogAsync. Regression tests: DisableInstanceAsync_SiteSucceeds_SetsDisabledStateAndAudits, EnableInstanceAsync_SiteSucceeds_SetsEnabledStateAndAudits, DeleteInstanceAsync_SiteSucceeds_RemovesRecordAndAudits, DeployInstanceAsync_SiteSucceeds_WritesDeployAuditEntry, DeployInstanceAsync_FlatteningFails_DoesNotReachAudit, DeployInstanceAsync_SameInstance_OperationLockSerializesConcurrentDeploys.

DeploymentManager-012 — `LifecycleCommandTimeout` option is dead code


Severity	Low
Category	Documentation & comments
Status	Resolved
Location	`src/ZB.MOM.WW.ScadaBridge.DeploymentManager/DeploymentManagerOptions.cs:8-9`

Description

DeploymentManagerOptions.LifecycleCommandTimeout is declared with a 30s default and an XML doc, but it is never read anywhere in the codebase (lifecycle commands rely on CommunicationOptions.LifecycleTimeout inside CommunicationService). The option misleads readers into thinking it controls disable/enable/delete timeouts, when setting it has no effect.

Verification: Confirmed against source. A repo-wide grep found exactly one occurrence of LifecycleCommandTimeout — the declaration itself.

Recommendation

Remove LifecycleCommandTimeout, or actually thread it through to the lifecycle command calls (e.g. by creating a linked CTS with this timeout in DisableInstanceAsync/EnableInstanceAsync/DeleteInstanceAsync, the way ArtifactDeploymentTimeoutPerSite is used).

Resolution

Resolved 2026-05-16 (commit pending): LifecycleCommandTimeout is now actually threaded through (the option exists for tuning, so it was wired up rather than deleted). DisableInstanceAsync/EnableInstanceAsync/DeleteInstanceAsync each create a linked CancellationTokenSource with CancelAfter( _options.LifecycleCommandTimeout) — the same pattern ArtifactDeploymentService uses for ArtifactDeploymentTimeoutPerSite — and pass its token to the CommunicationService call. Each method now catches the resulting TimeoutException/OperationCanceledException, logs a warning, and returns a Result.Failure (previously an AskTimeoutException from a hung site escaped uncaught). The option's XML doc was corrected to describe the real behaviour. Regression test: DisableInstanceAsync_SiteUnresponsive_LifecycleCommandTimeoutBoundsTheWait (asserts a 300 ms LifecycleCommandTimeout bounds the wait far below the 30 s CommunicationOptions.LifecycleTimeout; confirmed to fail before the fix — the call hung the full 30 s and threw AskTimeoutException).

DeploymentManager-013 — SMTP credentials serialized and broadcast to all sites


Severity	Low
Category	Security
Status	Resolved
Location	`src/ZB.MOM.WW.ScadaBridge.DeploymentManager/ArtifactDeploymentService.cs:108-111`

Description

BuildDeployArtifactsCommandAsync maps smtp.Credentials directly into SmtpConfigurationArtifact and that command is sent to every site. Distributing SMTP credentials to sites is consistent with the design (SMTP configuration is a deployable artifact), but the credentials travel inside a serialized command across the inter-cluster transport and are stored on each site's SQLite. There is no indication the value is encrypted at rest on the site or scrubbed from logs. Worth confirming the transport is TLS-protected and the site stores the credential securely; at minimum this should be a conscious, documented decision.

Recommendation

Confirm inter-cluster transport encryption covers artifact commands, ensure Credentials is never written to logs, and document the at-rest protection of SMTP credentials on site SQLite. Consider encrypting the credential field within the artifact payload.

Verification (2026-05-16): Re-triaged against source. The DeploymentManager side is clean: ArtifactDeploymentService maps SmtpConfiguration.Credentials into the artifact (which the design explicitly mandates — SMTP configuration is a deployable artifact) and never logs it — the three log statements in DeployToAllSitesAsync only reference SiteId, SiteName, DeploymentId, and ex.Message, never the credential. There is no defect to fix purely within src/ZB.MOM.WW.ScadaBridge.DeploymentManager. The finding's remaining recommendations are all cross-module and one needs a design decision:

inter-cluster transport TLS — ZB.MOM.WW.ScadaBridge.Communication / ZB.MOM.WW.ScadaBridge.ClusterInfrastructure (Akka remoting + ClusterClient config);
at-rest encryption of the credential on site SQLite — ZB.MOM.WW.ScadaBridge.SiteRuntime artifact store;
encrypting the credential field inside the artifact payload — needs the SmtpConfigurationArtifact shape in ZB.MOM.WW.ScadaBridge.Commons plus cooperating producer (DeploymentManager) and consumer (SiteRuntime) changes, and a key-management design decision (where the encryption key lives, how it is distributed to sites) that cannot be made unilaterally here.

Status: Open — flagged. No purely-DeploymentManager fix exists; the work crosses Communication / SiteRuntime / Commons and requires a key-management design decision. Severity confirmed Low: with TLS-protected inter-cluster transport (a separate, assumed-in-place control) and no logging leak, this is a hardening item, not an active leak.

Resolution

Resolved 2026-05-16 (commit <pending>). Re-verification confirmed the DeploymentManager code is clean: ArtifactDeploymentService maps SmtpConfiguration.Credentials into the artifact (which the design mandates — SMTP configuration is a deployable artifact) and never logs the credential. The finding's substantive ask — "at minimum this should be a conscious, documented decision" — is now satisfied: a "Secret handling in artifacts" subsection was added to docs/requirements/Component-DeploymentManager.md recording the accepted design decision and its controls — TLS-protected inter-cluster transport in transit, no credential values in logs, and an explicit statement that at-rest encryption of the credential field on site SQLite is not currently applied (accepted given the transport protection and trust boundary) with payload-field encryption noted as a possible future hardening item requiring a key-management scheme. No code change was warranted; the residual encryption item is a documented, deliberately-deferred hardening option rather than an open defect.

DeploymentManager-014 — Dead `CreateCommand` helper in artifact tests


Severity	Low
Category	Testing coverage
Status	Resolved
Location	`tests/ZB.MOM.WW.ScadaBridge.DeploymentManager.Tests/ArtifactDeploymentServiceTests.cs:86-90`

Description

The private static CreateCommand() helper is never referenced by any test in the file. It is dead code that suggests an intended test (e.g. a successful multi-site artifact deployment) was never written — coverage of DeployToAllSitesAsync is limited to the no-sites failure case, and RetryForSiteAsync and BuildDeployArtifactsCommandAsync have no tests at all.

Verification: Confirmed against source. The CreateCommand() helper had no callers, and DeployToAllSitesAsync/RetryForSiteAsync only had the no-sites failure case.

Recommendation

Either remove the unused helper or, preferably, write the missing tests for DeployToAllSitesAsync (per-site success/failure matrix, partial failure) and RetryForSiteAsync using it.

Resolution

Resolved 2026-05-16 (commit pending): took the recommendation's preferred option — removed the dead CreateCommand() helper and wrote the missing coverage instead. ArtifactDeploymentServiceTests now extends TestKit and uses a stand-in ArtifactProbeActor (records the DeployArtifactsCommands it receives, replies success or, for a configured failure set, failure) so DeployToAllSitesAsync and RetryForSiteAsync are exercised end-to-end past the communication boundary. New tests: DeployToAllSitesAsync_AllPerSiteCommandsShareTheSummaryDeploymentId (also covers DeploymentManager-010), DeployToAllSitesAsync_PartialFailure_ReportsPerSiteMatrix (per-site success/failure matrix), RetryForSiteAsync_SiteSucceeds_ReturnsSuccessAndAudits.

DeploymentManager-015 — Site-query reconciliation marks a deployment `Success` but skips instance-state and snapshot updates


Severity	High
Category	Correctness & logic bugs
Status	Resolved
Location	`src/ZB.MOM.WW.ScadaBridge.DeploymentManager/DeploymentService.cs:631-655`

Description

TryReconcileWithSiteAsync (the DeploymentManager-006 query-before-redeploy path) handles the case where a prior InProgress/timeout-Failed record exists and the site reports it already has the target revision hash. In that case it marks the prior DeploymentRecord Success, audit-logs DeployReconciled, and returns it — the caller then returns Result.Success and never enters the normal deploy body.

The normal success path (DeployInstanceAsync.cs:215-223) does three things on a successful site response: writes the deployment record terminal status, sets instance.State = InstanceState.Enabled + UpdateInstanceAsync, and calls StoreDeployedSnapshotAsync. The reconciliation shortcut performs only the first. Consequently, after a reconciled deployment:

The instance State is left at whatever it was (e.g. NotDeployed for a first-time deploy that timed out, or Disabled) even though the site is actually running the configuration — the central state machine and the site diverge, and a subsequent DisableInstanceAsync/EnableInstanceAsync will be rejected or allowed incorrectly by StateTransitionValidator.
No DeployedConfigSnapshot is created or refreshed. A first-time deploy that is resolved purely by reconciliation leaves GetDeploymentComparisonAsync permanently returning "No deployed snapshot found for this instance.", and a redeploy reconciliation leaves the stored snapshot showing the old config even though the deployment record claims Success for the new revision.

The design ("Deployed vs. Template-Derived State", WP-4/WP-8) requires the deployed snapshot and instance state to reflect the last successful deployment; the reconciliation path silently breaks both invariants.

Recommendation

In the reconciled-success branch of TryReconcileWithSiteAsync, perform the same post-success side effects as the normal path: set instance.State = InstanceState.Enabled (+ UpdateInstanceAsync) and call StoreDeployedSnapshotAsync with the target deployment ID / revision hash / config JSON. Factor the shared post-success logic into one helper so the normal and reconciliation paths cannot drift. Add a regression test asserting that a reconciled deployment leaves the instance Enabled and a snapshot stored.

Resolution

Resolved 2026-05-17 (commit pending): extracted the shared post-success side effects into ApplyPostSuccessSideEffectsAsync (sets instance State = Enabled + UpdateInstanceAsync, stores/refreshes the DeployedConfigSnapshot) and invoked it from both the normal deploy success path and the TryReconcileWithSiteAsync reconciled-success branch, so a reconciled deployment now performs the same instance-state and snapshot updates as a normal one (configJson is now computed before the reconciliation call and threaded into TryReconcileWithSiteAsync). Regression test: DeployInstanceAsync_Reconciled_SetsInstanceEnabledAndStoresSnapshot.

DeploymentManager-016 — Reconciled prior record keeps its stale `RevisionHash`


Severity	Medium
Category	Correctness & logic bugs
Status	Resolved
Location	`src/ZB.MOM.WW.ScadaBridge.DeploymentManager/DeploymentService.cs:639-651`

Description

When TryReconcileWithSiteAsync reconciles a prior record, it mutates prior.Status, prior.ErrorMessage, and prior.CompletedAt, but not prior.RevisionHash. The reconciliation condition only compares the site's AppliedRevisionHash against the freshly-flattened targetRevisionHash — it does not require prior.RevisionHash to equal either of them.

The prior record can legitimately carry a different revision hash than the current target: e.g. a deploy timed out at revision R1, the template was then edited so the current flatten yields R2, and meanwhile the site actually applied R2 through some other path (or R1 and R2 are equal-by-content but the prior record predates a hash recompute). After reconciliation the record's Status is Success but its RevisionHash still says R1, so staleness checks and any UI that reads DeploymentRecord.RevisionHash will report the instance as deployed at the wrong revision. The audit DeployReconciled entry records RevisionHash = targetRevisionHash, contradicting the persisted record.

Recommendation

In the reconciled-success branch, also set prior.RevisionHash = targetRevisionHash so the persisted record, the audit entry, and the site's actual applied revision all agree. Alternatively, only reconcile when prior.RevisionHash == targetRevisionHash and otherwise fall through to a normal deploy.

Resolution

Resolved 2026-05-17 (commit pending): the reconciled-success branch of TryReconcileWithSiteAsync now also sets prior.RevisionHash = targetRevisionHash, so the persisted record, the DeployReconciled audit entry, and the site's actually-applied revision all agree. Regression test: DeployInstanceAsync_Reconciled_PriorRecordRevisionHashUpdatedToTarget.

DeploymentManager-017 — `GetDeploymentStatusAsync` XML doc describes behaviour it does not implement


Severity	Low
Category	Documentation & comments
Status	Resolved
Location	`src/ZB.MOM.WW.ScadaBridge.DeploymentManager/DeploymentService.cs:562-570`

Description

The XML summary on GetDeploymentStatusAsync reads: "WP-2: After failover/timeout, query site for current deployment state before re-deploying." The method body does no such thing — it is a one-line pass-through to _repository.GetDeploymentByDeploymentIdAsync, a pure local DB read. The query-the-site-before-redeploy behaviour the comment describes was implemented separately in TryReconcileWithSiteAsync (DeploymentManager-006). The stale comment is a leftover of the original design intent and misleads a reader into thinking this method contacts the site.

Recommendation

Reword the summary to describe what the method actually does — "returns the current persisted DeploymentRecord for the given deployment ID from the configuration database" — and, if useful, cross-reference TryReconcileWithSiteAsync as the place the site-query reconciliation lives.

Resolution

Resolved 2026-05-17 (commit pending): the GetDeploymentStatusAsync XML doc now states it returns the persisted DeploymentRecord from the configuration database as a pure local read, and cross-references TryReconcileWithSiteAsync as where the query-the-site-before-redeploy reconciliation actually lives. Documentation-only change; no regression test (a test asserting comment text would be meaningless).

DeploymentManager-018 — Reconciliation force-sets `Enabled`, overwriting an intentional `Disabled` after central failover


Severity	High
Category	Correctness & logic bugs
Status	Resolved
Location	`src/ZB.MOM.WW.ScadaBridge.DeploymentManager/DeploymentService.cs:675-682,721-748`

Resolution — Added a forceEnabledState parameter to ApplyPostSuccessSideEffectsAsync. The normal deploy path passes true (fresh apply legitimately ends in Enabled); the reconciliation path passes false, so the helper only promotes NotDeployed → Enabled and leaves an existing Disabled (or Enabled) untouched. Regression test DeployInstanceAsync_Reconciled_DisabledInstance_PreservesDisabledState exercises the failover scenario and asserts the prior record still flips to Success while Instance.State stays Disabled.

Description

TryReconcileWithSiteAsync calls ApplyPostSuccessSideEffectsAsync whenever the site reports it has the target revision hash, and that helper unconditionally writes instance.State = InstanceState.Enabled. The reconciliation shortcut only runs when the prior DeploymentRecord is InProgress or timeout-Failed — exactly the scenarios that survive a central failover (the in-memory OperationLockManager is lost on failover, by design: "Lost on central failover (acceptable per design — in-progress treated as failed)").

After such a failover, the per-instance operation lock is gone but the deployment record is still InProgress in the DB. A user can legitimately issue DisableInstanceAsync for the same instance — there is nothing in DisableInstanceAsync that consults the deployment record, only the StateTransitionValidator over Instance.State. If the state is Enabled (the typical case when the deploy started), the disable proceeds, the site honours it (the design states a disabled instance retains its deployed configuration), and central now persists Instance.State = Disabled. The deployment-record row remains InProgress (no one transitioned it). Later the user retries the deploy: TryReconcileWithSiteAsync runs, the site still has the target revision hash (Disable doesn't change the deployed config), the prior record is marked Success, and ApplyPostSuccessSideEffectsAsync writes Instance.State = Enabled — silently overriding the user's explicit Disable.

The same trap exists for any direct DB edit / migration that flipped the state between the timed-out deploy and the redeploy. The normal deploy path can defensibly assume Enabled after a fresh successful apply, but the reconciliation path is reconciling prior state with prior user intent; it should preserve Disabled if that is the current Instance.State at the time of reconciliation, mirroring the design's separation between deploy (config apply) and disable (subscription/script lifecycle).

Recommendation

In the reconciliation branch, do not force Enabled. Either:

Pass a flag/parameter to ApplyPostSuccessSideEffectsAsync telling it whether to touch state, and skip the state write on the reconciliation path (leaving the current Instance.State intact, which is already Enabled for a fresh deploy that timed out and Disabled for the user-disabled follow-up case); or
Only set Enabled when the current Instance.State is NotDeployed (i.e. the first-deploy timed-out case), and leave existing Enabled/Disabled alone.

Add a regression test where an instance with Instance.State = Disabled and a prior InProgress deployment record is reconciled — the resulting Instance.State must remain Disabled, and the deployment record must still be marked Success.

DeploymentManager-019 — Lifecycle command timeout writes no audit entry


Severity	Medium
Category	Error handling & resilience
Status	Resolved
Location	`src/ZB.MOM.WW.ScadaBridge.DeploymentManager/DeploymentService.cs:328-339,385-396,445-458`

Resolution (2026-05-28): added TryLogLifecycleTimeoutAsync, a private helper that mirrors the DeployFailed pattern — it calls _auditService.LogAsync with CancellationToken.None (so the operator's already-cancelled outer token cannot also prevent the audit write) and stamps the row with the <Action>TimedOut action name (DisableTimedOut / EnableTimedOut / DeleteTimedOut), the command id, the configured deadline, and the captured exception message. Each of DisableInstanceAsync / EnableInstanceAsync / DeleteInstanceAsync invokes the helper from its catch (TimeoutException or OperationCanceledException) block before returning the failure Result. The helper itself try/catches around the audit write so a failed audit pipeline does not mask the underlying timeout for the caller — it only logs at Warning. Regression tests DisableInstanceAsync_LifecycleTimeout_WritesDisableTimedOutAuditEntry, EnableInstanceAsync_LifecycleTimeout_WritesEnableTimedOutAuditEntry, and DeleteInstanceAsync_LifecycleTimeout_WritesDeleteTimedOutAuditEntry use the existing SilentProbeActor to keep the site unresponsive, configure a 300 ms LifecycleCommandTimeout to bound the wait, and assert the audit log received the corresponding <Action>TimedOut entry exactly once.

Description

DisableInstanceAsync, EnableInstanceAsync, and DeleteInstanceAsync each wrap the CommunicationService call in a linked CTS with LifecycleCommandTimeout (DeploymentManager-012). On timeout they log a warning and return Result<...>.Failure(...) — and skip the _auditService.LogAsync call entirely. As a result, an operator-initiated disable/enable/delete that times out at the site leaves no audit trail: the user, the timestamp, the command id, and the failure mode are not recorded in the audit log. The deploy path goes out of its way to write a DeployFailed audit entry on the same failure mode (DeploymentService.cs:274-276), with CancellationToken.None so the write is durable; the lifecycle commands do not.

The design lists audit logging as a Deployment Manager responsibility for "all deployment actions, system-wide artifact deployments, and instance lifecycle changes" — a timed-out lifecycle command is an attempted lifecycle change, and the operator action is exactly the kind of event the audit log exists to record.

Recommendation

In each of the three catch (Exception ex) when (ex is TimeoutException or OperationCanceledException) blocks, write a DisableTimeout/EnableTimeout/ DeleteTimeout (or use the existing operation name with a failure flag) audit entry with CancellationToken.None so a cancelled outer token does not prevent the audit write, mirroring DeployFailed. Add a unit test asserting that DisableInstanceAsync_SiteUnresponsive_LifecycleCommandTimeoutBoundsTheWait also produces an audit entry.

DeploymentManager-020 — `DeployReconciled` audit attributes the action to the prior deployer, not the current user


Severity	Low
Category	Documentation & comments
Status	Resolved
Location	`src/ZB.MOM.WW.ScadaBridge.DeploymentManager/DeploymentService.cs:698-712`

Description

In TryReconcileWithSiteAsync the audit call is:

await _auditService.LogAsync(prior.DeployedBy, "DeployReconciled", ...)

prior.DeployedBy is the user who issued the original (timed-out / stuck) deployment, not the user parameter passed into DeployInstanceAsync. The current user — the one who triggered the redeploy that produced the reconciliation — is dropped on the floor. For audit forensics this is misleading: the row will read "user A reconciled their own deployment" when in fact user B initiated the action that reconciled it.

The original deployer is interesting context, but it should be carried in the audit-detail object (where DeploymentId and RevisionHash already live), not substituted for the actor.

Recommendation

Use user (the parameter on DeployInstanceAsync, threaded through TryReconcileWithSiteAsync) as the audit actor, and include OriginalDeployer = prior.DeployedBy in the detail object so the original attribution is preserved without misrepresenting who took the action.

Resolution (2026-05-28): Threaded the user parameter from DeployInstanceAsync into TryReconcileWithSiteAsync as a new currentUser argument (consistent with the DeploymentManager-018 forceEnabledState parameter-threading pattern) and rewrote the audit call to log currentUser as the actor with OriginalDeployer = prior.DeployedBy carried in the detail object. Added test DeployInstanceAsync_Reconciled_AuditAttributesCurrentUserNotPriorDeployer that pins the new attribution and asserts the prior deployer is no longer used as the actor. Tests green (80/80 in DeploymentManager.Tests).

DeploymentManager-021 — `ResolveSiteIdentifierAsync` silently substitutes the DB id when the site row is missing


Severity	Low
Category	Correctness & logic bugs
Status	Resolved
Location	`src/ZB.MOM.WW.ScadaBridge.DeploymentManager/DeploymentService.cs:107-111`

Resolution (2026-05-28): ResolveSiteIdentifierAsync now throws InvalidOperationException ("Site with ID {siteId} not found; cannot resolve its SiteIdentifier for routing.") when the Site row is missing, instead of returning the numeric id rendered as a string. The deploy path's existing try/catch turns the throw into a DeploymentStatus.Failed record carrying the descriptive message (the DeploymentManager-001/-002 cleanup write the failure with CancellationToken.None); the lifecycle paths (Disable/Enable/Delete) propagate the exception so the CLI/UI caller surfaces the actual cause to the operator rather than seeing a confusing downstream "unknown site" routing error. The repository contract already returned Site?, so the null path is now type-visible at the call site instead of silently papered over.

Description

private async Task<string> ResolveSiteIdentifierAsync(int siteId, CancellationToken cancellationToken)
{
    var site = await _siteRepository.GetSiteByIdAsync(siteId, cancellationToken);
    return site?.SiteIdentifier ?? siteId.ToString();
}

If the Site row is missing (FK was deleted, race with admin delete, DB inconsistency), the method silently returns the numeric DB id rendered as a string. This is then passed to CommunicationService.{Deploy,Disable,Enable, Delete}InstanceAsync and QueryDeploymentStateAsync as if it were a real SiteIdentifier (e.g. "site-a"). The communication layer will fail with an "unknown site" or routing error, producing a confusing diagnostic that hides the actual problem (no site row).

This is a defensive concern, but every mutating operation in the module goes through this method, so a stale instance whose site was deleted will produce a misleading error every time it is touched.

Recommendation

Treat a missing site as a hard validation failure: return a Result.Failure($"Site with ID {siteId} not found") early from the calling operations, instead of fabricating an identifier. The repository already returns Site?, so the null path is type-visible; just don't paper over it.

DeploymentManager-022 — `Pending` and `InProgress` are written back-to-back with no intervening work


Severity	Low
Category	Code organization & conventions
Status	Resolved
Location	`src/ZB.MOM.WW.ScadaBridge.DeploymentManager/DeploymentService.cs:178-194`

Resolution (2026-05-28): The transient Pending write was dropped — the deployment record is now created directly in DeploymentStatus.InProgress, which collapses the start of the deploy into a single AddDeploymentRecordAsync + SaveChangesAsync + NotifyStatusChange (instead of two writes back-to-back). The flattening, validation, and TryReconcileWithSiteAsync round-trip have all completed before the insert, and the deploy command is sent immediately after, so Pending carried no operational meaning between the two writes. InProgress retains its documented "sent to site, awaiting response" semantics. Eliminating the extra SaveChangesAsync round-trip also removes the Pending→InProgress flicker the CentralUI-006 deployment-status page used to render via the second IDeploymentStatusNotifier.NotifyStatusChanged invocation.

Description

DeployInstanceAsync does:

record.Status = Pending;
AddDeploymentRecordAsync(record); SaveChangesAsync(); NotifyStatusChange(record);
record.Status = InProgress;
UpdateDeploymentRecordAsync(record); SaveChangesAsync(); NotifyStatusChange(record);

There is no work between the two writes — flattening, validation, and reconciliation have already completed by line 174. The deploy command is sent immediately after the InProgress write. The Pending write therefore costs: an extra SaveChangesAsync round-trip, an extra IDeploymentStatusNotifier invocation (which the CentralUI-006 page renders, so the user briefly sees a Pending flicker before InProgress), and an extra row-version bump if EF optimistic concurrency is enabled on the table.

The design uses Pending to mean "queued, not yet sent" and InProgress to mean "sent to site, awaiting response". The code's Pending slot has no queuing — it is set and immediately overwritten — so the state buys nothing operationally.

Recommendation

Either:

Drop the Pending write entirely and create the record directly in InProgress (one row insert, one notification, simpler UI); or
Move the Pending→InProgress transition to bracket actual queueing/work (e.g. set Pending before flattening + reconciliation, set InProgress immediately before DeployInstanceAsync on the comm service) so the two states carry distinguishable semantics worth a separate write.

DeploymentManager-023 — `BuildDeployArtifactsCommandAsync` re-queries system-wide artifacts once per site


Severity	Low
Category	Performance & resource management
Status	Resolved
Location	`src/ZB.MOM.WW.ScadaBridge.DeploymentManager/ArtifactDeploymentService.cs:82-144,169-173`

Resolution (2026-05-28): Hoisted the global artifact queries (shared scripts, external systems + methods, DB connections, notification lists, SMTP configurations) out of the per-site loop into a new private FetchGlobalArtifactsAsync that produces a GlobalArtifactSnapshot record. DeployToAllSitesAsync now calls it ONCE before the loop and threads the snapshot through a new prefetched-globals overload of BuildDeployArtifactsCommandAsync; the public single-site overload keeps the prior fetch-then-build behaviour for RetryForSiteAsync. Only the per-site data-connection query remains inside the loop. Regression tests DeployToAllSitesAsync_HoistsGlobalArtifactQueriesOutOfPerSiteLoop (three sites; pins exactly-one call to each global getter and one per-site call to GetDataConnectionsBySiteIdAsync) and RetryForSiteAsync_SingleSitePath_StillRunsTheGlobalQueriesOnce (single-site path still owns its own fetch).

Description

DeployToAllSitesAsync loops over sites and calls BuildDeployArtifactsCommandAsync(site.Id, ...) for each one. Of the six artifact sets the method gathers, only dataConnections is per-site:

_templateRepo.GetAllSharedScriptsAsync — global.
_externalSystemRepo.GetAllExternalSystemsAsync — global, plus GetMethodsByExternalSystemIdAsync per external system per site.
_externalSystemRepo.GetAllDatabaseConnectionsAsync — global.
_notificationRepo.GetAllNotificationListsAsync — global.
_notificationRepo.GetAllSmtpConfigurationsAsync — global.
_siteRepo.GetDataConnectionsBySiteIdAsync(siteId, ...) — per-site.

With N sites this issues ≈ 5·N redundant queries on the global sets (plus M·N method queries, where M is the external-system count). On a hub-and-spoke deployment with many sites the artifact-deploy path is noticeably slower than necessary and pins DbContext usage longer than needed. Per CLAUDE.md, the DbContext is not thread-safe and the per-site commands are already built sequentially (good); the redundant queries are sequential too, but the network/round-trip cost is real.

Recommendation

Hoist the global queries (shared scripts, external systems + their methods, DB connections, notification lists, SMTP configurations) out of BuildDeployArtifactsCommandAsync, fetch them once in DeployToAllSitesAsync, and pass them in alongside the site id (or expose a BuildDeployArtifactsCommandAsync(siteId, prefetchedGlobals) overload). RetryForSiteAsync (the single-site path) can keep the convenience-overload behaviour. Add a test using NSubstitute's .Received() to assert _templateRepo.GetAllSharedScriptsAsync is called exactly once for an N-site deployment.

DeploymentManager-024 — Test probe actors hold mutable static state across tests


Severity	Low
Category	Testing coverage
Status	Resolved
Location	`tests/ZB.MOM.WW.ScadaBridge.DeploymentManager.Tests/DeploymentServiceTests.cs:966-1075`, `tests/ZB.MOM.WW.ScadaBridge.DeploymentManager.Tests/ArtifactDeploymentServiceTests.cs:196-217`

Resolution (2026-05-28): Replaced the static counters with per-test instance state. Introduced ReconcileProbeCounters and SerializationProbeCounters (in DeploymentServiceTests) and ArtifactProbeRecorder (in ArtifactDeploymentServiceTests); each probe actor now takes the counter object as its first constructor argument. Every test instantiates a fresh counter local, passes it via Props.Create(() => new ReconcileProbeActor(counters, ...)), and reads the counts directly off counters — no shared static fields remain. ReconcileProbeActor's counter increments swap to Interlocked.Increment for the cross-thread CAS, and SerializationProbeActor retains its lock on a per-test Gate. All 85 ZB.MOM.WW.ScadaBridge.DeploymentManager.Tests continue to pass after the refactor.

Description

ReconcileProbeActor.QueryCount / DeployCount, SerializationProbeActor.MaxConcurrent / _current, and ArtifactProbeActor.Received are all static fields. Each test's actor constructor resets them — but reset-on-construction only works as long as no two tests in the same class run concurrently. xUnit's default parallelism disables intra-class parallelism, so today's tests pass; flip the assembly-level [CollectionBehavior(DisableTestParallelization = true)] or move to xUnit v3 (which enables intra-class parallelism by default) and the counters race — a deploy in test A could increment DeployCount while test B is asserting on it.

Static state shared across tests is also why a flaky-test investigation here will be unusually painful: the offending interaction is invisible from any single test file.

Recommendation

Replace the static counters with instance state, hand the actor a probe recipient (an IActorRef to a TestKit probe), and assert via ExpectMsg in each test. Where the simpler counter shape is preferred, pass a shared-state object into the actor's constructor so each test owns its own instance — never reach for static mutable test state.

DeploymentManager-025 — Notification lists and SMTP configurations (with credentials) are still broadcast to every site, contradicting the central-only design


Severity	High
Category	Design-document adherence
Status	Resolved
Location	`src/ZB.MOM.WW.ScadaBridge.DeploymentManager/ArtifactDeploymentService.cs:13-22,128,130,150-151,181-188`

Description

The design now states explicitly, in two authoritative places, that notification lists and SMTP configuration are not deployable artifacts:

docs/requirements/Component-DeploymentManager.md:142-146: "Notification lists and SMTP configuration are not deployable artifacts — they are central-only definitions managed by the Notification Service ... Notification delivery happens on the central cluster, so no notification artifact or SMTP credential is ever distributed to sites."
CLAUDE.md "External Integrations" decision: "Notification delivery is central-only ... Notification lists and SMTP config are no longer deployed to sites; recipient resolution happens at central, at delivery time."

ArtifactDeploymentService still does the opposite. FetchGlobalArtifactsAsync queries _notificationRepo.GetAllNotificationListsAsync and GetAllSmtpConfigurationsAsync (lines 150-151), maps them into NotificationListArtifacts and SmtpConfigurationArtifacts — the SMTP artifact carrying smtp.Credentials verbatim (line 188) — and BuildDeployArtifactsCommandAsync places both into the DeployArtifactsCommand sent to every site (lines 128, 130). The site side persists them: SiteReplicationActor (lines 192-201) and DeploymentManagerActor (lines 1383-1419) loop over command.NotificationLists and command.SmtpConfigurations and write them to site SQLite via SiteNotificationRepository.

This is the precise scenario the design says must never happen: SMTP credentials travel across the inter-cluster transport and land on every site's SQLite. It supersedes the framing of the now-closed DeploymentManager-013, which accepted SMTP-as-deployable-artifact as a documented design decision — the design has since flipped to forbid distribution entirely, so this is a fresh divergence, not the same finding. The class-level XML doc (lines 13-22) is correspondingly stale: it still advertises "notification lists ... and SMTP configurations" as artifacts the service "broadcasts ... to all sites."

Secondary defect in the same mapping: NotificationListArtifact is built from nl.Recipients.Where(r => r.EmailAddress is not null) (line 182), which silently drops every SMS-only recipient (PhoneNumber set, EmailAddress null) of a NotificationType.Sms list. Even if list distribution were intended, the SMS recipient set would be lost — but since lists must not be distributed at all, this is subsumed by the primary fix.

Recommendation

Stop fetching, mapping, and shipping notification lists and SMTP configurations from the artifact deployment path. Drop the _notificationRepo queries from FetchGlobalArtifactsAsync, pass null (or empty) for the NotificationLists and SmtpConfigurations fields of DeployArtifactsCommand, and update the class XML doc to remove both from the artifact list. The message fields can remain on DeployArtifactsCommand for additive compatibility but must never be populated from central. Coordinate removal of the consuming code in SiteRuntime (SiteReplicationActor, DeploymentManagerActor, SiteNotificationRepository) in the same session per the project's "design + code + tests travel together" rule. Update the contradicting tests (see DeploymentManager-027).

Resolution

Resolved 2026-06-20 (commit fd618cf1): central FetchGlobalArtifactsAsync no longer queries or ships notification lists / SMTP configs (passes null; DeployArtifactsCommand fields kept for contract compatibility), and the site purges any already-persisted notification_lists / smtp_configurations rows (clearing the plaintext SMTP password) on both apply paths — enforcing the central-only delivery design. Verified no site runtime/delivery path reads this config.

DeploymentManager-026 — `DeployInstanceAsync` inserts a new deployment record every deploy; per-instance rows accumulate and the "current status" read has no tiebreaker


Severity	Medium
Category	Design-document adherence
Status	Resolved
Location	`src/ZB.MOM.WW.ScadaBridge.DeploymentManager/DeploymentService.cs:215-225`, `src/ZB.MOM.WW.ScadaBridge.ConfigurationDatabase/Repositories/DeploymentManagerRepository.cs:55-61`

Description

DeployInstanceAsync always creates a brand-new DeploymentRecord and calls _repository.AddDeploymentRecordAsync(record, …) (line 223) — it never reuses or updates the instance's existing record. There is no update-in-place or delete-prior path on the deploy flow, so every successful or failed deployment of an instance leaves its predecessor row behind. Over the life of a central process — amplified by the bulk "deploy all out-of-date instances" workflow and by repeated redeploys after timeouts — the DeploymentRecords table grows without bound, one row per deploy attempt per instance.

This contradicts the design's "Deployment Status Persistence" section (Component-DeploymentManager.md:106-109): "Only the current deployment status per instance is stored in the configuration database ... No deployment history table — the audit log (via IAuditService) already captures every deployment action." The audit log is the history; the deployment-record table is supposed to hold only the current status. The implementation instead keeps an ad-hoc, unindexed history there.

The accumulation also makes the reconciliation read order-sensitive. TryReconcileWithSiteAsync reads the "prior" record via GetCurrentDeploymentStatusAsync, which is OrderByDescending(d => d.DeployedAt).FirstOrDefault() with no secondary sort key (e.g. ThenByDescending(d => d.Id)). DeployedAt is a DateTimeOffset stamped with DateTimeOffset.UtcNow at record creation; two records inserted within the same clock tick (rapid redeploy, or a redeploy immediately after a timed-out attempt) tie on DeployedAt, and SQL Server's choice between equal sort keys is undefined. Reconciliation could then read the wrong prior record (e.g. an older Success instead of the latest stuck InProgress), skipping the intended site query, or vice-versa.

Recommendation

Either (a) make the deploy path upsert the instance's single current record (update-in-place when one exists, insert only on first deploy) so the table holds exactly one row per instance per the design, or (b) if multiple rows are deliberately retained, add a deterministic tiebreaker to GetCurrentDeploymentStatusAsync (OrderByDescending(d => d.DeployedAt) .ThenByDescending(d => d.Id)) and document the retention/cleanup story so the table does not grow unbounded. Option (a) aligns with the design and is preferred.

Resolution

Resolved 2026-06-20 (commit fd618cf1): added a deterministic .ThenByDescending(d => d.Id) tiebreaker to GetCurrentDeploymentStatusAsync so same-tick deployment records resolve to the newest row. Insert-per-deploy behaviour unchanged (history-vs-upsert remains a separate design question).

DeploymentManager-027 — Artifact tests assert that notification lists and SMTP configs are shipped, cementing the DeploymentManager-025 design violation


Severity	Low
Category	Testing coverage
Status	Resolved
Location	`tests/ZB.MOM.WW.ScadaBridge.DeploymentManager.Tests/ArtifactDeploymentServiceTests.cs:173-174,200-201`

Description

ArtifactDeploymentServiceTests asserts, via NSubstitute, that the artifact deployment path queries the forbidden artifact sets exactly once per deployment:

await _notificationRepo.Received(1).GetAllNotificationListsAsync(Arg.Any<CancellationToken>());
await _notificationRepo.Received(1).GetAllSmtpConfigurationsAsync(Arg.Any<CancellationToken>());

(both in the DeployToAllSitesAsync global-query-hoisting test at 173-174 and the RetryForSiteAsync test at 200-201). These assertions pin the exact behaviour the current design forbids (DeploymentManager-025): they will keep the service shipping notification lists and SMTP configs to sites and will actively block the fix — removing the queries makes these Received(1) assertions fail. Tests that lock in a design violation are worse than no test, because they make the correct change look like a regression.

Recommendation

When DeploymentManager-025 is fixed, change these to DidNotReceive().GetAllNotificationListsAsync(...) / DidNotReceive().GetAllSmtpConfigurationsAsync(...) (and assert the DeployArtifactsCommand's NotificationLists / SmtpConfigurations fields are null/empty) so the tests enforce the central-only design instead of contradicting it.

Resolution

Resolved 2026-06-20 (commit fd618cf1): the artifact tests no longer assert the forbidden notification/SMTP shipping — flipped Received(1) → DidNotReceive() and added assertions that the shipped command's NotificationLists/SmtpConfigurations are null.

78 KiB Raw Blame History

Code Review — DeploymentManager

Summary

Re-review 2026-05-17 (commit 39d737e)

Re-review 2026-05-28 (commit 1eb6e97)

Re-review 2026-06-20 (commit 4307c381) — full review

Checklist coverage

Re-review 2026-06-20 (commit 4307c381)

Re-review 2026-05-28 (commit 1eb6e97)

Findings

DeploymentManager-001 — Unexpected exceptions leave the deployment record stuck in InProgress

DeploymentManager-002 — Failure-status write uses a possibly-cancelled cancellation token

DeploymentManager-003 — Successful-deployment cleanup is not atomic with the status write

DeploymentManager-004 — Site-success but central-delete-failure leaves orphaned site config

DeploymentManager-005 — OperationLockManager leaks a SemaphoreSlim per instance name

DeploymentManager-006 — Query-the-site-before-redeploy idempotency requirement not implemented

DeploymentManager-007 — "Diff View" reduced to a hash comparison with no diff detail

DeploymentManager-008 — DeploymentManagerOptions is never bound to configuration

DeploymentManager-009 — Misleading timeout comment on DeleteInstanceAsync

DeploymentManager-010 — SystemArtifactDeploymentRecord does not persist the deployment ID

DeploymentManager-011 — Tests never exercise a successful deployment or lifecycle success path

DeploymentManager-012 — LifecycleCommandTimeout option is dead code

DeploymentManager-013 — SMTP credentials serialized and broadcast to all sites

DeploymentManager-014 — Dead CreateCommand helper in artifact tests

DeploymentManager-015 — Site-query reconciliation marks a deployment Success but skips instance-state and snapshot updates

DeploymentManager-016 — Reconciled prior record keeps its stale RevisionHash

DeploymentManager-017 — GetDeploymentStatusAsync XML doc describes behaviour it does not implement

DeploymentManager-018 — Reconciliation force-sets Enabled, overwriting an intentional Disabled after central failover

DeploymentManager-019 — Lifecycle command timeout writes no audit entry

DeploymentManager-020 — DeployReconciled audit attributes the action to the prior deployer, not the current user

DeploymentManager-021 — ResolveSiteIdentifierAsync silently substitutes the DB id when the site row is missing

DeploymentManager-022 — Pending and InProgress are written back-to-back with no intervening work

DeploymentManager-023 — BuildDeployArtifactsCommandAsync re-queries system-wide artifacts once per site

DeploymentManager-024 — Test probe actors hold mutable static state across tests

DeploymentManager-025 — Notification lists and SMTP configurations (with credentials) are still broadcast to every site, contradicting the central-only design

DeploymentManager-026 — DeployInstanceAsync inserts a new deployment record every deploy; per-instance rows accumulate and the "current status" read has no tiebreaker

DeploymentManager-027 — Artifact tests assert that notification lists and SMTP configs are shipped, cementing the DeploymentManager-025 design violation

78 KiB

Raw Blame History

Re-review 2026-05-17 (commit `39d737e`)

Re-review 2026-05-28 (commit `1eb6e97`)

Re-review 2026-06-20 (commit `4307c381`) — full review

Re-review 2026-06-20 (commit `4307c381`)

Re-review 2026-05-28 (commit `1eb6e97`)

DeploymentManager-001 — Unexpected exceptions leave the deployment record stuck in `InProgress`

DeploymentManager-005 — `OperationLockManager` leaks a `SemaphoreSlim` per instance name

DeploymentManager-008 — `DeploymentManagerOptions` is never bound to configuration

DeploymentManager-009 — Misleading timeout comment on `DeleteInstanceAsync`

DeploymentManager-010 — `SystemArtifactDeploymentRecord` does not persist the deployment ID

DeploymentManager-012 — `LifecycleCommandTimeout` option is dead code

DeploymentManager-014 — Dead `CreateCommand` helper in artifact tests

DeploymentManager-015 — Site-query reconciliation marks a deployment `Success` but skips instance-state and snapshot updates

DeploymentManager-016 — Reconciled prior record keeps its stale `RevisionHash`

DeploymentManager-017 — `GetDeploymentStatusAsync` XML doc describes behaviour it does not implement

DeploymentManager-018 — Reconciliation force-sets `Enabled`, overwriting an intentional `Disabled` after central failover

DeploymentManager-020 — `DeployReconciled` audit attributes the action to the prior deployer, not the current user

DeploymentManager-021 — `ResolveSiteIdentifierAsync` silently substitutes the DB id when the site row is missing

DeploymentManager-022 — `Pending` and `InProgress` are written back-to-back with no intervening work

DeploymentManager-023 — `BuildDeployArtifactsCommandAsync` re-queries system-wide artifacts once per site

DeploymentManager-026 — `DeployInstanceAsync` inserts a new deployment record every deploy; per-instance rows accumulate and the "current status" read has no tiebreaker