Full per-module re-review of the 16 stale modules (last seen1eb6e97/ 2026-05-28) plus first-ever reviews of KpiHistory (#26) and ScriptAnalysis (#25), at HEAD4307c381. 67 new findings (0 Critical, 6 High, 27 Medium, 34 Low). Remediation in commitfd618cf1closed 5 of the 6 Highs and ~33 Medium/Low; the rest are Deferred/Won't Fix with rationale. Remaining pending (4) are all InboundAPI's Database-helper findings (IA-026 High .. IA-029), left to the active feat/ipsen-movein effort per owner decision. Highlights: caught a central-only-delivery security drift (SMTP creds broadcast to sites — DM-025/SR-031), a never-committed 'Resolved' fix (SiteEventLogging-016 → -024), an unguarded KPI recorder tick (KH-001), a trust-analyzer fallback weakening (SA-001), and a native-alarm subscribe-path leak (DCL-023). ScriptAnalysis verdict: trust boundary is semantically sound (symbol-based) in the production cluster config. README regenerated; regen-readme.py --check passes (4 pending / 567 total).
78 KiB
Code Review — DeploymentManager
| Field | Value |
|---|---|
| Module | src/ZB.MOM.WW.ScadaBridge.DeploymentManager |
| Design doc | docs/requirements/Component-DeploymentManager.md |
| Status | Reviewed |
| Last reviewed | 2026-06-20 |
| Reviewer | claude-agent |
| Commit reviewed | 4307c381 |
| Open findings | 0 |
Summary
The DeploymentManager module is small, well-structured, and clearly maps work
packages (WP-N) onto code. The happy paths for instance deployment, lifecycle
commands, artifact broadcast, and staleness comparison are implemented
sensibly, and the operation lock correctly serializes mutating operations per
instance while allowing cross-instance parallelism. However, the review found a
significant cluster of error-handling and resilience gaps: the deployment
record can be left permanently stuck in InProgress when an exception other
than timeout/cancellation is thrown, the catch block writes its failure status
using a cancellation token that may already be cancelled, and the
OperationLockManager leaks one SemaphoreSlim per instance name forever.
There are also two notable design-document adherence gaps: the
"query-the-site-before-redeploy" idempotency requirement is not implemented
(GetDeploymentStatusAsync only reads the local DB), and the "Diff View"
feature is reduced to a bare hash comparison with no added/removed/changed
detail. Configuration is not bound to appsettings.json, leaving one option
entirely dead. Test coverage stops at the communication boundary and never
exercises a successful deployment or the lifecycle success paths.
Re-review 2026-05-17 (commit 39d737e)
Re-reviewed at commit 39d737e after the batch of fixes for
DeploymentManager-001..014. All fourteen prior findings remain Resolved and
verified against source — the broadened catch, non-cancellable cleanup writes,
ref-counted OperationLockManager, query-before-redeploy reconciliation,
structured diff, options binding, and the expanded TestKit-actor test suite are
all present and correct. The module is in markedly better shape than the
first review: error paths are now defensively handled and test coverage is
broad (successful deploy/lifecycle, lock serialization, reconciliation
matrix, artifact per-site matrix).
This re-review found 3 new findings, all clustered on the
DeploymentManager-006 reconciliation path added since the last review. The
reconciliation shortcut (TryReconcileWithSiteAsync) marks a stale prior
record Success when the site already has the target revision, but it does
not perform the side effects the normal success path does — it never
updates the instance State, never refreshes the DeployedConfigSnapshot,
and never corrects the prior record's own RevisionHash (DeploymentManager-015,
DeploymentManager-016). The GetDeploymentStatusAsync XML doc is now stale —
it still describes the query-before-redeploy behaviour that actually moved into
TryReconcileWithSiteAsync (DeploymentManager-017).
Re-review 2026-05-28 (commit 1eb6e97)
Re-reviewed at commit 1eb6e97 after the DeploymentManager-015/016/017 fixes
and a docs-only XML-comment pass. The three prior findings remain Resolved
and verified — ApplyPostSuccessSideEffectsAsync is now invoked from both the
normal success path and TryReconcileWithSiteAsync, the reconciled-success
branch corrects prior.RevisionHash to the target, and GetDeploymentStatusAsync's
XML doc now describes the local-DB-read it actually performs and cross-refs the
reconciliation helper. The DiffService wiring, options binding, ref-counted
operation lock, broadened catch, non-cancellable cleanup, and TestKit-actor
test seam are still in place. The 7 new findings here are not regressions in
the DeploymentManager-015/016 fixes — they are issues uncovered by widening
the lens to the lifecycle paths, reconciliation's interaction with
intentional Disabled state, audit semantics, and operational concerns
(per-site artifact-build cost, Pending→InProgress double-write).
The single notable correctness issue is DeploymentManager-018: the
reconciliation shortcut unconditionally sets instance.State = Enabled via
ApplyPostSuccessSideEffectsAsync. After a central failover that loses the
in-memory operation lock, a user can legitimately Disable an instance whose
prior deploy record is still InProgress; a subsequent redeploy then reconciles
and silently re-enables the instance against the user's explicit intent.
The remaining six findings are medium/low: lifecycle-timeout audit gap
(DeploymentManager-019), audit-user attribution in reconciliation
(DeploymentManager-020), silent fallback in ResolveSiteIdentifierAsync
(DeploymentManager-021), back-to-back Pending→InProgress writes
(DeploymentManager-022), per-site re-query of system-wide artifacts
(DeploymentManager-023), and shared static state across *ProbeActor tests
(DeploymentManager-024).
Re-review 2026-06-20 (commit 4307c381) — full review
Re-reviewed the whole current module at HEAD after the rename, the
cert-broadcast / Transport IStaleInstanceProbe work, and milestone changes.
DeploymentManager-001..024 all remain Resolved and verified against source —
the ref-counted OperationLockManager, the broadened/non-cancellable failure
writes, the ApplyPostSuccessSideEffectsAsync shared helper with
forceEnabledState (Disabled-preservation), the lifecycle-timeout audit helper,
the structured diff with List-value normalization, the hoisted global artifact
fetch, and the instance-state-aware reconciliation are all present and correct.
The two flagged cross-module/architectural seams the prompt called out — the
TrustServerCert/RemoveServerCert broadcast-to-both-nodes and the
DeploymentManagerActor deploy-state query handler — live in SiteRuntime /
Communication / CentralUI, not this module, so they are out of scope here.
This review found 3 new findings. The material one is DeploymentManager-025:
the system-wide artifact path still fetches and broadcasts notification lists
and SMTP configurations (including SMTP credentials) to every site, in direct
contradiction of the now-explicit design decision that these are central-only
and "no SMTP credential is ever distributed to sites" (Component-DeploymentManager.md
lines 142-146; CLAUDE.md notification-central-only decision). This supersedes
the earlier accepted-deployable-artifact framing of the closed
DeploymentManager-013. DeploymentManager-026 (deployment records are insert-only
— a new row per deploy accumulates per instance, contradicting "only current
status stored, no history table", and the same-tick OrderByDescending(DeployedAt)
read has no tiebreaker) and DeploymentManager-027 (artifact tests assert the
forbidden notif/SMTP shipping, cementing the DeploymentManager-025 violation)
are the remaining two.
Checklist coverage
Re-review 2026-06-20 (commit 4307c381)
| # | Category | Examined | Notes |
|---|---|---|---|
| 1 | Correctness & logic bugs | ✓ | New: deployment records are insert-only — DeployInstanceAsync Adds a new row per deploy; reconciliation's GetCurrentDeploymentStatusAsync orders by DeployedAt with no tiebreaker (DeploymentManager-026). |
| 2 | Akka.NET conventions | ✓ | Module remains a plain service layer; no actors. The deploy-state-query/cert-broadcast actors live in SiteRuntime, out of scope. No issues. |
| 3 | Concurrency & thread safety | ✓ | OperationLockManager ref-counting + gate re-verified; DeployToAllSitesAsync prebuilds per-site commands before the parallel phase (no shared DbContext under Task.WhenAll). No issues. |
| 4 | Error handling & resilience | ✓ | Failure-status writes use CancellationToken.None; lifecycle timeouts now audit; delete-orphan path surfaced. No new issues. |
| 5 | Security | ✓ | New: SMTP credentials are still serialized into the per-site artifact command and broadcast to every site, which the current design forbids outright (DeploymentManager-025). |
| 6 | Performance & resource management | ✓ | Global artifact queries hoisted (DM-023 resolved). Deployment-record row growth is unbounded per instance (part of DeploymentManager-026). |
| 7 | Design-document adherence | ✓ | New: notification lists + SMTP configs are still treated as deployable artifacts, contradicting the "central-only, never distributed to sites" design (DeploymentManager-025). |
| 8 | Code organization & conventions | ✓ | Options bound via Host; OptionsSection constant correct. No new issues. |
| 9 | Testing coverage | ✓ | Broad and current. New: artifact tests assert the forbidden notif/SMTP shipping (DeploymentManager-027). |
| 10 | Documentation & comments | ✓ | ArtifactDeploymentService class XML doc still lists notification lists + SMTP as broadcast artifacts (stale vs design — folded into DeploymentManager-025). |
Re-review 2026-05-28 (commit 1eb6e97)
| # | Category | Examined | Notes |
|---|---|---|---|
| 1 | Correctness & logic bugs | ✓ | New: reconciliation forces Enabled even if the user disabled the instance in between (DeploymentManager-018). |
| 2 | Akka.NET conventions | ✓ | Module remains a plain service layer; no actors. No issues. |
| 3 | Concurrency & thread safety | ✓ | OperationLockManager ref-counting verified. Note: test probes hold static state (DeploymentManager-024) — a test concern, not production code. |
| 4 | Error handling & resilience | ✓ | New: Disable/Enable/Delete timeouts return early without writing any audit entry — deploy has DeployFailed, lifecycle has nothing (DeploymentManager-019). |
| 5 | Security | ✓ | No new issues. SMTP credential decision documented (DeploymentManager-013 closed). |
| 6 | Performance & resource management | ✓ | New: BuildDeployArtifactsCommandAsync re-queries every system-wide artifact set per site in DeployToAllSitesAsync (DeploymentManager-023). |
| 7 | Design-document adherence | ✓ | Reconciliation now performs post-success side effects (DeploymentManager-015 resolved). DeploymentManager-018 surfaces a new gap on Disabled-state preservation. |
| 8 | Code organization & conventions | ✓ | New: redundant Pending→InProgress back-to-back write with no intervening work (DeploymentManager-022). Silent string-fallback in ResolveSiteIdentifierAsync (DeploymentManager-021). |
| 9 | Testing coverage | ✓ | New: no coverage for the reconciliation-overwrites-Disabled case (part of DeploymentManager-018); test probes share static state across tests (DeploymentManager-024). |
| 10 | Documentation & comments | ✓ | New: DeployReconciled audit uses prior.DeployedBy instead of the current user parameter — misleading for forensics (DeploymentManager-020). |
Findings
DeploymentManager-001 — Unexpected exceptions leave the deployment record stuck in InProgress
| Severity | High |
| Category | Error handling & resilience |
| Status | Resolved |
| Location | src/ZB.MOM.WW.ScadaBridge.DeploymentManager/DeploymentService.cs:141-199 |
Description
DeployInstanceAsync sets the record to InProgress (lines 137-139), then the
try block calls into CommunicationService and the repository. The only
catch filter is when (ex is TimeoutException or OperationCanceledException).
Any other exception — InvalidOperationException (thrown by
CommunicationService.GetCommunicationActor() when the actor is not set), a
JSON serialization error, a deserialization failure of the response, a DB
exception on UpdateDeploymentRecordAsync, or any transport error — escapes the
method. The deployment record remains in DeploymentStatus.InProgress
permanently. Because staleness and the UI both read current status, the
instance is then misreported as "deploying" forever and a re-deploy may be
blocked or misinterpreted. The design explicitly states an interrupted
deployment must be "treated as failed".
Recommendation
Broaden the catch to a general catch (Exception ex) that records
DeploymentStatus.Failed with the error message, audit-logs the failure, and
re-throws or returns a failed Result. Keep the timeout-specific branch only
if a distinct message is desired. Ensure the failure-status write happens for
every exit path out of the try.
Resolution
Resolved 2026-05-16 (commit <pending>): broadened the catch in
DeployInstanceAsync to catch (Exception ex) so any exception (transport,
serialization, DB, InvalidOperationException from an uninitialized
CommunicationService) marks the deployment record Failed with the error
message and audit-logs the failure, instead of escaping and leaving the record
stuck in InProgress. Regression test:
DeployInstanceAsync_CommunicationThrowsUnexpectedException_RecordMarkedFailed.
DeploymentManager-002 — Failure-status write uses a possibly-cancelled cancellation token
| Severity | High |
| Category | Error handling & resilience |
| Status | Resolved |
| Location | src/ZB.MOM.WW.ScadaBridge.DeploymentManager/DeploymentService.cs:186-196 |
Description
The catch (Exception ex) when (ex is TimeoutException or OperationCanceledException) block updates the record to Failed and calls
UpdateDeploymentRecordAsync/SaveChangesAsync/LogAsync passing the same
cancellationToken that was just cancelled (an OperationCanceledException
caught here means the token is already in the cancelled state). Those
repository and audit calls will themselves throw OperationCanceledException
before the failure status is persisted, so the record stays InProgress — the
exact bug DeploymentManager-001 describes, reached via the supposedly-handled
path.
Recommendation
Perform the cleanup writes with a fresh, non-cancellable token (e.g.
CancellationToken.None, optionally with an independent short timeout) so the
failure status is durably recorded even when the original operation was
cancelled or timed out.
Resolution
Resolved 2026-05-16 (commit <pending>): the broadened catch block now
performs the failure-status write (UpdateDeploymentRecordAsync,
SaveChangesAsync) and the audit LogAsync with CancellationToken.None
instead of the operation's (possibly-cancelled) token, so the Failed status
is durably recorded even after a timeout/cancellation. The cleanup writes are
themselves wrapped in a try/catch that logs (without masking the original
error) if persistence still fails. Regression test:
DeployInstanceAsync_FailureWrite_UsesNonCancellableToken.
DeploymentManager-003 — Successful-deployment cleanup is not atomic with the status write
| Severity | Medium |
| Category | Error handling & resilience |
| Status | Resolved |
| Location | src/ZB.MOM.WW.ScadaBridge.DeploymentManager/DeploymentService.cs:155-170 |
Description
After a successful site response the code calls UpdateDeploymentRecordAsync
(no SaveChanges yet), then UpdateInstanceAsync, then
StoreDeployedSnapshotAsync (which itself issues Add/Update calls), then a
single SaveChangesAsync at line 170. If StoreDeployedSnapshotAsync throws,
the exception is not caught (see DeploymentManager-001) and the
SaveChangesAsync never runs — the instance state, deployment status, and
snapshot are all left unpersisted even though the site has actually applied the
deployment. Central and site are now divergent: the site is running the new
config but central still shows the old state and a non-Success deployment
record.
Verification: Confirmed against source. The DeploymentManager-001 fix made
this strictly worse, not better — after that fix a snapshot-store failure is
caught and the record is flipped from Success back to Failed, so central
reports a failed deployment while the site is running the new config.
Recommendation
Wrap the post-success persistence so that, at minimum, the deployment record's
Success status is committed. Consider committing the status first, then the
instance state and snapshot, so a later failure does not lose the fact that the
site succeeded. Log loudly if the snapshot write fails after a confirmed site
apply.
Resolution
Resolved 2026-05-16 (commit pending): DeployInstanceAsync now commits the
deployment record's terminal status (UpdateDeploymentRecordAsync +
SaveChangesAsync) immediately after the site confirms the apply, before
touching instance state or the deployed-config snapshot. The post-success
instance-state update and StoreDeployedSnapshotAsync are wrapped in a
best-effort try/catch that logs loudly for operator reconciliation but no
longer flips the already-committed Success record back to Failed.
Regression test:
DeployInstanceAsync_SiteSucceeds_SnapshotWriteFails_RecordStillCommittedSuccess.
DeploymentManager-004 — Site-success but central-delete-failure leaves orphaned site config
| Severity | Medium |
| Category | Error handling & resilience |
| Status | Resolved |
| Location | src/ZB.MOM.WW.ScadaBridge.DeploymentManager/DeploymentService.cs:312-319 |
Description
In DeleteInstanceAsync, when the site responds Success the code calls
_repository.DeleteInstanceAsync then SaveChangesAsync. If SaveChangesAsync
throws (DB error, concurrency), the exception propagates uncaught: the site has
already destroyed the Instance Actor and removed its config, but the central
instance record still exists. The instance is now un-deletable through the
normal path (the site no longer has it, so a re-issued delete may fail) and is
permanently orphaned. The design states central must not mark the instance
deleted until the site confirms — but it does not address the inverse failure.
Verification: Confirmed against source. DeleteInstanceAsync has no
try/catch around the post-success block, so any exception from
DeleteInstanceAsync/SaveChangesAsync escapes uncaught to the caller.
Recommendation
Catch persistence failures in the post-success block and surface a distinct error indicating the site succeeded but the central record could not be removed, so an operator/retry can reconcile. Consider making the central delete idempotent and retryable independently of the site command.
Resolution
Resolved 2026-05-16 (commit pending): the post-success removal in
DeleteInstanceAsync (DeleteInstanceAsync + SaveChangesAsync) is now
wrapped in a try/catch. A persistence failure no longer escapes uncaught —
it is logged, recorded with a DeleteOrphaned audit entry, and surfaced as a
distinct Result failure stating the site deleted the instance but the central
record is orphaned and must be reconciled. Regression test:
DeleteInstanceAsync_SiteSucceeds_CentralDeleteFails_ReturnsDistinctFailure.
DeploymentManager-005 — OperationLockManager leaks a SemaphoreSlim per instance name
| Severity | Medium |
| Category | Performance & resource management |
| Status | Resolved |
| Location | src/ZB.MOM.WW.ScadaBridge.DeploymentManager/OperationLockManager.cs:15-33 |
Description
AcquireAsync does _locks.GetOrAdd(instanceUniqueName, _ => new SemaphoreSlim(1, 1)) and entries are never removed. Every distinct instance
unique name that is ever deployed/disabled/enabled/deleted permanently adds a
SemaphoreSlim (an IDisposable holding a kernel wait handle) to the
dictionary. Over the lifetime of a long-running central process — especially
with the bulk "deploy all out-of-date instances" workflow and instances that
are created and deleted over time — this is an unbounded leak of both managed
memory and OS handles. Deleted instances' semaphores are never reclaimed.
Verification: Confirmed against source. _locks is a ConcurrentDictionary
with no removal path anywhere in the type.
Recommendation
Either accept the leak explicitly and document the expected bounded cardinality of instance names, or implement reclamation: e.g. ref-count handles and remove
Dispose()the semaphore when the count reaches zero and the lock is free. At minimum, remove the semaphore entry when an instance is deleted (DeleteInstanceAsync).
Resolution
Resolved 2026-05-16 (commit pending): OperationLockManager now ref-counts each
lock entry. A reference is reserved (creating the entry if needed) before the
SemaphoreSlim.WaitAsync, so concurrent waiters for the same instance share one
semaphore and the entry survives until every waiter/holder has released. When
the reference count reaches zero — on release, timeout, or cancellation — the
entry is removed from the dictionary and the semaphore is Dispose()d, so the
process no longer accumulates one kernel wait handle per distinct instance name.
A TrackedLockCount diagnostic property was added to make reclamation testable.
Regression tests: AcquireAsync_ReleasedLock_RemovesSemaphoreEntry,
AcquireAsync_ManyDistinctInstances_DoesNotAccumulateSemaphores,
AcquireAsync_ContendedLock_KeepsSemaphoreUntilLastReleaseThenReclaims.
DeploymentManager-006 — Query-the-site-before-redeploy idempotency requirement not implemented
| Severity | High |
| Category | Design-document adherence |
| Status | Resolved |
| Location | src/ZB.MOM.WW.ScadaBridge.DeploymentManager/DeploymentService.cs:84-200,363-368 |
Description
The design ("Deployment Identity & Idempotency") requires: "After a central
failover or timeout, the Deployment Manager queries the site for current
deployment state before allowing a re-deploy. This prevents duplicate
application and out-of-order config changes." The code never does this.
GetDeploymentStatusAsync only reads the local DeploymentRecord from the DB
(GetDeploymentByDeploymentIdAsync) — it does not contact the site.
DeployInstanceAsync unconditionally generates a new deployment ID and sends a
new DeployInstanceCommand regardless of any prior in-flight or timed-out
deployment. After a timeout where the site actually applied the config, a
re-deploy produces a second deployment with no reconciliation against the
site's current revision hash. Site-side stale-rejection is the only safety
net, and that is not verified here.
Recommendation
Add a site query (a new CommunicationService pattern returning the site's
currently-applied deployment ID / revision hash) and call it before re-deploy
when a prior record for the instance is in InProgress/Failed due to
timeout. Reconcile: if the site already has the target revision, mark the prior
record Success instead of re-sending. Either implement this or update the
design doc to reflect that reconciliation is delegated entirely to site-side
stale-rejection.
Resolution
Resolved 2026-05-16 (commit <pending>): implemented the cross-module
query-the-site-before-redeploy idempotency feature across Commons, SiteRuntime,
Communication, and DeploymentManager — new DeploymentStateQueryRequest /
DeploymentStateQueryResponse contracts, a DeploymentManagerActor handler
answering from the site's deployed-config store, a
CommunicationService.QueryDeploymentStateAsync method routed over the
ClusterClient command/control transport, and reconciliation in
DeployInstanceAsync (TryReconcileWithSiteAsync) that queries the site only
when a prior record is InProgress or Failed due to a timeout, marks the
prior record Success without re-sending if the site already has the target
revision hash, and falls through to a normal deploy (relying on site-side
stale-rejection) when the query fails. Regression tests:
RoundTrip_DeploymentStateQueryRequest_Succeeds,
RoundTrip_DeploymentStateQueryResponse_Deployed_Succeeds,
RoundTrip_DeploymentStateQueryResponse_NotDeployed_NullApplied,
DeploymentStateQuery_DeployedInstance_ReturnsAppliedIdentity,
DeploymentStateQuery_UnknownInstance_ReturnsNotDeployed,
DeploymentStateQuery_ForwardedToDeploymentManager,
QueryDeploymentStateAsync_BeforeInitialization_Throws,
QueryDeploymentStateAsync_SendsEnvelopeAndReturnsResponse,
DeployInstanceAsync_PriorInProgressRecord_SiteHasTargetHash_MarksSuccessWithoutRedeploy,
DeployInstanceAsync_PriorInProgressRecord_SiteHasDifferentHash_ProceedsWithDeploy,
DeployInstanceAsync_PriorFailedTimeoutRecord_QueriesSite,
DeployInstanceAsync_PriorSuccessRecord_SkipsSiteQuery,
DeployInstanceAsync_FreshFirstTimeDeploy_SkipsSiteQuery,
DeployInstanceAsync_PriorInProgressRecord_QueryFails_FallsThroughToDeploy.
DeploymentManager-007 — "Diff View" reduced to a hash comparison with no diff detail
| Severity | Medium |
| Category | Design-document adherence |
| Status | Resolved |
| Location | src/ZB.MOM.WW.ScadaBridge.DeploymentManager/DeploymentService.cs:334-358,401-406 |
Description
The design ("Diff View" and "Dependencies" sections) states the Deployment
Manager can request a diff from the Template Engine showing added/removed
members, changed values, and connection-binding changes.
GetDeploymentComparisonAsync and DeploymentComparisonResult only compare two
revision hashes and return a boolean IsStale plus the two hashes. No
added/removed/changed detail is produced, and the Template Engine's diff
capability is not invoked. The UI cannot render a meaningful diff from this
result.
Verification: Confirmed against source. The Template Engine already provides
DiffService + ConfigurationDiff (structured Added/Removed/Changed entries
for attributes, alarms, and scripts, including data connection binding fields),
and DiffService is DI-registered — it was simply never wired into the
Deployment Manager's comparison path.
Recommendation
Either implement a real diff (deserialize the stored
DeployedConfigSnapshot.ConfigurationJson and the freshly flattened config and
invoke the Template Engine's diff service, surfacing structured
added/removed/changed entries), or revise the design doc to scope the feature
down to staleness detection only.
Resolution
Resolved 2026-05-16 (commit pending): GetDeploymentComparisonAsync now
deserializes the stored DeployedConfigSnapshot.ConfigurationJson and runs the
Template Engine DiffService against the freshly flattened current
configuration, attaching the resulting ConfigurationDiff (added/removed/changed
attributes, alarms, scripts) to a new optional Diff property on
DeploymentComparisonResult. DiffService is injected into DeploymentService.
A snapshot that cannot be deserialized (corrupt / older schema) still yields the
hash-based staleness result with a null diff, logged at warning level.
Regression test: GetDeploymentComparisonAsync_ProducesStructuredDiff.
DeploymentManager-008 — DeploymentManagerOptions is never bound to configuration
| Severity | Medium |
| Category | Code organization & conventions |
| Status | Resolved |
| Location | src/ZB.MOM.WW.ScadaBridge.DeploymentManager/ServiceCollectionExtensions.cs:7-14 |
Description
AddDeploymentManager registers the services but never calls
services.Configure<DeploymentManagerOptions>(configuration.GetSection(...)).
IOptions<DeploymentManagerOptions> therefore always resolves to a
default-constructed instance — the operation-lock and artifact-deployment
timeouts cannot be tuned via appsettings.json, contrary to the CLAUDE.md
convention "Per-component configuration via appsettings.json sections bound
to options classes (Options pattern)." Host/Program.cs binds
SecurityOptions and InboundApiOptions from configuration sections but has
no equivalent for DeploymentManagerOptions.
Verification: Confirmed against source. Neither AddDeploymentManager nor
Host/Program.cs binds DeploymentManagerOptions.
Recommendation
Add an IConfiguration parameter (or a configure callback) to
AddDeploymentManager and bind DeploymentManagerOptions to a section such as
ScadaBridge:DeploymentManager, consistent with the other components.
Resolution
Resolved 2026-05-16 (commit pending): AddDeploymentManager() now calls
services.AddOptions<DeploymentManagerOptions>() so IOptions<DeploymentManagerOptions>
is always resolvable, and Host/Program.cs binds the
ScadaBridge:DeploymentManager section (exposed as
ServiceCollectionExtensions.OptionsSection) via
services.Configure<DeploymentManagerOptions>(...) — the same pattern the Host
uses for SecurityOptions/InboundApiOptions. An earlier attempt added an
AddDeploymentManager(IConfiguration) overload; that was reverted because the
project convention (enforced by Host.Tests.OptionsTests) forbids component
Add* methods from depending on IConfiguration — the Host owns
configuration binding. Regression tests:
AddDeploymentManager_RegistersResolvableOptions_WithDefaults,
AddDeploymentManager_OptionsBindToConfigurationSection_AsTheHostWires,
OptionsSection_MatchesTheConventionalComponentSectionPath.
DeploymentManager-009 — Misleading timeout comment on DeleteInstanceAsync
| Severity | Low |
| Category | Documentation & comments |
| Status | Resolved |
| Location | src/ZB.MOM.WW.ScadaBridge.DeploymentManager/DeploymentService.cs:288 |
Description
The XML doc says "Delete fails if site unreachable (30s timeout via
CommunicationOptions)." The actual delete timeout is whatever
CommunicationOptions.LifecycleTimeout is configured to (passed inside
CommunicationService.DeleteInstanceAsync); the "30s" figure is hard-coded
into the comment and not derived from any constant in this module. If
LifecycleTimeout is reconfigured, the comment becomes wrong. It also wrongly
implies the value lives in this module.
Verification: Confirmed against source. The DeleteInstanceAsync XML doc
quoted a hard-coded "30s" value.
Recommendation
Reword to "Delete fails if the site is unreachable within
CommunicationOptions.LifecycleTimeout" without quoting a specific number.
Resolution
Resolved 2026-05-16 (commit pending): the DeleteInstanceAsync XML doc no
longer quotes a hard-coded "30s" — it now states delete fails if the site is
unreachable within CommunicationOptions.LifecycleTimeout (and notes the
deadline is applied inside CommunicationService.DeleteInstanceAsync).
Documentation-only change; no regression test (a test asserting comment text
would be meaningless).
DeploymentManager-010 — SystemArtifactDeploymentRecord does not persist the deployment ID
| Severity | Low |
| Category | Correctness & logic bugs |
| Status | Resolved |
| Location | src/ZB.MOM.WW.ScadaBridge.DeploymentManager/ArtifactDeploymentService.cs:136,194-211 |
Description
DeployToAllSitesAsync generates a deploymentId (line 136) and returns it in
the ArtifactDeploymentSummary and audit log, but the persisted
SystemArtifactDeploymentRecord has no field for it (the entity only has Id,
ArtifactType, DeployedBy, DeployedAt, PerSiteStatus). The deployment ID
that appears in the UI summary and audit log cannot be correlated back to the
stored record. Additionally each per-site DeployArtifactsCommand carries its
own separate GUID (BuildDeployArtifactsCommandAsync line 114), so there are in
fact N+1 unrelated IDs for one logical artifact deployment.
Verification: Confirmed against source. Each per-site command minted its own GUID and the persisted record had no way to reference the logical id.
Recommendation
Add a DeploymentId column to SystemArtifactDeploymentRecord and store the
single logical deploymentId; reuse that ID (or a derived per-site ID) for the
per-site commands so the audit log, UI summary, and persisted record agree.
Resolution
Resolved 2026-05-16 (commit pending): BuildDeployArtifactsCommandAsync now
accepts an optional deploymentId, and DeployToAllSitesAsync passes the one
logical deploymentId to every per-site command — so the per-site commands,
the audit log, and the UI summary all reference a single id instead of N+1
unrelated GUIDs (RetryForSiteAsync, an independent single-site retry, still
mints its own id). Adding a dedicated DeploymentId column to
SystemArtifactDeploymentRecord was deliberately not done: that entity
lives in ZB.MOM.WW.ScadaBridge.Commons with its EF mapping in
ZB.MOM.WW.ScadaBridge.ConfigurationDatabase, both outside this module's edit scope.
Instead the logical deploymentId is embedded in the record's free-form
PerSiteStatus JSON payload ({ DeploymentId, Sites }), which is fully within
this module's control, so the persisted record is correlatable with the
summary/audit. A follow-up to promote it to a first-class column should be
filed against Commons/ConfigurationDatabase if a queryable index is needed.
Regression tests: DeployToAllSitesAsync_AllPerSiteCommandsShareTheSummaryDeploymentId,
DeployToAllSitesAsync_PartialFailure_ReportsPerSiteMatrix,
RetryForSiteAsync_SiteSucceeds_ReturnsSuccessAndAudits.
DeploymentManager-011 — Tests never exercise a successful deployment or lifecycle success path
| Severity | Medium |
| Category | Testing coverage |
| Status | Resolved |
| Location | tests/ZB.MOM.WW.ScadaBridge.DeploymentManager.Tests/DeploymentServiceTests.cs:100-151,155-199 |
Description
DeploymentServiceTests never sets the CommunicationService actor, so every
deploy/lifecycle test deliberately stops at the InvalidOperationException
thrown by GetCommunicationActor() (see lines 118-125, 147). As a result there
is no test covering: a successful deployment (DeploymentStatus.Success
response → instance state set to Enabled, snapshot stored, audit logged); a
failed-but-handled site response; the InProgress-stuck bug
(DeploymentManager-001); successful Disable/Enable/Delete; or the operation
lock actually serializing two concurrent deploys of the same instance. The
critical post-response branch (DeploymentService.cs:154-184) and the entire
delete/disable/enable success path are untested. The AuditLogs test
(lines 277-289) asserts nothing.
Verification: Partially confirmed. By the time this finding was being
resolved, the DeploymentManager-006 fix had already introduced a TestKit-actor
seam (CreateServiceWithCommActor + ReconcileProbeActor) and successful-deploy
tests. The genuinely-still-missing coverage was: successful Disable/Enable/Delete
paths, per-instance lock serialization during deploy, and the assertionless
AuditLogs test — those gaps were addressed.
Recommendation
Introduce a seam to inject a fake/substitute communication path (e.g. an
interface over CommunicationService, or wire a TestKit actor) so success and
handled-failure paths can be unit tested. Add tests for the stuck-InProgress
scenario and for per-instance lock contention during deploy. Make the audit
test assert on IAuditService.LogAsync.
Resolution
Resolved 2026-05-16 (commit pending): extended the TestKit-actor seam
(ReconcileProbeActor now also answers lifecycle commands) and added the
missing coverage — successful Disable/Enable/Delete (state transition + audit
assertions), a successful-deploy audit assertion, and per-instance lock
serialization via a new deferred-reply SerializationProbeActor that asserts a
single instance's concurrent deploys never overlap. The assertionless AuditLogs
test was replaced with DeployInstanceAsync_FlatteningFails_DoesNotReachAudit,
which asserts on IAuditService.LogAsync. Regression tests:
DisableInstanceAsync_SiteSucceeds_SetsDisabledStateAndAudits,
EnableInstanceAsync_SiteSucceeds_SetsEnabledStateAndAudits,
DeleteInstanceAsync_SiteSucceeds_RemovesRecordAndAudits,
DeployInstanceAsync_SiteSucceeds_WritesDeployAuditEntry,
DeployInstanceAsync_FlatteningFails_DoesNotReachAudit,
DeployInstanceAsync_SameInstance_OperationLockSerializesConcurrentDeploys.
DeploymentManager-012 — LifecycleCommandTimeout option is dead code
| Severity | Low |
| Category | Documentation & comments |
| Status | Resolved |
| Location | src/ZB.MOM.WW.ScadaBridge.DeploymentManager/DeploymentManagerOptions.cs:8-9 |
Description
DeploymentManagerOptions.LifecycleCommandTimeout is declared with a 30s
default and an XML doc, but it is never read anywhere in the codebase
(lifecycle commands rely on CommunicationOptions.LifecycleTimeout inside
CommunicationService). The option misleads readers into thinking it controls
disable/enable/delete timeouts, when setting it has no effect.
Verification: Confirmed against source. A repo-wide grep found exactly one
occurrence of LifecycleCommandTimeout — the declaration itself.
Recommendation
Remove LifecycleCommandTimeout, or actually thread it through to the
lifecycle command calls (e.g. by creating a linked CTS with this timeout in
DisableInstanceAsync/EnableInstanceAsync/DeleteInstanceAsync, the way
ArtifactDeploymentTimeoutPerSite is used).
Resolution
Resolved 2026-05-16 (commit pending): LifecycleCommandTimeout is now actually
threaded through (the option exists for tuning, so it was wired up rather than
deleted). DisableInstanceAsync/EnableInstanceAsync/DeleteInstanceAsync
each create a linked CancellationTokenSource with CancelAfter( _options.LifecycleCommandTimeout) — the same pattern ArtifactDeploymentService
uses for ArtifactDeploymentTimeoutPerSite — and pass its token to the
CommunicationService call. Each method now catches the resulting
TimeoutException/OperationCanceledException, logs a warning, and returns a
Result.Failure (previously an AskTimeoutException from a hung site escaped
uncaught). The option's XML doc was corrected to describe the real behaviour.
Regression test:
DisableInstanceAsync_SiteUnresponsive_LifecycleCommandTimeoutBoundsTheWait
(asserts a 300 ms LifecycleCommandTimeout bounds the wait far below the 30 s
CommunicationOptions.LifecycleTimeout; confirmed to fail before the fix —
the call hung the full 30 s and threw AskTimeoutException).
DeploymentManager-013 — SMTP credentials serialized and broadcast to all sites
| Severity | Low |
| Category | Security |
| Status | Resolved |
| Location | src/ZB.MOM.WW.ScadaBridge.DeploymentManager/ArtifactDeploymentService.cs:108-111 |
Description
BuildDeployArtifactsCommandAsync maps smtp.Credentials directly into
SmtpConfigurationArtifact and that command is sent to every site. Distributing
SMTP credentials to sites is consistent with the design (SMTP configuration is
a deployable artifact), but the credentials travel inside a serialized command
across the inter-cluster transport and are stored on each site's SQLite. There
is no indication the value is encrypted at rest on the site or scrubbed from
logs. Worth confirming the transport is TLS-protected and the site stores the
credential securely; at minimum this should be a conscious, documented decision.
Recommendation
Confirm inter-cluster transport encryption covers artifact commands, ensure
Credentials is never written to logs, and document the at-rest protection of
SMTP credentials on site SQLite. Consider encrypting the credential field
within the artifact payload.
Verification (2026-05-16): Re-triaged against source. The DeploymentManager
side is clean: ArtifactDeploymentService maps SmtpConfiguration.Credentials
into the artifact (which the design explicitly mandates — SMTP configuration is
a deployable artifact) and never logs it — the three log statements in
DeployToAllSitesAsync only reference SiteId, SiteName, DeploymentId, and
ex.Message, never the credential. There is no defect to fix purely within
src/ZB.MOM.WW.ScadaBridge.DeploymentManager. The finding's remaining recommendations are
all cross-module and one needs a design decision:
- inter-cluster transport TLS —
ZB.MOM.WW.ScadaBridge.Communication/ZB.MOM.WW.ScadaBridge.ClusterInfrastructure(Akka remoting + ClusterClient config); - at-rest encryption of the credential on site SQLite —
ZB.MOM.WW.ScadaBridge.SiteRuntimeartifact store; - encrypting the credential field inside the artifact payload — needs the
SmtpConfigurationArtifactshape inZB.MOM.WW.ScadaBridge.Commonsplus cooperating producer (DeploymentManager) and consumer (SiteRuntime) changes, and a key-management design decision (where the encryption key lives, how it is distributed to sites) that cannot be made unilaterally here.
Status: Open — flagged. No purely-DeploymentManager fix exists; the work crosses Communication / SiteRuntime / Commons and requires a key-management design decision. Severity confirmed Low: with TLS-protected inter-cluster transport (a separate, assumed-in-place control) and no logging leak, this is a hardening item, not an active leak.
Resolution
Resolved 2026-05-16 (commit <pending>). Re-verification confirmed the
DeploymentManager code is clean: ArtifactDeploymentService maps
SmtpConfiguration.Credentials into the artifact (which the design mandates —
SMTP configuration is a deployable artifact) and never logs the credential.
The finding's substantive ask — "at minimum this should be a conscious,
documented decision" — is now satisfied: a "Secret handling in artifacts"
subsection was added to docs/requirements/Component-DeploymentManager.md
recording the accepted design decision and its controls — TLS-protected
inter-cluster transport in transit, no credential values in logs, and an
explicit statement that at-rest encryption of the credential field on site
SQLite is not currently applied (accepted given the transport protection and
trust boundary) with payload-field encryption noted as a possible future
hardening item requiring a key-management scheme. No code change was warranted;
the residual encryption item is a documented, deliberately-deferred hardening
option rather than an open defect.
DeploymentManager-014 — Dead CreateCommand helper in artifact tests
| Severity | Low |
| Category | Testing coverage |
| Status | Resolved |
| Location | tests/ZB.MOM.WW.ScadaBridge.DeploymentManager.Tests/ArtifactDeploymentServiceTests.cs:86-90 |
Description
The private static CreateCommand() helper is never referenced by any test in
the file. It is dead code that suggests an intended test (e.g. a successful
multi-site artifact deployment) was never written — coverage of
DeployToAllSitesAsync is limited to the no-sites failure case, and
RetryForSiteAsync and BuildDeployArtifactsCommandAsync have no tests at all.
Verification: Confirmed against source. The CreateCommand() helper had no
callers, and DeployToAllSitesAsync/RetryForSiteAsync only had the no-sites
failure case.
Recommendation
Either remove the unused helper or, preferably, write the missing tests for
DeployToAllSitesAsync (per-site success/failure matrix, partial failure) and
RetryForSiteAsync using it.
Resolution
Resolved 2026-05-16 (commit pending): took the recommendation's preferred
option — removed the dead CreateCommand() helper and wrote the missing
coverage instead. ArtifactDeploymentServiceTests now extends TestKit and
uses a stand-in ArtifactProbeActor (records the DeployArtifactsCommands it
receives, replies success or, for a configured failure set, failure) so
DeployToAllSitesAsync and RetryForSiteAsync are exercised end-to-end past
the communication boundary. New tests:
DeployToAllSitesAsync_AllPerSiteCommandsShareTheSummaryDeploymentId (also
covers DeploymentManager-010), DeployToAllSitesAsync_PartialFailure_ReportsPerSiteMatrix
(per-site success/failure matrix), RetryForSiteAsync_SiteSucceeds_ReturnsSuccessAndAudits.
DeploymentManager-015 — Site-query reconciliation marks a deployment Success but skips instance-state and snapshot updates
| Severity | High |
| Category | Correctness & logic bugs |
| Status | Resolved |
| Location | src/ZB.MOM.WW.ScadaBridge.DeploymentManager/DeploymentService.cs:631-655 |
Description
TryReconcileWithSiteAsync (the DeploymentManager-006 query-before-redeploy
path) handles the case where a prior InProgress/timeout-Failed record exists
and the site reports it already has the target revision hash. In that case it
marks the prior DeploymentRecord Success, audit-logs DeployReconciled, and
returns it — the caller then returns Result.Success and never enters the
normal deploy body.
The normal success path (DeployInstanceAsync.cs:215-223) does three things on
a successful site response: writes the deployment record terminal status, sets
instance.State = InstanceState.Enabled + UpdateInstanceAsync, and calls
StoreDeployedSnapshotAsync. The reconciliation shortcut performs only the
first. Consequently, after a reconciled deployment:
- The instance
Stateis left at whatever it was (e.g.NotDeployedfor a first-time deploy that timed out, orDisabled) even though the site is actually running the configuration — the central state machine and the site diverge, and a subsequentDisableInstanceAsync/EnableInstanceAsyncwill be rejected or allowed incorrectly byStateTransitionValidator. - No
DeployedConfigSnapshotis created or refreshed. A first-time deploy that is resolved purely by reconciliation leavesGetDeploymentComparisonAsyncpermanently returning"No deployed snapshot found for this instance.", and a redeploy reconciliation leaves the stored snapshot showing the old config even though the deployment record claimsSuccessfor the new revision.
The design ("Deployed vs. Template-Derived State", WP-4/WP-8) requires the deployed snapshot and instance state to reflect the last successful deployment; the reconciliation path silently breaks both invariants.
Recommendation
In the reconciled-success branch of TryReconcileWithSiteAsync, perform the
same post-success side effects as the normal path: set instance.State = InstanceState.Enabled (+ UpdateInstanceAsync) and call
StoreDeployedSnapshotAsync with the target deployment ID / revision hash /
config JSON. Factor the shared post-success logic into one helper so the normal
and reconciliation paths cannot drift. Add a regression test asserting that a
reconciled deployment leaves the instance Enabled and a snapshot stored.
Resolution
Resolved 2026-05-17 (commit pending): extracted the shared post-success side
effects into ApplyPostSuccessSideEffectsAsync (sets instance State = Enabled + UpdateInstanceAsync, stores/refreshes the DeployedConfigSnapshot)
and invoked it from both the normal deploy success path and the
TryReconcileWithSiteAsync reconciled-success branch, so a reconciled
deployment now performs the same instance-state and snapshot updates as a
normal one (configJson is now computed before the reconciliation call and
threaded into TryReconcileWithSiteAsync). Regression test:
DeployInstanceAsync_Reconciled_SetsInstanceEnabledAndStoresSnapshot.
DeploymentManager-016 — Reconciled prior record keeps its stale RevisionHash
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Resolved |
| Location | src/ZB.MOM.WW.ScadaBridge.DeploymentManager/DeploymentService.cs:639-651 |
Description
When TryReconcileWithSiteAsync reconciles a prior record, it mutates
prior.Status, prior.ErrorMessage, and prior.CompletedAt, but not
prior.RevisionHash. The reconciliation condition only compares the site's
AppliedRevisionHash against the freshly-flattened targetRevisionHash — it
does not require prior.RevisionHash to equal either of them.
The prior record can legitimately carry a different revision hash than the
current target: e.g. a deploy timed out at revision R1, the template was then
edited so the current flatten yields R2, and meanwhile the site actually
applied R2 through some other path (or R1 and R2 are equal-by-content but
the prior record predates a hash recompute). After reconciliation the record's
Status is Success but its RevisionHash still says R1, so staleness
checks and any UI that reads DeploymentRecord.RevisionHash will report the
instance as deployed at the wrong revision. The audit DeployReconciled entry
records RevisionHash = targetRevisionHash, contradicting the persisted record.
Recommendation
In the reconciled-success branch, also set prior.RevisionHash = targetRevisionHash so the persisted record, the audit entry, and the site's
actual applied revision all agree. Alternatively, only reconcile when
prior.RevisionHash == targetRevisionHash and otherwise fall through to a
normal deploy.
Resolution
Resolved 2026-05-17 (commit pending): the reconciled-success branch of
TryReconcileWithSiteAsync now also sets prior.RevisionHash = targetRevisionHash, so the persisted record, the DeployReconciled audit
entry, and the site's actually-applied revision all agree. Regression test:
DeployInstanceAsync_Reconciled_PriorRecordRevisionHashUpdatedToTarget.
DeploymentManager-017 — GetDeploymentStatusAsync XML doc describes behaviour it does not implement
| Severity | Low |
| Category | Documentation & comments |
| Status | Resolved |
| Location | src/ZB.MOM.WW.ScadaBridge.DeploymentManager/DeploymentService.cs:562-570 |
Description
The XML summary on GetDeploymentStatusAsync reads: "WP-2: After
failover/timeout, query site for current deployment state before
re-deploying." The method body does no such thing — it is a one-line
pass-through to _repository.GetDeploymentByDeploymentIdAsync, a pure local DB
read. The query-the-site-before-redeploy behaviour the comment describes was
implemented separately in TryReconcileWithSiteAsync (DeploymentManager-006).
The stale comment is a leftover of the original design intent and misleads a
reader into thinking this method contacts the site.
Recommendation
Reword the summary to describe what the method actually does — "returns the
current persisted DeploymentRecord for the given deployment ID from the
configuration database" — and, if useful, cross-reference
TryReconcileWithSiteAsync as the place the site-query reconciliation lives.
Resolution
Resolved 2026-05-17 (commit pending): the GetDeploymentStatusAsync XML doc
now states it returns the persisted DeploymentRecord from the configuration
database as a pure local read, and cross-references TryReconcileWithSiteAsync
as where the query-the-site-before-redeploy reconciliation actually lives.
Documentation-only change; no regression test (a test asserting comment text
would be meaningless).
DeploymentManager-018 — Reconciliation force-sets Enabled, overwriting an intentional Disabled after central failover
| Severity | High |
| Category | Correctness & logic bugs |
| Status | Resolved |
| Location | src/ZB.MOM.WW.ScadaBridge.DeploymentManager/DeploymentService.cs:675-682,721-748 |
Resolution — Added a forceEnabledState parameter to ApplyPostSuccessSideEffectsAsync. The normal deploy path passes true (fresh apply legitimately ends in Enabled); the reconciliation path passes false, so the helper only promotes NotDeployed → Enabled and leaves an existing Disabled (or Enabled) untouched. Regression test DeployInstanceAsync_Reconciled_DisabledInstance_PreservesDisabledState exercises the failover scenario and asserts the prior record still flips to Success while Instance.State stays Disabled.
Description
TryReconcileWithSiteAsync calls ApplyPostSuccessSideEffectsAsync whenever
the site reports it has the target revision hash, and that helper
unconditionally writes instance.State = InstanceState.Enabled. The
reconciliation shortcut only runs when the prior DeploymentRecord is
InProgress or timeout-Failed — exactly the scenarios that survive a central
failover (the in-memory OperationLockManager is lost on failover, by design:
"Lost on central failover (acceptable per design — in-progress treated as
failed)").
After such a failover, the per-instance operation lock is gone but the
deployment record is still InProgress in the DB. A user can legitimately
issue DisableInstanceAsync for the same instance — there is nothing in
DisableInstanceAsync that consults the deployment record, only the
StateTransitionValidator over Instance.State. If the state is Enabled
(the typical case when the deploy started), the disable proceeds, the site
honours it (the design states a disabled instance retains its deployed
configuration), and central now persists Instance.State = Disabled. The
deployment-record row remains InProgress (no one transitioned it). Later the
user retries the deploy: TryReconcileWithSiteAsync runs, the site still has
the target revision hash (Disable doesn't change the deployed config), the
prior record is marked Success, and ApplyPostSuccessSideEffectsAsync writes
Instance.State = Enabled — silently overriding the user's explicit Disable.
The same trap exists for any direct DB edit / migration that flipped the state
between the timed-out deploy and the redeploy. The normal deploy path can
defensibly assume Enabled after a fresh successful apply, but the
reconciliation path is reconciling prior state with prior user intent; it
should preserve Disabled if that is the current Instance.State at the time
of reconciliation, mirroring the design's separation between deploy (config
apply) and disable (subscription/script lifecycle).
Recommendation
In the reconciliation branch, do not force Enabled. Either:
- Pass a flag/parameter to
ApplyPostSuccessSideEffectsAsynctelling it whether to touch state, and skip the state write on the reconciliation path (leaving the currentInstance.Stateintact, which is alreadyEnabledfor a fresh deploy that timed out andDisabledfor the user-disabled follow-up case); or - Only set
Enabledwhen the currentInstance.StateisNotDeployed(i.e. the first-deploy timed-out case), and leave existingEnabled/Disabledalone.
Add a regression test where an instance with Instance.State = Disabled and a
prior InProgress deployment record is reconciled — the resulting
Instance.State must remain Disabled, and the deployment record must still
be marked Success.
DeploymentManager-019 — Lifecycle command timeout writes no audit entry
| Severity | Medium |
| Category | Error handling & resilience |
| Status | Resolved |
| Location | src/ZB.MOM.WW.ScadaBridge.DeploymentManager/DeploymentService.cs:328-339,385-396,445-458 |
Resolution (2026-05-28): added TryLogLifecycleTimeoutAsync, a private
helper that mirrors the DeployFailed pattern — it calls _auditService.LogAsync
with CancellationToken.None (so the operator's already-cancelled outer
token cannot also prevent the audit write) and stamps the row with the
<Action>TimedOut action name (DisableTimedOut / EnableTimedOut /
DeleteTimedOut), the command id, the configured deadline, and the captured
exception message. Each of DisableInstanceAsync / EnableInstanceAsync /
DeleteInstanceAsync invokes the helper from its
catch (TimeoutException or OperationCanceledException) block before
returning the failure Result. The helper itself try/catches around the
audit write so a failed audit pipeline does not mask the underlying timeout
for the caller — it only logs at Warning. Regression tests
DisableInstanceAsync_LifecycleTimeout_WritesDisableTimedOutAuditEntry,
EnableInstanceAsync_LifecycleTimeout_WritesEnableTimedOutAuditEntry, and
DeleteInstanceAsync_LifecycleTimeout_WritesDeleteTimedOutAuditEntry use the
existing SilentProbeActor to keep the site unresponsive, configure a 300 ms
LifecycleCommandTimeout to bound the wait, and assert the audit log
received the corresponding <Action>TimedOut entry exactly once.
Description
DisableInstanceAsync, EnableInstanceAsync, and DeleteInstanceAsync each
wrap the CommunicationService call in a linked CTS with
LifecycleCommandTimeout (DeploymentManager-012). On timeout they log a
warning and return Result<...>.Failure(...) — and skip the
_auditService.LogAsync call entirely. As a result, an operator-initiated
disable/enable/delete that times out at the site leaves no audit trail:
the user, the timestamp, the command id, and the failure mode are not
recorded in the audit log. The deploy path goes out of its way to write a
DeployFailed audit entry on the same failure mode
(DeploymentService.cs:274-276), with CancellationToken.None so the write is
durable; the lifecycle commands do not.
The design lists audit logging as a Deployment Manager responsibility for "all deployment actions, system-wide artifact deployments, and instance lifecycle changes" — a timed-out lifecycle command is an attempted lifecycle change, and the operator action is exactly the kind of event the audit log exists to record.
Recommendation
In each of the three catch (Exception ex) when (ex is TimeoutException or OperationCanceledException) blocks, write a DisableTimeout/EnableTimeout/
DeleteTimeout (or use the existing operation name with a failure flag)
audit entry with CancellationToken.None so a cancelled outer token does not
prevent the audit write, mirroring DeployFailed. Add a unit test asserting
that DisableInstanceAsync_SiteUnresponsive_LifecycleCommandTimeoutBoundsTheWait
also produces an audit entry.
DeploymentManager-020 — DeployReconciled audit attributes the action to the prior deployer, not the current user
| Severity | Low |
| Category | Documentation & comments |
| Status | Resolved |
| Location | src/ZB.MOM.WW.ScadaBridge.DeploymentManager/DeploymentService.cs:698-712 |
Description
In TryReconcileWithSiteAsync the audit call is:
await _auditService.LogAsync(prior.DeployedBy, "DeployReconciled", ...)
prior.DeployedBy is the user who issued the original (timed-out / stuck)
deployment, not the user parameter passed into DeployInstanceAsync. The
current user — the one who triggered the redeploy that produced the
reconciliation — is dropped on the floor. For audit forensics this is
misleading: the row will read "user A reconciled their own deployment"
when in fact user B initiated the action that reconciled it.
The original deployer is interesting context, but it should be carried in the
audit-detail object (where DeploymentId and RevisionHash already live), not
substituted for the actor.
Recommendation
Use user (the parameter on DeployInstanceAsync, threaded through
TryReconcileWithSiteAsync) as the audit actor, and include
OriginalDeployer = prior.DeployedBy in the detail object so the original
attribution is preserved without misrepresenting who took the action.
Resolution (2026-05-28): Threaded the user parameter from
DeployInstanceAsync into TryReconcileWithSiteAsync as a new currentUser
argument (consistent with the DeploymentManager-018 forceEnabledState
parameter-threading pattern) and rewrote the audit call to log
currentUser as the actor with OriginalDeployer = prior.DeployedBy carried
in the detail object. Added test
DeployInstanceAsync_Reconciled_AuditAttributesCurrentUserNotPriorDeployer
that pins the new attribution and asserts the prior deployer is no longer used
as the actor. Tests green (80/80 in DeploymentManager.Tests).
DeploymentManager-021 — ResolveSiteIdentifierAsync silently substitutes the DB id when the site row is missing
| Severity | Low |
| Category | Correctness & logic bugs |
| Status | Resolved |
| Location | src/ZB.MOM.WW.ScadaBridge.DeploymentManager/DeploymentService.cs:107-111 |
Resolution (2026-05-28): ResolveSiteIdentifierAsync now throws InvalidOperationException ("Site with ID {siteId} not found; cannot resolve its SiteIdentifier for routing.") when the Site row is missing, instead of returning the numeric id rendered as a string. The deploy path's existing try/catch turns the throw into a DeploymentStatus.Failed record carrying the descriptive message (the DeploymentManager-001/-002 cleanup write the failure with CancellationToken.None); the lifecycle paths (Disable/Enable/Delete) propagate the exception so the CLI/UI caller surfaces the actual cause to the operator rather than seeing a confusing downstream "unknown site" routing error. The repository contract already returned Site?, so the null path is now type-visible at the call site instead of silently papered over.
Description
private async Task<string> ResolveSiteIdentifierAsync(int siteId, CancellationToken cancellationToken)
{
var site = await _siteRepository.GetSiteByIdAsync(siteId, cancellationToken);
return site?.SiteIdentifier ?? siteId.ToString();
}
If the Site row is missing (FK was deleted, race with admin delete, DB
inconsistency), the method silently returns the numeric DB id rendered as a
string. This is then passed to CommunicationService.{Deploy,Disable,Enable, Delete}InstanceAsync and QueryDeploymentStateAsync as if it were a real
SiteIdentifier (e.g. "site-a"). The communication layer will fail with an
"unknown site" or routing error, producing a confusing diagnostic that hides
the actual problem (no site row).
This is a defensive concern, but every mutating operation in the module goes through this method, so a stale instance whose site was deleted will produce a misleading error every time it is touched.
Recommendation
Treat a missing site as a hard validation failure: return a
Result.Failure($"Site with ID {siteId} not found") early from the calling
operations, instead of fabricating an identifier. The repository already
returns Site?, so the null path is type-visible; just don't paper over it.
DeploymentManager-022 — Pending and InProgress are written back-to-back with no intervening work
| Severity | Low |
| Category | Code organization & conventions |
| Status | Resolved |
| Location | src/ZB.MOM.WW.ScadaBridge.DeploymentManager/DeploymentService.cs:178-194 |
Resolution (2026-05-28): The transient Pending write was dropped — the deployment record is now created directly in DeploymentStatus.InProgress, which collapses the start of the deploy into a single AddDeploymentRecordAsync + SaveChangesAsync + NotifyStatusChange (instead of two writes back-to-back). The flattening, validation, and TryReconcileWithSiteAsync round-trip have all completed before the insert, and the deploy command is sent immediately after, so Pending carried no operational meaning between the two writes. InProgress retains its documented "sent to site, awaiting response" semantics. Eliminating the extra SaveChangesAsync round-trip also removes the Pending→InProgress flicker the CentralUI-006 deployment-status page used to render via the second IDeploymentStatusNotifier.NotifyStatusChanged invocation.
Description
DeployInstanceAsync does:
record.Status = Pending;
AddDeploymentRecordAsync(record); SaveChangesAsync(); NotifyStatusChange(record);
record.Status = InProgress;
UpdateDeploymentRecordAsync(record); SaveChangesAsync(); NotifyStatusChange(record);
There is no work between the two writes — flattening, validation, and
reconciliation have already completed by line 174. The deploy command is sent
immediately after the InProgress write. The Pending write therefore costs:
an extra SaveChangesAsync round-trip, an extra IDeploymentStatusNotifier
invocation (which the CentralUI-006 page renders, so the user briefly sees a
Pending flicker before InProgress), and an extra row-version bump if EF
optimistic concurrency is enabled on the table.
The design uses Pending to mean "queued, not yet sent" and InProgress to
mean "sent to site, awaiting response". The code's Pending slot has no
queuing — it is set and immediately overwritten — so the state buys nothing
operationally.
Recommendation
Either:
- Drop the
Pendingwrite entirely and create the record directly inInProgress(one row insert, one notification, simpler UI); or - Move the
Pending→InProgresstransition to bracket actual queueing/work (e.g. setPendingbefore flattening + reconciliation, setInProgressimmediately beforeDeployInstanceAsyncon the comm service) so the two states carry distinguishable semantics worth a separate write.
DeploymentManager-023 — BuildDeployArtifactsCommandAsync re-queries system-wide artifacts once per site
| Severity | Low |
| Category | Performance & resource management |
| Status | Resolved |
| Location | src/ZB.MOM.WW.ScadaBridge.DeploymentManager/ArtifactDeploymentService.cs:82-144,169-173 |
Resolution (2026-05-28): Hoisted the global artifact queries (shared scripts, external systems + methods, DB connections, notification lists, SMTP configurations) out of the per-site loop into a new private FetchGlobalArtifactsAsync that produces a GlobalArtifactSnapshot record. DeployToAllSitesAsync now calls it ONCE before the loop and threads the snapshot through a new prefetched-globals overload of BuildDeployArtifactsCommandAsync; the public single-site overload keeps the prior fetch-then-build behaviour for RetryForSiteAsync. Only the per-site data-connection query remains inside the loop. Regression tests DeployToAllSitesAsync_HoistsGlobalArtifactQueriesOutOfPerSiteLoop (three sites; pins exactly-one call to each global getter and one per-site call to GetDataConnectionsBySiteIdAsync) and RetryForSiteAsync_SingleSitePath_StillRunsTheGlobalQueriesOnce (single-site path still owns its own fetch).
Description
DeployToAllSitesAsync loops over sites and calls
BuildDeployArtifactsCommandAsync(site.Id, ...) for each one. Of the six
artifact sets the method gathers, only dataConnections is per-site:
_templateRepo.GetAllSharedScriptsAsync— global._externalSystemRepo.GetAllExternalSystemsAsync— global, plusGetMethodsByExternalSystemIdAsyncper external system per site._externalSystemRepo.GetAllDatabaseConnectionsAsync— global._notificationRepo.GetAllNotificationListsAsync— global._notificationRepo.GetAllSmtpConfigurationsAsync— global._siteRepo.GetDataConnectionsBySiteIdAsync(siteId, ...)— per-site.
With N sites this issues ≈ 5·N redundant queries on the global sets (plus M·N method queries, where M is the external-system count). On a hub-and-spoke deployment with many sites the artifact-deploy path is noticeably slower than necessary and pins DbContext usage longer than needed. Per CLAUDE.md, the DbContext is not thread-safe and the per-site commands are already built sequentially (good); the redundant queries are sequential too, but the network/round-trip cost is real.
Recommendation
Hoist the global queries (shared scripts, external systems + their methods,
DB connections, notification lists, SMTP configurations) out of
BuildDeployArtifactsCommandAsync, fetch them once in DeployToAllSitesAsync,
and pass them in alongside the site id (or expose a
BuildDeployArtifactsCommandAsync(siteId, prefetchedGlobals) overload).
RetryForSiteAsync (the single-site path) can keep the convenience-overload
behaviour. Add a test using NSubstitute's .Received() to assert
_templateRepo.GetAllSharedScriptsAsync is called exactly once for an
N-site deployment.
DeploymentManager-024 — Test probe actors hold mutable static state across tests
| Severity | Low |
| Category | Testing coverage |
| Status | Resolved |
| Location | tests/ZB.MOM.WW.ScadaBridge.DeploymentManager.Tests/DeploymentServiceTests.cs:966-1075, tests/ZB.MOM.WW.ScadaBridge.DeploymentManager.Tests/ArtifactDeploymentServiceTests.cs:196-217 |
Resolution (2026-05-28): Replaced the static counters with per-test instance state. Introduced ReconcileProbeCounters and SerializationProbeCounters (in DeploymentServiceTests) and ArtifactProbeRecorder (in ArtifactDeploymentServiceTests); each probe actor now takes the counter object as its first constructor argument. Every test instantiates a fresh counter local, passes it via Props.Create(() => new ReconcileProbeActor(counters, ...)), and reads the counts directly off counters — no shared static fields remain. ReconcileProbeActor's counter increments swap to Interlocked.Increment for the cross-thread CAS, and SerializationProbeActor retains its lock on a per-test Gate. All 85 ZB.MOM.WW.ScadaBridge.DeploymentManager.Tests continue to pass after the refactor.
Description
ReconcileProbeActor.QueryCount / DeployCount, SerializationProbeActor.MaxConcurrent
/ _current, and ArtifactProbeActor.Received are all static fields.
Each test's actor constructor resets them — but reset-on-construction only
works as long as no two tests in the same class run concurrently. xUnit's
default parallelism disables intra-class parallelism, so today's tests pass;
flip the assembly-level [CollectionBehavior(DisableTestParallelization = true)]
or move to xUnit v3 (which enables intra-class parallelism by default) and the
counters race — a deploy in test A could increment DeployCount while test B
is asserting on it.
Static state shared across tests is also why a flaky-test investigation here will be unusually painful: the offending interaction is invisible from any single test file.
Recommendation
Replace the static counters with instance state, hand the actor a probe
recipient (an IActorRef to a TestKit probe), and assert via ExpectMsg
in each test. Where the simpler counter shape is preferred, pass a
shared-state object into the actor's constructor so each test owns its own
instance — never reach for static mutable test state.
DeploymentManager-025 — Notification lists and SMTP configurations (with credentials) are still broadcast to every site, contradicting the central-only design
| Severity | High |
| Category | Design-document adherence |
| Status | Resolved |
| Location | src/ZB.MOM.WW.ScadaBridge.DeploymentManager/ArtifactDeploymentService.cs:13-22,128,130,150-151,181-188 |
Description
The design now states explicitly, in two authoritative places, that notification lists and SMTP configuration are not deployable artifacts:
docs/requirements/Component-DeploymentManager.md:142-146: "Notification lists and SMTP configuration are not deployable artifacts — they are central-only definitions managed by the Notification Service ... Notification delivery happens on the central cluster, so no notification artifact or SMTP credential is ever distributed to sites."- CLAUDE.md "External Integrations" decision: "Notification delivery is central-only ... Notification lists and SMTP config are no longer deployed to sites; recipient resolution happens at central, at delivery time."
ArtifactDeploymentService still does the opposite. FetchGlobalArtifactsAsync
queries _notificationRepo.GetAllNotificationListsAsync and
GetAllSmtpConfigurationsAsync (lines 150-151), maps them into
NotificationListArtifacts and SmtpConfigurationArtifacts — the SMTP artifact
carrying smtp.Credentials verbatim (line 188) — and BuildDeployArtifactsCommandAsync
places both into the DeployArtifactsCommand sent to every site (lines 128,
130). The site side persists them: SiteReplicationActor (lines 192-201) and
DeploymentManagerActor (lines 1383-1419) loop over command.NotificationLists
and command.SmtpConfigurations and write them to site SQLite via
SiteNotificationRepository.
This is the precise scenario the design says must never happen: SMTP credentials travel across the inter-cluster transport and land on every site's SQLite. It supersedes the framing of the now-closed DeploymentManager-013, which accepted SMTP-as-deployable-artifact as a documented design decision — the design has since flipped to forbid distribution entirely, so this is a fresh divergence, not the same finding. The class-level XML doc (lines 13-22) is correspondingly stale: it still advertises "notification lists ... and SMTP configurations" as artifacts the service "broadcasts ... to all sites."
Secondary defect in the same mapping: NotificationListArtifact is built from
nl.Recipients.Where(r => r.EmailAddress is not null) (line 182), which silently
drops every SMS-only recipient (PhoneNumber set, EmailAddress null) of a
NotificationType.Sms list. Even if list distribution were intended, the SMS
recipient set would be lost — but since lists must not be distributed at all, this
is subsumed by the primary fix.
Recommendation
Stop fetching, mapping, and shipping notification lists and SMTP configurations
from the artifact deployment path. Drop the _notificationRepo queries from
FetchGlobalArtifactsAsync, pass null (or empty) for the NotificationLists
and SmtpConfigurations fields of DeployArtifactsCommand, and update the class
XML doc to remove both from the artifact list. The message fields can remain on
DeployArtifactsCommand for additive compatibility but must never be populated
from central. Coordinate removal of the consuming code in SiteRuntime
(SiteReplicationActor, DeploymentManagerActor, SiteNotificationRepository)
in the same session per the project's "design + code + tests travel together"
rule. Update the contradicting tests (see DeploymentManager-027).
Resolution
Resolved 2026-06-20 (commit fd618cf1): central FetchGlobalArtifactsAsync no longer queries or ships notification lists / SMTP configs (passes null; DeployArtifactsCommand fields kept for contract compatibility), and the site purges any already-persisted notification_lists / smtp_configurations rows (clearing the plaintext SMTP password) on both apply paths — enforcing the central-only delivery design. Verified no site runtime/delivery path reads this config.
DeploymentManager-026 — DeployInstanceAsync inserts a new deployment record every deploy; per-instance rows accumulate and the "current status" read has no tiebreaker
| Severity | Medium |
| Category | Design-document adherence |
| Status | Resolved |
| Location | src/ZB.MOM.WW.ScadaBridge.DeploymentManager/DeploymentService.cs:215-225, src/ZB.MOM.WW.ScadaBridge.ConfigurationDatabase/Repositories/DeploymentManagerRepository.cs:55-61 |
Description
DeployInstanceAsync always creates a brand-new DeploymentRecord and calls
_repository.AddDeploymentRecordAsync(record, …) (line 223) — it never reuses or
updates the instance's existing record. There is no update-in-place or
delete-prior path on the deploy flow, so every successful or failed deployment of
an instance leaves its predecessor row behind. Over the life of a central process
— amplified by the bulk "deploy all out-of-date instances" workflow and by repeated
redeploys after timeouts — the DeploymentRecords table grows without bound, one
row per deploy attempt per instance.
This contradicts the design's "Deployment Status Persistence" section
(Component-DeploymentManager.md:106-109): "Only the current deployment
status per instance is stored in the configuration database ... No deployment
history table — the audit log (via IAuditService) already captures every
deployment action." The audit log is the history; the deployment-record table is
supposed to hold only the current status. The implementation instead keeps an
ad-hoc, unindexed history there.
The accumulation also makes the reconciliation read order-sensitive.
TryReconcileWithSiteAsync reads the "prior" record via
GetCurrentDeploymentStatusAsync, which is
OrderByDescending(d => d.DeployedAt).FirstOrDefault() with no secondary sort
key (e.g. ThenByDescending(d => d.Id)). DeployedAt is a DateTimeOffset
stamped with DateTimeOffset.UtcNow at record creation; two records inserted
within the same clock tick (rapid redeploy, or a redeploy immediately after a
timed-out attempt) tie on DeployedAt, and SQL Server's choice between equal
sort keys is undefined. Reconciliation could then read the wrong prior record
(e.g. an older Success instead of the latest stuck InProgress), skipping the
intended site query, or vice-versa.
Recommendation
Either (a) make the deploy path upsert the instance's single current record
(update-in-place when one exists, insert only on first deploy) so the table holds
exactly one row per instance per the design, or (b) if multiple rows are
deliberately retained, add a deterministic tiebreaker to
GetCurrentDeploymentStatusAsync (OrderByDescending(d => d.DeployedAt) .ThenByDescending(d => d.Id)) and document the retention/cleanup story so the
table does not grow unbounded. Option (a) aligns with the design and is preferred.
Resolution
Resolved 2026-06-20 (commit fd618cf1): added a deterministic .ThenByDescending(d => d.Id) tiebreaker to GetCurrentDeploymentStatusAsync so same-tick deployment records resolve to the newest row. Insert-per-deploy behaviour unchanged (history-vs-upsert remains a separate design question).
DeploymentManager-027 — Artifact tests assert that notification lists and SMTP configs are shipped, cementing the DeploymentManager-025 design violation
| Severity | Low |
| Category | Testing coverage |
| Status | Resolved |
| Location | tests/ZB.MOM.WW.ScadaBridge.DeploymentManager.Tests/ArtifactDeploymentServiceTests.cs:173-174,200-201 |
Description
ArtifactDeploymentServiceTests asserts, via NSubstitute, that the artifact
deployment path queries the forbidden artifact sets exactly once per deployment:
await _notificationRepo.Received(1).GetAllNotificationListsAsync(Arg.Any<CancellationToken>());
await _notificationRepo.Received(1).GetAllSmtpConfigurationsAsync(Arg.Any<CancellationToken>());
(both in the DeployToAllSitesAsync global-query-hoisting test at 173-174 and the
RetryForSiteAsync test at 200-201). These assertions pin the exact behaviour the
current design forbids (DeploymentManager-025): they will keep the service shipping
notification lists and SMTP configs to sites and will actively block the fix —
removing the queries makes these Received(1) assertions fail. Tests that lock in
a design violation are worse than no test, because they make the correct change
look like a regression.
Recommendation
When DeploymentManager-025 is fixed, change these to
DidNotReceive().GetAllNotificationListsAsync(...) /
DidNotReceive().GetAllSmtpConfigurationsAsync(...) (and assert the
DeployArtifactsCommand's NotificationLists / SmtpConfigurations fields are
null/empty) so the tests enforce the central-only design instead of contradicting
it.
Resolution
Resolved 2026-06-20 (commit fd618cf1): the artifact tests no longer assert the forbidden notification/SMTP shipping — flipped Received(1) → DidNotReceive() and added assertions that the shipped command's NotificationLists/SmtpConfigurations are null.