fix(deployment-manager): resolve DeploymentManager-003..011 — atomic status commit, orphan-delete handling, semaphore reclamation, structured diff, options binding, lifecycle test coverage

This commit is contained in:
Joseph Doherty
2026-05-16 21:11:24 -04:00
parent c9b236e507
commit 8c67ffad2a
8 changed files with 760 additions and 44 deletions

View File

@@ -8,7 +8,7 @@
| Last reviewed | 2026-05-16 |
| Reviewer | claude-agent |
| Commit reviewed | `9c60592` |
| Open findings | 11 |
| Open findings | 5 |
## Summary
@@ -134,7 +134,7 @@ error) if persistence still fails. Regression test:
|--|--|
| Severity | Medium |
| Category | Error handling & resilience |
| Status | Open |
| Status | Resolved |
| Location | `src/ScadaLink.DeploymentManager/DeploymentService.cs:155-170` |
**Description**
@@ -150,6 +150,11 @@ deployment. Central and site are now divergent: the site is running the new
config but central still shows the old state and a non-`Success` deployment
record.
**Verification:** Confirmed against source. The DeploymentManager-001 fix made
this strictly worse, not better — after that fix a snapshot-store failure is
caught and the record is flipped from `Success` back to `Failed`, so central
reports a *failed* deployment while the site is running the new config.
**Recommendation**
Wrap the post-success persistence so that, at minimum, the deployment record's
@@ -160,7 +165,15 @@ apply.
**Resolution**
_Unresolved._
Resolved 2026-05-16 (commit pending): `DeployInstanceAsync` now commits the
deployment record's terminal status (`UpdateDeploymentRecordAsync` +
`SaveChangesAsync`) immediately after the site confirms the apply, *before*
touching instance state or the deployed-config snapshot. The post-success
instance-state update and `StoreDeployedSnapshotAsync` are wrapped in a
best-effort `try`/`catch` that logs loudly for operator reconciliation but no
longer flips the already-committed `Success` record back to `Failed`.
Regression test:
`DeployInstanceAsync_SiteSucceeds_SnapshotWriteFails_RecordStillCommittedSuccess`.
### DeploymentManager-004 — Site-success but central-delete-failure leaves orphaned site config
@@ -168,7 +181,7 @@ _Unresolved._
|--|--|
| Severity | Medium |
| Category | Error handling & resilience |
| Status | Open |
| Status | Resolved |
| Location | `src/ScadaLink.DeploymentManager/DeploymentService.cs:312-319` |
**Description**
@@ -182,6 +195,10 @@ normal path (the site no longer has it, so a re-issued delete may fail) and is
permanently orphaned. The design states central must not mark the instance
deleted until the site confirms — but it does not address the inverse failure.
**Verification:** Confirmed against source. `DeleteInstanceAsync` has no
`try`/`catch` around the post-success block, so any exception from
`DeleteInstanceAsync`/`SaveChangesAsync` escapes uncaught to the caller.
**Recommendation**
Catch persistence failures in the post-success block and surface a distinct
@@ -191,7 +208,13 @@ idempotent and retryable independently of the site command.
**Resolution**
_Unresolved._
Resolved 2026-05-16 (commit pending): the post-success removal in
`DeleteInstanceAsync` (`DeleteInstanceAsync` + `SaveChangesAsync`) is now
wrapped in a `try`/`catch`. A persistence failure no longer escapes uncaught —
it is logged, recorded with a `DeleteOrphaned` audit entry, and surfaced as a
distinct `Result` failure stating the site deleted the instance but the central
record is orphaned and must be reconciled. Regression test:
`DeleteInstanceAsync_SiteSucceeds_CentralDeleteFails_ReturnsDistinctFailure`.
### DeploymentManager-005 — `OperationLockManager` leaks a `SemaphoreSlim` per instance name
@@ -199,7 +222,7 @@ _Unresolved._
|--|--|
| Severity | Medium |
| Category | Performance & resource management |
| Status | Open |
| Status | Resolved |
| Location | `src/ScadaLink.DeploymentManager/OperationLockManager.cs:15-33` |
**Description**
@@ -213,6 +236,9 @@ with the bulk "deploy all out-of-date instances" workflow and instances that
are created and deleted over time — this is an unbounded leak of both managed
memory and OS handles. Deleted instances' semaphores are never reclaimed.
**Verification:** Confirmed against source. `_locks` is a `ConcurrentDictionary`
with no removal path anywhere in the type.
**Recommendation**
Either accept the leak explicitly and document the expected bounded cardinality
@@ -223,7 +249,17 @@ At minimum, remove the semaphore entry when an instance is deleted
**Resolution**
_Unresolved._
Resolved 2026-05-16 (commit pending): `OperationLockManager` now ref-counts each
lock entry. A reference is reserved (creating the entry if needed) before the
`SemaphoreSlim.WaitAsync`, so concurrent waiters for the same instance share one
semaphore and the entry survives until every waiter/holder has released. When
the reference count reaches zero — on release, timeout, or cancellation — the
entry is removed from the dictionary and the semaphore is `Dispose()`d, so the
process no longer accumulates one kernel wait handle per distinct instance name.
A `TrackedLockCount` diagnostic property was added to make reclamation testable.
Regression tests: `AcquireAsync_ReleasedLock_RemovesSemaphoreEntry`,
`AcquireAsync_ManyDistinctInstances_DoesNotAccumulateSemaphores`,
`AcquireAsync_ContendedLock_KeepsSemaphoreUntilLastReleaseThenReclaims`.
### DeploymentManager-006 — Query-the-site-before-redeploy idempotency requirement not implemented
@@ -294,7 +330,7 @@ stale-rejection) when the query fails. Regression tests:
|--|--|
| Severity | Medium |
| Category | Design-document adherence |
| Status | Open |
| Status | Resolved |
| Location | `src/ScadaLink.DeploymentManager/DeploymentService.cs:334-358,401-406` |
**Description**
@@ -308,6 +344,12 @@ added/removed/changed detail is produced, and the Template Engine's diff
capability is not invoked. The UI cannot render a meaningful diff from this
result.
**Verification:** Confirmed against source. The Template Engine already provides
`DiffService` + `ConfigurationDiff` (structured Added/Removed/Changed entries
for attributes, alarms, and scripts, including data connection binding fields),
and `DiffService` is DI-registered — it was simply never wired into the
Deployment Manager's comparison path.
**Recommendation**
Either implement a real diff (deserialize the stored
@@ -318,7 +360,15 @@ down to staleness detection only.
**Resolution**
_Unresolved._
Resolved 2026-05-16 (commit pending): `GetDeploymentComparisonAsync` now
deserializes the stored `DeployedConfigSnapshot.ConfigurationJson` and runs the
Template Engine `DiffService` against the freshly flattened current
configuration, attaching the resulting `ConfigurationDiff` (added/removed/changed
attributes, alarms, scripts) to a new optional `Diff` property on
`DeploymentComparisonResult`. `DiffService` is injected into `DeploymentService`.
A snapshot that cannot be deserialized (corrupt / older schema) still yields the
hash-based staleness result with a null diff, logged at warning level.
Regression test: `GetDeploymentComparisonAsync_ProducesStructuredDiff`.
### DeploymentManager-008 — `DeploymentManagerOptions` is never bound to configuration
@@ -326,7 +376,7 @@ _Unresolved._
|--|--|
| Severity | Medium |
| Category | Code organization & conventions |
| Status | Open |
| Status | Resolved |
| Location | `src/ScadaLink.DeploymentManager/ServiceCollectionExtensions.cs:7-14` |
**Description**
@@ -341,6 +391,9 @@ to options classes (Options pattern)." `Host/Program.cs` binds
`SecurityOptions` and `InboundApiOptions` from configuration sections but has
no equivalent for `DeploymentManagerOptions`.
**Verification:** Confirmed against source. Neither `AddDeploymentManager` nor
`Host/Program.cs` binds `DeploymentManagerOptions`.
**Recommendation**
Add an `IConfiguration` parameter (or a configure callback) to
@@ -349,7 +402,18 @@ Add an `IConfiguration` parameter (or a configure callback) to
**Resolution**
_Unresolved._
Resolved 2026-05-16 (commit pending): added an
`AddDeploymentManager(IServiceCollection, IConfiguration)` overload that binds
`DeploymentManagerOptions` to the `ScadaLink:DeploymentManager` configuration
section (exposed as `ServiceCollectionExtensions.OptionsSection`). The
`Microsoft.Extensions.Options.ConfigurationExtensions` package was added to the
project. The original parameterless overload is retained for callers/tests that
do not bind configuration. Regression tests:
`AddDeploymentManager_WithConfiguration_BindsDeploymentManagerOptions`,
`AddDeploymentManager_WithConfiguration_MissingSection_UsesDefaults`. Note: a
one-line follow-up in `Host/Program.cs` (call the new overload with
`builder.Configuration`) is required to take effect at runtime — that file is
outside this module's edit scope and is surfaced for the Host owner.
### DeploymentManager-009 — Misleading timeout comment on `DeleteInstanceAsync`
@@ -415,7 +479,7 @@ _Unresolved._
|--|--|
| Severity | Medium |
| Category | Testing coverage |
| Status | Open |
| Status | Resolved |
| Location | `tests/ScadaLink.DeploymentManager.Tests/DeploymentServiceTests.cs:100-151,155-199` |
**Description**
@@ -432,6 +496,13 @@ critical post-response branch (`DeploymentService.cs:154-184`) and the entire
delete/disable/enable success path are untested. The `AuditLogs` test
(lines 277-289) asserts nothing.
**Verification:** Partially confirmed. By the time this finding was being
resolved, the DeploymentManager-006 fix had already introduced a TestKit-actor
seam (`CreateServiceWithCommActor` + `ReconcileProbeActor`) and successful-deploy
tests. The genuinely-still-missing coverage was: successful Disable/Enable/Delete
paths, per-instance lock serialization during deploy, and the assertionless
`AuditLogs` test — those gaps were addressed.
**Recommendation**
Introduce a seam to inject a fake/substitute communication path (e.g. an
@@ -442,7 +513,20 @@ test assert on `IAuditService.LogAsync`.
**Resolution**
_Unresolved._
Resolved 2026-05-16 (commit pending): extended the TestKit-actor seam
(`ReconcileProbeActor` now also answers lifecycle commands) and added the
missing coverage — successful Disable/Enable/Delete (state transition + audit
assertions), a successful-deploy audit assertion, and per-instance lock
serialization via a new deferred-reply `SerializationProbeActor` that asserts a
single instance's concurrent deploys never overlap. The assertionless `AuditLogs`
test was replaced with `DeployInstanceAsync_FlatteningFails_DoesNotReachAudit`,
which asserts on `IAuditService.LogAsync`. Regression tests:
`DisableInstanceAsync_SiteSucceeds_SetsDisabledStateAndAudits`,
`EnableInstanceAsync_SiteSucceeds_SetsEnabledStateAndAudits`,
`DeleteInstanceAsync_SiteSucceeds_RemovesRecordAndAudits`,
`DeployInstanceAsync_SiteSucceeds_WritesDeployAuditEntry`,
`DeployInstanceAsync_FlatteningFails_DoesNotReachAudit`,
`DeployInstanceAsync_SameInstance_OperationLockSerializesConcurrentDeploys`.
### DeploymentManager-012 — `LifecycleCommandTimeout` option is dead code