fix(deployment-manager): resolve DeploymentManager-003..011 — atomic status commit, orphan-delete handling, semaphore reclamation, structured diff, options binding, lifecycle test coverage
This commit is contained in:
@@ -8,7 +8,7 @@
|
||||
| Last reviewed | 2026-05-16 |
|
||||
| Reviewer | claude-agent |
|
||||
| Commit reviewed | `9c60592` |
|
||||
| Open findings | 11 |
|
||||
| Open findings | 5 |
|
||||
|
||||
## Summary
|
||||
|
||||
@@ -134,7 +134,7 @@ error) if persistence still fails. Regression test:
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Error handling & resilience |
|
||||
| Status | Open |
|
||||
| Status | Resolved |
|
||||
| Location | `src/ScadaLink.DeploymentManager/DeploymentService.cs:155-170` |
|
||||
|
||||
**Description**
|
||||
@@ -150,6 +150,11 @@ deployment. Central and site are now divergent: the site is running the new
|
||||
config but central still shows the old state and a non-`Success` deployment
|
||||
record.
|
||||
|
||||
**Verification:** Confirmed against source. The DeploymentManager-001 fix made
|
||||
this strictly worse, not better — after that fix a snapshot-store failure is
|
||||
caught and the record is flipped from `Success` back to `Failed`, so central
|
||||
reports a *failed* deployment while the site is running the new config.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Wrap the post-success persistence so that, at minimum, the deployment record's
|
||||
@@ -160,7 +165,15 @@ apply.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
Resolved 2026-05-16 (commit pending): `DeployInstanceAsync` now commits the
|
||||
deployment record's terminal status (`UpdateDeploymentRecordAsync` +
|
||||
`SaveChangesAsync`) immediately after the site confirms the apply, *before*
|
||||
touching instance state or the deployed-config snapshot. The post-success
|
||||
instance-state update and `StoreDeployedSnapshotAsync` are wrapped in a
|
||||
best-effort `try`/`catch` that logs loudly for operator reconciliation but no
|
||||
longer flips the already-committed `Success` record back to `Failed`.
|
||||
Regression test:
|
||||
`DeployInstanceAsync_SiteSucceeds_SnapshotWriteFails_RecordStillCommittedSuccess`.
|
||||
|
||||
### DeploymentManager-004 — Site-success but central-delete-failure leaves orphaned site config
|
||||
|
||||
@@ -168,7 +181,7 @@ _Unresolved._
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Error handling & resilience |
|
||||
| Status | Open |
|
||||
| Status | Resolved |
|
||||
| Location | `src/ScadaLink.DeploymentManager/DeploymentService.cs:312-319` |
|
||||
|
||||
**Description**
|
||||
@@ -182,6 +195,10 @@ normal path (the site no longer has it, so a re-issued delete may fail) and is
|
||||
permanently orphaned. The design states central must not mark the instance
|
||||
deleted until the site confirms — but it does not address the inverse failure.
|
||||
|
||||
**Verification:** Confirmed against source. `DeleteInstanceAsync` has no
|
||||
`try`/`catch` around the post-success block, so any exception from
|
||||
`DeleteInstanceAsync`/`SaveChangesAsync` escapes uncaught to the caller.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Catch persistence failures in the post-success block and surface a distinct
|
||||
@@ -191,7 +208,13 @@ idempotent and retryable independently of the site command.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
Resolved 2026-05-16 (commit pending): the post-success removal in
|
||||
`DeleteInstanceAsync` (`DeleteInstanceAsync` + `SaveChangesAsync`) is now
|
||||
wrapped in a `try`/`catch`. A persistence failure no longer escapes uncaught —
|
||||
it is logged, recorded with a `DeleteOrphaned` audit entry, and surfaced as a
|
||||
distinct `Result` failure stating the site deleted the instance but the central
|
||||
record is orphaned and must be reconciled. Regression test:
|
||||
`DeleteInstanceAsync_SiteSucceeds_CentralDeleteFails_ReturnsDistinctFailure`.
|
||||
|
||||
### DeploymentManager-005 — `OperationLockManager` leaks a `SemaphoreSlim` per instance name
|
||||
|
||||
@@ -199,7 +222,7 @@ _Unresolved._
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Performance & resource management |
|
||||
| Status | Open |
|
||||
| Status | Resolved |
|
||||
| Location | `src/ScadaLink.DeploymentManager/OperationLockManager.cs:15-33` |
|
||||
|
||||
**Description**
|
||||
@@ -213,6 +236,9 @@ with the bulk "deploy all out-of-date instances" workflow and instances that
|
||||
are created and deleted over time — this is an unbounded leak of both managed
|
||||
memory and OS handles. Deleted instances' semaphores are never reclaimed.
|
||||
|
||||
**Verification:** Confirmed against source. `_locks` is a `ConcurrentDictionary`
|
||||
with no removal path anywhere in the type.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Either accept the leak explicitly and document the expected bounded cardinality
|
||||
@@ -223,7 +249,17 @@ At minimum, remove the semaphore entry when an instance is deleted
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
Resolved 2026-05-16 (commit pending): `OperationLockManager` now ref-counts each
|
||||
lock entry. A reference is reserved (creating the entry if needed) before the
|
||||
`SemaphoreSlim.WaitAsync`, so concurrent waiters for the same instance share one
|
||||
semaphore and the entry survives until every waiter/holder has released. When
|
||||
the reference count reaches zero — on release, timeout, or cancellation — the
|
||||
entry is removed from the dictionary and the semaphore is `Dispose()`d, so the
|
||||
process no longer accumulates one kernel wait handle per distinct instance name.
|
||||
A `TrackedLockCount` diagnostic property was added to make reclamation testable.
|
||||
Regression tests: `AcquireAsync_ReleasedLock_RemovesSemaphoreEntry`,
|
||||
`AcquireAsync_ManyDistinctInstances_DoesNotAccumulateSemaphores`,
|
||||
`AcquireAsync_ContendedLock_KeepsSemaphoreUntilLastReleaseThenReclaims`.
|
||||
|
||||
### DeploymentManager-006 — Query-the-site-before-redeploy idempotency requirement not implemented
|
||||
|
||||
@@ -294,7 +330,7 @@ stale-rejection) when the query fails. Regression tests:
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Design-document adherence |
|
||||
| Status | Open |
|
||||
| Status | Resolved |
|
||||
| Location | `src/ScadaLink.DeploymentManager/DeploymentService.cs:334-358,401-406` |
|
||||
|
||||
**Description**
|
||||
@@ -308,6 +344,12 @@ added/removed/changed detail is produced, and the Template Engine's diff
|
||||
capability is not invoked. The UI cannot render a meaningful diff from this
|
||||
result.
|
||||
|
||||
**Verification:** Confirmed against source. The Template Engine already provides
|
||||
`DiffService` + `ConfigurationDiff` (structured Added/Removed/Changed entries
|
||||
for attributes, alarms, and scripts, including data connection binding fields),
|
||||
and `DiffService` is DI-registered — it was simply never wired into the
|
||||
Deployment Manager's comparison path.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Either implement a real diff (deserialize the stored
|
||||
@@ -318,7 +360,15 @@ down to staleness detection only.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
Resolved 2026-05-16 (commit pending): `GetDeploymentComparisonAsync` now
|
||||
deserializes the stored `DeployedConfigSnapshot.ConfigurationJson` and runs the
|
||||
Template Engine `DiffService` against the freshly flattened current
|
||||
configuration, attaching the resulting `ConfigurationDiff` (added/removed/changed
|
||||
attributes, alarms, scripts) to a new optional `Diff` property on
|
||||
`DeploymentComparisonResult`. `DiffService` is injected into `DeploymentService`.
|
||||
A snapshot that cannot be deserialized (corrupt / older schema) still yields the
|
||||
hash-based staleness result with a null diff, logged at warning level.
|
||||
Regression test: `GetDeploymentComparisonAsync_ProducesStructuredDiff`.
|
||||
|
||||
### DeploymentManager-008 — `DeploymentManagerOptions` is never bound to configuration
|
||||
|
||||
@@ -326,7 +376,7 @@ _Unresolved._
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Code organization & conventions |
|
||||
| Status | Open |
|
||||
| Status | Resolved |
|
||||
| Location | `src/ScadaLink.DeploymentManager/ServiceCollectionExtensions.cs:7-14` |
|
||||
|
||||
**Description**
|
||||
@@ -341,6 +391,9 @@ to options classes (Options pattern)." `Host/Program.cs` binds
|
||||
`SecurityOptions` and `InboundApiOptions` from configuration sections but has
|
||||
no equivalent for `DeploymentManagerOptions`.
|
||||
|
||||
**Verification:** Confirmed against source. Neither `AddDeploymentManager` nor
|
||||
`Host/Program.cs` binds `DeploymentManagerOptions`.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Add an `IConfiguration` parameter (or a configure callback) to
|
||||
@@ -349,7 +402,18 @@ Add an `IConfiguration` parameter (or a configure callback) to
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
Resolved 2026-05-16 (commit pending): added an
|
||||
`AddDeploymentManager(IServiceCollection, IConfiguration)` overload that binds
|
||||
`DeploymentManagerOptions` to the `ScadaLink:DeploymentManager` configuration
|
||||
section (exposed as `ServiceCollectionExtensions.OptionsSection`). The
|
||||
`Microsoft.Extensions.Options.ConfigurationExtensions` package was added to the
|
||||
project. The original parameterless overload is retained for callers/tests that
|
||||
do not bind configuration. Regression tests:
|
||||
`AddDeploymentManager_WithConfiguration_BindsDeploymentManagerOptions`,
|
||||
`AddDeploymentManager_WithConfiguration_MissingSection_UsesDefaults`. Note: a
|
||||
one-line follow-up in `Host/Program.cs` (call the new overload with
|
||||
`builder.Configuration`) is required to take effect at runtime — that file is
|
||||
outside this module's edit scope and is surfaced for the Host owner.
|
||||
|
||||
### DeploymentManager-009 — Misleading timeout comment on `DeleteInstanceAsync`
|
||||
|
||||
@@ -415,7 +479,7 @@ _Unresolved._
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Testing coverage |
|
||||
| Status | Open |
|
||||
| Status | Resolved |
|
||||
| Location | `tests/ScadaLink.DeploymentManager.Tests/DeploymentServiceTests.cs:100-151,155-199` |
|
||||
|
||||
**Description**
|
||||
@@ -432,6 +496,13 @@ critical post-response branch (`DeploymentService.cs:154-184`) and the entire
|
||||
delete/disable/enable success path are untested. The `AuditLogs` test
|
||||
(lines 277-289) asserts nothing.
|
||||
|
||||
**Verification:** Partially confirmed. By the time this finding was being
|
||||
resolved, the DeploymentManager-006 fix had already introduced a TestKit-actor
|
||||
seam (`CreateServiceWithCommActor` + `ReconcileProbeActor`) and successful-deploy
|
||||
tests. The genuinely-still-missing coverage was: successful Disable/Enable/Delete
|
||||
paths, per-instance lock serialization during deploy, and the assertionless
|
||||
`AuditLogs` test — those gaps were addressed.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Introduce a seam to inject a fake/substitute communication path (e.g. an
|
||||
@@ -442,7 +513,20 @@ test assert on `IAuditService.LogAsync`.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
Resolved 2026-05-16 (commit pending): extended the TestKit-actor seam
|
||||
(`ReconcileProbeActor` now also answers lifecycle commands) and added the
|
||||
missing coverage — successful Disable/Enable/Delete (state transition + audit
|
||||
assertions), a successful-deploy audit assertion, and per-instance lock
|
||||
serialization via a new deferred-reply `SerializationProbeActor` that asserts a
|
||||
single instance's concurrent deploys never overlap. The assertionless `AuditLogs`
|
||||
test was replaced with `DeployInstanceAsync_FlatteningFails_DoesNotReachAudit`,
|
||||
which asserts on `IAuditService.LogAsync`. Regression tests:
|
||||
`DisableInstanceAsync_SiteSucceeds_SetsDisabledStateAndAudits`,
|
||||
`EnableInstanceAsync_SiteSucceeds_SetsEnabledStateAndAudits`,
|
||||
`DeleteInstanceAsync_SiteSucceeds_RemovesRecordAndAudits`,
|
||||
`DeployInstanceAsync_SiteSucceeds_WritesDeployAuditEntry`,
|
||||
`DeployInstanceAsync_FlatteningFails_DoesNotReachAudit`,
|
||||
`DeployInstanceAsync_SameInstance_OperationLockSerializesConcurrentDeploys`.
|
||||
|
||||
### DeploymentManager-012 — `LifecycleCommandTimeout` option is dead code
|
||||
|
||||
|
||||
Reference in New Issue
Block a user