18 KiB
Deployment Manager
The Deployment Manager is the central-side pipeline that takes a validated, flattened instance configuration from the Template Engine, ships it to a site via the Communication Layer, and tracks the result — along with full instance lifecycle commands and system-wide artifact distribution to all connected sites.
Overview
Deployment Manager (#2) runs exclusively on the central cluster. The site-side counterpart — the Deployment Manager singleton inside Site Runtime — receives and applies what central sends; that actor's design is covered in Site Runtime (#3).
The component code lives in src/ZB.MOM.WW.ScadaBridge.DeploymentManager/:
DeploymentService— per-instance deploy, disable, enable, delete, diff, and status queries.ArtifactDeploymentService— system-wide artifact broadcast and per-site retry.FlatteningPipeline— wraps the Template Engine'sFlatteningService,ValidationService, andRevisionHashServiceinto a single call used byDeploymentService.OperationLockManager— ref-counted per-instanceSemaphoreSlim(1,1)that serialises all mutating operations on one instance.StateTransitionValidator— encodes the allowed state-transition matrix forInstanceState.DeploymentStatusNotifier— singleton in-process event broadcaster that pushesDeploymentStatusChangeto the Central UI's Blazor circuits instead of letting them poll.
Registration entry point: ServiceCollectionExtensions.AddDeploymentManager. Options are bound from ScadaBridge:DeploymentManager in appsettings.json.
Key Concepts
Deployment identity
Every instance deployment carries two correlated identifiers:
DeploymentId— a newGuid(formatted"N") minted byDeploymentServiceat the start of eachDeployInstanceAsynccall.RevisionHash— computed by the Template Engine'sRevisionHashServiceover the fully resolvedFlattenedConfiguration. The hash captures the template state at the moment of flattening, so concurrent last-write-wins template edits do not affect an in-flight deployment.
The pair travels inside DeployInstanceCommand to the site. The site uses the DeploymentId to detect an already-applied identical command (idempotent re-delivery) and uses the RevisionHash to reject a stale configuration that predates what is already running.
Central stores the RevisionHash on DeploymentRecord and, after a confirmed success, on DeployedConfigSnapshot. Comparing the snapshot hash against the current-template hash determines whether an instance is stale without a site round-trip.
Per-instance operation lock
OperationLockManager holds a Dictionary<string, LockEntry> keyed by instance UniqueName. Each LockEntry wraps a SemaphoreSlim(1,1) with a reference count so the semaphore is created on first contention and disposed when the last waiter clears. The lock covers all four mutating operations — deploy, disable, enable, delete — so they can never interleave on a single instance. Operations on different instances proceed in parallel.
Lock acquisition throws TimeoutException after DeploymentManagerOptions.OperationLockTimeout (default 5 s). The operation lock is in-memory and is therefore lost on a central failover; the design treats any in-progress deployment at failover time as failed.
State transition rules
StateTransitionValidator enforces the following matrix:
InstanceState |
Deploy | Disable | Enable | Delete |
|---|---|---|---|---|
NotDeployed |
Yes | No | No | Yes |
Enabled |
Yes | Yes | No | Yes |
Disabled |
Yes* | No | Yes | Yes |
* Deploying from Disabled transitions the instance to Enabled on confirmed success.
Optimistic concurrency on deployment status
DeploymentRecord carries a RowVersion byte[] column. EF Core uses this as an optimistic-concurrency token on every UPDATE and DELETE. A concurrent write to the same record surfaces as DbUpdateConcurrencyException rather than silently overwriting the peer's state.
Failover and in-progress deployments
The operation lock is in-memory. If the active central node fails mid-deployment, the new active node has no lock and no knowledge of what the site received. The DeploymentRecord is left InProgress (or Failed if the failure path ran before the node died). Before allowing a re-deploy, DeploymentService calls TryReconcileWithSiteAsync, which queries the site for its currently-applied revision hash and reconciles rather than re-sending if the site already has the target revision.
Architecture
Instance deploy pipeline
DeployInstanceAsync executes the following sequence:
- Load and validate state — loads the
InstancefromIDeploymentManagerRepositoryand checks the transition viaStateTransitionValidator. - Acquire operation lock —
OperationLockManager.AcquireAsyncblocks competing operations on the same instance. - Flatten and validate —
IFlatteningPipeline.FlattenAndValidateAsyncruns the Template Engine pipeline and returns aFlatteningPipelineResultcontaining theFlattenedConfiguration,RevisionHash, and aValidationResult. Semantic validation failures (call targets, argument types, trigger operand types, connection binding completeness) are returned to the caller before any record is written. - Pre-deploy site reconciliation — when the prior
DeploymentRecordfor the instance isInProgressorFailedwith a timeout marker ("Communication failure:"), the service queries the site viaCommunicationService.QueryDeploymentStateAsync. If the site already holds the target revision hash, the prior record is updated toSuccessand no new deployment is sent. - Write
InProgressrecord — a singleDeploymentRecordinsert directly atInProgressstatus (no transientPendinghop).IDeploymentStatusNotifier.NotifyStatusChangedfires to push the status to the UI. - Send
DeployInstanceCommand— the command carriesDeploymentId,InstanceUniqueName,RevisionHash,FlattenedConfigurationJson,DeployedBy, andTimestamp. - Commit terminal status — the
DeploymentRecordis updated toSuccessorFailedand saved before any post-success side effects run. This ordering ensures the recorded outcome can never be lost if a post-success write fails. - Post-success side effects —
ApplyPostSuccessSideEffectsAsyncsetsInstance.State = Enabled(or preservesDisabledon the reconciliation path) and upserts theDeployedConfigSnapshot. These writes are best-effort: a failure here is logged atErrorbut does not flip the already-committedSuccessrecord back toFailed. - Audit log —
IAuditService.LogAsyncrecordsDeploy/DeployFailed/DeployReconciledwith theDeploymentId, status, and user.
Any exception in the site round-trip (steps 6–7) writes DeploymentStatus.Failed using CancellationToken.None so a cancelled outer token cannot prevent the failure record from being persisted:
// DeploymentService.DeployInstanceAsync — exception handler
var isTimeout = ex is TimeoutException or OperationCanceledException;
record.Status = DeploymentStatus.Failed;
record.ErrorMessage = isTimeout
? $"{TimeoutFailurePrefix} {ex.Message}"
: $"Deployment error: {ex.Message}";
record.CompletedAt = DateTimeOffset.UtcNow;
await _repository.UpdateDeploymentRecordAsync(record, CancellationToken.None);
await _repository.SaveChangesAsync(CancellationToken.None);
NotifyStatusChange(record);
The TimeoutFailurePrefix constant ("Communication failure:") is the marker that ShouldQuerySiteBeforeRedeploy checks on the next deploy attempt.
Pre-deploy site reconciliation
TryReconcileWithSiteAsync is invoked only when a prior deployment record exists and ShouldQuerySiteBeforeRedeploy returns true:
private static bool ShouldQuerySiteBeforeRedeploy(DeploymentRecord prior) =>
prior.Status == DeploymentStatus.InProgress
|| (prior.Status == DeploymentStatus.Failed
&& prior.ErrorMessage != null
&& prior.ErrorMessage.StartsWith(TimeoutFailurePrefix, StringComparison.Ordinal));
If the site responds that it is running the target RevisionHash, the stale prior record is updated to Success (with the hash corrected to the target), ApplyPostSuccessSideEffectsAsync runs with forceEnabledState: false to avoid undoing an intentional disable, and the caller receives the reconciled record. A query failure falls through to a normal deploy; the site's own stale-rejection logic is the safety net.
Deployed config snapshot and diff
DeployedConfigSnapshot is a one-per-instance row that stores the DeploymentId, RevisionHash, and the full FlattenedConfiguration JSON as of the last confirmed success. DeploymentService.GetDeploymentComparisonAsync re-flattens the current template state, compares the hash, and feeds both configs to DiffService.ComputeDiff if the hashes differ, producing a ConfigurationDiff with added, removed, and changed attributes, alarms, scripts, and connection bindings.
Artifact deployment
ArtifactDeploymentService.DeployToAllSitesAsync deploys the full system-wide artifact set to every site in parallel. It fetches system-wide artifacts (shared scripts, external systems with serialised methods, database connections, notification lists, SMTP configurations) once via FetchGlobalArtifactsAsync before the per-site loop, avoiding N×1 re-queries. Per-site data connections are fetched inside each per-site command build because they legitimately vary per site.
All per-site DeployArtifactsCommand messages share one DeploymentId so the audit log, UI summary, and persisted SystemArtifactDeploymentRecord all reference the same logical deployment. Each site runs under a cts.CancelAfter(ArtifactDeploymentTimeoutPerSite) linked source. Successful sites are never rolled back on other failures; individual failed sites are retryable via RetryForSiteAsync.
// ArtifactDeploymentService — parallel per-site dispatch
var tasks = sites.Select(async site =>
{
using var cts = CancellationTokenSource.CreateLinkedTokenSource(cancellationToken);
cts.CancelAfter(_options.ArtifactDeploymentTimeoutPerSite);
var command = siteCommands[site.Id];
var response = await _communicationService.DeployArtifactsAsync(
site.SiteIdentifier, command, cts.Token);
return new SiteArtifactResult(
site.SiteIdentifier, site.Name, response.Success, response.ErrorMessage);
}).ToList();
Cross-site artifact version skew is supported by design: a site that missed an artifact deployment continues operating with its current versions until an operator retries.
Status notification
DeploymentStatusNotifier is a DI singleton that exposes event Action<DeploymentStatusChange>? StatusChanged. DeploymentService calls NotifyStatusChanged at every point a DeploymentRecord status is written. The Central UI's deployment page subscribes at render time and re-renders over its Blazor Server SignalR circuit without polling. Each subscriber is invoked individually inside a try/catch so a disposed Blazor circuit cannot break the deployment pipeline.
Usage
DeploymentService and ArtifactDeploymentService are scoped services, typically resolved by ManagementService actor handlers (triggered by MgmtDeployArtifactsCommand, GetDeploymentDiffCommand, and the instance lifecycle commands) or directly by Central UI Blazor components. Engineers interact through the Central UI; automated bulk operations (deploy all stale instances) decompose into individual DeployInstanceAsync calls.
Lifecycle commands (DisableInstanceAsync, EnableInstanceAsync, DeleteInstanceAsync) follow the same lock-then-command pattern as deploy, with LifecycleCommandTimeout applied as a linked CancellationTokenSource deadline:
// DeploymentService — lifecycle command pattern (disable shown)
using var lockHandle = await _lockManager.AcquireAsync(
instance.UniqueName, _options.OperationLockTimeout, cancellationToken);
using var cts = CancellationTokenSource.CreateLinkedTokenSource(cancellationToken);
cts.CancelAfter(_options.LifecycleCommandTimeout);
response = await _communicationService.DisableInstanceAsync(siteId, command, cts.Token);
A timeout on a lifecycle command writes a DisableTimedOut / EnableTimedOut / DeleteTimedOut audit entry via TryLogLifecycleTimeoutAsync using CancellationToken.None, mirroring the DeployFailed audit pattern. The site-side Instance state is only updated in the central DB after the site confirms success; a timeout leaves the DB state unchanged.
Delete is stricter than disable/enable: if the site confirms but the central DeleteInstanceAsync repository call subsequently fails, the instance record is orphaned. The service logs at Error, records a DeleteOrphaned audit entry, and returns a descriptive failure so an operator can reconcile — it does not retry automatically.
Configuration
Options are registered via AddDeploymentManager and bound from ScadaBridge:DeploymentManager.
| Key | Default | Description |
|---|---|---|
OperationLockTimeout |
00:00:05 |
Maximum wait for the per-instance operation lock before throwing TimeoutException. |
LifecycleCommandTimeout |
00:00:30 |
Maximum round-trip for a disable, enable, or delete command before the operation is declared timed out. |
ArtifactDeploymentTimeoutPerSite |
00:02:00 |
Per-site deadline for a DeployArtifactsCommand response. Sites exceeding this are recorded as failed; others are unaffected. |
Dependencies & Interactions
- Template Engine (#1) —
FlatteningPipelinedelegates toFlatteningService,ValidationService, andRevisionHashService. Template state is captured at flatten time; last-write-wins edits made after flatten do not affect the in-flight deployment.DiffService.ComputeDiffpowers the deployment diff view. - Configuration Database (#17) — owns the EF Core implementation of
IDeploymentManagerRepository, which storesDeploymentRecord,DeployedConfigSnapshot, andSystemArtifactDeploymentRecord.IAuditService(also registered by the Configuration Database component) writes all deployment audit rows. - Central–Site Communication (#5) —
CommunicationServiceprovidesDeployInstanceAsync,QueryDeploymentStateAsync,DeployArtifactsAsync,DisableInstanceAsync,EnableInstanceAsync, andDeleteInstanceAsync. The communication layer routes bySiteIdentifier(string), not DB id;DeploymentService.ResolveSiteIdentifierAsyncresolves the numericSiteIdbefore each cross-cluster call and treats a missing site row as a hard failure. - Commons (#16) — owns
DeploymentRecord,DeployedConfigSnapshot,SystemArtifactDeploymentRecord,DeploymentStatus,InstanceState,DeployInstanceCommand,DeployArtifactsCommand,DeploymentStateQueryRequest/Response,InstanceLifecycleResponse, and theIDeploymentManagerRepositoryinterface. - Site Runtime (#3) — receives
DeployInstanceCommandandDeployArtifactsCommandvia the Communication Layer. Site-side apply is all-or-nothing per instance: the Deployment Manager singleton at the site stores the config, compiles all scripts, and creates or replaces the Instance Actor as a unit. A failure at any step is reported back with the specific error message and the previous configuration remains active. - Central UI (#9) — engineers trigger deployments, view diffs, manage instance lifecycle, and deploy system-wide artifacts through the UI. The deployment status page subscribes to
IDeploymentStatusNotifier.StatusChangedfor real-time push updates via Blazor Server SignalR. - Management Service (#18) — the actor-layer entry point for deployment commands received over ClusterClient. It resolves
DeploymentServiceandArtifactDeploymentServicefrom a per-message DI scope and forwardsMgmtDeployArtifactsCommand,GetDeploymentDiffCommand, and instance lifecycle requests. - Security & Auth (#10) — the Deployment role is required for all deploy and artifact operations; site-scoped permissions are enforced by the Central UI and Management Service before commands reach
DeploymentService.
Troubleshooting
An instance is stuck InProgress after a central failover
The operation lock is in-memory. On failover the new active node has no lock entry, and the deployment record remains InProgress. When the engineer issues a re-deploy, TryReconcileWithSiteAsync queries the site; if the site already applied the config the record is updated to Success without re-sending. If the site did not apply it, a new deployment proceeds. No manual DB edits are required in the normal failover case.
A deployment record shows Failed with "Communication failure:"
The site round-trip timed out or was cancelled before a response arrived. The site may or may not have applied the config. On the next deploy attempt the reconciliation query determines the ground truth. If the query also fails (site unreachable), a new DeployInstanceCommand is sent; the site rejects it with "already applied" if it ran the previous one.
DeleteOrphaned audit entry
The site destroyed the Instance Actor but the central DB removal failed. The instance record exists in the central DB but has no corresponding site actor. It cannot be deleted through the normal UI path (the site will reject the delete command because the instance does not exist). Reconcile by removing the central record directly via the Management API or database, referencing the CommandId in the audit entry.
Artifact deployment partially failed
DeployToAllSitesAsync returns an ArtifactDeploymentSummary with per-site SiteArtifactResult. Failed sites do not block or roll back successful ones. Use RetryForSiteAsync when the failed site is reachable again; it re-fetches all global artifacts and re-sends to the single site.