Files

18 KiB
Raw Permalink Blame History

Deployment Manager

The Deployment Manager is the central-side pipeline that takes a validated, flattened instance configuration from the Template Engine, ships it to a site via the Communication Layer, and tracks the result — along with full instance lifecycle commands and system-wide artifact distribution to all connected sites.

Overview

Deployment Manager (#2) runs exclusively on the central cluster. The site-side counterpart — the Deployment Manager singleton inside Site Runtime — receives and applies what central sends; that actor's design is covered in Site Runtime (#3).

The component code lives in src/ZB.MOM.WW.ScadaBridge.DeploymentManager/:

  • DeploymentService — per-instance deploy, disable, enable, delete, diff, and status queries.
  • ArtifactDeploymentService — system-wide artifact broadcast and per-site retry.
  • FlatteningPipeline — wraps the Template Engine's FlatteningService, ValidationService, and RevisionHashService into a single call used by DeploymentService.
  • OperationLockManager — ref-counted per-instance SemaphoreSlim(1,1) that serialises all mutating operations on one instance.
  • StateTransitionValidator — encodes the allowed state-transition matrix for InstanceState.
  • DeploymentStatusNotifier — singleton in-process event broadcaster that pushes DeploymentStatusChange to the Central UI's Blazor circuits instead of letting them poll.

Registration entry point: ServiceCollectionExtensions.AddDeploymentManager. Options are bound from ScadaBridge:DeploymentManager in appsettings.json.

Key Concepts

Deployment identity

Every instance deployment carries two correlated identifiers:

  • DeploymentId — a new Guid (formatted "N") minted by DeploymentService at the start of each DeployInstanceAsync call.
  • RevisionHash — computed by the Template Engine's RevisionHashService over the fully resolved FlattenedConfiguration. The hash captures the template state at the moment of flattening, so concurrent last-write-wins template edits do not affect an in-flight deployment.

The pair travels inside DeployInstanceCommand to the site. The site uses the DeploymentId to detect an already-applied identical command (idempotent re-delivery) and uses the RevisionHash to reject a stale configuration that predates what is already running.

Central stores the RevisionHash on DeploymentRecord and, after a confirmed success, on DeployedConfigSnapshot. Comparing the snapshot hash against the current-template hash determines whether an instance is stale without a site round-trip.

Per-instance operation lock

OperationLockManager holds a Dictionary<string, LockEntry> keyed by instance UniqueName. Each LockEntry wraps a SemaphoreSlim(1,1) with a reference count so the semaphore is created on first contention and disposed when the last waiter clears. The lock covers all four mutating operations — deploy, disable, enable, delete — so they can never interleave on a single instance. Operations on different instances proceed in parallel.

Lock acquisition throws TimeoutException after DeploymentManagerOptions.OperationLockTimeout (default 5 s). The operation lock is in-memory and is therefore lost on a central failover; the design treats any in-progress deployment at failover time as failed.

State transition rules

StateTransitionValidator enforces the following matrix:

InstanceState Deploy Disable Enable Delete
NotDeployed Yes No No Yes
Enabled Yes Yes No Yes
Disabled Yes* No Yes Yes

* Deploying from Disabled transitions the instance to Enabled on confirmed success.

Optimistic concurrency on deployment status

DeploymentRecord carries a RowVersion byte[] column. EF Core uses this as an optimistic-concurrency token on every UPDATE and DELETE. A concurrent write to the same record surfaces as DbUpdateConcurrencyException rather than silently overwriting the peer's state.

Failover and in-progress deployments

The operation lock is in-memory. If the active central node fails mid-deployment, the new active node has no lock and no knowledge of what the site received. The DeploymentRecord is left InProgress (or Failed if the failure path ran before the node died). Before allowing a re-deploy, DeploymentService calls TryReconcileWithSiteAsync, which queries the site for its currently-applied revision hash and reconciles rather than re-sending if the site already has the target revision.

Architecture

Instance deploy pipeline

DeployInstanceAsync executes the following sequence:

  1. Load and validate state — loads the Instance from IDeploymentManagerRepository and checks the transition via StateTransitionValidator.
  2. Acquire operation lockOperationLockManager.AcquireAsync blocks competing operations on the same instance.
  3. Flatten and validateIFlatteningPipeline.FlattenAndValidateAsync runs the Template Engine pipeline and returns a FlatteningPipelineResult containing the FlattenedConfiguration, RevisionHash, and a ValidationResult. Semantic validation failures (call targets, argument types, trigger operand types, connection binding completeness) are returned to the caller before any record is written.
  4. Pre-deploy site reconciliation — when the prior DeploymentRecord for the instance is InProgress or Failed with a timeout marker ("Communication failure:"), the service queries the site via CommunicationService.QueryDeploymentStateAsync. If the site already holds the target revision hash, the prior record is updated to Success and no new deployment is sent.
  5. Write InProgress record — a single DeploymentRecord insert directly at InProgress status (no transient Pending hop). IDeploymentStatusNotifier.NotifyStatusChanged fires to push the status to the UI.
  6. Send DeployInstanceCommand — the command carries DeploymentId, InstanceUniqueName, RevisionHash, FlattenedConfigurationJson, DeployedBy, and Timestamp.
  7. Commit terminal status — the DeploymentRecord is updated to Success or Failed and saved before any post-success side effects run. This ordering ensures the recorded outcome can never be lost if a post-success write fails.
  8. Post-success side effectsApplyPostSuccessSideEffectsAsync sets Instance.State = Enabled (or preserves Disabled on the reconciliation path) and upserts the DeployedConfigSnapshot. These writes are best-effort: a failure here is logged at Error but does not flip the already-committed Success record back to Failed.
  9. Audit logIAuditService.LogAsync records Deploy / DeployFailed / DeployReconciled with the DeploymentId, status, and user.

Any exception in the site round-trip (steps 67) writes DeploymentStatus.Failed using CancellationToken.None so a cancelled outer token cannot prevent the failure record from being persisted:

// DeploymentService.DeployInstanceAsync — exception handler
var isTimeout = ex is TimeoutException or OperationCanceledException;

record.Status = DeploymentStatus.Failed;
record.ErrorMessage = isTimeout
    ? $"{TimeoutFailurePrefix} {ex.Message}"
    : $"Deployment error: {ex.Message}";
record.CompletedAt = DateTimeOffset.UtcNow;

await _repository.UpdateDeploymentRecordAsync(record, CancellationToken.None);
await _repository.SaveChangesAsync(CancellationToken.None);
NotifyStatusChange(record);

The TimeoutFailurePrefix constant ("Communication failure:") is the marker that ShouldQuerySiteBeforeRedeploy checks on the next deploy attempt.

Pre-deploy site reconciliation

TryReconcileWithSiteAsync is invoked only when a prior deployment record exists and ShouldQuerySiteBeforeRedeploy returns true:

private static bool ShouldQuerySiteBeforeRedeploy(DeploymentRecord prior) =>
    prior.Status == DeploymentStatus.InProgress
    || (prior.Status == DeploymentStatus.Failed
        && prior.ErrorMessage != null
        && prior.ErrorMessage.StartsWith(TimeoutFailurePrefix, StringComparison.Ordinal));

If the site responds that it is running the target RevisionHash, the stale prior record is updated to Success (with the hash corrected to the target), ApplyPostSuccessSideEffectsAsync runs with forceEnabledState: false to avoid undoing an intentional disable, and the caller receives the reconciled record. A query failure falls through to a normal deploy; the site's own stale-rejection logic is the safety net.

Deployed config snapshot and diff

DeployedConfigSnapshot is a one-per-instance row that stores the DeploymentId, RevisionHash, and the full FlattenedConfiguration JSON as of the last confirmed success. DeploymentService.GetDeploymentComparisonAsync re-flattens the current template state, compares the hash, and feeds both configs to DiffService.ComputeDiff if the hashes differ, producing a ConfigurationDiff with added, removed, and changed attributes, alarms, scripts, and connection bindings.

Artifact deployment

ArtifactDeploymentService.DeployToAllSitesAsync deploys the full system-wide artifact set to every site in parallel. It fetches system-wide artifacts (shared scripts, external systems with serialised methods, database connections, notification lists, SMTP configurations) once via FetchGlobalArtifactsAsync before the per-site loop, avoiding N×1 re-queries. Per-site data connections are fetched inside each per-site command build because they legitimately vary per site.

All per-site DeployArtifactsCommand messages share one DeploymentId so the audit log, UI summary, and persisted SystemArtifactDeploymentRecord all reference the same logical deployment. Each site runs under a cts.CancelAfter(ArtifactDeploymentTimeoutPerSite) linked source. Successful sites are never rolled back on other failures; individual failed sites are retryable via RetryForSiteAsync.

// ArtifactDeploymentService — parallel per-site dispatch
var tasks = sites.Select(async site =>
{
    using var cts = CancellationTokenSource.CreateLinkedTokenSource(cancellationToken);
    cts.CancelAfter(_options.ArtifactDeploymentTimeoutPerSite);

    var command = siteCommands[site.Id];
    var response = await _communicationService.DeployArtifactsAsync(
        site.SiteIdentifier, command, cts.Token);

    return new SiteArtifactResult(
        site.SiteIdentifier, site.Name, response.Success, response.ErrorMessage);
}).ToList();

Cross-site artifact version skew is supported by design: a site that missed an artifact deployment continues operating with its current versions until an operator retries.

Status notification

DeploymentStatusNotifier is a DI singleton that exposes event Action<DeploymentStatusChange>? StatusChanged. DeploymentService calls NotifyStatusChanged at every point a DeploymentRecord status is written. The Central UI's deployment page subscribes at render time and re-renders over its Blazor Server SignalR circuit without polling. Each subscriber is invoked individually inside a try/catch so a disposed Blazor circuit cannot break the deployment pipeline.

Usage

DeploymentService and ArtifactDeploymentService are scoped services, typically resolved by ManagementService actor handlers (triggered by MgmtDeployArtifactsCommand, GetDeploymentDiffCommand, and the instance lifecycle commands) or directly by Central UI Blazor components. Engineers interact through the Central UI; automated bulk operations (deploy all stale instances) decompose into individual DeployInstanceAsync calls.

Lifecycle commands (DisableInstanceAsync, EnableInstanceAsync, DeleteInstanceAsync) follow the same lock-then-command pattern as deploy, with LifecycleCommandTimeout applied as a linked CancellationTokenSource deadline:

// DeploymentService — lifecycle command pattern (disable shown)
using var lockHandle = await _lockManager.AcquireAsync(
    instance.UniqueName, _options.OperationLockTimeout, cancellationToken);

using var cts = CancellationTokenSource.CreateLinkedTokenSource(cancellationToken);
cts.CancelAfter(_options.LifecycleCommandTimeout);
response = await _communicationService.DisableInstanceAsync(siteId, command, cts.Token);

A timeout on a lifecycle command writes a DisableTimedOut / EnableTimedOut / DeleteTimedOut audit entry via TryLogLifecycleTimeoutAsync using CancellationToken.None, mirroring the DeployFailed audit pattern. The site-side Instance state is only updated in the central DB after the site confirms success; a timeout leaves the DB state unchanged.

Delete is stricter than disable/enable: if the site confirms but the central DeleteInstanceAsync repository call subsequently fails, the instance record is orphaned. The service logs at Error, records a DeleteOrphaned audit entry, and returns a descriptive failure so an operator can reconcile — it does not retry automatically.

Configuration

Options are registered via AddDeploymentManager and bound from ScadaBridge:DeploymentManager.

Key Default Description
OperationLockTimeout 00:00:05 Maximum wait for the per-instance operation lock before throwing TimeoutException.
LifecycleCommandTimeout 00:00:30 Maximum round-trip for a disable, enable, or delete command before the operation is declared timed out.
ArtifactDeploymentTimeoutPerSite 00:02:00 Per-site deadline for a DeployArtifactsCommand response. Sites exceeding this are recorded as failed; others are unaffected.

Dependencies & Interactions

  • Template Engine (#1)FlatteningPipeline delegates to FlatteningService, ValidationService, and RevisionHashService. Template state is captured at flatten time; last-write-wins edits made after flatten do not affect the in-flight deployment. DiffService.ComputeDiff powers the deployment diff view.
  • Configuration Database (#17) — owns the EF Core implementation of IDeploymentManagerRepository, which stores DeploymentRecord, DeployedConfigSnapshot, and SystemArtifactDeploymentRecord. IAuditService (also registered by the Configuration Database component) writes all deployment audit rows.
  • CentralSite Communication (#5)CommunicationService provides DeployInstanceAsync, QueryDeploymentStateAsync, DeployArtifactsAsync, DisableInstanceAsync, EnableInstanceAsync, and DeleteInstanceAsync. The communication layer routes by SiteIdentifier (string), not DB id; DeploymentService.ResolveSiteIdentifierAsync resolves the numeric SiteId before each cross-cluster call and treats a missing site row as a hard failure.
  • Commons (#16) — owns DeploymentRecord, DeployedConfigSnapshot, SystemArtifactDeploymentRecord, DeploymentStatus, InstanceState, DeployInstanceCommand, DeployArtifactsCommand, DeploymentStateQueryRequest/Response, InstanceLifecycleResponse, and the IDeploymentManagerRepository interface.
  • Site Runtime (#3) — receives DeployInstanceCommand and DeployArtifactsCommand via the Communication Layer. Site-side apply is all-or-nothing per instance: the Deployment Manager singleton at the site stores the config, compiles all scripts, and creates or replaces the Instance Actor as a unit. A failure at any step is reported back with the specific error message and the previous configuration remains active.
  • Central UI (#9) — engineers trigger deployments, view diffs, manage instance lifecycle, and deploy system-wide artifacts through the UI. The deployment status page subscribes to IDeploymentStatusNotifier.StatusChanged for real-time push updates via Blazor Server SignalR.
  • Management Service (#18) — the actor-layer entry point for deployment commands received over ClusterClient. It resolves DeploymentService and ArtifactDeploymentService from a per-message DI scope and forwards MgmtDeployArtifactsCommand, GetDeploymentDiffCommand, and instance lifecycle requests.
  • Security & Auth (#10) — the Deployment role is required for all deploy and artifact operations; site-scoped permissions are enforced by the Central UI and Management Service before commands reach DeploymentService.

Troubleshooting

An instance is stuck InProgress after a central failover

The operation lock is in-memory. On failover the new active node has no lock entry, and the deployment record remains InProgress. When the engineer issues a re-deploy, TryReconcileWithSiteAsync queries the site; if the site already applied the config the record is updated to Success without re-sending. If the site did not apply it, a new deployment proceeds. No manual DB edits are required in the normal failover case.

A deployment record shows Failed with "Communication failure:"

The site round-trip timed out or was cancelled before a response arrived. The site may or may not have applied the config. On the next deploy attempt the reconciliation query determines the ground truth. If the query also fails (site unreachable), a new DeployInstanceCommand is sent; the site rejects it with "already applied" if it ran the previous one.

DeleteOrphaned audit entry

The site destroyed the Instance Actor but the central DB removal failed. The instance record exists in the central DB but has no corresponding site actor. It cannot be deleted through the normal UI path (the site will reject the delete command because the instance does not exist). Reconcile by removing the central record directly via the Management API or database, referencing the CommandId in the audit entry.

Artifact deployment partially failed

DeployToAllSitesAsync returns an ArtifactDeploymentSummary with per-site SiteArtifactResult. Failed sites do not block or roll back successful ones. Use RetryForSiteAsync when the failed site is reachable again; it re-fetches all global artifacts and re-sends to the single site.