Files

210 lines
18 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Deployment Manager
The Deployment Manager is the central-side pipeline that takes a validated, flattened instance configuration from the Template Engine, ships it to a site via the Communication Layer, and tracks the result — along with full instance lifecycle commands and system-wide artifact distribution to all connected sites.
## Overview
Deployment Manager (#2) runs exclusively on the central cluster. The site-side counterpart — the Deployment Manager singleton inside Site Runtime — receives and applies what central sends; that actor's design is covered in Site Runtime (#3).
The component code lives in `src/ZB.MOM.WW.ScadaBridge.DeploymentManager/`:
- `DeploymentService` — per-instance deploy, disable, enable, delete, diff, and status queries.
- `ArtifactDeploymentService` — system-wide artifact broadcast and per-site retry.
- `FlatteningPipeline` — wraps the Template Engine's `FlatteningService`, `ValidationService`, and `RevisionHashService` into a single call used by `DeploymentService`.
- `OperationLockManager` — ref-counted per-instance `SemaphoreSlim(1,1)` that serialises all mutating operations on one instance.
- `StateTransitionValidator` — encodes the allowed state-transition matrix for `InstanceState`.
- `DeploymentStatusNotifier` — singleton in-process event broadcaster that pushes `DeploymentStatusChange` to the Central UI's Blazor circuits instead of letting them poll.
Registration entry point: `ServiceCollectionExtensions.AddDeploymentManager`. Options are bound from `ScadaBridge:DeploymentManager` in `appsettings.json`.
## Key Concepts
### Deployment identity
Every instance deployment carries two correlated identifiers:
- **`DeploymentId`** — a new `Guid` (formatted `"N"`) minted by `DeploymentService` at the start of each `DeployInstanceAsync` call.
- **`RevisionHash`** — computed by the Template Engine's `RevisionHashService` over the fully resolved `FlattenedConfiguration`. The hash captures the template state at the moment of flattening, so concurrent last-write-wins template edits do not affect an in-flight deployment.
The pair travels inside `DeployInstanceCommand` to the site. The site uses the `DeploymentId` to detect an already-applied identical command (idempotent re-delivery) and uses the `RevisionHash` to reject a stale configuration that predates what is already running.
Central stores the `RevisionHash` on `DeploymentRecord` and, after a confirmed success, on `DeployedConfigSnapshot`. Comparing the snapshot hash against the current-template hash determines whether an instance is stale without a site round-trip.
### Per-instance operation lock
`OperationLockManager` holds a `Dictionary<string, LockEntry>` keyed by instance `UniqueName`. Each `LockEntry` wraps a `SemaphoreSlim(1,1)` with a reference count so the semaphore is created on first contention and disposed when the last waiter clears. The lock covers all four mutating operations — deploy, disable, enable, delete — so they can never interleave on a single instance. Operations on different instances proceed in parallel.
Lock acquisition throws `TimeoutException` after `DeploymentManagerOptions.OperationLockTimeout` (default 5 s). The operation lock is in-memory and is therefore lost on a central failover; the design treats any in-progress deployment at failover time as failed.
### State transition rules
`StateTransitionValidator` enforces the following matrix:
| `InstanceState` | Deploy | Disable | Enable | Delete |
|-----------------|--------|---------|--------|--------|
| `NotDeployed` | Yes | No | No | Yes |
| `Enabled` | Yes | Yes | No | Yes |
| `Disabled` | Yes* | No | Yes | Yes |
\* Deploying from `Disabled` transitions the instance to `Enabled` on confirmed success.
### Optimistic concurrency on deployment status
`DeploymentRecord` carries a `RowVersion byte[]` column. EF Core uses this as an optimistic-concurrency token on every `UPDATE` and `DELETE`. A concurrent write to the same record surfaces as `DbUpdateConcurrencyException` rather than silently overwriting the peer's state.
### Failover and in-progress deployments
The operation lock is in-memory. If the active central node fails mid-deployment, the new active node has no lock and no knowledge of what the site received. The `DeploymentRecord` is left `InProgress` (or `Failed` if the failure path ran before the node died). Before allowing a re-deploy, `DeploymentService` calls `TryReconcileWithSiteAsync`, which queries the site for its currently-applied revision hash and reconciles rather than re-sending if the site already has the target revision.
## Architecture
### Instance deploy pipeline
`DeployInstanceAsync` executes the following sequence:
1. **Load and validate state** — loads the `Instance` from `IDeploymentManagerRepository` and checks the transition via `StateTransitionValidator`.
2. **Acquire operation lock**`OperationLockManager.AcquireAsync` blocks competing operations on the same instance.
3. **Flatten and validate**`IFlatteningPipeline.FlattenAndValidateAsync` runs the Template Engine pipeline and returns a `FlatteningPipelineResult` containing the `FlattenedConfiguration`, `RevisionHash`, and a `ValidationResult`. Semantic validation failures (call targets, argument types, trigger operand types, connection binding completeness) are returned to the caller before any record is written.
4. **Pre-deploy site reconciliation** — when the prior `DeploymentRecord` for the instance is `InProgress` or `Failed` with a timeout marker (`"Communication failure:"`), the service queries the site via `CommunicationService.QueryDeploymentStateAsync`. If the site already holds the target revision hash, the prior record is updated to `Success` and no new deployment is sent.
5. **Write `InProgress` record** — a single `DeploymentRecord` insert directly at `InProgress` status (no transient `Pending` hop). `IDeploymentStatusNotifier.NotifyStatusChanged` fires to push the status to the UI.
6. **Send `DeployInstanceCommand`** — the command carries `DeploymentId`, `InstanceUniqueName`, `RevisionHash`, `FlattenedConfigurationJson`, `DeployedBy`, and `Timestamp`.
7. **Commit terminal status** — the `DeploymentRecord` is updated to `Success` or `Failed` and saved before any post-success side effects run. This ordering ensures the recorded outcome can never be lost if a post-success write fails.
8. **Post-success side effects**`ApplyPostSuccessSideEffectsAsync` sets `Instance.State = Enabled` (or preserves `Disabled` on the reconciliation path) and upserts the `DeployedConfigSnapshot`. These writes are best-effort: a failure here is logged at `Error` but does not flip the already-committed `Success` record back to `Failed`.
9. **Audit log**`IAuditService.LogAsync` records `Deploy` / `DeployFailed` / `DeployReconciled` with the `DeploymentId`, status, and user.
Any exception in the site round-trip (steps 67) writes `DeploymentStatus.Failed` using `CancellationToken.None` so a cancelled outer token cannot prevent the failure record from being persisted:
```csharp
// DeploymentService.DeployInstanceAsync — exception handler
var isTimeout = ex is TimeoutException or OperationCanceledException;
record.Status = DeploymentStatus.Failed;
record.ErrorMessage = isTimeout
? $"{TimeoutFailurePrefix} {ex.Message}"
: $"Deployment error: {ex.Message}";
record.CompletedAt = DateTimeOffset.UtcNow;
await _repository.UpdateDeploymentRecordAsync(record, CancellationToken.None);
await _repository.SaveChangesAsync(CancellationToken.None);
NotifyStatusChange(record);
```
The `TimeoutFailurePrefix` constant (`"Communication failure:"`) is the marker that `ShouldQuerySiteBeforeRedeploy` checks on the next deploy attempt.
### Pre-deploy site reconciliation
`TryReconcileWithSiteAsync` is invoked only when a prior deployment record exists and `ShouldQuerySiteBeforeRedeploy` returns true:
```csharp
private static bool ShouldQuerySiteBeforeRedeploy(DeploymentRecord prior) =>
prior.Status == DeploymentStatus.InProgress
|| (prior.Status == DeploymentStatus.Failed
&& prior.ErrorMessage != null
&& prior.ErrorMessage.StartsWith(TimeoutFailurePrefix, StringComparison.Ordinal));
```
If the site responds that it is running the target `RevisionHash`, the stale prior record is updated to `Success` (with the hash corrected to the target), `ApplyPostSuccessSideEffectsAsync` runs with `forceEnabledState: false` to avoid undoing an intentional disable, and the caller receives the reconciled record. A query failure falls through to a normal deploy; the site's own stale-rejection logic is the safety net.
### Deployed config snapshot and diff
`DeployedConfigSnapshot` is a one-per-instance row that stores the `DeploymentId`, `RevisionHash`, and the full `FlattenedConfiguration` JSON as of the last confirmed success. `DeploymentService.GetDeploymentComparisonAsync` re-flattens the current template state, compares the hash, and feeds both configs to `DiffService.ComputeDiff` if the hashes differ, producing a `ConfigurationDiff` with added, removed, and changed attributes, alarms, scripts, and connection bindings.
### Artifact deployment
`ArtifactDeploymentService.DeployToAllSitesAsync` deploys the full system-wide artifact set to every site in parallel. It fetches system-wide artifacts (shared scripts, external systems with serialised methods, database connections, notification lists, SMTP configurations) once via `FetchGlobalArtifactsAsync` before the per-site loop, avoiding N×1 re-queries. Per-site data connections are fetched inside each per-site command build because they legitimately vary per site.
All per-site `DeployArtifactsCommand` messages share one `DeploymentId` so the audit log, UI summary, and persisted `SystemArtifactDeploymentRecord` all reference the same logical deployment. Each site runs under a `cts.CancelAfter(ArtifactDeploymentTimeoutPerSite)` linked source. Successful sites are never rolled back on other failures; individual failed sites are retryable via `RetryForSiteAsync`.
```csharp
// ArtifactDeploymentService — parallel per-site dispatch
var tasks = sites.Select(async site =>
{
using var cts = CancellationTokenSource.CreateLinkedTokenSource(cancellationToken);
cts.CancelAfter(_options.ArtifactDeploymentTimeoutPerSite);
var command = siteCommands[site.Id];
var response = await _communicationService.DeployArtifactsAsync(
site.SiteIdentifier, command, cts.Token);
return new SiteArtifactResult(
site.SiteIdentifier, site.Name, response.Success, response.ErrorMessage);
}).ToList();
```
Cross-site artifact version skew is supported by design: a site that missed an artifact deployment continues operating with its current versions until an operator retries.
### Status notification
`DeploymentStatusNotifier` is a DI singleton that exposes `event Action<DeploymentStatusChange>? StatusChanged`. `DeploymentService` calls `NotifyStatusChanged` at every point a `DeploymentRecord` status is written. The Central UI's deployment page subscribes at render time and re-renders over its Blazor Server SignalR circuit without polling. Each subscriber is invoked individually inside a try/catch so a disposed Blazor circuit cannot break the deployment pipeline.
## Usage
`DeploymentService` and `ArtifactDeploymentService` are scoped services, typically resolved by `ManagementService` actor handlers (triggered by `MgmtDeployArtifactsCommand`, `GetDeploymentDiffCommand`, and the instance lifecycle commands) or directly by Central UI Blazor components. Engineers interact through the Central UI; automated bulk operations (deploy all stale instances) decompose into individual `DeployInstanceAsync` calls.
Lifecycle commands (`DisableInstanceAsync`, `EnableInstanceAsync`, `DeleteInstanceAsync`) follow the same lock-then-command pattern as deploy, with `LifecycleCommandTimeout` applied as a linked `CancellationTokenSource` deadline:
```csharp
// DeploymentService — lifecycle command pattern (disable shown)
using var lockHandle = await _lockManager.AcquireAsync(
instance.UniqueName, _options.OperationLockTimeout, cancellationToken);
using var cts = CancellationTokenSource.CreateLinkedTokenSource(cancellationToken);
cts.CancelAfter(_options.LifecycleCommandTimeout);
response = await _communicationService.DisableInstanceAsync(siteId, command, cts.Token);
```
A timeout on a lifecycle command writes a `DisableTimedOut` / `EnableTimedOut` / `DeleteTimedOut` audit entry via `TryLogLifecycleTimeoutAsync` using `CancellationToken.None`, mirroring the `DeployFailed` audit pattern. The site-side `Instance` state is only updated in the central DB after the site confirms success; a timeout leaves the DB state unchanged.
Delete is stricter than disable/enable: if the site confirms but the central `DeleteInstanceAsync` repository call subsequently fails, the instance record is orphaned. The service logs at `Error`, records a `DeleteOrphaned` audit entry, and returns a descriptive failure so an operator can reconcile — it does not retry automatically.
## Configuration
Options are registered via `AddDeploymentManager` and bound from `ScadaBridge:DeploymentManager`.
| Key | Default | Description |
|-----|---------|-------------|
| `OperationLockTimeout` | `00:00:05` | Maximum wait for the per-instance operation lock before throwing `TimeoutException`. |
| `LifecycleCommandTimeout` | `00:00:30` | Maximum round-trip for a disable, enable, or delete command before the operation is declared timed out. |
| `ArtifactDeploymentTimeoutPerSite` | `00:02:00` | Per-site deadline for a `DeployArtifactsCommand` response. Sites exceeding this are recorded as failed; others are unaffected. |
## Dependencies & Interactions
- [Template Engine (#1)](./TemplateEngine.md) — `FlatteningPipeline` delegates to `FlatteningService`, `ValidationService`, and `RevisionHashService`. Template state is captured at flatten time; last-write-wins edits made after flatten do not affect the in-flight deployment. `DiffService.ComputeDiff` powers the deployment diff view.
- [Configuration Database (#17)](./ConfigurationDatabase.md) — owns the EF Core implementation of `IDeploymentManagerRepository`, which stores `DeploymentRecord`, `DeployedConfigSnapshot`, and `SystemArtifactDeploymentRecord`. `IAuditService` (also registered by the Configuration Database component) writes all deployment audit rows.
- [CentralSite Communication (#5)](./Communication.md) — `CommunicationService` provides `DeployInstanceAsync`, `QueryDeploymentStateAsync`, `DeployArtifactsAsync`, `DisableInstanceAsync`, `EnableInstanceAsync`, and `DeleteInstanceAsync`. The communication layer routes by `SiteIdentifier` (string), not DB id; `DeploymentService.ResolveSiteIdentifierAsync` resolves the numeric `SiteId` before each cross-cluster call and treats a missing site row as a hard failure.
- [Commons (#16)](./Commons.md) — owns `DeploymentRecord`, `DeployedConfigSnapshot`, `SystemArtifactDeploymentRecord`, `DeploymentStatus`, `InstanceState`, `DeployInstanceCommand`, `DeployArtifactsCommand`, `DeploymentStateQueryRequest/Response`, `InstanceLifecycleResponse`, and the `IDeploymentManagerRepository` interface.
- [Site Runtime (#3)](./SiteRuntime.md) — receives `DeployInstanceCommand` and `DeployArtifactsCommand` via the Communication Layer. Site-side apply is all-or-nothing per instance: the Deployment Manager singleton at the site stores the config, compiles all scripts, and creates or replaces the Instance Actor as a unit. A failure at any step is reported back with the specific error message and the previous configuration remains active.
- [Central UI (#9)](./CentralUI.md) — engineers trigger deployments, view diffs, manage instance lifecycle, and deploy system-wide artifacts through the UI. The deployment status page subscribes to `IDeploymentStatusNotifier.StatusChanged` for real-time push updates via Blazor Server SignalR.
- [Management Service (#18)](./ManagementService.md) — the actor-layer entry point for deployment commands received over ClusterClient. It resolves `DeploymentService` and `ArtifactDeploymentService` from a per-message DI scope and forwards `MgmtDeployArtifactsCommand`, `GetDeploymentDiffCommand`, and instance lifecycle requests.
- [Security & Auth (#10)](./Security.md) — the Deployment role is required for all deploy and artifact operations; site-scoped permissions are enforced by the Central UI and Management Service before commands reach `DeploymentService`.
## Troubleshooting
### An instance is stuck InProgress after a central failover
The operation lock is in-memory. On failover the new active node has no lock entry, and the deployment record remains `InProgress`. When the engineer issues a re-deploy, `TryReconcileWithSiteAsync` queries the site; if the site already applied the config the record is updated to `Success` without re-sending. If the site did not apply it, a new deployment proceeds. No manual DB edits are required in the normal failover case.
### A deployment record shows Failed with "Communication failure:"
The site round-trip timed out or was cancelled before a response arrived. The site may or may not have applied the config. On the next deploy attempt the reconciliation query determines the ground truth. If the query also fails (site unreachable), a new `DeployInstanceCommand` is sent; the site rejects it with "already applied" if it ran the previous one.
### DeleteOrphaned audit entry
The site destroyed the Instance Actor but the central DB removal failed. The instance record exists in the central DB but has no corresponding site actor. It cannot be deleted through the normal UI path (the site will reject the delete command because the instance does not exist). Reconcile by removing the central record directly via the Management API or database, referencing the `CommandId` in the audit entry.
### Artifact deployment partially failed
`DeployToAllSitesAsync` returns an `ArtifactDeploymentSummary` with per-site `SiteArtifactResult`. Failed sites do not block or roll back successful ones. Use `RetryForSiteAsync` when the failed site is reachable again; it re-fetches all global artifacts and re-sends to the single site.
## Related Documentation
- [Deployment Manager design specification](../requirements/Component-DeploymentManager.md)
- [Template Engine](./TemplateEngine.md)
- [Site Runtime](./SiteRuntime.md)
- [Configuration Database](./ConfigurationDatabase.md)
- [CentralSite Communication](./Communication.md)
- [Commons](./Commons.md)
- [Central UI](./CentralUI.md)
- [Management Service](./ManagementService.md)
- [Security & Auth](./Security.md)