Final cross-bundle reviewer identified 7 inconsistencies that the per-bundle reviewers couldn't see; all fixed in one logical commit. Critical: - HighLevelReqs AL-3: drop 'then upsert-on-newer-status' — AuditLog is strictly append-only (correct for SiteCalls/Notifications, wrong for the immutable AuditLog shadow). - Component-AuditLog Error rate KPI: align with HealthMonitoring's exclusion list (Success/Delivered/Enqueued) rather than just non-Success; otherwise every Delivered notification or Enqueued cached call would be counted as an error. Important: - Component-AuditLog line 154: ISiteAuditWriter -> IAuditWriter (canonical name per Commons and the rest of this doc). - Component-AuditLog Central direct-write paragraph: convert remaining slash notation (ApiInbound/Completed, Notification/Attempt, Notification/Terminal) to dot notation used everywhere else. - Component-ClusterInfrastructure: scope SiteCallAuditActor to reconciliation + KPIs + Retry/Discard relay; cached-telemetry ingest is AuditLogIngestActor's role per Combined Telemetry contract. - Component-CentralUI Audit Log page: state the OperationalAudit read permission and the read-vs-export split (matching CLI doc). - Component-NotificationOutbox: add never-fail-the-action invariant for dispatcher audit writes. Minor: - Component-InboundAPI: 'Non-blocking semantics' was ambiguous (could be read as async); reword to 'Fail-soft' — the write is still synchronous before flush, but failures are caught and don't change the response. - Component-CLI: realign audit-query/audit-export flags to actually match the Central UI Audit Log filter set (channel, kind, status, site, instance, target, actor, correlation-id, errors-only); drop --user and --entity-id which are IAuditService concepts, not Audit Log columns. - Component-AuditLog KPI tile names: 'Volume/Error rate/Backlog' -> 'Audit volume/Audit error rate/Audit backlog' (matches Central UI and Health Monitoring); drop the two orphan KPIs (Top inbound callers, Top outbound 5xx) that were never surfaced anywhere. - Component-AuditLog Interactions: re-attribute DbOutbound emissions to ESG (where Database.* lives) with a note that Site Runtime is the API surface for scripts. - HighLevelReqs AL-12: drop 'and reconciliation operations' (CLI has no reconcile command; reconciliation is an internal self-healing pull). Add note that verify-chain becomes operational once AL-11's hash chain ships.
174 lines
12 KiB
Markdown
174 lines
12 KiB
Markdown
# Component: Cluster Infrastructure
|
|
|
|
## Purpose
|
|
|
|
The Cluster Infrastructure component manages the Akka.NET cluster setup, active/standby node roles, failover detection, and the foundational runtime environment on which all other components run. It provides the base layer for both central and site clusters.
|
|
|
|
## Location
|
|
|
|
Both central and site clusters.
|
|
|
|
## Responsibilities
|
|
|
|
- Bootstrap the Akka.NET actor system on each node.
|
|
- Form a two-node cluster (active/standby) using Akka.NET Cluster.
|
|
- Manage leader election and role assignment (active vs. standby).
|
|
- Detect node failures and trigger failover.
|
|
- Provide the Akka.NET remoting infrastructure for inter-cluster communication.
|
|
- Support cluster singleton hosting (used by the Site Runtime Deployment Manager singleton on site clusters).
|
|
- Manage Windows service lifecycle (start, stop, restart) on each node.
|
|
|
|
## Implementation Note — Code Placement
|
|
|
|
This component is a **design responsibility**, not a single buildable project that
|
|
contains all of the code. The cluster-infrastructure responsibilities above are
|
|
realised across two projects:
|
|
|
|
- **`src/ScadaLink.ClusterInfrastructure`** owns the cluster **configuration model**:
|
|
the `ClusterOptions` POCO (seed nodes, roles, remoting/gRPC ports, failure-detection
|
|
timings, split-brain settings) bound from `appsettings.json` via the Options pattern.
|
|
- **`src/ScadaLink.Host`** owns the cluster **bootstrap and runtime wiring**: it
|
|
builds the Akka.NET HOCON from `ClusterOptions`, starts the `ActorSystem`,
|
|
configures the keep-oldest split-brain resolver (`down-if-alone = on`), wires
|
|
`CoordinatedShutdown` into the service lifecycle, and provides active-node /
|
|
cluster-membership health checks. See `Component-Host.md` (REQ-HOST-*) for detail.
|
|
|
|
This split is deliberate — the Host is the single deployable binary and the only
|
|
project that performs Akka.NET bootstrap, so the cluster bring-up lives there
|
|
alongside role-based component registration. The `ClusterInfrastructure` project
|
|
remains the home of the configuration contract that the Host consumes.
|
|
|
|
## Cluster Topology
|
|
|
|
### Central Cluster
|
|
- Two nodes forming an Akka.NET cluster.
|
|
- One active node runs all central components (Template Engine, Deployment Manager, Central UI, etc.).
|
|
- One standby node is ready to take over on failover.
|
|
- Connected to MS SQL databases (Config DB, Machine Data DB).
|
|
|
|
### Site Cluster (per site)
|
|
- Two nodes forming an Akka.NET cluster.
|
|
- One active node runs all site components (Site Runtime, Data Connection Layer, Store-and-Forward Engine, etc.).
|
|
- The Site Runtime Deployment Manager runs as an **Akka.NET cluster singleton** on the active node, owning the full Instance Actor hierarchy.
|
|
- One standby node receives replicated store-and-forward data and is ready to take over.
|
|
- Connected to local SQLite databases (store-and-forward buffer, event logs, deployed configurations).
|
|
- Connected to machines via data connections (OPC UA).
|
|
|
|
## Cluster Singletons
|
|
|
|
Akka.NET cluster singletons run on the active node of their cluster and migrate on failover. Each singleton listed here is owned by the named component; this component (Cluster Infrastructure) provides only the hosting, supervision, and active-node placement guarantee.
|
|
|
|
### Central singletons (active central node)
|
|
|
|
- **`NotificationOutboxActor`** — owned by Notification Outbox (#21). Drives the central notification dispatch loop against the `Notifications` table.
|
|
- **`SiteCallAuditActor`** — owned by Site Call Audit (#22). Owns the operational `SiteCalls` table: drives periodic reconciliation pulls for `CachedCall` / `CachedWrite` lifecycle, computes KPIs, and relays operator Retry/Discard actions to the owning site. Note: ingest of cached-call telemetry is performed by `AuditLogIngestActor` (#23) in one transaction with the immutable `AuditLog` insert — see Component-AuditLog.md, Cached Operations — Combined Telemetry.
|
|
- **`AuditLogIngestActor`** — owned by Audit Log (#23). Receives gRPC telemetry batches of `AuditEvent` rows from sites and performs insert-if-not-exists on `EventId` against the central `AuditLog` table. For cached-call telemetry (which carries both audit-row content and operational-state fields in a single packet), the ingest performs the `AuditLog` insert and the `SiteCalls` upsert in **one transaction** — see Component-AuditLog.md for the combined-telemetry contract.
|
|
- **`SiteAuditReconciliationActor`** — owned by Audit Log (#23). Periodic per-site pull (default every 5 minutes) that self-heals missed audit telemetry by asking each site for its oldest `ForwardState = 'Pending'` row and issuing a `PullAuditEvents(sinceUtc, batchSize)` when a non-draining backlog is detected.
|
|
- **`AuditLogPurgeActor`** — owned by Audit Log (#23). Daily partition-switch purge against `ps_AuditLog_Month`; switches out any partition older than `AuditLog:RetentionDays` and emits an `AuditLog:Purged` event. Also rolls the partition scheme forward each month so the next month's partition exists ahead of time.
|
|
|
|
### Site singletons (active site node, per site cluster)
|
|
|
|
- **Site Runtime Deployment Manager** — owned by Site Runtime (#3). Owns the full Instance Actor hierarchy; re-creates it on failover from local SQLite.
|
|
- **`SiteAuditTelemetryActor`** — owned by Audit Log (#23). Drains the local site `AuditLog` SQLite's `ForwardState = 'Pending'` rows to central in batches via the existing `SiteStream` gRPC channel; cadence is short (default 5 s) when the queue is non-empty and longer (default 30 s) when idle. Runs on a **dedicated dispatcher** so it does not compete with the script blocking-I/O dispatcher (per Component-AuditLog.md, Ingestion Paths → Telemetry forward).
|
|
|
|
## Failover Behavior
|
|
|
|
### Detection
|
|
- Akka.NET Cluster monitors node health via heartbeat.
|
|
- If the active node becomes unreachable, the standby node detects the failure and promotes itself to active.
|
|
|
|
### Central Failover
|
|
- The new active node takes over all central responsibilities.
|
|
- In-progress deployments are treated as **failed** — engineers must retry.
|
|
- The UI session may be interrupted — users reconnect to the new active node.
|
|
- No message buffering at central — no state to recover beyond what's in MS SQL.
|
|
|
|
### Site Failover
|
|
- The new active node takes over:
|
|
- The Deployment Manager singleton restarts and re-creates the full Instance Actor hierarchy by reading deployed configurations from local SQLite. Each Instance Actor spawns its child Script and Alarm Actors.
|
|
- Data collection (Data Connection Layer re-establishes subscriptions as Instance Actors register their data source references).
|
|
- Store-and-forward delivery (buffer is already replicated locally).
|
|
- Active debug view streams from central are interrupted — the engineer must re-open them.
|
|
- Health reporting resumes from the new active node.
|
|
- Alarm states are re-evaluated from incoming values (alarm state is in-memory only).
|
|
|
|
## Split-Brain Resolution
|
|
|
|
The system uses the Akka.NET **keep-oldest** split-brain resolver strategy:
|
|
|
|
- On a network partition, the node that has been in the cluster longest remains active. The younger node downs itself.
|
|
- **Stable-after duration**: 15 seconds. The cluster membership must remain stable (no changes) for 15 seconds before the resolver acts to down unreachable nodes. This prevents premature downing during startup or rolling restarts.
|
|
- **`down-if-alone = on`**: The keep-oldest resolver is configured with `down-if-alone` enabled. If the oldest node finds itself alone (no other reachable members), it downs itself rather than continuing as a single-node cluster. This prevents the oldest node from running in isolation during a network partition while the younger node also forms its own cluster.
|
|
- **Why keep-oldest**: With only two nodes, quorum-based strategies (static-quorum, keep-majority) cannot distinguish "one node crashed" from "network partition" — both sides see fewer than quorum and both would down themselves, resulting in total cluster shutdown. Keep-oldest with `down-if-alone` provides safe singleton ownership — at most one node runs the cluster singleton at any time.
|
|
|
|
## Single-Node Operation
|
|
|
|
`akka.cluster.min-nr-of-members` must be set to **1**. After failover, only one node is running. If set to 2, the surviving node waits for a second member before allowing the Cluster Singleton (Site Runtime Deployment Manager) to start — blocking all data collection and script execution indefinitely.
|
|
|
|
## Failure Detection Timing
|
|
|
|
Configurable defaults for heartbeat and failure detection:
|
|
|
|
| Setting | Default | Description |
|
|
|---------|---------|-------------|
|
|
| Heartbeat interval | 2 seconds | Frequency of health check messages between nodes |
|
|
| Failure detection threshold | 10 seconds | Time without heartbeat before a node is considered unreachable |
|
|
| Stable-after (split-brain) | 15 seconds | Time cluster must be stable before resolver acts |
|
|
| **Total failover time** | **~25 seconds** | Detection (10s) + stable-after (15s) + singleton restart |
|
|
|
|
These values balance failover speed with stability — fast enough that data collection gaps are small, tolerant enough that brief network hiccups don't trigger unnecessary failovers.
|
|
|
|
## Dual-Node Recovery
|
|
|
|
If both nodes in a cluster fail simultaneously (e.g., site power outage):
|
|
|
|
1. **No manual intervention required.** Since both nodes are configured as seed nodes, whichever node starts first forms a new cluster. The second node joins when it starts.
|
|
2. **State recovery** (each node has its own local copy of all required data):
|
|
- **Site clusters**: The Deployment Manager singleton reads deployed configurations from local SQLite and re-creates the full Instance Actor hierarchy. Store-and-forward buffers are already persisted locally. Alarm states re-evaluate from incoming data values.
|
|
- **Central cluster**: All state is in MS SQL (configuration database). The active node resumes normal operation.
|
|
3. The keep-oldest resolver handles the "both starting fresh" case naturally — there is no pre-existing cluster to conflict with.
|
|
|
|
## Graceful Shutdown & Singleton Handover
|
|
|
|
When a node is stopped for planned maintenance (Windows Service stop), `CoordinatedShutdown` triggers a **graceful leave** from the cluster. This enables the Cluster Singleton (Site Runtime Deployment Manager) to hand over to the other node in seconds (limited by the hand-over retry interval) rather than waiting for the full failure detection timeout (~25 seconds).
|
|
|
|
Configuration required:
|
|
- `akka.coordinated-shutdown.run-by-clr-shutdown-hook = on`
|
|
- `akka.cluster.run-coordinated-shutdown-when-down = on`
|
|
|
|
The Host component wires CoordinatedShutdown into the Windows Service lifecycle (see REQ-HOST-9).
|
|
|
|
## Node Configuration
|
|
|
|
Each node is configured with:
|
|
- **Cluster seed nodes**: **Both nodes** are seed nodes — each node lists both itself and its partner. Either node can start first and form the cluster; the other joins when it starts. No startup ordering dependency.
|
|
- **Cluster role**: Central or Site (plus site identifier for site clusters).
|
|
- **Akka.NET remoting**: Hostname/port for inter-node and inter-cluster communication (default 8081 central, 8082 site).
|
|
- **gRPC port** (site nodes only): Dedicated HTTP/2 port for the SiteStreamGrpcServer (default 8083). Separate from the Akka remoting port — gRPC uses Kestrel, Akka uses its own TCP transport.
|
|
- **Local storage paths**: SQLite database locations (site nodes only).
|
|
|
|
## Windows Service
|
|
|
|
- Each node runs as a **Windows service** for automatic startup and recovery.
|
|
- Service configuration includes Akka.NET cluster settings and component-specific configuration.
|
|
|
|
## Platform
|
|
|
|
- **OS**: Windows Server.
|
|
- **Runtime**: .NET (Akka.NET).
|
|
- **Cluster**: Akka.NET Cluster (application-level, not Windows Server Failover Clustering).
|
|
|
|
## Dependencies
|
|
|
|
- **Akka.NET**: Core actor system, cluster, remoting, and cluster singleton libraries.
|
|
- **Windows**: Service hosting, networking.
|
|
- **MS SQL** (central only): Database connectivity.
|
|
- **SQLite** (sites only): Local storage.
|
|
|
|
## Interactions
|
|
|
|
- **All components**: Every component runs within the Akka.NET actor system managed by this infrastructure.
|
|
- **Site Runtime**: The Deployment Manager singleton relies on Akka.NET cluster singleton support provided by this infrastructure.
|
|
- **Communication Layer**: Built on top of the Akka.NET remoting provided here.
|
|
- **Health Monitoring**: Reports node status (active/standby) as a health metric.
|