Files

T

Joseph Doherty fd618cf1dc fix(review): full code-review remediation — 5 High + Medium/Low across 16 modules

Remediation from the full per-module code review at 4307c381 (findings recorded
separately in code-reviews/).

Highs fixed:
- DeploymentManager-025/SiteRuntime-031: stop broadcasting notification lists + SMTP
  configs (incl. credentials) to sites; site purges already-persisted rows on apply
  (enforces the central-only delivery design; clears plaintext SMTP creds at rest).
- DataConnectionLayer-023: guard the native-alarm subscribe path against the
  mid-flight-unsubscribe adapter-feed leak (mirrors the DCL-021 tag-path fix).
- SiteEventLogging-024: normalize From/To query bounds to UTC (the -016 fix the
  audit trail claimed but never committed).
- KpiHistory-001: add an in-flight guard to the recorder sample tick.
- ScriptAnalysis-001: harden the trust analyzer's TPA-absent fallback (resolve
  forbidden anchors in the minimal reference set; warn on degraded mode) — anchors
  added to validation references only, never the compile gate.
(InboundAPI-026 left to the feat/ipsen-movein effort per owner decision.)

Medium/Low: DM-026 deterministic deploy-status tiebreaker; SR-027/028/029/030
native-alarm leak/phantom-active/delete-during-redeploy fixes; AL-013/014/016;
TE-024 (folder-mutation audit rows now persisted)/025; SF-025 gauge-provider
clear-on-stop; ESG-025/026; SEC-023/024/025; SCA-007/008/009; plus doc/test
accuracy COM-023/024, HOST-025/026, HM-024/025, NS-027/028.

Full-solution build 0 warnings; ~3560 tests across 18 touched suites green.

2026-06-20 17:55:12 -04:00

33 KiB

Raw Blame History

Component: Central–Site Communication

Purpose

The Communication component manages all messaging between the central cluster and site clusters. It provides the transport layer for deployments, instance lifecycle commands, integration routing, debug streaming, health reporting, notification submission, and remote queries (parked messages, event logs). Two transports are used: Akka.NET ClusterClient for command/control messaging and gRPC server-streaming for real-time data (attribute values, alarm states).

Location

Both central and site clusters. Each side has communication actors that handle message routing.

Responsibilities

Resolve site addresses (Akka remoting and gRPC) from the configuration database and maintain a cached address map.
Establish and maintain cross-cluster connections using Akka.NET ClusterClient/ClusterClientReceptionist for command/control.
Establish and maintain per-site gRPC streaming connections for real-time data delivery (site→central).
Route messages between central and site clusters in a hub-and-spoke topology.
Broker requests from external systems (via central) to sites and return responses.
Support multiple concurrent message patterns (request/response, fire-and-forget, streaming).
Detect site connectivity status for health monitoring.
Host the SiteStreamGrpcServer on site nodes (Kestrel HTTP/2) to serve real-time event streams.
Manage per-site SiteStreamGrpcClient instances on central nodes via SiteStreamGrpcClientFactory.

Communication Patterns

1. Deployment (Central → Site)

Pattern: Request/Response.
Central sends a flattened configuration to a site.
Site Runtime receives, compiles scripts, creates/updates Instance Actors, and responds with success/failure.
No buffering at central — if the site is unreachable, the deployment fails immediately.

2. Instance Lifecycle Commands (Central → Site)

Pattern: Request/Response.
Central sends disable, enable, or delete commands for specific instances.
Site Runtime processes the command and responds with success/failure.
If the site is unreachable, the command fails immediately (no buffering).

3. System-Wide Artifact Deployment (Central → Site(s))

Pattern: Broadcast with per-site acknowledgment (deploy to all sites), or targeted to a single site (per-site deployment).
When shared scripts, external system definitions, database connections, or data connections are explicitly deployed, central sends them to the target site(s). (Notification lists and SMTP configuration are central-only and are not deployed to sites.)
Each site acknowledges receipt and reports success/failure independently.
Shared script deployment triggers immediate recompilation on the site — the site's SharedScriptLibrary replaces its in-memory compiled code, making updated shared scripts available to all running instances without redeployment. Other artifact types (external systems, database connections, etc.) are stored but do not require recompilation.

4. Integration Routing (External System → Central → Site → Central → External System)

Pattern: Request/Response (brokered).
External system sends a request to central (e.g., MES requests machine values).
Central routes the request to the appropriate site.
Site reads values from the Instance Actor and responds.
Central returns the response to the external system.

5. Recipe/Command Delivery (External System → Central → Site)

Pattern: Fire-and-forget with acknowledgment.
External system sends a command to central (e.g., recipe manager sends recipe).
Central routes to the site.
Site applies and acknowledges.

6. Debug Streaming (Site → Central)

Pattern: Subscribe/push with initial snapshot. Two transports: ClusterClient for the subscribe/unsubscribe handshake and initial snapshot, gRPC server-streaming for ongoing real-time events.
A DebugStreamBridgeActor (one per active debug session) is created on the central cluster by the DebugStreamService. The bridge actor first opens a gRPC server-streaming subscription to the site via SiteStreamGrpcClient, then sends a SubscribeDebugViewRequest to the site via CentralCommunicationActor (ClusterClient). The site's InstanceActor replies with an initial snapshot via the ClusterClient reply path.
gRPC stream (real-time events): The site's SiteStreamGrpcServer receives the gRPC SubscribeInstance call and creates a StreamRelayActor that subscribes to SiteStreamManager for the requested instance. Events (AttributeValueChanged, AlarmStateChanged) flow from SiteStreamManager → StreamRelayActor → Channel<SiteStreamEvent> (bounded, 1000, DropOldest) → gRPC response stream → SiteStreamGrpcClient on central → DebugStreamBridgeActor.
The DebugStreamEvent message type no longer exists — events are not routed through ClusterClient. SiteCommunicationActor and CentralCommunicationActor have no role in streaming event delivery.
The bridge actor forwards received events to the consumer via callbacks (Blazor component or SignalR hub).
Snapshot-to-stream handoff: The gRPC stream is opened before the snapshot request to avoid missing events. The consumer applies the snapshot as baseline, then replays buffered gRPC events with timestamps newer than the snapshot (timestamp-based dedup).
Attribute value stream messages: [InstanceUniqueName].[AttributePath].[AttributeName], value, quality, timestamp.
Alarm state stream messages: [InstanceUniqueName].[AlarmName], state (active/normal), priority, timestamp.
Central sends an unsubscribe request via ClusterClient when the debug session ends. The gRPC stream is cancelled. The site's StreamRelayActor is stopped and the SiteStreamManager subscription is removed.
The stream is session-based and temporary.

Site-Side gRPC Streaming Components

SiteStreamGrpcServer: gRPC service (SiteStreamService.SiteStreamServiceBase) hosted on each site node via Kestrel HTTP/2 on a dedicated port (default 8083). Implements the SubscribeInstance RPC. For each subscription, creates a StreamRelayActor that subscribes to SiteStreamManager, bridges events through a Channel<SiteStreamEvent> to the gRPC response stream. Tracks active subscriptions by correlation_id — duplicate IDs cancel the old stream. Enforces a max concurrent stream limit (default 100). Rejects streams with StatusCode.Unavailable before the actor system is ready.
StreamRelayActor: Short-lived actor created per gRPC subscription. Receives domain events (AttributeValueChanged, AlarmStateChanged) from SiteStreamManager, converts them to protobuf SiteStreamEvent messages, and writes to the Channel<SiteStreamEvent> writer. Stopped when the gRPC stream is cancelled or the client disconnects.

Central-Side Debug Stream Components

DebugStreamService: Singleton service that manages debug stream sessions. Resolves instance ID to unique name and site, creates and tears down DebugStreamBridgeActor instances, and provides a clean API for both Blazor components and the SignalR hub. Injects SiteStreamGrpcClientFactory for gRPC stream creation.
DebugStreamBridgeActor: One per active debug session. Opens a gRPC streaming subscription via SiteStreamGrpcClient and receives real-time events via callback. Also receives the initial DebugViewSnapshot via ClusterClient. Forwards all events to the consumer via callbacks. Handles gRPC stream errors with reconnection logic: tries the other site node endpoint, retries with backoff (max 3 retries), terminates the session if all retries fail.
SiteStreamGrpcClient: Per-site gRPC client that manages GrpcChannel instances and streaming subscriptions. Reads from the gRPC response stream in a background task, converts protobuf messages to domain events, and invokes the onEvent callback.
SiteStreamGrpcClientFactory: Caches per-site SiteStreamGrpcClient instances. Reads GrpcNodeAAddress / GrpcNodeBAddress from the Site entity (loaded by CentralCommunicationActor). Falls back to NodeB if NodeA connection fails. Disposes clients on site removal or address change.
DebugStreamHub: SignalR hub at /hubs/debug-stream for external consumers (e.g., CLI). Authenticates via Basic Auth + LDAP and requires the Deployment role. Server-to-client methods: OnSnapshot, OnAttributeChanged, OnAlarmChanged, OnStreamTerminated.

gRPC Proto Definition

The streaming protocol is defined in sitestream.proto (src/ZB.MOM.WW.ScadaBridge.Communication/Protos/sitestream.proto):

Service: SiteStreamService — hosted on each site node by SiteStreamGrpcServer — exposes five RPCs. One is the original real-time server-streaming subscription; the other four are unary request/response calls added by the Audit Log (#23) and Site Call Audit (#22) components. A unary call is request/response and is distinct from the command/control ClusterClient channel — gRPC on this service is no longer real-time-stream-only:
- SubscribeInstance(InstanceStreamRequest) returns (stream SiteStreamEvent) — the real-time debug stream (§6); the only server-streaming RPC.
- IngestAuditEvents(AuditEventBatch) returns (IngestAck) — central-side ingest receiving surface for Audit Log (#23) telemetry; routes the batch to the central AuditLogIngestActor proxy and returns the accepted EventIds. (The production push path is still ClusterClient via ClusterClientSiteAuditClient; this RPC is the gRPC-receiving counterpart.)
- IngestCachedTelemetry(CachedTelemetryBatch) returns (IngestAck) — ingest receiving surface for the combined cached-call telemetry packet (audit row + SiteCalls operational upsert written in one transaction).
- PullAuditEvents(PullAuditEventsRequest) returns (PullAuditEventsResponse) — central→site reconciliation pull for the Audit Log self-heal feed; the site serves Pending/Forwarded rows from its ISiteAuditQueue.
- PullSiteCalls(PullSiteCallsRequest) returns (PullSiteCallsResponse) — central→site reconciliation pull for the Site Call Audit (#22) self-heal feed; the site serves operation-tracking rows changed since a cursor from its IOperationTrackingStore. A separate RPC from PullAuditEvents because the tracking store is the operational source of truth, distinct from the site audit queue.
Messages: InstanceStreamRequest (correlation_id, instance_unique_name), SiteStreamEvent (correlation_id, oneof event: AttributeValueUpdate, AlarmStateUpdate); AuditEventDto/AuditEventBatch/IngestAck for ingest; CachedTelemetryPacket/CachedTelemetryBatch (each packet pairing an AuditEventDto with a SiteCallOperationalDto); PullAuditEventsRequest/PullAuditEventsResponse and PullSiteCallsRequest/PullSiteCallsResponse (each request carries since_utc + batch_size; each response carries more_available to signal a saturated batch).
The oneof event pattern is extensible — future event types (health metrics, connection state changes) are added as new fields without breaking existing consumers.
Proto field numbers are never reused; new RPCs and message fields are appended additively. Old clients ignore unknown oneof variants.

Enriched AlarmStateUpdate (Native Alarm Mirror)

AlarmStateUpdate carries the read-only native alarm mirror (Computed, native OPC UA, and native MxAccess Gateway alarms) to central over the existing gRPC real-time stream — no new transport, no command/control round-trip. The message was extended additively: existing fields 1–7 are unchanged, and fields 8–23 carry the enriched native-alarm state. Old clients that only read fields 1–7 continue to work; new fields are populated only where the source provides them.

Field	#	Type	Meaning
`kind`	8	string	Alarm origin: `Computed`, `NativeOpcUa`, or `NativeMxAccess`.
`active`	9	bool	Alarm condition is active.
`acknowledged`	10	bool	Alarm has been acknowledged.
`confirmed`	11	bool	Alarm has been confirmed. The domain `Confirmed` (`bool?`) collapses to a definite bool on the wire.
`shelve_state`	12	string	`Unshelved`, `OneShotShelved`, `TimedShelved`, or `PermanentShelved`.
`suppressed`	13	bool	Alarm is suppressed by the source system.
`source_reference`	14	string	Source node / tag reference.
`alarm_type_name`	15	string	Native alarm type name.
`category`	16	string	Alarm category.
`operator_user`	17	string	User who last acted on the alarm.
`operator_comment`	18	string	Operator comment from the last action.
`original_raise_time`	19	Timestamp	First-raise time of the underlying condition (nullable on the wire).
`current_value`	20	string	Current process value associated with the alarm.
`limit_value`	21	string	Limit / setpoint value that the alarm evaluates against.
`native_source_canonical_name`	22	string	Native binding canonical name; empty for computed alarms.
`is_configured_placeholder`	23	bool	Marks a quiet-binding placeholder row. Snapshot-only — see the relay note below; on the live gRPC stream this is always `false`.

Server-side mapping (StreamRelayActor.HandleAlarmStateChanged): maps the enriched domain AlarmStateChanged event — Kind + AlarmConditionState + native metadata — out to the proto AlarmStateUpdate. The nullable original_raise_time is emitted only when present, and shelve_state is mapped from the domain shelve enum to its wire string via a new AlarmShelveStateCodec (string↔enum, defaulting to Unshelved). The domain Confirmed (bool?) is collapsed to a definite bool for field 11.
Placeholder rows are dropped at the relay: is_configured_placeholder (field 23) is a Debug View snapshot-only concept emitted by InstanceActor.BuildAlarmStatesSnapshot for quiet bindings — it is never a real alarm transition (its timestamp may be DateTimeOffset.MinValue, the Protobuf Timestamp lower boundary). StreamRelayActor.HandleAlarmStateChanged therefore returns early — never relaying a placeholder row to the live gRPC stream — so field 23 is always false on the live stream and only ever carries true in the snapshot path.
Client-side mapping (SiteStreamGrpcClient.ConvertToDomainEvent): reconstructs the domain AlarmStateChanged from the proto — Kind is parsed via ParseAlarmKind, the Condition is rebuilt with severity taken from the existing wire priority, and native metadata is repopulated from fields 8–23 (native_source_canonical_name → NativeSourceCanonicalName, is_configured_placeholder → IsConfiguredPlaceholder) — so central-side consumers receive the same domain event the site emitted.

Regeneration is manual (macOS-only). sitestream.proto is not auto-compiled: the <Protobuf> include is commented out in the .csproj, and the generated C# is vendored under SiteStreamGrpc/. To regenerate after editing the proto: toggle the <Protobuf> include on, build so Grpc.Tools regenerates the C#, copy the generated files into SiteStreamGrpc/, then re-comment the include. Adding AlarmStateUpdate fields 8–23 and the four unary RPCs (IngestAuditEvents, IngestCachedTelemetry, PullAuditEvents, PullSiteCalls) plus their message types followed this process.

gRPC Connection Keepalive

Three layers of dead-client detection prevent orphan streams on site nodes:

Layer	Detects	Timeline	Mechanism
TCP RST	Clean process death, connection close	1–5s	OS-level TCP, `WriteAsync` throws
gRPC keepalive PING	Network partition, silent crash, firewall drop	~25s	HTTP/2 PING frames, `CancellationToken` fires
Session timeout	Misconfigured keepalive, long-lived zombie streams	4 hours	`CancellationTokenSource.CancelAfter`

Keepalive settings are configurable via CommunicationOptions:

GrpcKeepAlivePingDelay: 15 seconds (default)
GrpcKeepAlivePingTimeout: 10 seconds (default)
GrpcMaxStreamLifetime: 4 hours (default)
GrpcMaxConcurrentStreams: 100 (default)

6a. Debug Snapshot (Central → Site)

Pattern: Request/Response (one-shot, no subscription).
Central sends a DebugSnapshotRequest (identified by instance unique name) to the site.
Site's Deployment Manager routes to the Instance Actor by unique name.
Instance Actor builds and returns a DebugViewSnapshot with all current attribute values and alarm states (same payload as the streaming initial snapshot).
No subscription is created; no stream is established.
Uses the 30-second QueryTimeout.

7. Health Reporting (Site → Central)

Pattern: Periodic push.
Sites periodically send health metrics (connection status, node status, buffer depth, script error rates, alarm evaluation error rates) to central.

8. Remote Queries (Central → Site)

Pattern: Request/Response.
Central queries sites for:
- Parked messages (store-and-forward dead letters).
- Site event logs.
- Instance debug snapshots (attribute values and alarm states).
Central can also send management commands:
- Retry or discard parked messages and parked cached calls — central sends RetryParkedOperation / DiscardParkedOperation (keyed by TrackedOperationId) to the owning site, which applies the change to its S&F buffer and tracking table.

9. Notification Submission (Site → Central)

Pattern: Fire-and-forget with acknowledgment.
The site Store-and-Forward Engine sends a NotificationSubmit message to central carrying the notification — NotificationId, target list name, subject, body, and source provenance.
Central ingests the submission with an insert-if-not-exists on NotificationId and acknowledges after the row is persisted to the Notifications table in the central configuration database. The site S&F engine clears the buffered message only on that ack.
The NotificationId GUID — generated at the site — is the idempotency key. The handoff is at-least-once: a re-sent submission after a lost ack is harmless because central's insert-if-not-exists treats the duplicate as a no-op.
Transport: ClusterClient (site→central command/control), consistent with how other site→central messages are sent.

10. Cached Call Telemetry (Site → Central)

Pattern: Fire-and-forget telemetry with a periodic reconciliation pull.
The site Store-and-Forward Engine emits a CachedCallTelemetry message to central on every cached-call lifecycle transition (Pending → Retrying → Delivered / Parked / Failed / Discarded). The first telemetry event for an operation carries its initial status — Pending when a transient failure has buffered the call, or directly Delivered/Failed for a cached call that never buffers. The message carries the TrackedOperationId, source site, Kind (the TrackedOperationKind enum), target summary, status, retry count, last error, key timestamps, and source provenance.
Emission is best-effort and at-least-once, idempotent on TrackedOperationId — central's Site Call Audit component ingests with insert-if-not-exists then upsert-on-newer-status, so a re-sent or out-of-order event is harmless.
Reconciliation pull: because telemetry is best-effort, the central Site Call Audit component periodically — and on site reconnect — pulls the changed rows back from each site over the PullSiteCalls unary gRPC RPC on SiteStreamService (not a ClusterClient round-trip). Central sends a PullSiteCallsRequest (since_utc cursor + batch_size); the site reads its IOperationTrackingStore and replies with a PullSiteCallsResponse carrying the matching operation-tracking rows (as SiteCallOperationalDtos) plus a more_available flag that signals a saturated batch so central advances the cursor and pulls again. Any telemetry missed during a disconnect self-heals through this pull. The Audit Log (#23) reconciliation feed uses the sibling PullAuditEvents RPC the same way.
Central audit is an eventually-consistent mirror — the site's operation tracking table remains the source of truth for cached-call status (Tracking.Status(id) is always answered site-locally).
Transport: the push telemetry emission rides ClusterClient (site→central command/control), consistent with how other site→central messages are sent; the reconciliation pull rides the gRPC unary PullSiteCalls RPC (central→site request/response). The two paths are complementary — push is the fast, best-effort feed; pull is the slower self-heal backfill.

Topology

%%{init: {'theme':'base', 'themeVariables': {'textColor':'#111111','lineColor':'#555555','edgeLabelBackground':'#ffffff','fontSize':'15px'}}}%%
flowchart LR
    subgraph Central["Central Cluster"]
        CCA["ClusterClient<br/>(command/control)"]
        CCB["ClusterClient<br/>(command/control)"]
        CCN["ClusterClient<br/>(command/control)"]
        GRPCC["SiteStreamGrpcClient<br/>(real-time data)"]
    end

    subgraph SiteA["Site A Cluster"]
        SACOMM["SiteCommunicationActor<br/>(via Receptionist)"]
        SAGRPC["SiteStreamGrpcServer<br/>(Kestrel HTTP/2, port 8083)"]
        SACC["ClusterClient to Central<br/>(CentralCommunicationActor)"]
    end

    subgraph SiteB["Site B Cluster"]
        SBCOMM["SiteCommunicationActor<br/>(via Receptionist)"]
        SBGRPC["SiteStreamGrpcServer"]
    end

    subgraph SiteN["Site N Cluster"]
        SNCOMM["SiteCommunicationActor<br/>(via Receptionist)"]
        SNGRPC["SiteStreamGrpcServer"]
    end

    CCA -->|command/control| SACOMM
    CCB -->|command/control| SBCOMM
    CCN -->|command/control| SNCOMM

    SAGRPC -->|"gRPC stream (real-time data)"| GRPCC
    SBGRPC -->|gRPC stream| GRPCC
    SNGRPC -->|gRPC stream| GRPCC

    SACC -.->|command/control| Central

    NOTE["Sites do NOT communicate with each other.<br/>All inter-cluster communication flows through Central."]

    classDef start fill:#d5e8d4,stroke:#82b366,color:#111111;
    classDef proc fill:#dae8fc,stroke:#6c8ebf,color:#111111;
    classDef dec fill:#fff2cc,stroke:#d6b656,color:#111111;
    classDef alt fill:#e1d5e7,stroke:#9673a6,color:#111111;
    classDef muted fill:#f5f5f5,stroke:#999999,color:#666666;
    class CCA,CCB,CCN,SACOMM,SACC,SBCOMM,SNCOMM dec
    class GRPCC,SAGRPC,SBGRPC,SNGRPC start
    class NOTE muted
    class Central proc
    class SiteA,SiteB,SiteN alt

Sites do not communicate with each other.
All inter-cluster communication flows through central.
Both CentralCommunicationActor and SiteCommunicationActor are registered with their cluster's ClusterClientReceptionist for cross-cluster discovery.

Site Address Resolution

Central discovers site addresses through the configuration database, not runtime registration:

Each site record in the Sites table includes optional NodeAAddress and NodeBAddress fields containing base Akka addresses of the site's cluster nodes (e.g., akka.tcp://scadabridge@host:port), and optional GrpcNodeAAddress and GrpcNodeBAddress fields containing gRPC endpoints (e.g., http://host:8083).
The CentralCommunicationActor loads all site addresses from the database at startup and creates one ClusterClient per site, configured with both NodeA and NodeB as contact points. The SiteStreamGrpcClientFactory uses GrpcNodeAAddress / GrpcNodeBAddress to create per-site gRPC channels for streaming.
The address cache is refreshed every 60 seconds and on-demand when site records are added, edited, or deleted via the Central UI or CLI. ClusterClient instances are recreated when contact points change.
When routing a message to a site, central sends via ClusterClient.Send("/user/site-communication", msg). ClusterClient handles failover between NodeA and NodeB internally — there is no application-level NodeA preference/NodeB fallback logic.
Heartbeats from sites serve health monitoring only — they do not serve as a registration or address discovery mechanism.
If no addresses are configured for a site, messages to that site are dropped and the caller's Ask times out.

Site → Central Communication

Site nodes configure a list of CentralContactPoints (both central node addresses) instead of a single CentralActorPath.
The site creates a ClusterClient using the central contact points and sends heartbeats, health reports, and other messages via ClusterClient.Send("/user/central-communication", msg).
ClusterClient handles automatic failover between central nodes — if the active central node goes down, the site's ClusterClient reconnects to the standby node transparently.

Message Timeouts

Each request/response pattern has a default timeout that can be overridden in configuration:

Pattern	Default Timeout	Rationale
1. Deployment	120 seconds	Script compilation at the site can be slow
2. Instance Lifecycle	30 seconds	Stop/start actors is fast
3. System-Wide Artifacts	120 seconds per site	Includes shared script recompilation
4. Integration Routing	30 seconds	External system waiting for response; Inbound API per-method timeout may cap this further
5. Recipe/Command Delivery	30 seconds	Fire-and-forget with ack
8. Remote Queries	30 seconds	Querying parked messages or event logs
9. Notification Submission	30 seconds	Fire-and-forget with ack; central acks after persisting the row
10. Cached Call Telemetry	30 seconds	Telemetry emission (ClusterClient) is fire-and-forget; the reconciliation pull is the unary gRPC `PullSiteCalls` request/response (its deadline is the gRPC call timeout, not the Akka ask)

Timeouts use the Akka.NET ask pattern. If no response is received within the timeout, the caller receives a timeout failure.

Transport Configuration

Akka.NET remoting provides the underlying transport for both intra-cluster communication and ClusterClient connections. The following transport-level settings are explicitly configured (not left to framework defaults) for predictable behavior:

Transport heartbeat interval: Configurable interval at which heartbeat messages are sent over remoting connections (e.g., every 5 seconds).
Failure detection threshold: Number of missed heartbeats before the connection is considered lost (e.g., 3 missed heartbeats = 15 seconds with a 5-second interval).
Reconnection: ClusterClient handles reconnection and failover between contact points automatically for cross-cluster communication. No custom reconnection logic is required.

These settings should be tuned for the expected network conditions between central and site clusters.

Application-Level Correlation

All request/response messages include an application-level correlation ID to ensure correct pairing of requests and responses, even across reconnection events:

Deployments include a deployment ID and revision hash for idempotency (see Deployment Manager).
Lifecycle commands include a command ID for deduplication.
Remote queries include a query ID for response correlation.

This provides protocol-level safety beyond Akka.NET's transport guarantees, which may not hold across disconnect/reconnect cycles.

Message Ordering

Akka.NET guarantees message ordering between a specific sender/receiver actor pair. The Communication Layer relies on this guarantee — messages to a given site are processed in the order they are sent. Callers do not need to handle out-of-order delivery.

ManagementActor and ClusterClient

The ManagementActor is registered at the well-known path /user/management on central nodes and advertised via ClusterClientReceptionist. External tools (primarily the CLI) connect using Akka.NET ClusterClient, which contacts the receptionist to discover the ManagementActor. This is a separate ClusterClient usage from the inter-cluster ClusterClient connections used for central-site messaging — the CLI does not participate in cluster membership or affect the hub-and-spoke topology.

Connection Failure Behavior

Disconnect is detected at the transport layer, never via an application-level signal from central. There is no ConnectionStateChanged-style synchronous notification: the central coordinator does not maintain a model of "this site is up / down" because the two transports already report unavailability at their natural cadence.

In-flight command/control messages (ClusterClient + Ask): When a connection drops while a request is in flight (e.g., a deployment sent but no response received), the Akka ask pattern times out and the caller receives a failure. There is no automatic retry or buffering at central — the engineer sees the failure in the UI and re-initiates the action. This is consistent with the design principle that central does not buffer messages. An in-progress deployment whose round-trip exceeds the Ask timeout (default 120 s at CommunicationService.DeployInstanceAsync) surfaces as DeploymentStatus.Failed to the caller.
Debug streams (gRPC): Any gRPC stream interruption is detected by the HTTP/2 keepalive PING (~25 s) and triggers reconnection logic in the DebugStreamBridgeActor. The bridge actor attempts to reconnect to the other site node endpoint (NodeB if NodeA failed, or vice versa), with up to 3 retries and 5-second backoff. If all retries fail, the consumer is notified via OnStreamTerminated and the bridge actor is stopped. Events during the reconnection gap are lost (acceptable for real-time debug view). On successful reconnection, the consumer can request a fresh snapshot to re-sync state.

Failover Behavior

Central failover: The standby node takes over the Akka.NET cluster role. In-progress deployments are treated as failed. Site ClusterClients automatically reconnect to the standby central node via their configured contact points.
Site failover: The standby node takes over. The Deployment Manager singleton restarts and re-creates the Instance Actor hierarchy. Central's per-site ClusterClient automatically reconnects to the surviving site node. Ongoing debug streams are interrupted and must be re-established by the engineer.

Dependencies

Akka.NET Remoting + ClusterClient: Provides the command/control transport layer. ClusterClient/ClusterClientReceptionist used for cross-cluster command/control messaging (deployments, lifecycle, subscribe/unsubscribe handshake, snapshots).
gRPC (Grpc.AspNetCore + Grpc.Net.Client): Provides the real-time data streaming transport. Site nodes host a gRPC server (SiteStreamGrpcServer); central nodes create per-site gRPC clients (SiteStreamGrpcClient).
Cluster Infrastructure: Manages node roles and failover detection.
Configuration Database: Provides site node addresses (NodeAAddress, NodeBAddress for Akka remoting; GrpcNodeAAddress, GrpcNodeBAddress for gRPC streaming) for address resolution.
Site Runtime (SiteStreamManager): The SiteStreamGrpcServer subscribes to SiteStreamManager to receive real-time events for gRPC delivery.
ISiteAuditQueue (site-local): Handed to SiteStreamGrpcServer (post-construction, on site roles) so the PullAuditEvents RPC can read the site's Pending/Forwarded audit rows to serve the Audit Log (#23) reconciliation pull. Null when not wired (central-only host) — the handler then returns an empty response.
IOperationTrackingStore (site-local): Handed to SiteStreamGrpcServer (post-construction, on site roles) so the PullSiteCalls RPC can read operation-tracking rows changed since a cursor to serve the Site Call Audit (#22) reconciliation pull. Null when not wired — the handler returns an empty response.
AuditLogIngestActor proxy (central): Handed to SiteStreamGrpcServer after the central cluster singleton starts; the IngestAuditEvents / IngestCachedTelemetry RPCs route ingested batches to it. Null when not yet wired — the handler returns an empty IngestAck so the caller treats it as transient and retries.

Interactions

Deployment Manager (central): Uses communication to deliver configurations, lifecycle commands, and system-wide artifacts, and receive status.
Site Runtime: Receives deployments, lifecycle commands, and artifact updates. Provides debug view data.
Central UI: Debug view requests and remote queries flow through communication.
Health Monitoring: Receives periodic health reports from sites.
Store-and-Forward Engine (site): Parked message queries/commands are routed through communication. Also emits CachedCallTelemetry (push, ClusterClient) and serves the PullSiteCalls gRPC reconciliation pull from its IOperationTrackingStore, and receives relayed RetryParkedOperation / DiscardParkedOperation commands.
Site Call Audit (central): Receives cached-call telemetry and issues the PullSiteCalls gRPC reconciliation pulls to sites; relays parked-operation Retry/Discard commands to sites through communication.
Audit Log (#23): Sites forward audit-event telemetry (push) and serve the PullAuditEvents gRPC reconciliation pull from their ISiteAuditQueue; the central AuditLogIngestActor is the ingest target for both the push path and the combined cached-call telemetry packet.
Site Event Logging: Event log queries are routed through communication.
Management Service: The ManagementActor is registered with ClusterClientReceptionist on central nodes. The CLI communicates with the ManagementActor via ClusterClient, which is a separate channel from inter-cluster remoting.

33 KiB Raw Blame History Unescape Escape