Remediation from the full per-module code review at 4307c381 (findings recorded
separately in code-reviews/).
Highs fixed:
- DeploymentManager-025/SiteRuntime-031: stop broadcasting notification lists + SMTP
configs (incl. credentials) to sites; site purges already-persisted rows on apply
(enforces the central-only delivery design; clears plaintext SMTP creds at rest).
- DataConnectionLayer-023: guard the native-alarm subscribe path against the
mid-flight-unsubscribe adapter-feed leak (mirrors the DCL-021 tag-path fix).
- SiteEventLogging-024: normalize From/To query bounds to UTC (the -016 fix the
audit trail claimed but never committed).
- KpiHistory-001: add an in-flight guard to the recorder sample tick.
- ScriptAnalysis-001: harden the trust analyzer's TPA-absent fallback (resolve
forbidden anchors in the minimal reference set; warn on degraded mode) — anchors
added to validation references only, never the compile gate.
(InboundAPI-026 left to the feat/ipsen-movein effort per owner decision.)
Medium/Low: DM-026 deterministic deploy-status tiebreaker; SR-027/028/029/030
native-alarm leak/phantom-active/delete-during-redeploy fixes; AL-013/014/016;
TE-024 (folder-mutation audit rows now persisted)/025; SF-025 gauge-provider
clear-on-stop; ESG-025/026; SEC-023/024/025; SCA-007/008/009; plus doc/test
accuracy COM-023/024, HOST-025/026, HM-024/025, NS-027/028.
Full-solution build 0 warnings; ~3560 tests across 18 touched suites green.
33 KiB
Component: Central–Site Communication
Purpose
The Communication component manages all messaging between the central cluster and site clusters. It provides the transport layer for deployments, instance lifecycle commands, integration routing, debug streaming, health reporting, notification submission, and remote queries (parked messages, event logs). Two transports are used: Akka.NET ClusterClient for command/control messaging and gRPC server-streaming for real-time data (attribute values, alarm states).
Location
Both central and site clusters. Each side has communication actors that handle message routing.
Responsibilities
- Resolve site addresses (Akka remoting and gRPC) from the configuration database and maintain a cached address map.
- Establish and maintain cross-cluster connections using Akka.NET ClusterClient/ClusterClientReceptionist for command/control.
- Establish and maintain per-site gRPC streaming connections for real-time data delivery (site→central).
- Route messages between central and site clusters in a hub-and-spoke topology.
- Broker requests from external systems (via central) to sites and return responses.
- Support multiple concurrent message patterns (request/response, fire-and-forget, streaming).
- Detect site connectivity status for health monitoring.
- Host the SiteStreamGrpcServer on site nodes (Kestrel HTTP/2) to serve real-time event streams.
- Manage per-site SiteStreamGrpcClient instances on central nodes via SiteStreamGrpcClientFactory.
Communication Patterns
1. Deployment (Central → Site)
- Pattern: Request/Response.
- Central sends a flattened configuration to a site.
- Site Runtime receives, compiles scripts, creates/updates Instance Actors, and responds with success/failure.
- No buffering at central — if the site is unreachable, the deployment fails immediately.
2. Instance Lifecycle Commands (Central → Site)
- Pattern: Request/Response.
- Central sends disable, enable, or delete commands for specific instances.
- Site Runtime processes the command and responds with success/failure.
- If the site is unreachable, the command fails immediately (no buffering).
3. System-Wide Artifact Deployment (Central → Site(s))
- Pattern: Broadcast with per-site acknowledgment (deploy to all sites), or targeted to a single site (per-site deployment).
- When shared scripts, external system definitions, database connections, or data connections are explicitly deployed, central sends them to the target site(s). (Notification lists and SMTP configuration are central-only and are not deployed to sites.)
- Each site acknowledges receipt and reports success/failure independently.
- Shared script deployment triggers immediate recompilation on the site — the site's
SharedScriptLibraryreplaces its in-memory compiled code, making updated shared scripts available to all running instances without redeployment. Other artifact types (external systems, database connections, etc.) are stored but do not require recompilation.
4. Integration Routing (External System → Central → Site → Central → External System)
- Pattern: Request/Response (brokered).
- External system sends a request to central (e.g., MES requests machine values).
- Central routes the request to the appropriate site.
- Site reads values from the Instance Actor and responds.
- Central returns the response to the external system.
5. Recipe/Command Delivery (External System → Central → Site)
- Pattern: Fire-and-forget with acknowledgment.
- External system sends a command to central (e.g., recipe manager sends recipe).
- Central routes to the site.
- Site applies and acknowledges.
6. Debug Streaming (Site → Central)
- Pattern: Subscribe/push with initial snapshot. Two transports: ClusterClient for the subscribe/unsubscribe handshake and initial snapshot, gRPC server-streaming for ongoing real-time events.
- A DebugStreamBridgeActor (one per active debug session) is created on the central cluster by the DebugStreamService. The bridge actor first opens a gRPC server-streaming subscription to the site via
SiteStreamGrpcClient, then sends aSubscribeDebugViewRequestto the site viaCentralCommunicationActor(ClusterClient). The site'sInstanceActorreplies with an initial snapshot via the ClusterClient reply path. - gRPC stream (real-time events): The site's SiteStreamGrpcServer receives the gRPC
SubscribeInstancecall and creates a StreamRelayActor that subscribes to SiteStreamManager for the requested instance. Events (AttributeValueChanged,AlarmStateChanged) flow fromSiteStreamManager→StreamRelayActor→Channel<SiteStreamEvent>(bounded, 1000, DropOldest) → gRPC response stream →SiteStreamGrpcClienton central →DebugStreamBridgeActor. - The
DebugStreamEventmessage type no longer exists — events are not routed through ClusterClient.SiteCommunicationActorandCentralCommunicationActorhave no role in streaming event delivery. - The bridge actor forwards received events to the consumer via callbacks (Blazor component or SignalR hub).
- Snapshot-to-stream handoff: The gRPC stream is opened before the snapshot request to avoid missing events. The consumer applies the snapshot as baseline, then replays buffered gRPC events with timestamps newer than the snapshot (timestamp-based dedup).
- Attribute value stream messages:
[InstanceUniqueName].[AttributePath].[AttributeName], value, quality, timestamp. - Alarm state stream messages:
[InstanceUniqueName].[AlarmName], state (active/normal), priority, timestamp. - Central sends an unsubscribe request via ClusterClient when the debug session ends. The gRPC stream is cancelled. The site's
StreamRelayActoris stopped and the SiteStreamManager subscription is removed. - The stream is session-based and temporary.
Site-Side gRPC Streaming Components
- SiteStreamGrpcServer: gRPC service (
SiteStreamService.SiteStreamServiceBase) hosted on each site node via Kestrel HTTP/2 on a dedicated port (default 8083). Implements theSubscribeInstanceRPC. For each subscription, creates aStreamRelayActorthat subscribes toSiteStreamManager, bridges events through aChannel<SiteStreamEvent>to the gRPC response stream. Tracks active subscriptions bycorrelation_id— duplicate IDs cancel the old stream. Enforces a max concurrent stream limit (default 100). Rejects streams withStatusCode.Unavailablebefore the actor system is ready. - StreamRelayActor: Short-lived actor created per gRPC subscription. Receives domain events (
AttributeValueChanged,AlarmStateChanged) fromSiteStreamManager, converts them to protobufSiteStreamEventmessages, and writes to theChannel<SiteStreamEvent>writer. Stopped when the gRPC stream is cancelled or the client disconnects.
Central-Side Debug Stream Components
- DebugStreamService: Singleton service that manages debug stream sessions. Resolves instance ID to unique name and site, creates and tears down
DebugStreamBridgeActorinstances, and provides a clean API for both Blazor components and the SignalR hub. InjectsSiteStreamGrpcClientFactoryfor gRPC stream creation. - DebugStreamBridgeActor: One per active debug session. Opens a gRPC streaming subscription via
SiteStreamGrpcClientand receives real-time events via callback. Also receives the initialDebugViewSnapshotvia ClusterClient. Forwards all events to the consumer via callbacks. Handles gRPC stream errors with reconnection logic: tries the other site node endpoint, retries with backoff (max 3 retries), terminates the session if all retries fail. - SiteStreamGrpcClient: Per-site gRPC client that manages
GrpcChannelinstances and streaming subscriptions. Reads from the gRPC response stream in a background task, converts protobuf messages to domain events, and invokes theonEventcallback. - SiteStreamGrpcClientFactory: Caches per-site
SiteStreamGrpcClientinstances. ReadsGrpcNodeAAddress/GrpcNodeBAddressfrom theSiteentity (loaded byCentralCommunicationActor). Falls back to NodeB if NodeA connection fails. Disposes clients on site removal or address change. - DebugStreamHub: SignalR hub at
/hubs/debug-streamfor external consumers (e.g., CLI). Authenticates via Basic Auth + LDAP and requires the Deployment role. Server-to-client methods:OnSnapshot,OnAttributeChanged,OnAlarmChanged,OnStreamTerminated.
gRPC Proto Definition
The streaming protocol is defined in sitestream.proto (src/ZB.MOM.WW.ScadaBridge.Communication/Protos/sitestream.proto):
- Service:
SiteStreamService— hosted on each site node bySiteStreamGrpcServer— exposes five RPCs. One is the original real-time server-streaming subscription; the other four are unary request/response calls added by the Audit Log (#23) and Site Call Audit (#22) components. A unary call is request/response and is distinct from the command/control ClusterClient channel — gRPC on this service is no longer real-time-stream-only:SubscribeInstance(InstanceStreamRequest) returns (stream SiteStreamEvent)— the real-time debug stream (§6); the only server-streaming RPC.IngestAuditEvents(AuditEventBatch) returns (IngestAck)— central-side ingest receiving surface for Audit Log (#23) telemetry; routes the batch to the centralAuditLogIngestActorproxy and returns the acceptedEventIds. (The production push path is still ClusterClient viaClusterClientSiteAuditClient; this RPC is the gRPC-receiving counterpart.)IngestCachedTelemetry(CachedTelemetryBatch) returns (IngestAck)— ingest receiving surface for the combined cached-call telemetry packet (audit row +SiteCallsoperational upsert written in one transaction).PullAuditEvents(PullAuditEventsRequest) returns (PullAuditEventsResponse)— central→site reconciliation pull for the Audit Log self-heal feed; the site servesPending/Forwardedrows from itsISiteAuditQueue.PullSiteCalls(PullSiteCallsRequest) returns (PullSiteCallsResponse)— central→site reconciliation pull for the Site Call Audit (#22) self-heal feed; the site serves operation-tracking rows changed since a cursor from itsIOperationTrackingStore. A separate RPC fromPullAuditEventsbecause the tracking store is the operational source of truth, distinct from the site audit queue.
- Messages:
InstanceStreamRequest(correlation_id, instance_unique_name),SiteStreamEvent(correlation_id, oneof event:AttributeValueUpdate,AlarmStateUpdate);AuditEventDto/AuditEventBatch/IngestAckfor ingest;CachedTelemetryPacket/CachedTelemetryBatch(each packet pairing anAuditEventDtowith aSiteCallOperationalDto);PullAuditEventsRequest/PullAuditEventsResponseandPullSiteCallsRequest/PullSiteCallsResponse(each request carriessince_utc+batch_size; each response carriesmore_availableto signal a saturated batch). - The
oneof eventpattern is extensible — future event types (health metrics, connection state changes) are added as new fields without breaking existing consumers. - Proto field numbers are never reused; new RPCs and message fields are appended additively. Old clients ignore unknown
oneofvariants.
Enriched AlarmStateUpdate (Native Alarm Mirror)
AlarmStateUpdate carries the read-only native alarm mirror (Computed, native OPC UA, and native MxAccess Gateway alarms) to central over the existing gRPC real-time stream — no new transport, no command/control round-trip. The message was extended additively: existing fields 1–7 are unchanged, and fields 8–23 carry the enriched native-alarm state. Old clients that only read fields 1–7 continue to work; new fields are populated only where the source provides them.
| Field | # | Type | Meaning |
|---|---|---|---|
kind |
8 | string | Alarm origin: Computed, NativeOpcUa, or NativeMxAccess. |
active |
9 | bool | Alarm condition is active. |
acknowledged |
10 | bool | Alarm has been acknowledged. |
confirmed |
11 | bool | Alarm has been confirmed. The domain Confirmed (bool?) collapses to a definite bool on the wire. |
shelve_state |
12 | string | Unshelved, OneShotShelved, TimedShelved, or PermanentShelved. |
suppressed |
13 | bool | Alarm is suppressed by the source system. |
source_reference |
14 | string | Source node / tag reference. |
alarm_type_name |
15 | string | Native alarm type name. |
category |
16 | string | Alarm category. |
operator_user |
17 | string | User who last acted on the alarm. |
operator_comment |
18 | string | Operator comment from the last action. |
original_raise_time |
19 | Timestamp | First-raise time of the underlying condition (nullable on the wire). |
current_value |
20 | string | Current process value associated with the alarm. |
limit_value |
21 | string | Limit / setpoint value that the alarm evaluates against. |
native_source_canonical_name |
22 | string | Native binding canonical name; empty for computed alarms. |
is_configured_placeholder |
23 | bool | Marks a quiet-binding placeholder row. Snapshot-only — see the relay note below; on the live gRPC stream this is always false. |
- Server-side mapping (
StreamRelayActor.HandleAlarmStateChanged): maps the enriched domainAlarmStateChangedevent —Kind+AlarmConditionState+ native metadata — out to the protoAlarmStateUpdate. The nullableoriginal_raise_timeis emitted only when present, andshelve_stateis mapped from the domain shelve enum to its wire string via a newAlarmShelveStateCodec(string↔enum, defaulting toUnshelved). The domainConfirmed(bool?) is collapsed to a definite bool for field 11. - Placeholder rows are dropped at the relay:
is_configured_placeholder(field 23) is a Debug View snapshot-only concept emitted byInstanceActor.BuildAlarmStatesSnapshotfor quiet bindings — it is never a real alarm transition (its timestamp may beDateTimeOffset.MinValue, the ProtobufTimestamplower boundary).StreamRelayActor.HandleAlarmStateChangedtherefore returns early — never relaying a placeholder row to the live gRPC stream — so field 23 is alwaysfalseon the live stream and only ever carriestruein the snapshot path. - Client-side mapping (
SiteStreamGrpcClient.ConvertToDomainEvent): reconstructs the domainAlarmStateChangedfrom the proto —Kindis parsed viaParseAlarmKind, theConditionis rebuilt withseveritytaken from the existing wirepriority, and native metadata is repopulated from fields 8–23 (native_source_canonical_name→NativeSourceCanonicalName,is_configured_placeholder→IsConfiguredPlaceholder) — so central-side consumers receive the same domain event the site emitted.
Regeneration is manual (macOS-only).
sitestream.protois not auto-compiled: the<Protobuf>include is commented out in the.csproj, and the generated C# is vendored underSiteStreamGrpc/. To regenerate after editing the proto: toggle the<Protobuf>include on, build soGrpc.Toolsregenerates the C#, copy the generated files intoSiteStreamGrpc/, then re-comment the include. AddingAlarmStateUpdatefields 8–23 and the four unary RPCs (IngestAuditEvents,IngestCachedTelemetry,PullAuditEvents,PullSiteCalls) plus their message types followed this process.
gRPC Connection Keepalive
Three layers of dead-client detection prevent orphan streams on site nodes:
| Layer | Detects | Timeline | Mechanism |
|---|---|---|---|
| TCP RST | Clean process death, connection close | 1–5s | OS-level TCP, WriteAsync throws |
| gRPC keepalive PING | Network partition, silent crash, firewall drop | ~25s | HTTP/2 PING frames, CancellationToken fires |
| Session timeout | Misconfigured keepalive, long-lived zombie streams | 4 hours | CancellationTokenSource.CancelAfter |
Keepalive settings are configurable via CommunicationOptions:
GrpcKeepAlivePingDelay: 15 seconds (default)GrpcKeepAlivePingTimeout: 10 seconds (default)GrpcMaxStreamLifetime: 4 hours (default)GrpcMaxConcurrentStreams: 100 (default)
6a. Debug Snapshot (Central → Site)
- Pattern: Request/Response (one-shot, no subscription).
- Central sends a
DebugSnapshotRequest(identified by instance unique name) to the site. - Site's Deployment Manager routes to the Instance Actor by unique name.
- Instance Actor builds and returns a
DebugViewSnapshotwith all current attribute values and alarm states (same payload as the streaming initial snapshot). - No subscription is created; no stream is established.
- Uses the 30-second
QueryTimeout.
7. Health Reporting (Site → Central)
- Pattern: Periodic push.
- Sites periodically send health metrics (connection status, node status, buffer depth, script error rates, alarm evaluation error rates) to central.
8. Remote Queries (Central → Site)
- Pattern: Request/Response.
- Central queries sites for:
- Parked messages (store-and-forward dead letters).
- Site event logs.
- Instance debug snapshots (attribute values and alarm states).
- Central can also send management commands:
- Retry or discard parked messages and parked cached calls — central sends
RetryParkedOperation/DiscardParkedOperation(keyed byTrackedOperationId) to the owning site, which applies the change to its S&F buffer and tracking table.
- Retry or discard parked messages and parked cached calls — central sends
9. Notification Submission (Site → Central)
- Pattern: Fire-and-forget with acknowledgment.
- The site Store-and-Forward Engine sends a
NotificationSubmitmessage to central carrying the notification —NotificationId, target list name, subject, body, and source provenance. - Central ingests the submission with an insert-if-not-exists on
NotificationIdand acknowledges after the row is persisted to theNotificationstable in the central configuration database. The site S&F engine clears the buffered message only on that ack. - The
NotificationIdGUID — generated at the site — is the idempotency key. The handoff is at-least-once: a re-sent submission after a lost ack is harmless because central's insert-if-not-exists treats the duplicate as a no-op. - Transport: ClusterClient (site→central command/control), consistent with how other site→central messages are sent.
10. Cached Call Telemetry (Site → Central)
- Pattern: Fire-and-forget telemetry with a periodic reconciliation pull.
- The site Store-and-Forward Engine emits a
CachedCallTelemetrymessage to central on every cached-call lifecycle transition (Pending → Retrying → Delivered / Parked / Failed / Discarded). The first telemetry event for an operation carries its initial status —Pendingwhen a transient failure has buffered the call, or directlyDelivered/Failedfor a cached call that never buffers. The message carries theTrackedOperationId, source site,Kind(theTrackedOperationKindenum), target summary, status, retry count, last error, key timestamps, and source provenance. - Emission is best-effort and at-least-once, idempotent on
TrackedOperationId— central's Site Call Audit component ingests with insert-if-not-exists then upsert-on-newer-status, so a re-sent or out-of-order event is harmless. - Reconciliation pull: because telemetry is best-effort, the central Site Call Audit component periodically — and on site reconnect — pulls the changed rows back from each site over the
PullSiteCallsunary gRPC RPC onSiteStreamService(not a ClusterClient round-trip). Central sends aPullSiteCallsRequest(since_utccursor +batch_size); the site reads itsIOperationTrackingStoreand replies with aPullSiteCallsResponsecarrying the matching operation-tracking rows (asSiteCallOperationalDtos) plus amore_availableflag that signals a saturated batch so central advances the cursor and pulls again. Any telemetry missed during a disconnect self-heals through this pull. The Audit Log (#23) reconciliation feed uses the siblingPullAuditEventsRPC the same way. - Central audit is an eventually-consistent mirror — the site's operation tracking table remains the source of truth for cached-call status (
Tracking.Status(id)is always answered site-locally). - Transport: the push telemetry emission rides ClusterClient (site→central command/control), consistent with how other site→central messages are sent; the reconciliation pull rides the gRPC unary
PullSiteCallsRPC (central→site request/response). The two paths are complementary — push is the fast, best-effort feed; pull is the slower self-heal backfill.
Topology
%%{init: {'theme':'base', 'themeVariables': {'textColor':'#111111','lineColor':'#555555','edgeLabelBackground':'#ffffff','fontSize':'15px'}}}%%
flowchart LR
subgraph Central["Central Cluster"]
CCA["ClusterClient<br/>(command/control)"]
CCB["ClusterClient<br/>(command/control)"]
CCN["ClusterClient<br/>(command/control)"]
GRPCC["SiteStreamGrpcClient<br/>(real-time data)"]
end
subgraph SiteA["Site A Cluster"]
SACOMM["SiteCommunicationActor<br/>(via Receptionist)"]
SAGRPC["SiteStreamGrpcServer<br/>(Kestrel HTTP/2, port 8083)"]
SACC["ClusterClient to Central<br/>(CentralCommunicationActor)"]
end
subgraph SiteB["Site B Cluster"]
SBCOMM["SiteCommunicationActor<br/>(via Receptionist)"]
SBGRPC["SiteStreamGrpcServer"]
end
subgraph SiteN["Site N Cluster"]
SNCOMM["SiteCommunicationActor<br/>(via Receptionist)"]
SNGRPC["SiteStreamGrpcServer"]
end
CCA -->|command/control| SACOMM
CCB -->|command/control| SBCOMM
CCN -->|command/control| SNCOMM
SAGRPC -->|"gRPC stream (real-time data)"| GRPCC
SBGRPC -->|gRPC stream| GRPCC
SNGRPC -->|gRPC stream| GRPCC
SACC -.->|command/control| Central
NOTE["Sites do NOT communicate with each other.<br/>All inter-cluster communication flows through Central."]
classDef start fill:#d5e8d4,stroke:#82b366,color:#111111;
classDef proc fill:#dae8fc,stroke:#6c8ebf,color:#111111;
classDef dec fill:#fff2cc,stroke:#d6b656,color:#111111;
classDef alt fill:#e1d5e7,stroke:#9673a6,color:#111111;
classDef muted fill:#f5f5f5,stroke:#999999,color:#666666;
class CCA,CCB,CCN,SACOMM,SACC,SBCOMM,SNCOMM dec
class GRPCC,SAGRPC,SBGRPC,SNGRPC start
class NOTE muted
class Central proc
class SiteA,SiteB,SiteN alt
- Sites do not communicate with each other.
- All inter-cluster communication flows through central.
- Both CentralCommunicationActor and SiteCommunicationActor are registered with their cluster's ClusterClientReceptionist for cross-cluster discovery.
Site Address Resolution
Central discovers site addresses through the configuration database, not runtime registration:
- Each site record in the Sites table includes optional NodeAAddress and NodeBAddress fields containing base Akka addresses of the site's cluster nodes (e.g.,
akka.tcp://scadabridge@host:port), and optional GrpcNodeAAddress and GrpcNodeBAddress fields containing gRPC endpoints (e.g.,http://host:8083). - The CentralCommunicationActor loads all site addresses from the database at startup and creates one ClusterClient per site, configured with both NodeA and NodeB as contact points. The SiteStreamGrpcClientFactory uses
GrpcNodeAAddress/GrpcNodeBAddressto create per-site gRPC channels for streaming. - The address cache is refreshed every 60 seconds and on-demand when site records are added, edited, or deleted via the Central UI or CLI. ClusterClient instances are recreated when contact points change.
- When routing a message to a site, central sends via
ClusterClient.Send("/user/site-communication", msg). ClusterClient handles failover between NodeA and NodeB internally — there is no application-level NodeA preference/NodeB fallback logic. - Heartbeats from sites serve health monitoring only — they do not serve as a registration or address discovery mechanism.
- If no addresses are configured for a site, messages to that site are dropped and the caller's Ask times out.
Site → Central Communication
- Site nodes configure a list of CentralContactPoints (both central node addresses) instead of a single
CentralActorPath. - The site creates a ClusterClient using the central contact points and sends heartbeats, health reports, and other messages via
ClusterClient.Send("/user/central-communication", msg). - ClusterClient handles automatic failover between central nodes — if the active central node goes down, the site's ClusterClient reconnects to the standby node transparently.
Message Timeouts
Each request/response pattern has a default timeout that can be overridden in configuration:
| Pattern | Default Timeout | Rationale |
|---|---|---|
| 1. Deployment | 120 seconds | Script compilation at the site can be slow |
| 2. Instance Lifecycle | 30 seconds | Stop/start actors is fast |
| 3. System-Wide Artifacts | 120 seconds per site | Includes shared script recompilation |
| 4. Integration Routing | 30 seconds | External system waiting for response; Inbound API per-method timeout may cap this further |
| 5. Recipe/Command Delivery | 30 seconds | Fire-and-forget with ack |
| 8. Remote Queries | 30 seconds | Querying parked messages or event logs |
| 9. Notification Submission | 30 seconds | Fire-and-forget with ack; central acks after persisting the row |
| 10. Cached Call Telemetry | 30 seconds | Telemetry emission (ClusterClient) is fire-and-forget; the reconciliation pull is the unary gRPC PullSiteCalls request/response (its deadline is the gRPC call timeout, not the Akka ask) |
Timeouts use the Akka.NET ask pattern. If no response is received within the timeout, the caller receives a timeout failure.
Transport Configuration
Akka.NET remoting provides the underlying transport for both intra-cluster communication and ClusterClient connections. The following transport-level settings are explicitly configured (not left to framework defaults) for predictable behavior:
- Transport heartbeat interval: Configurable interval at which heartbeat messages are sent over remoting connections (e.g., every 5 seconds).
- Failure detection threshold: Number of missed heartbeats before the connection is considered lost (e.g., 3 missed heartbeats = 15 seconds with a 5-second interval).
- Reconnection: ClusterClient handles reconnection and failover between contact points automatically for cross-cluster communication. No custom reconnection logic is required.
These settings should be tuned for the expected network conditions between central and site clusters.
Application-Level Correlation
All request/response messages include an application-level correlation ID to ensure correct pairing of requests and responses, even across reconnection events:
- Deployments include a deployment ID and revision hash for idempotency (see Deployment Manager).
- Lifecycle commands include a command ID for deduplication.
- Remote queries include a query ID for response correlation.
This provides protocol-level safety beyond Akka.NET's transport guarantees, which may not hold across disconnect/reconnect cycles.
Message Ordering
Akka.NET guarantees message ordering between a specific sender/receiver actor pair. The Communication Layer relies on this guarantee — messages to a given site are processed in the order they are sent. Callers do not need to handle out-of-order delivery.
ManagementActor and ClusterClient
The ManagementActor is registered at the well-known path /user/management on central nodes and advertised via ClusterClientReceptionist. External tools (primarily the CLI) connect using Akka.NET ClusterClient, which contacts the receptionist to discover the ManagementActor. This is a separate ClusterClient usage from the inter-cluster ClusterClient connections used for central-site messaging — the CLI does not participate in cluster membership or affect the hub-and-spoke topology.
Connection Failure Behavior
Disconnect is detected at the transport layer, never via an application-level signal from central. There is no ConnectionStateChanged-style synchronous notification: the central coordinator does not maintain a model of "this site is up / down" because the two transports already report unavailability at their natural cadence.
- In-flight command/control messages (ClusterClient + Ask): When a connection drops while a request is in flight (e.g., a deployment sent but no response received), the Akka ask pattern times out and the caller receives a failure. There is no automatic retry or buffering at central — the engineer sees the failure in the UI and re-initiates the action. This is consistent with the design principle that central does not buffer messages. An in-progress deployment whose round-trip exceeds the Ask timeout (default 120 s at
CommunicationService.DeployInstanceAsync) surfaces asDeploymentStatus.Failedto the caller. - Debug streams (gRPC): Any gRPC stream interruption is detected by the HTTP/2 keepalive PING (~25 s) and triggers reconnection logic in the
DebugStreamBridgeActor. The bridge actor attempts to reconnect to the other site node endpoint (NodeB if NodeA failed, or vice versa), with up to 3 retries and 5-second backoff. If all retries fail, the consumer is notified viaOnStreamTerminatedand the bridge actor is stopped. Events during the reconnection gap are lost (acceptable for real-time debug view). On successful reconnection, the consumer can request a fresh snapshot to re-sync state.
Failover Behavior
- Central failover: The standby node takes over the Akka.NET cluster role. In-progress deployments are treated as failed. Site ClusterClients automatically reconnect to the standby central node via their configured contact points.
- Site failover: The standby node takes over. The Deployment Manager singleton restarts and re-creates the Instance Actor hierarchy. Central's per-site ClusterClient automatically reconnects to the surviving site node. Ongoing debug streams are interrupted and must be re-established by the engineer.
Dependencies
- Akka.NET Remoting + ClusterClient: Provides the command/control transport layer. ClusterClient/ClusterClientReceptionist used for cross-cluster command/control messaging (deployments, lifecycle, subscribe/unsubscribe handshake, snapshots).
- gRPC (Grpc.AspNetCore + Grpc.Net.Client): Provides the real-time data streaming transport. Site nodes host a gRPC server (SiteStreamGrpcServer); central nodes create per-site gRPC clients (SiteStreamGrpcClient).
- Cluster Infrastructure: Manages node roles and failover detection.
- Configuration Database: Provides site node addresses (NodeAAddress, NodeBAddress for Akka remoting; GrpcNodeAAddress, GrpcNodeBAddress for gRPC streaming) for address resolution.
- Site Runtime (SiteStreamManager): The SiteStreamGrpcServer subscribes to SiteStreamManager to receive real-time events for gRPC delivery.
ISiteAuditQueue(site-local): Handed toSiteStreamGrpcServer(post-construction, on site roles) so thePullAuditEventsRPC can read the site'sPending/Forwardedaudit rows to serve the Audit Log (#23) reconciliation pull. Null when not wired (central-only host) — the handler then returns an empty response.IOperationTrackingStore(site-local): Handed toSiteStreamGrpcServer(post-construction, on site roles) so thePullSiteCallsRPC can read operation-tracking rows changed since a cursor to serve the Site Call Audit (#22) reconciliation pull. Null when not wired — the handler returns an empty response.AuditLogIngestActorproxy (central): Handed toSiteStreamGrpcServerafter the central cluster singleton starts; theIngestAuditEvents/IngestCachedTelemetryRPCs route ingested batches to it. Null when not yet wired — the handler returns an emptyIngestAckso the caller treats it as transient and retries.
Interactions
- Deployment Manager (central): Uses communication to deliver configurations, lifecycle commands, and system-wide artifacts, and receive status.
- Site Runtime: Receives deployments, lifecycle commands, and artifact updates. Provides debug view data.
- Central UI: Debug view requests and remote queries flow through communication.
- Health Monitoring: Receives periodic health reports from sites.
- Store-and-Forward Engine (site): Parked message queries/commands are routed through communication. Also emits
CachedCallTelemetry(push, ClusterClient) and serves thePullSiteCallsgRPC reconciliation pull from itsIOperationTrackingStore, and receives relayedRetryParkedOperation/DiscardParkedOperationcommands. - Site Call Audit (central): Receives cached-call telemetry and issues the
PullSiteCallsgRPC reconciliation pulls to sites; relays parked-operation Retry/Discard commands to sites through communication. - Audit Log (#23): Sites forward audit-event telemetry (push) and serve the
PullAuditEventsgRPC reconciliation pull from theirISiteAuditQueue; the centralAuditLogIngestActoris the ingest target for both the push path and the combined cached-call telemetry packet. - Site Event Logging: Event log queries are routed through communication.
- Management Service: The ManagementActor is registered with ClusterClientReceptionist on central nodes. The CLI communicates with the ManagementActor via ClusterClient, which is a separate channel from inter-cluster remoting.