Files
scadalink-design/docs/requirements/Component-Communication.md
Joseph Doherty 416a03b782 feat: complete gRPC streaming channel — site host, docker config, docs, integration tests
Switch site host to WebApplicationBuilder with Kestrel HTTP/2 gRPC server,
add GrpcPort/keepalive config, wire SiteStreamManager as ISiteStreamSubscriber,
expose gRPC ports in docker-compose, add site seed script, update all 10
requirement docs + CLAUDE.md + README.md for the new dual-transport architecture.
2026-03-21 12:38:33 -04:00

20 KiB
Raw Blame History

Component: CentralSite Communication

Purpose

The Communication component manages all messaging between the central cluster and site clusters. It provides the transport layer for deployments, instance lifecycle commands, integration routing, debug streaming, health reporting, and remote queries (parked messages, event logs). Two transports are used: Akka.NET ClusterClient for command/control messaging and gRPC server-streaming for real-time data (attribute values, alarm states).

Location

Both central and site clusters. Each side has communication actors that handle message routing.

Responsibilities

  • Resolve site addresses (Akka remoting and gRPC) from the configuration database and maintain a cached address map.
  • Establish and maintain cross-cluster connections using Akka.NET ClusterClient/ClusterClientReceptionist for command/control.
  • Establish and maintain per-site gRPC streaming connections for real-time data delivery (site→central).
  • Route messages between central and site clusters in a hub-and-spoke topology.
  • Broker requests from external systems (via central) to sites and return responses.
  • Support multiple concurrent message patterns (request/response, fire-and-forget, streaming).
  • Detect site connectivity status for health monitoring.
  • Host the SiteStreamGrpcServer on site nodes (Kestrel HTTP/2) to serve real-time event streams.
  • Manage per-site SiteStreamGrpcClient instances on central nodes via SiteStreamGrpcClientFactory.

Communication Patterns

1. Deployment (Central → Site)

  • Pattern: Request/Response.
  • Central sends a flattened configuration to a site.
  • Site Runtime receives, compiles scripts, creates/updates Instance Actors, and responds with success/failure.
  • No buffering at central — if the site is unreachable, the deployment fails immediately.

2. Instance Lifecycle Commands (Central → Site)

  • Pattern: Request/Response.
  • Central sends disable, enable, or delete commands for specific instances.
  • Site Runtime processes the command and responds with success/failure.
  • If the site is unreachable, the command fails immediately (no buffering).

3. System-Wide Artifact Deployment (Central → Site(s))

  • Pattern: Broadcast with per-site acknowledgment (deploy to all sites), or targeted to a single site (per-site deployment).
  • When shared scripts, external system definitions, database connections, data connections, notification lists, or SMTP configuration are explicitly deployed, central sends them to the target site(s).
  • Each site acknowledges receipt and reports success/failure independently.

4. Integration Routing (External System → Central → Site → Central → External System)

  • Pattern: Request/Response (brokered).
  • External system sends a request to central (e.g., MES requests machine values).
  • Central routes the request to the appropriate site.
  • Site reads values from the Instance Actor and responds.
  • Central returns the response to the external system.

5. Recipe/Command Delivery (External System → Central → Site)

  • Pattern: Fire-and-forget with acknowledgment.
  • External system sends a command to central (e.g., recipe manager sends recipe).
  • Central routes to the site.
  • Site applies and acknowledges.

6. Debug Streaming (Site → Central)

  • Pattern: Subscribe/push with initial snapshot. Two transports: ClusterClient for the subscribe/unsubscribe handshake and initial snapshot, gRPC server-streaming for ongoing real-time events.
  • A DebugStreamBridgeActor (one per active debug session) is created on the central cluster by the DebugStreamService. The bridge actor first opens a gRPC server-streaming subscription to the site via SiteStreamGrpcClient, then sends a SubscribeDebugViewRequest to the site via CentralCommunicationActor (ClusterClient). The site's InstanceActor replies with an initial snapshot via the ClusterClient reply path.
  • gRPC stream (real-time events): The site's SiteStreamGrpcServer receives the gRPC SubscribeInstance call and creates a StreamRelayActor that subscribes to SiteStreamManager for the requested instance. Events (AttributeValueChanged, AlarmStateChanged) flow from SiteStreamManagerStreamRelayActorChannel<SiteStreamEvent> (bounded, 1000, DropOldest) → gRPC response stream → SiteStreamGrpcClient on central → DebugStreamBridgeActor.
  • The DebugStreamEvent message type no longer exists — events are not routed through ClusterClient. SiteCommunicationActor and CentralCommunicationActor have no role in streaming event delivery.
  • The bridge actor forwards received events to the consumer via callbacks (Blazor component or SignalR hub).
  • Snapshot-to-stream handoff: The gRPC stream is opened before the snapshot request to avoid missing events. The consumer applies the snapshot as baseline, then replays buffered gRPC events with timestamps newer than the snapshot (timestamp-based dedup).
  • Attribute value stream messages: [InstanceUniqueName].[AttributePath].[AttributeName], value, quality, timestamp.
  • Alarm state stream messages: [InstanceUniqueName].[AlarmName], state (active/normal), priority, timestamp.
  • Central sends an unsubscribe request via ClusterClient when the debug session ends. The gRPC stream is cancelled. The site's StreamRelayActor is stopped and the SiteStreamManager subscription is removed.
  • The stream is session-based and temporary.

Site-Side gRPC Streaming Components

  • SiteStreamGrpcServer: gRPC service (SiteStreamService.SiteStreamServiceBase) hosted on each site node via Kestrel HTTP/2 on a dedicated port (default 8083). Implements the SubscribeInstance RPC. For each subscription, creates a StreamRelayActor that subscribes to SiteStreamManager, bridges events through a Channel<SiteStreamEvent> to the gRPC response stream. Tracks active subscriptions by correlation_id — duplicate IDs cancel the old stream. Enforces a max concurrent stream limit (default 100). Rejects streams with StatusCode.Unavailable before the actor system is ready.
  • StreamRelayActor: Short-lived actor created per gRPC subscription. Receives domain events (AttributeValueChanged, AlarmStateChanged) from SiteStreamManager, converts them to protobuf SiteStreamEvent messages, and writes to the Channel<SiteStreamEvent> writer. Stopped when the gRPC stream is cancelled or the client disconnects.

Central-Side Debug Stream Components

  • DebugStreamService: Singleton service that manages debug stream sessions. Resolves instance ID to unique name and site, creates and tears down DebugStreamBridgeActor instances, and provides a clean API for both Blazor components and the SignalR hub. Injects SiteStreamGrpcClientFactory for gRPC stream creation.
  • DebugStreamBridgeActor: One per active debug session. Opens a gRPC streaming subscription via SiteStreamGrpcClient and receives real-time events via callback. Also receives the initial DebugViewSnapshot via ClusterClient. Forwards all events to the consumer via callbacks. Handles gRPC stream errors with reconnection logic: tries the other site node endpoint, retries with backoff (max 3 retries), terminates the session if all retries fail.
  • SiteStreamGrpcClient: Per-site gRPC client that manages GrpcChannel instances and streaming subscriptions. Reads from the gRPC response stream in a background task, converts protobuf messages to domain events, and invokes the onEvent callback.
  • SiteStreamGrpcClientFactory: Caches per-site SiteStreamGrpcClient instances. Reads GrpcNodeAAddress / GrpcNodeBAddress from the Site entity (loaded by CentralCommunicationActor). Falls back to NodeB if NodeA connection fails. Disposes clients on site removal or address change.
  • DebugStreamHub: SignalR hub at /hubs/debug-stream for external consumers (e.g., CLI). Authenticates via Basic Auth + LDAP and requires the Deployment role. Server-to-client methods: OnSnapshot, OnAttributeChanged, OnAlarmChanged, OnStreamTerminated.

gRPC Proto Definition

The streaming protocol is defined in sitestream.proto (src/ScadaLink.Communication/Protos/sitestream.proto):

  • Service: SiteStreamService with a single RPC SubscribeInstance(InstanceStreamRequest) returns (stream SiteStreamEvent).
  • Messages: InstanceStreamRequest (correlation_id, instance_unique_name), SiteStreamEvent (correlation_id, oneof event: AttributeValueUpdate, AlarmStateUpdate).
  • The oneof event pattern is extensible — future event types (health metrics, connection state changes) are added as new fields without breaking existing consumers.
  • Proto field numbers are never reused. Old clients ignore unknown oneof variants.

gRPC Connection Keepalive

Three layers of dead-client detection prevent orphan streams on site nodes:

Layer Detects Timeline Mechanism
TCP RST Clean process death, connection close 15s OS-level TCP, WriteAsync throws
gRPC keepalive PING Network partition, silent crash, firewall drop ~25s HTTP/2 PING frames, CancellationToken fires
Session timeout Misconfigured keepalive, long-lived zombie streams 4 hours CancellationTokenSource.CancelAfter

Keepalive settings are configurable via CommunicationOptions:

  • GrpcKeepAlivePingDelay: 15 seconds (default)
  • GrpcKeepAlivePingTimeout: 10 seconds (default)
  • GrpcMaxStreamLifetime: 4 hours (default)
  • GrpcMaxConcurrentStreams: 100 (default)

6a. Debug Snapshot (Central → Site)

  • Pattern: Request/Response (one-shot, no subscription).
  • Central sends a DebugSnapshotRequest (identified by instance unique name) to the site.
  • Site's Deployment Manager routes to the Instance Actor by unique name.
  • Instance Actor builds and returns a DebugViewSnapshot with all current attribute values and alarm states (same payload as the streaming initial snapshot).
  • No subscription is created; no stream is established.
  • Uses the 30-second QueryTimeout.

7. Health Reporting (Site → Central)

  • Pattern: Periodic push.
  • Sites periodically send health metrics (connection status, node status, buffer depth, script error rates, alarm evaluation error rates) to central.

8. Remote Queries (Central → Site)

  • Pattern: Request/Response.
  • Central queries sites for:
    • Parked messages (store-and-forward dead letters).
    • Site event logs.
    • Instance debug snapshots (attribute values and alarm states).
  • Central can also send management commands:
    • Retry or discard parked messages.

Topology

Central Cluster
  ├── ClusterClient → Site A Cluster (SiteCommunicationActor via Receptionist)  [command/control]
  ├── ClusterClient → Site B Cluster (SiteCommunicationActor via Receptionist)  [command/control]
  └── ClusterClient → Site N Cluster (SiteCommunicationActor via Receptionist)  [command/control]
  │
  ├── SiteStreamGrpcClient ◄── gRPC stream ── Site A (SiteStreamGrpcServer)     [real-time data]
  ├── SiteStreamGrpcClient ◄── gRPC stream ── Site B (SiteStreamGrpcServer)     [real-time data]
  └── SiteStreamGrpcClient ◄── gRPC stream ── Site N (SiteStreamGrpcServer)     [real-time data]

Site Clusters
  └── ClusterClient → Central Cluster (CentralCommunicationActor via Receptionist)  [command/control]
  └── SiteStreamGrpcServer (Kestrel HTTP/2, port 8083) → serves gRPC streams       [real-time data]
  • Sites do not communicate with each other.
  • All inter-cluster communication flows through central.
  • Both CentralCommunicationActor and SiteCommunicationActor are registered with their cluster's ClusterClientReceptionist for cross-cluster discovery.

Site Address Resolution

Central discovers site addresses through the configuration database, not runtime registration:

  • Each site record in the Sites table includes optional NodeAAddress and NodeBAddress fields containing base Akka addresses of the site's cluster nodes (e.g., akka.tcp://scadalink@host:port), and optional GrpcNodeAAddress and GrpcNodeBAddress fields containing gRPC endpoints (e.g., http://host:8083).
  • The CentralCommunicationActor loads all site addresses from the database at startup and creates one ClusterClient per site, configured with both NodeA and NodeB as contact points. The SiteStreamGrpcClientFactory uses GrpcNodeAAddress / GrpcNodeBAddress to create per-site gRPC channels for streaming.
  • The address cache is refreshed every 60 seconds and on-demand when site records are added, edited, or deleted via the Central UI or CLI. ClusterClient instances are recreated when contact points change.
  • When routing a message to a site, central sends via ClusterClient.Send("/user/site-communication", msg). ClusterClient handles failover between NodeA and NodeB internally — there is no application-level NodeA preference/NodeB fallback logic.
  • Heartbeats from sites serve health monitoring only — they do not serve as a registration or address discovery mechanism.
  • If no addresses are configured for a site, messages to that site are dropped and the caller's Ask times out.

Site → Central Communication

  • Site nodes configure a list of CentralContactPoints (both central node addresses) instead of a single CentralActorPath.
  • The site creates a ClusterClient using the central contact points and sends heartbeats, health reports, and other messages via ClusterClient.Send("/user/central-communication", msg).
  • ClusterClient handles automatic failover between central nodes — if the active central node goes down, the site's ClusterClient reconnects to the standby node transparently.

Message Timeouts

Each request/response pattern has a default timeout that can be overridden in configuration:

Pattern Default Timeout Rationale
1. Deployment 120 seconds Script compilation at the site can be slow
2. Instance Lifecycle 30 seconds Stop/start actors is fast
3. System-Wide Artifacts 120 seconds per site Includes shared script recompilation
4. Integration Routing 30 seconds External system waiting for response; Inbound API per-method timeout may cap this further
5. Recipe/Command Delivery 30 seconds Fire-and-forget with ack
8. Remote Queries 30 seconds Querying parked messages or event logs

Timeouts use the Akka.NET ask pattern. If no response is received within the timeout, the caller receives a timeout failure.

Transport Configuration

Akka.NET remoting provides the underlying transport for both intra-cluster communication and ClusterClient connections. The following transport-level settings are explicitly configured (not left to framework defaults) for predictable behavior:

  • Transport heartbeat interval: Configurable interval at which heartbeat messages are sent over remoting connections (e.g., every 5 seconds).
  • Failure detection threshold: Number of missed heartbeats before the connection is considered lost (e.g., 3 missed heartbeats = 15 seconds with a 5-second interval).
  • Reconnection: ClusterClient handles reconnection and failover between contact points automatically for cross-cluster communication. No custom reconnection logic is required.

These settings should be tuned for the expected network conditions between central and site clusters.

Application-Level Correlation

All request/response messages include an application-level correlation ID to ensure correct pairing of requests and responses, even across reconnection events:

  • Deployments include a deployment ID and revision hash for idempotency (see Deployment Manager).
  • Lifecycle commands include a command ID for deduplication.
  • Remote queries include a query ID for response correlation.

This provides protocol-level safety beyond Akka.NET's transport guarantees, which may not hold across disconnect/reconnect cycles.

Message Ordering

Akka.NET guarantees message ordering between a specific sender/receiver actor pair. The Communication Layer relies on this guarantee — messages to a given site are processed in the order they are sent. Callers do not need to handle out-of-order delivery.

ManagementActor and ClusterClient

The ManagementActor is registered at the well-known path /user/management on central nodes and advertised via ClusterClientReceptionist. External tools (primarily the CLI) connect using Akka.NET ClusterClient, which contacts the receptionist to discover the ManagementActor. This is a separate ClusterClient usage from the inter-cluster ClusterClient connections used for central-site messaging — the CLI does not participate in cluster membership or affect the hub-and-spoke topology.

Connection Failure Behavior

  • In-flight messages: When a connection drops while a request is in flight (e.g., deployment sent but no response received), the Akka ask pattern times out and the caller receives a failure. There is no automatic retry or buffering at central — the engineer sees the failure in the UI and re-initiates the action. This is consistent with the design principle that central does not buffer messages.
  • Debug streams: Any gRPC stream interruption triggers reconnection logic in the DebugStreamBridgeActor. The bridge actor attempts to reconnect to the other site node endpoint (NodeB if NodeA failed, or vice versa), with up to 3 retries and 5-second backoff. If all retries fail, the consumer is notified via OnStreamTerminated and the bridge actor is stopped. Events during the reconnection gap are lost (acceptable for real-time debug view). On successful reconnection, the consumer can request a fresh snapshot to re-sync state.

Failover Behavior

  • Central failover: The standby node takes over the Akka.NET cluster role. In-progress deployments are treated as failed. Site ClusterClients automatically reconnect to the standby central node via their configured contact points.
  • Site failover: The standby node takes over. The Deployment Manager singleton restarts and re-creates the Instance Actor hierarchy. Central's per-site ClusterClient automatically reconnects to the surviving site node. Ongoing debug streams are interrupted and must be re-established by the engineer.

Dependencies

  • Akka.NET Remoting + ClusterClient: Provides the command/control transport layer. ClusterClient/ClusterClientReceptionist used for cross-cluster command/control messaging (deployments, lifecycle, subscribe/unsubscribe handshake, snapshots).
  • gRPC (Grpc.AspNetCore + Grpc.Net.Client): Provides the real-time data streaming transport. Site nodes host a gRPC server (SiteStreamGrpcServer); central nodes create per-site gRPC clients (SiteStreamGrpcClient).
  • Cluster Infrastructure: Manages node roles and failover detection.
  • Configuration Database: Provides site node addresses (NodeAAddress, NodeBAddress for Akka remoting; GrpcNodeAAddress, GrpcNodeBAddress for gRPC streaming) for address resolution.
  • Site Runtime (SiteStreamManager): The SiteStreamGrpcServer subscribes to SiteStreamManager to receive real-time events for gRPC delivery.

Interactions

  • Deployment Manager (central): Uses communication to deliver configurations, lifecycle commands, and system-wide artifacts, and receive status.
  • Site Runtime: Receives deployments, lifecycle commands, and artifact updates. Provides debug view data.
  • Central UI: Debug view requests and remote queries flow through communication.
  • Health Monitoring: Receives periodic health reports from sites.
  • Store-and-Forward Engine (site): Parked message queries/commands are routed through communication.
  • Site Event Logging: Event log queries are routed through communication.
  • Management Service: The ManagementActor is registered with ClusterClientReceptionist on central nodes. The CLI communicates with the ManagementActor via ClusterClient, which is a separate channel from inter-cluster remoting.